netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 0/5]: soreuseport: Bind multiple sockets to the same port
@ 2013-01-14 20:00 Tom Herbert
  2013-01-14 20:29 ` David Miller
  2013-01-15  9:34 ` David Laight
  0 siblings, 2 replies; 13+ messages in thread
From: Tom Herbert @ 2013-01-14 20:00 UTC (permalink / raw)
  To: netdev, davem; +Cc: netdev, eric.dumazet

[-- Attachment #1: Type: TEXT/PLAIN, Size: 2675 bytes --]

Rebasing the soreuseport patches to 3.8.  No material changes since
first posted.
---
These patches implements so_reuseport (SO_REUSEPORT socket option) for
TCP and UDP.  For TCP, so_reuseport allows multiple listener sockets
to be bound to the same port.  In the case of UDP, so_reuseport allows
multiple sockets to bind to the same port.  To prevent port hijacking
all sockets bound to the same port using so_reuseport must have the
same uid.  Received packets are distributed to multiple sockets bound
to the same port using a 4-tuple hash.

The motivating case for so_resuseport in TCP would be something like
a web server binding to port 80 running with multiple threads, where
each thread might have it's own listener socket.  This could be done
as an alternative to other models: 1) have one listener thread which
dispatches completed connections to workers. 2) accept on a single
listener socket from multiple threads.  In case #1 the listener thread
can easily become the bottleneck with high connection turn-over rate.
In case #2, the proportion of connections accepted per thread tends
to be uneven under high connection load (assuming simple event loop:
while (1) { accept(); process() }, wakeup does not promote fairness
among the sockets.  We have seen the  disproportion to be as high
as 3:1 ratio between thread accepting most connections and the one
accepting the fewest.  With so_reusport the distribution is
uniform.

The TCP implementation has a problem in that the request sockets for a
listener are attached to a listener socket.  If a SYN is received, a
listener socket is chosen and request structure is created (SYN-RECV
state).  If the subsequent ack in 3WHS does not match the same port
by so_reusport, the connection state is not found (reset) and the
request structure is orphaned.  This scenario would occur when the
number of listener sockets bound to a port changes (new ones are
added, or old ones closed).  We are looking for a solution to this,
maybe allow multiple sockets to share the same request table...

The motivating case for so_reuseport in UDP would be something like a
DNS server.  An alternative would be to recv on the same socket from
multiple threads.  As in the case of TCP, the load across these threads
tends to be disproportionate and we also see a lot of contection on
the socket lock.  Note that SO_REUSEADDR already allows multiple UDP
sockets to bind to the same port, however there is no provision to
prevent hijacking and nothing to distribute packets across all the
sockets sharing the same bound port.  This patch does not change the
semantics of SO_REUSEADDR, but provides usable functionality of it
for unicast.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH 0/5]: soreuseport: Bind multiple sockets to the same port
  2013-01-14 20:00 [PATCH 0/5]: soreuseport: Bind multiple sockets to the same port Tom Herbert
@ 2013-01-14 20:29 ` David Miller
  2013-01-14 23:35   ` Vijay Subramanian
  2013-01-15  9:34 ` David Laight
  1 sibling, 1 reply; 13+ messages in thread
From: David Miller @ 2013-01-14 20:29 UTC (permalink / raw)
  To: therbert; +Cc: netdev, netdev, eric.dumazet

From: Tom Herbert <therbert@google.com>
Date: Mon, 14 Jan 2013 12:00:16 -0800 (PST)

> Rebasing the soreuseport patches to 3.8.  No material changes since
> first posted.

FWIW I'm fine with the basic premise of these changes, and will
happily apply them once all the details are worked out.

Thanks Tom.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH 0/5]: soreuseport: Bind multiple sockets to the same port
  2013-01-14 20:29 ` David Miller
@ 2013-01-14 23:35   ` Vijay Subramanian
  2013-01-15  1:33     ` Tom Herbert
  0 siblings, 1 reply; 13+ messages in thread
From: Vijay Subramanian @ 2013-01-14 23:35 UTC (permalink / raw)
  To: David Miller; +Cc: therbert, netdev, netdev, eric.dumazet

On 14 January 2013 12:29, David Miller <davem@davemloft.net> wrote:
> From: Tom Herbert <therbert@google.com>
> Date: Mon, 14 Jan 2013 12:00:16 -0800 (PST)
>
>> Rebasing the soreuseport patches to 3.8.  No material changes since
>> first posted.

Tom,
I am not sure if this series was just RFC or you already are aware of
this but  I got the following errors when compiling with
CONFIG_NETFILTER_TPROXY=m

This is because the definitions of inet_lookup_listener  and
inet6_lookup_listener have changed with your patch.

In file included from net/netfilter/nf_tproxy_core.c:19:
include/net/netfilter/nf_tproxy_core.h: In function ‘nf_tproxy_get_sock_v4’:
include/net/netfilter/nf_tproxy_core.h:86: error: too few arguments to
function ‘inet_lookup_listener’
include/net/netfilter/nf_tproxy_core.h: In function ‘nf_tproxy_get_sock_v6’:
include/net/netfilter/nf_tproxy_core.h:155: warning: passing argument
5 of ‘inet6_lookup_listener’ makes pointer from integer without a cast
include/net/inet6_hashtables.h:72: note: expected ‘const struct
in6_addr *’ but argument is of type ‘int’
include/net/netfilter/nf_tproxy_core.h:155: error: too few arguments
to function ‘inet6_lookup_listener’
make[1]: *** [net/netfilter/nf_tproxy_core.o] Error 1
make[1]: *** Waiting for unfinished jobs....


Thanks,
Vijay

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH 0/5]: soreuseport: Bind multiple sockets to the same port
  2013-01-14 23:35   ` Vijay Subramanian
@ 2013-01-15  1:33     ` Tom Herbert
  0 siblings, 0 replies; 13+ messages in thread
From: Tom Herbert @ 2013-01-15  1:33 UTC (permalink / raw)
  To: Vijay Subramanian; +Cc: David Miller, netdev, netdev, eric.dumazet

On Mon, Jan 14, 2013 at 3:35 PM, Vijay Subramanian
<subramanian.vijay@gmail.com> wrote:
> On 14 January 2013 12:29, David Miller <davem@davemloft.net> wrote:
>> From: Tom Herbert <therbert@google.com>
>> Date: Mon, 14 Jan 2013 12:00:16 -0800 (PST)
>>
>>> Rebasing the soreuseport patches to 3.8.  No material changes since
>>> first posted.
>
> Tom,
> I am not sure if this series was just RFC or you already are aware of
> this but  I got the following errors when compiling with
> CONFIG_NETFILTER_TPROXY=m
>
Thanks,
I'll fix that.

Tom

> This is because the definitions of inet_lookup_listener  and
> inet6_lookup_listener have changed with your patch.
>
> In file included from net/netfilter/nf_tproxy_core.c:19:
> include/net/netfilter/nf_tproxy_core.h: In function ‘nf_tproxy_get_sock_v4’:
> include/net/netfilter/nf_tproxy_core.h:86: error: too few arguments to
> function ‘inet_lookup_listener’
> include/net/netfilter/nf_tproxy_core.h: In function ‘nf_tproxy_get_sock_v6’:
> include/net/netfilter/nf_tproxy_core.h:155: warning: passing argument
> 5 of ‘inet6_lookup_listener’ makes pointer from integer without a cast
> include/net/inet6_hashtables.h:72: note: expected ‘const struct
> in6_addr *’ but argument is of type ‘int’
> include/net/netfilter/nf_tproxy_core.h:155: error: too few arguments
> to function ‘inet6_lookup_listener’
> make[1]: *** [net/netfilter/nf_tproxy_core.o] Error 1
> make[1]: *** Waiting for unfinished jobs....
>
>
> Thanks,
> Vijay

^ permalink raw reply	[flat|nested] 13+ messages in thread

* RE: [PATCH 0/5]: soreuseport: Bind multiple sockets to the same port
  2013-01-14 20:00 [PATCH 0/5]: soreuseport: Bind multiple sockets to the same port Tom Herbert
  2013-01-14 20:29 ` David Miller
@ 2013-01-15  9:34 ` David Laight
  2013-01-16 18:22   ` Tom Herbert
  1 sibling, 1 reply; 13+ messages in thread
From: David Laight @ 2013-01-15  9:34 UTC (permalink / raw)
  To: Tom Herbert, netdev, davem; +Cc: netdev, eric.dumazet



> -----Original Message-----
> From: netdev-owner@vger.kernel.org [mailto:netdev-owner@vger.kernel.org] On Behalf Of Tom Herbert
> Sent: 14 January 2013 20:00
> To: netdev@vger.kernel.org; davem@davemloft.net
> Cc: netdev@markandruth.co.uk; eric.dumazet@gmail.com
> Subject: [PATCH 0/5]: soreuseport: Bind multiple sockets to the same port
> 
> Rebasing the soreuseport patches to 3.8.  No material changes since
> first posted.
> ---
> These patches implements so_reuseport (SO_REUSEPORT socket option) for
> TCP and UDP.  For TCP, so_reuseport allows multiple listener sockets
> to be bound to the same port.
>....
> 
> The motivating case for so_resuseport in TCP would be something like
> a web server binding to port 80 running with multiple threads, ...
> 2) accept on a single
> listener socket from multiple threads
> In case #2, the proportion of connections accepted per thread tends
> to be uneven under high connection load (assuming simple event loop:
> while (1) { accept(); process() }, wakeup does not promote fairness
> among the sockets.  We have seen the  disproportion to be as high
> as 3:1 ratio between thread accepting most connections and the one
> accepting the fewest.  With so_reusport the distribution is
> uniform.

Hmmm.... do you need that sort of fairness between the threads?

If one request takes longer than average to process, then you
don't want other requests to be delayed when there are other
idle worker processes.

Also having the same thread normally collect a request would
make it more likely that the required code/data be in the
cache of the cpu (assuming that the main reason for multiple
threads is to load balance over multiple cpus, and with the
threads tied to a single cpu).

If there are a lot of processes sleeping in accept() (on the same
socket) it might be worth looking at which is actually woken
when a new connection arrives. If they are sleeping in poll/select
it is probably more difficult (but not impossible) to avoid waking
all the processes for every incoming connection.

	David

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH 0/5]: soreuseport: Bind multiple sockets to the same port
  2013-01-15  9:34 ` David Laight
@ 2013-01-16 18:22   ` Tom Herbert
  2013-01-17  9:53     ` David Laight
  2013-01-21  7:23     ` Li Yu
  0 siblings, 2 replies; 13+ messages in thread
From: Tom Herbert @ 2013-01-16 18:22 UTC (permalink / raw)
  To: David Laight; +Cc: netdev, davem, netdev, eric.dumazet

> Hmmm.... do you need that sort of fairness between the threads?
>
Yes :-)

> If one request takes longer than average to process, then you
> don't want other requests to be delayed when there are other
> idle worker processes.
>
On a heavily loaded server processing thousands of requests/second,
law of large numbers hopefully applies where each connection
represents approximately same unit of work.

> Also having the same thread normally collect a request would
> make it more likely that the required code/data be in the
> cache of the cpu (assuming that the main reason for multiple
> threads is to load balance over multiple cpus, and with the
> threads tied to a single cpu).
>
Right.  Multiple listener sockets also imply that the work on the
connected sockets will be in the same thread or at least dispatched to
thread which is close to the same CPU.  soreuseport moves the start of
siloed processing into kernel.

> If there are a lot of processes sleeping in accept() (on the same
> socket) it might be worth looking at which is actually woken
> when a new connection arrives. If they are sleeping in poll/select
> it is probably more difficult (but not impossible) to avoid waking
> all the processes for every incoming connection.

We had considered solving this within accept.  The problem is that
there's no way to indicate how much work a thread should do via
accept.  For instance, an event loop usually would look like:

while (1) {
    fd = accept();
    process(fd);
}

With multiple threads, the number of accepted sockets in a particular
thread is non-deterministic.  It is even possible that one thread
could end up accepting all the connections, and the others are starved
(wake up but no connection to process.).  Since connections are the
unit of work, this creates imbalance among threads.  There was an
attempt to fix this in user space by sleeping for a while instead of
calling accept on threads for one that have already have a
disproportionate number of connections.  This was unpleasant-- it
needed shared state in user space and provided no granularity.

Tom

^ permalink raw reply	[flat|nested] 13+ messages in thread

* RE: [PATCH 0/5]: soreuseport: Bind multiple sockets to the same port
  2013-01-16 18:22   ` Tom Herbert
@ 2013-01-17  9:53     ` David Laight
  2013-01-17 14:27       ` Eric Dumazet
  2013-01-21  7:23     ` Li Yu
  1 sibling, 1 reply; 13+ messages in thread
From: David Laight @ 2013-01-17  9:53 UTC (permalink / raw)
  To: Tom Herbert; +Cc: netdev, davem, netdev, eric.dumazet

> We had considered solving this within accept.  The problem is that
> there's no way to indicate how much work a thread should do via
> accept.  For instance, an event loop usually would look like:
> 
> while (1) {
>     fd = accept();
>     process(fd);
> }
> 
> With multiple threads, the number of accepted sockets in a particular
> thread is non-deterministic...

If your loop looks like that then each thread is only processing
a single socket and won't call accept() again until it is idle.

OTOH if each thread is processing multiple requests using
poll/select (or similar) at the top of the loop then a single
thread is likely to pick up a large number of connections.

Given that both poll and select are inefficient with very large
numbers of fds (every call is usually o(n) [1]), the kernel will
support some kind of event mechanism, maybe tweaking that to
signal the waiters in turn would also work - and be more general.

It might also be possible to do something on the user side of
sockets to generate additional fd with their own queue?
(IMHO some of the SCTP stuff should have been done that way).

	David

[1] I've known systems where it was actually o(n*n) because
the relevant kernel used a linked list to map fd to kernel
file structures! By the time you got to 1000 fd this dominated.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* RE: [PATCH 0/5]: soreuseport: Bind multiple sockets to the same port
  2013-01-17  9:53     ` David Laight
@ 2013-01-17 14:27       ` Eric Dumazet
  0 siblings, 0 replies; 13+ messages in thread
From: Eric Dumazet @ 2013-01-17 14:27 UTC (permalink / raw)
  To: David Laight; +Cc: Tom Herbert, netdev, davem, netdev

On Thu, 2013-01-17 at 09:53 +0000, David Laight wrote:
> > We had considered solving this within accept.  The problem is that
> > there's no way to indicate how much work a thread should do via
> > accept.  For instance, an event loop usually would look like:
> > 
> > while (1) {
> >     fd = accept();
> >     process(fd);
> > }
> > 
> > With multiple threads, the number of accepted sockets in a particular
> > thread is non-deterministic...
> 
> If your loop looks like that then each thread is only processing
> a single socket and won't call accept() again until it is idle.
> 
> OTOH if each thread is processing multiple requests using
> poll/select (or similar) at the top of the loop then a single
> thread is likely to pick up a large number of connections.
> 
> Given that both poll and select are inefficient with very large
> numbers of fds (every call is usually o(n) [1]), the kernel will
> support some kind of event mechanism, maybe tweaking that to
> signal the waiters in turn would also work - and be more general.
> 
> It might also be possible to do something on the user side of
> sockets to generate additional fd with their own queue?
> (IMHO some of the SCTP stuff should have been done that way).

I hope you dont really believe Tom was going to explain how
a typical server is built around the accept() thing.

Linux has epoll() mechanism, so the poll()/select() O(n) behavior
are not relevant for modern applications.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH 0/5]: soreuseport: Bind multiple sockets to the same port
  2013-01-16 18:22   ` Tom Herbert
  2013-01-17  9:53     ` David Laight
@ 2013-01-21  7:23     ` Li Yu
  2013-01-21  7:58       ` Li Yu
  1 sibling, 1 reply; 13+ messages in thread
From: Li Yu @ 2013-01-21  7:23 UTC (permalink / raw)
  To: Tom Herbert; +Cc: David Laight, netdev, davem, netdev, eric.dumazet

于 2013年01月17日 02:22, Tom Herbert 写道:
>> Hmmm.... do you need that sort of fairness between the threads?
>>
> Yes :-)
>
>> If one request takes longer than average to process, then you
>> don't want other requests to be delayed when there are other
>> idle worker processes.
>>
> On a heavily loaded server processing thousands of requests/second,
> law of large numbers hopefully applies where each connection
> represents approximately same unit of work.
>

It seem that these words are reasonable for some scenarios, we
backported old version of SO_REUSEPORT patch into RHEL6 2.6.32-220.x
kernel on CDN platform, and result in better balanced
CPU utility among some haproxy instances.

Also, we did a performance benchmark for old SO_REUSEPORT. It

indeed bring significant improvement for short connections performance
sometimes, but it also has some performance regression another
sometimes. I think that problem is random selecting policy, the
selected result may trigger extra CPU cache misses -- I tried to write
a SO_BINDCPU patch to directly use RPS/RSS hashed result to select
listen fd, the performance regression disappear then. but I have send
it here since I did not implement load balance feature yet ...
I will send the benchmark results soon.

>> Also having the same thread normally collect a request would
>> make it more likely that the required code/data be in the
>> cache of the cpu (assuming that the main reason for multiple
>> threads is to load balance over multiple cpus, and with the
>> threads tied to a single cpu).
>>
> Right.  Multiple listener sockets also imply that the work on the
> connected sockets will be in the same thread or at least dispatched to
> thread which is close to the same CPU.  soreuseport moves the start of
> siloed processing into kernel.
>
>> If there are a lot of processes sleeping in accept() (on the same
>> socket) it might be worth looking at which is actually woken
>> when a new connection arrives. If they are sleeping in poll/select
>> it is probably more difficult (but not impossible) to avoid waking
>> all the processes for every incoming connection.
>
> We had considered solving this within accept.  The problem is that
> there's no way to indicate how much work a thread should do via
> accept.  For instance, an event loop usually would look like:
>
> while (1) {
>      fd = accept();
>      process(fd);
> }
>
> With multiple threads, the number of accepted sockets in a particular
> thread is non-deterministic.  It is even possible that one thread
> could end up accepting all the connections, and the others are starved
> (wake up but no connection to process.).  Since connections are the
> unit of work, this creates imbalance among threads.  There was an
> attempt to fix this in user space by sleeping for a while instead of
> calling accept on threads for one that have already have a
> disproportionate number of connections.  This was unpleasant-- it
> needed shared state in user space and provided no granularity.
>

I also have some thinks on this imbalance problem ...

At Last, I assumed that every accept-thread holds same numbers of
listen sockets, so we just can do load balance base on length of accept
queue.

Thanks for great SO_REUSEPORT work.

> Tom
> --
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH 0/5]: soreuseport: Bind multiple sockets to the same port
  2013-01-21  7:23     ` Li Yu
@ 2013-01-21  7:58       ` Li Yu
  0 siblings, 0 replies; 13+ messages in thread
From: Li Yu @ 2013-01-21  7:58 UTC (permalink / raw)
  To: Tom Herbert; +Cc: David Laight, netdev, davem, netdev, eric.dumazet

于 2013年01月21日 15:23, Li Yu 写道:
> 于 2013年01月17日 02:22, Tom Herbert 写道:
>>> Hmmm.... do you need that sort of fairness between the threads?
>>>
>> Yes :-)
>>
>>> If one request takes longer than average to process, then you
>>> don't want other requests to be delayed when there are other
>>> idle worker processes.
>>>
>> On a heavily loaded server processing thousands of requests/second,
>> law of large numbers hopefully applies where each connection
>> represents approximately same unit of work.
>>
>
> It seem that these words are reasonable for some scenarios, we
> backported old version of SO_REUSEPORT patch into RHEL6 2.6.32-220.x
> kernel on CDN platform, and result in better balanced
> CPU utility among some haproxy instances.
>
> Also, we did a performance benchmark for old SO_REUSEPORT. It
>
> indeed bring significant improvement for short connections performance
> sometimes, but it also has some performance regression another
> sometimes. I think that problem is random selecting policy, the
> selected result may trigger extra CPU cache misses -- I tried to write
> a SO_BINDCPU patch to directly use RPS/RSS hashed result to select
> listen fd, the performance regression disappear then. but I have send
> it here since I did not implement load balance feature yet ...
> I will send the benchmark results soon.
>

These are results of performance benchmark of old SO_REUSEPORT:

HW of testbed:

Summary:	Dell R720, 2 x Xeon E5-2680 0 2.70GHz, 31.4GB / 32GB 1600MHz DDR3
System:		Dell PowerEdge R720 (Dell 02P51C)
Processors:	2 x Xeon E5-2680 0 2.70GHz 8000MHz FSB (2 sockets x 8 cores 
x 2 threads)
Memory:		31.4GB / 32GB 1600MHz DDR3 == 8 x 4GB, 16 x empty
Network:	Chelsio Communications T420-CR Unified Wire Ethernet Controller
OS:		RHEL Server 6.2 (Santiago) x86_64, 64-bit
BIOS:		Dell 1.0.4 02/21/2012

processes/mode - number of worker processes/listen mode
		
		4/8/16 :  numbers of worker processes, each process is
			  bound on individual processor.

                 listen mode:
		 -s: RHEL6 without any extra patch
		 -r: RHEL6 with SO_REUSEPORT
		 -R: RHEL6 with both SO_REUSEPORT and SO_BINDCPU

64B|1x		- This benchmark suite just is to simulate simple RPC
		  workload. The client sends RPC request first, the
		  server replies a RPC response (I said such a pair of
		  messages is a RPC transaction ), then client send
		  next RPC request to start another new RPC trans.
		
		   64B/1024B : both RPC requests/responses are 64/1024
				bytes length.
		   1x/1024x : each TCP connection has 1 or 1024 RPC
			      trans.

The numbers in below table are represented by 10000 trans per second.

=====================================================================
processes/mode	64B|1x	64B|1024x	1024B|1x	1024B|1024x
=====================================================================
4/-s		18	80		17		78
---------------------------------------------------------------------
4/-r		16	71		15		67
---------------------------------------------------------------------
4/-R		23	96		23		92
---------------------------------------------------------------------
8/-s		18	165		18		160
---------------------------------------------------------------------
8/-r		30	155		29		147
---------------------------------------------------------------------
8/-R		36	185		36		180
---------------------------------------------------------------------
16/-s		15	230		14		220
---------------------------------------------------------------------
16/-r		38	230		38		220
---------------------------------------------------------------------
16/-R		43	230		43		220
---------------------------------------------------------------------

Above data are against RHEL6 2.6.32.279.xx kernel, I also tested 
upstream 3.6.2 kernel with these patches, the results are similar.

Thanks

Yu

>>> Also having the same thread normally collect a request would
>>> make it more likely that the required code/data be in the
>>> cache of the cpu (assuming that the main reason for multiple
>>> threads is to load balance over multiple cpus, and with the
>>> threads tied to a single cpu).
>>>
>> Right.  Multiple listener sockets also imply that the work on the
>> connected sockets will be in the same thread or at least dispatched to
>> thread which is close to the same CPU.  soreuseport moves the start of
>> siloed processing into kernel.
>>
>>> If there are a lot of processes sleeping in accept() (on the same
>>> socket) it might be worth looking at which is actually woken
>>> when a new connection arrives. If they are sleeping in poll/select
>>> it is probably more difficult (but not impossible) to avoid waking
>>> all the processes for every incoming connection.
>>
>> We had considered solving this within accept.  The problem is that
>> there's no way to indicate how much work a thread should do via
>> accept.  For instance, an event loop usually would look like:
>>
>> while (1) {
>>      fd = accept();
>>      process(fd);
>> }
>>
>> With multiple threads, the number of accepted sockets in a particular
>> thread is non-deterministic.  It is even possible that one thread
>> could end up accepting all the connections, and the others are starved
>> (wake up but no connection to process.).  Since connections are the
>> unit of work, this creates imbalance among threads.  There was an
>> attempt to fix this in user space by sleeping for a while instead of
>> calling accept on threads for one that have already have a
>> disproportionate number of connections.  This was unpleasant-- it
>> needed shared state in user space and provided no granularity.
>>
>
> I also have some thinks on this imbalance problem ...
>
> At Last, I assumed that every accept-thread holds same numbers of
> listen sockets, so we just can do load balance base on length of accept
> queue.
>
> Thanks for great SO_REUSEPORT work.
>
>> Tom
>> --
>> To unsubscribe from this list: send the line "unsubscribe netdev" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>
>

^ permalink raw reply	[flat|nested] 13+ messages in thread

* [PATCH 0/5]: soreuseport: Bind multiple sockets to the same port
@ 2013-01-22 19:49 Tom Herbert
  2013-01-22 20:28 ` David Miller
  2013-01-25  5:06 ` Nick Jones
  0 siblings, 2 replies; 13+ messages in thread
From: Tom Herbert @ 2013-01-22 19:49 UTC (permalink / raw)
  To: netdev, davem; +Cc: netdev, eric.dumazet

[-- Attachment #1: Type: TEXT/PLAIN, Size: 2587 bytes --]

This series implements so_reuseport (SO_REUSEPORT socket option) for
TCP and UDP.  For TCP, so_reuseport allows multiple listener sockets
to be bound to the same port.  In the case of UDP, so_reuseport allows
multiple sockets to bind to the same port.  To prevent port hijacking
all sockets bound to the same port using so_reuseport must have the
same uid.  Received packets are distributed to multiple sockets bound
to the same port using a 4-tuple hash.

The motivating case for so_resuseport in TCP would be something like
a web server binding to port 80 running with multiple threads, where
each thread might have it's own listener socket.  This could be done
as an alternative to other models: 1) have one listener thread which
dispatches completed connections to workers. 2) accept on a single
listener socket from multiple threads.  In case #1 the listener thread
can easily become the bottleneck with high connection turn-over rate.
In case #2, the proportion of connections accepted per thread tends
to be uneven under high connection load (assuming simple event loop:
while (1) { accept(); process() }, wakeup does not promote fairness
among the sockets.  We have seen the  disproportion to be as high
as 3:1 ratio between thread accepting most connections and the one
accepting the fewest.  With so_reusport the distribution is
uniform.

The TCP implementation has a problem in that the request sockets for a
listener are attached to a listener socket.  If a SYN is received, a
listener socket is chosen and request structure is created (SYN-RECV
state).  If the subsequent ack in 3WHS does not match the same port
by so_reusport, the connection state is not found (reset) and the
request structure is orphaned.  This scenario would occur when the
number of listener sockets bound to a port changes (new ones are
added, or old ones closed).  We are looking for a solution to this,
maybe allow multiple sockets to share the same request table...

The motivating case for so_reuseport in UDP would be something like a
DNS server.  An alternative would be to recv on the same socket from
multiple threads.  As in the case of TCP, the load across these threads
tends to be disproportionate and we also see a lot of contection on
the socket lock.  Note that SO_REUSEADDR already allows multiple UDP
sockets to bind to the same port, however there is no provision to
prevent hijacking and nothing to distribute packets across all the
sockets sharing the same bound port.  This patch does not change the
semantics of SO_REUSEADDR, but provides usable functionality of it
for unicast.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH 0/5]: soreuseport: Bind multiple sockets to the same port
  2013-01-22 19:49 Tom Herbert
@ 2013-01-22 20:28 ` David Miller
  2013-01-25  5:06 ` Nick Jones
  1 sibling, 0 replies; 13+ messages in thread
From: David Miller @ 2013-01-22 20:28 UTC (permalink / raw)
  To: therbert; +Cc: netdev, netdev, eric.dumazet

From: Tom Herbert <therbert@google.com>
Date: Tue, 22 Jan 2013 11:49:46 -0800 (PST)

> This series implements so_reuseport (SO_REUSEPORT socket option) for
> TCP and UDP.

Series applied, thanks Tom.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH 0/5]: soreuseport: Bind multiple sockets to the same port
  2013-01-22 19:49 Tom Herbert
  2013-01-22 20:28 ` David Miller
@ 2013-01-25  5:06 ` Nick Jones
  1 sibling, 0 replies; 13+ messages in thread
From: Nick Jones @ 2013-01-25  5:06 UTC (permalink / raw)
  To: Tom Herbert; +Cc: netdev, davem, netdev, eric.dumazet

On Wednesday, January 23, 2013 03:49 AM, Tom Herbert wrote:
...
>
> The motivating case for so_resuseport in TCP would be something like
> a web server binding to port 80 running with multiple threads, where
> each thread might have it's own listener socket.  This could be done
> as an alternative to other models: 1) have one listener thread which
> dispatches completed connections to workers. 2) accept on a single
> listener socket from multiple threads.  In case #1 the listener thread
> can easily become the bottleneck with high connection turn-over rate.
> In case #2, the proportion of connections accepted per thread tends
> to be uneven under high connection load (assuming simple event loop:
> while (1) { accept(); process() }, wakeup does not promote fairness
> among the sockets.  We have seen the  disproportion to be as high
> as 3:1 ratio between thread accepting most connections and the one
> accepting the fewest.  With so_reusport the distribution is
> uniform.
>

There is another model for accepting connections in a multi threaded 
application that I experimented with: dup the listener fd one time for 
each thread, then each thread register the fd in its own epoll set, then 
listen and accept independently.

Has anyone had experience with this strategy?  I'm sure that the 
SO_REUSEPORT feature will lead to much better performance, I'm just 
asking from the point of view of one who doesn't have that feature 
available.  I wonder if this strategy is like a poor mans SO_REUSEPORT?

The advantages of this approach were not fully proved in practise, I 
didn't produce any hard figures, but in theory it was appealing:
- no bottleneck of using a single thread for accepting then distributing 
connections (in addition to latency of waiting for the job handling 
thread to receive the event and start its work)
- connections were used in the thread in which they were accepted, thus 
locality was maintained (am I exaggerating this benefit?)
- when a new connection was received, all threads woke up to activity on 
their respective fd copies, and started accepting, I assume from the 
same connection queue.  This was a disadvantage at low load levels as 
threads would often wake and find nothing to do, but as loads got 
higher, strace showed that less threads would awake to disappointment.
- this approach was able to handle stress tests from a hardware packet 
generator, which showed no dropped or unhandled connections.

One disadvantage I imagined was that the single socket and all of its 
duplicates may find themselves attached to one cpu core or hardware 
queue on the network adapter, I don't know enough about the core net 
internals to say for sure, but as a precaution the dup was done in the 
context of the thread that would use the copy.

Just sharing and seeking comments.

^ permalink raw reply	[flat|nested] 13+ messages in thread

end of thread, other threads:[~2013-01-25  5:41 UTC | newest]

Thread overview: 13+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2013-01-14 20:00 [PATCH 0/5]: soreuseport: Bind multiple sockets to the same port Tom Herbert
2013-01-14 20:29 ` David Miller
2013-01-14 23:35   ` Vijay Subramanian
2013-01-15  1:33     ` Tom Herbert
2013-01-15  9:34 ` David Laight
2013-01-16 18:22   ` Tom Herbert
2013-01-17  9:53     ` David Laight
2013-01-17 14:27       ` Eric Dumazet
2013-01-21  7:23     ` Li Yu
2013-01-21  7:58       ` Li Yu
  -- strict thread matches above, loose matches on Subject: below --
2013-01-22 19:49 Tom Herbert
2013-01-22 20:28 ` David Miller
2013-01-25  5:06 ` Nick Jones

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).