SO_REUSEPORT?

netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* SO_REUSEPORT?
@ 2008-08-07 16:57 Tom Herbert
  2008-08-07 17:09 ` SO_REUSEPORT? Rémi Denis-Courmont
  0 siblings, 1 reply; 9+ messages in thread
From: Tom Herbert @ 2008-08-07 16:57 UTC (permalink / raw)
  To: netdev

Hello,

We are looking at ways to scale TCP listeners.  I think we like is the
ability to listen on a port from multiple threads (sockets bound to
same port,  INADDR_ANY, and no interface binding) , which is what
SO_REUSEPORT would seem to allow.  Has this ever been implemented for
Linux or is there a good reason not to have it?

Thanks,
Tom

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: SO_REUSEPORT?
  2008-08-07 16:57 SO_REUSEPORT? Tom Herbert
@ 2008-08-07 17:09 ` Rémi Denis-Courmont
  2008-08-07 17:58   ` SO_REUSEPORT? Tom Herbert
  0 siblings, 1 reply; 9+ messages in thread
From: Rémi Denis-Courmont @ 2008-08-07 17:09 UTC (permalink / raw)
  To: Tom Herbert; +Cc: netdev

Le jeudi 7 août 2008 19:57:15 Tom Herbert, vous avez écrit :
> Hello,
>
> We are looking at ways to scale TCP listeners.  I think we like is the
> ability to listen on a port from multiple threads (sockets bound to
> same port,  INADDR_ANY, and no interface binding) , which is what
> SO_REUSEPORT would seem to allow.  Has this ever been implemented for
> Linux or is there a good reason not to have it?

On Linux, SO_REUSEADDR provide most of what SO_REUSEPORT provides on BSD.

In any case, there is absolutely no point in creating multiple TCP listeners. 
Multiple threads can accept() on the same listener - at the same time.

-- 
Rémi Denis-Courmont
http://www.remlab.net/

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: SO_REUSEPORT?
  2008-08-07 17:09 ` SO_REUSEPORT? Rémi Denis-Courmont
@ 2008-08-07 17:58   ` Tom Herbert
  2008-08-07 18:17     ` SO_REUSEPORT? Rick Jones
  0 siblings, 1 reply; 9+ messages in thread
From: Tom Herbert @ 2008-08-07 17:58 UTC (permalink / raw)
  To: netdev

> > We are looking at ways to scale TCP listeners.  I think we like is the
> > ability to listen on a port from multiple threads (sockets bound to
> > same port,  INADDR_ANY, and no interface binding) , which is what
> > SO_REUSEPORT would seem to allow.  Has this ever been implemented for
> > Linux or is there a good reason not to have it?
>
> On Linux, SO_REUSEADDR provide most of what SO_REUSEPORT provides on BSD.
>
> In any case, there is absolutely no point in creating multiple TCP listeners.
> Multiple threads can accept() on the same listener - at the same time.
>

We've been doing that, but then on wakeup it would seem that we're at
the mercy of scheduling-- basically which ever threads wakes up first
will get to process accept queue first.  This seems to bias towards
threads running on the same CPU as the wakeup is called, and   so this
method doesn't give us an even distribution of new connections across
the threads that we'd like.

Tom

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: SO_REUSEPORT?
  2008-08-07 17:58   ` SO_REUSEPORT? Tom Herbert
@ 2008-08-07 18:17     ` Rick Jones
  2008-08-07 19:03       ` SO_REUSEPORT? Stephen Hemminger
  0 siblings, 1 reply; 9+ messages in thread
From: Rick Jones @ 2008-08-07 18:17 UTC (permalink / raw)
  To: Tom Herbert; +Cc: netdev

Tom Herbert wrote:
>>>We are looking at ways to scale TCP listeners.  I think we like is the
>>>ability to listen on a port from multiple threads (sockets bound to
>>>same port,  INADDR_ANY, and no interface binding) , which is what
>>>SO_REUSEPORT would seem to allow.  Has this ever been implemented for
>>>Linux or is there a good reason not to have it?
>>
>>On Linux, SO_REUSEADDR provide most of what SO_REUSEPORT provides on BSD.
>>
>>In any case, there is absolutely no point in creating multiple TCP listeners.
>>Multiple threads can accept() on the same listener - at the same time.
>>
> 
> 
> We've been doing that, but then on wakeup it would seem that we're at
> the mercy of scheduling-- basically which ever threads wakes up first
> will get to process accept queue first.  This seems to bias towards
> threads running on the same CPU as the wakeup is called, and   so this
> method doesn't give us an even distribution of new connections across
> the threads that we'd like.

How would the presence of multiple TCP LISTEN endpoints change that? 
You'd then be at the mercy of whatever "scheduling" there was inside the 
stack.

If you want to balance the threads, perhaps a dispatch thread, or a 
virtual one - each thread knows how many connections it is servicing, 
let them know how many the other threads are servicing, and if a thread 
has N more connections than the other threads have it not go into 
accept() that time around.  Might need some tweaking to handle 
pathological starvation cases like all the other threads are hung I 
suppose but the basic idea is there.

rick jones
> 
> Tom
> --
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: SO_REUSEPORT?
  2008-08-07 18:17     ` SO_REUSEPORT? Rick Jones
@ 2008-08-07 19:03       ` Stephen Hemminger
  2008-08-07 19:43         ` SO_REUSEPORT? Tom Herbert
  0 siblings, 1 reply; 9+ messages in thread
From: Stephen Hemminger @ 2008-08-07 19:03 UTC (permalink / raw)
  To: Rick Jones; +Cc: Tom Herbert, netdev

On Thu, 07 Aug 2008 11:17:55 -0700
Rick Jones <rick.jones2@hp.com> wrote:

> Tom Herbert wrote:
> >>>We are looking at ways to scale TCP listeners.  I think we like is the
> >>>ability to listen on a port from multiple threads (sockets bound to
> >>>same port,  INADDR_ANY, and no interface binding) , which is what
> >>>SO_REUSEPORT would seem to allow.  Has this ever been implemented for
> >>>Linux or is there a good reason not to have it?
> >>
> >>On Linux, SO_REUSEADDR provide most of what SO_REUSEPORT provides on BSD.
> >>
> >>In any case, there is absolutely no point in creating multiple TCP listeners.
> >>Multiple threads can accept() on the same listener - at the same time.
> >>
> > 
> > 
> > We've been doing that, but then on wakeup it would seem that we're at
> > the mercy of scheduling-- basically which ever threads wakes up first
> > will get to process accept queue first.  This seems to bias towards
> > threads running on the same CPU as the wakeup is called, and   so this
> > method doesn't give us an even distribution of new connections across
> > the threads that we'd like.
> 
> How would the presence of multiple TCP LISTEN endpoints change that? 
> You'd then be at the mercy of whatever "scheduling" there was inside the 
> stack.
> 
> If you want to balance the threads, perhaps a dispatch thread, or a 
> virtual one - each thread knows how many connections it is servicing, 
> let them know how many the other threads are servicing, and if a thread 
> has N more connections than the other threads have it not go into 
> accept() that time around.  Might need some tweaking to handle 
> pathological starvation cases like all the other threads are hung I 
> suppose but the basic idea is there.
> 
> rick jones

I suspect thread balancing would actually hurt performance!
You would be better off to have a couple of "hot" threads that are doing
all the work and stay in cache. If you push the work around to all the
threads, you have worst case cache behaviour.

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: SO_REUSEPORT?
  2008-08-07 19:03       ` SO_REUSEPORT? Stephen Hemminger
@ 2008-08-07 19:43         ` Tom Herbert
  2008-08-07 20:14           ` SO_REUSEPORT? Rick Jones
  0 siblings, 1 reply; 9+ messages in thread
From: Tom Herbert @ 2008-08-07 19:43 UTC (permalink / raw)
  To: Stephen Hemminger; +Cc: Rick Jones, netdev

On Thu, Aug 7, 2008 at 12:03 PM, Stephen Hemminger
<stephen.hemminger@vyatta.com> wrote:
> On Thu, 07 Aug 2008 11:17:55 -0700
> Rick Jones <rick.jones2@hp.com> wrote:
>
>> Tom Herbert wrote:
>> >>>We are looking at ways to scale TCP listeners.  I think we like is the
>> >>>ability to listen on a port from multiple threads (sockets bound to
>> >>>same port,  INADDR_ANY, and no interface binding) , which is what
>> >>>SO_REUSEPORT would seem to allow.  Has this ever been implemented for
>> >>>Linux or is there a good reason not to have it?
>> >>
>> >>On Linux, SO_REUSEADDR provide most of what SO_REUSEPORT provides on BSD.
>> >>
>> >>In any case, there is absolutely no point in creating multiple TCP listeners.
>> >>Multiple threads can accept() on the same listener - at the same time.
>> >>
>> >
>> >
>> > We've been doing that, but then on wakeup it would seem that we're at
>> > the mercy of scheduling-- basically which ever threads wakes up first
>> > will get to process accept queue first.  This seems to bias towards
>> > threads running on the same CPU as the wakeup is called, and   so this
>> > method doesn't give us an even distribution of new connections across
>> > the threads that we'd like.
>>
>> How would the presence of multiple TCP LISTEN endpoints change that?
>> You'd then be at the mercy of whatever "scheduling" there was inside the
>> stack.
>>
>> If you want to balance the threads, perhaps a dispatch thread, or a
>> virtual one - each thread knows how many connections it is servicing,
>> let them know how many the other threads are servicing, and if a thread
>> has N more connections than the other threads have it not go into
>> accept() that time around.  Might need some tweaking to handle
>> pathological starvation cases like all the other threads are hung I
>> suppose but the basic idea is there.
>>
>> rick jones
>
> I suspect thread balancing would actually hurt performance!
> You would be better off to have a couple of "hot" threads that are doing
> all the work and stay in cache. If you push the work around to all the
> threads, you have worst case cache behaviour.
>

I'm not sure that's applicable for us since the server application and
networking will max out all the CPUs on host anyway; one way or
another we need to dispatch the work of incoming connections to
threads on different CPUs.  If we do this in user space and do all
accepts in one thread, the CPU of that  thread becomes the bottleneck
(we're accepting about 40,000 connections per second).  If we have
multiple accept threads running on different CPUs, this helps some,
but the load is spread unevenly across the CPUs and we still can't get
the highest connection rate.  So it seems we're looking for a method
that distributes the incoming connection load across CPUs pretty
evenly.

Tom




But we need to spread the load across multiple threads on different CPUs

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: SO_REUSEPORT?
  2008-08-07 19:43         ` SO_REUSEPORT? Tom Herbert
@ 2008-08-07 20:14           ` Rick Jones
  2008-08-07 23:05             ` SO_REUSEPORT? Tom Herbert
  0 siblings, 1 reply; 9+ messages in thread
From: Rick Jones @ 2008-08-07 20:14 UTC (permalink / raw)
  To: Tom Herbert; +Cc: Stephen Hemminger, netdev

> I'm not sure that's applicable for us since the server application and
> networking will max out all the CPUs on host anyway; one way or
> another we need to dispatch the work of incoming connections to
> threads on different CPUs.  If we do this in user space and do all
> accepts in one thread, the CPU of that  thread becomes the bottleneck
> (we're accepting about 40,000 connections per second).  If we have
> multiple accept threads running on different CPUs, this helps some,
> but the load is spread unevenly across the CPUs and we still can't get
> the highest connection rate.  So it seems we're looking for a method
> that distributes the incoming connection load across CPUs pretty
> evenly.

Well, if you _really_ want the load spread, you may need to use a 
multiqueue (at least inbound if not also later outbound) interface, 
"know" how the NIC will hash and then have N distinct port numbers each 
assigned to a LISTEN endpoint.  The old song and dance about making an N 
CPU system look as much like N single-CPU systems and all that...

Unless there are NICs you can "tell" where to send the interrupts, which 
IMO is preferable - I have a preference for the application/scheduler 
telling "networking" where to work rather than networking (or the NIC) 
telling the scheduler where to run a thread - the archives of either 
here or netnews will probalby pull-up stuff were I've talked about 
Inbound Packet Scheduling (IPS) vs Thread Optimized Packet Scheduling 
(TOPS) and limitations of simplistic address hashing to pick a 
queue/processor/whatnot :)

rick jones

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: SO_REUSEPORT?
  2008-08-07 20:14           ` SO_REUSEPORT? Rick Jones
@ 2008-08-07 23:05             ` Tom Herbert
  2008-08-07 23:28               ` SO_REUSEPORT? Rick Jones
  0 siblings, 1 reply; 9+ messages in thread
From: Tom Herbert @ 2008-08-07 23:05 UTC (permalink / raw)
  To: Rick Jones; +Cc: Stephen Hemminger, netdev

> Well, if you _really_ want the load spread, you may need to use a multiqueue
> (at least inbound if not also later outbound) interface, "know" how the NIC
> will hash and then have N distinct port numbers each assigned to a LISTEN
> endpoint.  The old song and dance about making an N CPU system look as much
> like N single-CPU systems and all that...
>

Yep that's what I really want, except for the fact that I can only use
a single port for the server--  all flows could be nicely distributed
by the NIC multiqueue, but I still have the problem of how to ensure
that the accepting thread for a connection is run on the same CPU as
the interrupt and SYN processing were.

> Unless there are NICs you can "tell" where to send the interrupts, which IMO
> is preferable - I have a preference for the application/scheduler telling
> "networking" where to work rather than networking (or the NIC) telling the
> scheduler where to run a thread - the archives of either here or netnews
> will probalby pull-up stuff were I've talked about Inbound Packet Scheduling
> (IPS) vs Thread Optimized Packet Scheduling (TOPS) and limitations of
> simplistic address hashing to pick a queue/processor/whatnot :)
>

NICs are already doing steering based on tuple hash (RSS), and I think
some will allow specifying the CPU for interrupt based on RX flow.
Maybe this would address the issues of Inbound Packet Scheduling?

Thanks for the pointers on IPS and TOPS.  Out of curiosity has there
been an effort to do TOPS on Linux?  We are doing something very
similar in software RSS with a fair amount of success (I posted
patches for this a while back).

Tom

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: SO_REUSEPORT?
  2008-08-07 23:05             ` SO_REUSEPORT? Tom Herbert
@ 2008-08-07 23:28               ` Rick Jones
  0 siblings, 0 replies; 9+ messages in thread
From: Rick Jones @ 2008-08-07 23:28 UTC (permalink / raw)
  To: Tom Herbert; +Cc: Stephen Hemminger, netdev

Tom Herbert wrote:
>>Well, if you _really_ want the load spread, you may need to use a multiqueue
>>(at least inbound if not also later outbound) interface, "know" how the NIC
>>will hash and then have N distinct port numbers each assigned to a LISTEN
>>endpoint.  The old song and dance about making an N CPU system look as much
>>like N single-CPU systems and all that...
>>
> 
> 
> Yep that's what I really want, except for the fact that I can only use
> a single port for the server--  all flows could be nicely distributed
> by the NIC multiqueue, but I still have the problem of how to ensure
> that the accepting thread for a connection is run on the same CPU as
> the interrupt and SYN processing were.

That is where needing to know/control the NIC's hashing comes into play.

> NICs are already doing steering based on tuple hash (RSS), and I think
> some will allow specifying the CPU for interrupt based on RX flow.
> Maybe this would address the issues of Inbound Packet Scheduling?

All IPS in HP-UX 10.20 was was hash the IP/port numbers and queue based 
on that - this at the handoff between driver and netisr.  The problem 
was if you had a thread of execution servicing more than one connection, 
you would start whipsawing across the processors based on the remote 
addressing.

There are IIRC indeed some NICs where you can give them a finite number 
of tuples and say where each tuple should go.  I'm sure those vendors if 
watching can speak-up :)  That sort of functionality can be useful and 
would address the limitations of ISS/plain NIC header address hashing. 
At least for long-lived connections.  Or perhaps even long-lived LISTEN 
endpoints :)

While you say you are constrained to a single port number, are you 
similarly constrained to a single IP address?

> Thanks for the pointers on IPS and TOPS.  Out of curiosity has there
> been an effort to do TOPS on Linux?  We are doing something very
> similar in software RSS with a fair amount of success (I posted
> patches for this a while back).

I'm not sure.  Anything is possible.  The nice thing about TOPS in UX 
11.X was/is the lookup was essentially free and didn't involve things 
going across I/O busses.  Start to have to update those tuple mappings 
on the NIC with any frequency and that's the end of that.

rick

^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2008-08-07 23:28 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2008-08-07 16:57 SO_REUSEPORT? Tom Herbert
2008-08-07 17:09 ` SO_REUSEPORT? Rémi Denis-Courmont
2008-08-07 17:58   ` SO_REUSEPORT? Tom Herbert
2008-08-07 18:17     ` SO_REUSEPORT? Rick Jones
2008-08-07 19:03       ` SO_REUSEPORT? Stephen Hemminger
2008-08-07 19:43         ` SO_REUSEPORT? Tom Herbert
2008-08-07 20:14           ` SO_REUSEPORT? Rick Jones
2008-08-07 23:05             ` SO_REUSEPORT? Tom Herbert
2008-08-07 23:28               ` SO_REUSEPORT? Rick Jones

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).