* SO_REUSEPORT?
@ 2008-08-07 16:57 Tom Herbert
2008-08-07 17:09 ` SO_REUSEPORT? Rémi Denis-Courmont
0 siblings, 1 reply; 9+ messages in thread
From: Tom Herbert @ 2008-08-07 16:57 UTC (permalink / raw)
To: netdev
Hello,
We are looking at ways to scale TCP listeners. I think we like is the
ability to listen on a port from multiple threads (sockets bound to
same port, INADDR_ANY, and no interface binding) , which is what
SO_REUSEPORT would seem to allow. Has this ever been implemented for
Linux or is there a good reason not to have it?
Thanks,
Tom
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: SO_REUSEPORT?
2008-08-07 16:57 SO_REUSEPORT? Tom Herbert
@ 2008-08-07 17:09 ` Rémi Denis-Courmont
2008-08-07 17:58 ` SO_REUSEPORT? Tom Herbert
0 siblings, 1 reply; 9+ messages in thread
From: Rémi Denis-Courmont @ 2008-08-07 17:09 UTC (permalink / raw)
To: Tom Herbert; +Cc: netdev
Le jeudi 7 août 2008 19:57:15 Tom Herbert, vous avez écrit :
> Hello,
>
> We are looking at ways to scale TCP listeners. I think we like is the
> ability to listen on a port from multiple threads (sockets bound to
> same port, INADDR_ANY, and no interface binding) , which is what
> SO_REUSEPORT would seem to allow. Has this ever been implemented for
> Linux or is there a good reason not to have it?
On Linux, SO_REUSEADDR provide most of what SO_REUSEPORT provides on BSD.
In any case, there is absolutely no point in creating multiple TCP listeners.
Multiple threads can accept() on the same listener - at the same time.
--
Rémi Denis-Courmont
http://www.remlab.net/
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: SO_REUSEPORT?
2008-08-07 17:09 ` SO_REUSEPORT? Rémi Denis-Courmont
@ 2008-08-07 17:58 ` Tom Herbert
2008-08-07 18:17 ` SO_REUSEPORT? Rick Jones
0 siblings, 1 reply; 9+ messages in thread
From: Tom Herbert @ 2008-08-07 17:58 UTC (permalink / raw)
To: netdev
> > We are looking at ways to scale TCP listeners. I think we like is the
> > ability to listen on a port from multiple threads (sockets bound to
> > same port, INADDR_ANY, and no interface binding) , which is what
> > SO_REUSEPORT would seem to allow. Has this ever been implemented for
> > Linux or is there a good reason not to have it?
>
> On Linux, SO_REUSEADDR provide most of what SO_REUSEPORT provides on BSD.
>
> In any case, there is absolutely no point in creating multiple TCP listeners.
> Multiple threads can accept() on the same listener - at the same time.
>
We've been doing that, but then on wakeup it would seem that we're at
the mercy of scheduling-- basically which ever threads wakes up first
will get to process accept queue first. This seems to bias towards
threads running on the same CPU as the wakeup is called, and so this
method doesn't give us an even distribution of new connections across
the threads that we'd like.
Tom
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: SO_REUSEPORT?
2008-08-07 17:58 ` SO_REUSEPORT? Tom Herbert
@ 2008-08-07 18:17 ` Rick Jones
2008-08-07 19:03 ` SO_REUSEPORT? Stephen Hemminger
0 siblings, 1 reply; 9+ messages in thread
From: Rick Jones @ 2008-08-07 18:17 UTC (permalink / raw)
To: Tom Herbert; +Cc: netdev
Tom Herbert wrote:
>>>We are looking at ways to scale TCP listeners. I think we like is the
>>>ability to listen on a port from multiple threads (sockets bound to
>>>same port, INADDR_ANY, and no interface binding) , which is what
>>>SO_REUSEPORT would seem to allow. Has this ever been implemented for
>>>Linux or is there a good reason not to have it?
>>
>>On Linux, SO_REUSEADDR provide most of what SO_REUSEPORT provides on BSD.
>>
>>In any case, there is absolutely no point in creating multiple TCP listeners.
>>Multiple threads can accept() on the same listener - at the same time.
>>
>
>
> We've been doing that, but then on wakeup it would seem that we're at
> the mercy of scheduling-- basically which ever threads wakes up first
> will get to process accept queue first. This seems to bias towards
> threads running on the same CPU as the wakeup is called, and so this
> method doesn't give us an even distribution of new connections across
> the threads that we'd like.
How would the presence of multiple TCP LISTEN endpoints change that?
You'd then be at the mercy of whatever "scheduling" there was inside the
stack.
If you want to balance the threads, perhaps a dispatch thread, or a
virtual one - each thread knows how many connections it is servicing,
let them know how many the other threads are servicing, and if a thread
has N more connections than the other threads have it not go into
accept() that time around. Might need some tweaking to handle
pathological starvation cases like all the other threads are hung I
suppose but the basic idea is there.
rick jones
>
> Tom
> --
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: SO_REUSEPORT?
2008-08-07 18:17 ` SO_REUSEPORT? Rick Jones
@ 2008-08-07 19:03 ` Stephen Hemminger
2008-08-07 19:43 ` SO_REUSEPORT? Tom Herbert
0 siblings, 1 reply; 9+ messages in thread
From: Stephen Hemminger @ 2008-08-07 19:03 UTC (permalink / raw)
To: Rick Jones; +Cc: Tom Herbert, netdev
On Thu, 07 Aug 2008 11:17:55 -0700
Rick Jones <rick.jones2@hp.com> wrote:
> Tom Herbert wrote:
> >>>We are looking at ways to scale TCP listeners. I think we like is the
> >>>ability to listen on a port from multiple threads (sockets bound to
> >>>same port, INADDR_ANY, and no interface binding) , which is what
> >>>SO_REUSEPORT would seem to allow. Has this ever been implemented for
> >>>Linux or is there a good reason not to have it?
> >>
> >>On Linux, SO_REUSEADDR provide most of what SO_REUSEPORT provides on BSD.
> >>
> >>In any case, there is absolutely no point in creating multiple TCP listeners.
> >>Multiple threads can accept() on the same listener - at the same time.
> >>
> >
> >
> > We've been doing that, but then on wakeup it would seem that we're at
> > the mercy of scheduling-- basically which ever threads wakes up first
> > will get to process accept queue first. This seems to bias towards
> > threads running on the same CPU as the wakeup is called, and so this
> > method doesn't give us an even distribution of new connections across
> > the threads that we'd like.
>
> How would the presence of multiple TCP LISTEN endpoints change that?
> You'd then be at the mercy of whatever "scheduling" there was inside the
> stack.
>
> If you want to balance the threads, perhaps a dispatch thread, or a
> virtual one - each thread knows how many connections it is servicing,
> let them know how many the other threads are servicing, and if a thread
> has N more connections than the other threads have it not go into
> accept() that time around. Might need some tweaking to handle
> pathological starvation cases like all the other threads are hung I
> suppose but the basic idea is there.
>
> rick jones
I suspect thread balancing would actually hurt performance!
You would be better off to have a couple of "hot" threads that are doing
all the work and stay in cache. If you push the work around to all the
threads, you have worst case cache behaviour.
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: SO_REUSEPORT?
2008-08-07 19:03 ` SO_REUSEPORT? Stephen Hemminger
@ 2008-08-07 19:43 ` Tom Herbert
2008-08-07 20:14 ` SO_REUSEPORT? Rick Jones
0 siblings, 1 reply; 9+ messages in thread
From: Tom Herbert @ 2008-08-07 19:43 UTC (permalink / raw)
To: Stephen Hemminger; +Cc: Rick Jones, netdev
On Thu, Aug 7, 2008 at 12:03 PM, Stephen Hemminger
<stephen.hemminger@vyatta.com> wrote:
> On Thu, 07 Aug 2008 11:17:55 -0700
> Rick Jones <rick.jones2@hp.com> wrote:
>
>> Tom Herbert wrote:
>> >>>We are looking at ways to scale TCP listeners. I think we like is the
>> >>>ability to listen on a port from multiple threads (sockets bound to
>> >>>same port, INADDR_ANY, and no interface binding) , which is what
>> >>>SO_REUSEPORT would seem to allow. Has this ever been implemented for
>> >>>Linux or is there a good reason not to have it?
>> >>
>> >>On Linux, SO_REUSEADDR provide most of what SO_REUSEPORT provides on BSD.
>> >>
>> >>In any case, there is absolutely no point in creating multiple TCP listeners.
>> >>Multiple threads can accept() on the same listener - at the same time.
>> >>
>> >
>> >
>> > We've been doing that, but then on wakeup it would seem that we're at
>> > the mercy of scheduling-- basically which ever threads wakes up first
>> > will get to process accept queue first. This seems to bias towards
>> > threads running on the same CPU as the wakeup is called, and so this
>> > method doesn't give us an even distribution of new connections across
>> > the threads that we'd like.
>>
>> How would the presence of multiple TCP LISTEN endpoints change that?
>> You'd then be at the mercy of whatever "scheduling" there was inside the
>> stack.
>>
>> If you want to balance the threads, perhaps a dispatch thread, or a
>> virtual one - each thread knows how many connections it is servicing,
>> let them know how many the other threads are servicing, and if a thread
>> has N more connections than the other threads have it not go into
>> accept() that time around. Might need some tweaking to handle
>> pathological starvation cases like all the other threads are hung I
>> suppose but the basic idea is there.
>>
>> rick jones
>
> I suspect thread balancing would actually hurt performance!
> You would be better off to have a couple of "hot" threads that are doing
> all the work and stay in cache. If you push the work around to all the
> threads, you have worst case cache behaviour.
>
I'm not sure that's applicable for us since the server application and
networking will max out all the CPUs on host anyway; one way or
another we need to dispatch the work of incoming connections to
threads on different CPUs. If we do this in user space and do all
accepts in one thread, the CPU of that thread becomes the bottleneck
(we're accepting about 40,000 connections per second). If we have
multiple accept threads running on different CPUs, this helps some,
but the load is spread unevenly across the CPUs and we still can't get
the highest connection rate. So it seems we're looking for a method
that distributes the incoming connection load across CPUs pretty
evenly.
Tom
But we need to spread the load across multiple threads on different CPUs
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: SO_REUSEPORT?
2008-08-07 19:43 ` SO_REUSEPORT? Tom Herbert
@ 2008-08-07 20:14 ` Rick Jones
2008-08-07 23:05 ` SO_REUSEPORT? Tom Herbert
0 siblings, 1 reply; 9+ messages in thread
From: Rick Jones @ 2008-08-07 20:14 UTC (permalink / raw)
To: Tom Herbert; +Cc: Stephen Hemminger, netdev
> I'm not sure that's applicable for us since the server application and
> networking will max out all the CPUs on host anyway; one way or
> another we need to dispatch the work of incoming connections to
> threads on different CPUs. If we do this in user space and do all
> accepts in one thread, the CPU of that thread becomes the bottleneck
> (we're accepting about 40,000 connections per second). If we have
> multiple accept threads running on different CPUs, this helps some,
> but the load is spread unevenly across the CPUs and we still can't get
> the highest connection rate. So it seems we're looking for a method
> that distributes the incoming connection load across CPUs pretty
> evenly.
Well, if you _really_ want the load spread, you may need to use a
multiqueue (at least inbound if not also later outbound) interface,
"know" how the NIC will hash and then have N distinct port numbers each
assigned to a LISTEN endpoint. The old song and dance about making an N
CPU system look as much like N single-CPU systems and all that...
Unless there are NICs you can "tell" where to send the interrupts, which
IMO is preferable - I have a preference for the application/scheduler
telling "networking" where to work rather than networking (or the NIC)
telling the scheduler where to run a thread - the archives of either
here or netnews will probalby pull-up stuff were I've talked about
Inbound Packet Scheduling (IPS) vs Thread Optimized Packet Scheduling
(TOPS) and limitations of simplistic address hashing to pick a
queue/processor/whatnot :)
rick jones
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: SO_REUSEPORT?
2008-08-07 20:14 ` SO_REUSEPORT? Rick Jones
@ 2008-08-07 23:05 ` Tom Herbert
2008-08-07 23:28 ` SO_REUSEPORT? Rick Jones
0 siblings, 1 reply; 9+ messages in thread
From: Tom Herbert @ 2008-08-07 23:05 UTC (permalink / raw)
To: Rick Jones; +Cc: Stephen Hemminger, netdev
> Well, if you _really_ want the load spread, you may need to use a multiqueue
> (at least inbound if not also later outbound) interface, "know" how the NIC
> will hash and then have N distinct port numbers each assigned to a LISTEN
> endpoint. The old song and dance about making an N CPU system look as much
> like N single-CPU systems and all that...
>
Yep that's what I really want, except for the fact that I can only use
a single port for the server-- all flows could be nicely distributed
by the NIC multiqueue, but I still have the problem of how to ensure
that the accepting thread for a connection is run on the same CPU as
the interrupt and SYN processing were.
> Unless there are NICs you can "tell" where to send the interrupts, which IMO
> is preferable - I have a preference for the application/scheduler telling
> "networking" where to work rather than networking (or the NIC) telling the
> scheduler where to run a thread - the archives of either here or netnews
> will probalby pull-up stuff were I've talked about Inbound Packet Scheduling
> (IPS) vs Thread Optimized Packet Scheduling (TOPS) and limitations of
> simplistic address hashing to pick a queue/processor/whatnot :)
>
NICs are already doing steering based on tuple hash (RSS), and I think
some will allow specifying the CPU for interrupt based on RX flow.
Maybe this would address the issues of Inbound Packet Scheduling?
Thanks for the pointers on IPS and TOPS. Out of curiosity has there
been an effort to do TOPS on Linux? We are doing something very
similar in software RSS with a fair amount of success (I posted
patches for this a while back).
Tom
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: SO_REUSEPORT?
2008-08-07 23:05 ` SO_REUSEPORT? Tom Herbert
@ 2008-08-07 23:28 ` Rick Jones
0 siblings, 0 replies; 9+ messages in thread
From: Rick Jones @ 2008-08-07 23:28 UTC (permalink / raw)
To: Tom Herbert; +Cc: Stephen Hemminger, netdev
Tom Herbert wrote:
>>Well, if you _really_ want the load spread, you may need to use a multiqueue
>>(at least inbound if not also later outbound) interface, "know" how the NIC
>>will hash and then have N distinct port numbers each assigned to a LISTEN
>>endpoint. The old song and dance about making an N CPU system look as much
>>like N single-CPU systems and all that...
>>
>
>
> Yep that's what I really want, except for the fact that I can only use
> a single port for the server-- all flows could be nicely distributed
> by the NIC multiqueue, but I still have the problem of how to ensure
> that the accepting thread for a connection is run on the same CPU as
> the interrupt and SYN processing were.
That is where needing to know/control the NIC's hashing comes into play.
> NICs are already doing steering based on tuple hash (RSS), and I think
> some will allow specifying the CPU for interrupt based on RX flow.
> Maybe this would address the issues of Inbound Packet Scheduling?
All IPS in HP-UX 10.20 was was hash the IP/port numbers and queue based
on that - this at the handoff between driver and netisr. The problem
was if you had a thread of execution servicing more than one connection,
you would start whipsawing across the processors based on the remote
addressing.
There are IIRC indeed some NICs where you can give them a finite number
of tuples and say where each tuple should go. I'm sure those vendors if
watching can speak-up :) That sort of functionality can be useful and
would address the limitations of ISS/plain NIC header address hashing.
At least for long-lived connections. Or perhaps even long-lived LISTEN
endpoints :)
While you say you are constrained to a single port number, are you
similarly constrained to a single IP address?
> Thanks for the pointers on IPS and TOPS. Out of curiosity has there
> been an effort to do TOPS on Linux? We are doing something very
> similar in software RSS with a fair amount of success (I posted
> patches for this a while back).
I'm not sure. Anything is possible. The nice thing about TOPS in UX
11.X was/is the lookup was essentially free and didn't involve things
going across I/O busses. Start to have to update those tuple mappings
on the NIC with any frequency and that's the end of that.
rick
^ permalink raw reply [flat|nested] 9+ messages in thread
end of thread, other threads:[~2008-08-07 23:28 UTC | newest]
Thread overview: 9+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2008-08-07 16:57 SO_REUSEPORT? Tom Herbert
2008-08-07 17:09 ` SO_REUSEPORT? Rémi Denis-Courmont
2008-08-07 17:58 ` SO_REUSEPORT? Tom Herbert
2008-08-07 18:17 ` SO_REUSEPORT? Rick Jones
2008-08-07 19:03 ` SO_REUSEPORT? Stephen Hemminger
2008-08-07 19:43 ` SO_REUSEPORT? Tom Herbert
2008-08-07 20:14 ` SO_REUSEPORT? Rick Jones
2008-08-07 23:05 ` SO_REUSEPORT? Tom Herbert
2008-08-07 23:28 ` SO_REUSEPORT? Rick Jones
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).