Proposed linux kernel changes : scaling tcp/ip stack

netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* Proposed linux kernel changes : scaling  tcp/ip stack
@ 2010-06-03  8:16 Mitchell Erblich
  2010-06-03  9:14 ` Eric Dumazet
  0 siblings, 1 reply; 9+ messages in thread
From: Mitchell Erblich @ 2010-06-03  8:16 UTC (permalink / raw)
  To: netdev

To whom it may concern,

First, my assumption is to keep this discussion local to just a few tcp/ip
developers to see if there is any consensus that the below is a logical 
approach. Please also pass this email if there is a "owner(s)" of this stack
to identify if a case exists for the below possible changes.

I am not currently on the linux kernel mail group.

I have experience with modifications of the Linux tcp/ip stack, and have
merged the changes into the company's local tree and left the possible 
global integration to others.

I have been approached by a number of companies about scaling the
stack with the assumption of a number of cpu cores. At present, I find extra
time on my hands and am considering looking into this area on my own.

The first assumption is that if extra cores are available, that a single
received homogeneous flow of a large number of packets/segments per
second (pps) can be split into non-equal flows. This split can in effect
allow a larger recv'd pps rate at the same core load while splitting off
other workloads, such as xmit'ing pure ACKs.

Simply, again assuming Amdahl's law (and not looking to equalize the load
between cores), and creating logical separations where in a many core 
system, different cores could have new kernel threads  that operate in 
parallel within the tcp/ip stack. The initial separation points would be at 
the ip/tcp layer boundry and where any recv'd sk/pkt would generate some 
form of output.

The ip/tcp layer would be split like the vintage AT&T STREAMs protocol,
with some form of queuing & scheduling, would be needed. In addition,
the queuing/schedullng of other kernel threads would occur within ip & tcp
to separate the I/O.

A possible validation test is to identify the max recv'd pps rate within the
tcp/ip modules within normal flow TCP established state with normal order 
of say 64byte non fragmented segments, before and after each 
incremental change. Or the same rate with fewer core/cpu cycles.

I am willing to have a private git Linux.org tree that concentrates proposed
changes into this tree and if there is willingness, a seen want/need then identify
how to implement the merge.

		Mitchell Erblich
		UNIX Kernel Engineer

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Proposed linux kernel changes : scaling  tcp/ip stack
  2010-06-03  8:16 Proposed linux kernel changes : scaling tcp/ip stack Mitchell Erblich
@ 2010-06-03  9:14 ` Eric Dumazet
  2010-06-16  3:11   ` Mitchell Erblich
  0 siblings, 1 reply; 9+ messages in thread
From: Eric Dumazet @ 2010-06-03  9:14 UTC (permalink / raw)
  To: Mitchell Erblich; +Cc: netdev

Le jeudi 03 juin 2010 à 01:16 -0700, Mitchell Erblich a écrit :
> To whom it may concern,
> 
> First, my assumption is to keep this discussion local to just a few tcp/ip
> developers to see if there is any consensus that the below is a logical 
> approach. Please also pass this email if there is a "owner(s)" of this stack
> to identify if a case exists for the below possible changes.
> 
> I am not currently on the linux kernel mail group.
> 			
> I have experience with modifications of the Linux tcp/ip stack, and have
> merged the changes into the company's local tree and left the possible 
> global integration to others.
> 
> I have been approached by a number of companies about scaling the
> stack with the assumption of a number of cpu cores. At present, I find extra
> time on my hands and am considering looking into this area on my own.
> 
> The first assumption is that if extra cores are available, that a single
> received homogeneous flow of a large number of packets/segments per
> second (pps) can be split into non-equal flows. This split can in effect
> allow a larger recv'd pps rate at the same core load while splitting off
> other workloads, such as xmit'ing pure ACKs.
> 
> Simply, again assuming Amdahl's law (and not looking to equalize the load
> between cores), and creating logical separations where in a many core 
> system, different cores could have new kernel threads  that operate in 
> parallel within the tcp/ip stack. The initial separation points would be at 
> the ip/tcp layer boundry and where any recv'd sk/pkt would generate some 
> form of output.
> 
> The ip/tcp layer would be split like the vintage AT&T STREAMs protocol,
> with some form of queuing & scheduling, would be needed. In addition,
> the queuing/schedullng of other kernel threads would occur within ip & tcp
> to separate the I/O.
> 
> A possible validation test is to identify the max recv'd pps rate within the
> tcp/ip modules within normal flow TCP established state with normal order 
> of say 64byte non fragmented segments, before and after each 
> incremental change. Or the same rate with fewer core/cpu cycles.
> 
> I am willing to have a private git Linux.org tree that concentrates proposed
> changes into this tree and if there is willingness, a seen want/need then identify
> how to implement the merge.

Hi Mitchell

We work everyday to improve network stack, and standard linux tree is
pretty scalable, you dont need to setup a separate git tree for that.

Our beloved maintainer David S. Miller handles two trees, net-2.6 and
net-next-2.6 where we put all our changes.

http://git.kernel.org/?p=linux/kernel/git/davem/net-next-2.6.git
git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next-2.6.git

I suggest you read the last patches (say .. about 10.000 of them), to
have an idea of things we did during last years.

keywords : RCU, multiqueue, RPS, percpu data, lockless algos, cache line
placement...

Its nice to see another man joining the team !

Thanks



^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Proposed linux kernel changes : scaling  tcp/ip stack
  2010-06-03  9:14 ` Eric Dumazet
@ 2010-06-16  3:11   ` Mitchell Erblich
  2010-06-16  3:30     ` Proposed linux kernel changes : scaling tcp/ip stack : 2nd part Mitchell Erblich
  2010-06-16  9:10     ` Proposed linux kernel changes : scaling tcp/ip stack Andi Kleen
  0 siblings, 2 replies; 9+ messages in thread
From: Mitchell Erblich @ 2010-06-16  3:11 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: netdev


On Jun 3, 2010, at 2:14 AM, Eric Dumazet wrote:

> Le jeudi 03 juin 2010 à 01:16 -0700, Mitchell Erblich a écrit :
>> To whom it may concern,
>> 
>> First, my assumption is to keep this discussion local to just a few tcp/ip
>> developers to see if there is any consensus that the below is a logical 
>> approach. Please also pass this email if there is a "owner(s)" of this stack
>> to identify if a case exists for the below possible changes.
>> 
>> I am not currently on the linux kernel mail group.
>> 			
>> I have experience with modifications of the Linux tcp/ip stack, and have
>> merged the changes into the company's local tree and left the possible 
>> global integration to others.
>> 
>> I have been approached by a number of companies about scaling the
>> stack with the assumption of a number of cpu cores. At present, I find extra
>> time on my hands and am considering looking into this area on my own.
>> 
>> The first assumption is that if extra cores are available, that a single
>> received homogeneous flow of a large number of packets/segments per
>> second (pps) can be split into non-equal flows. This split can in effect
>> allow a larger recv'd pps rate at the same core load while splitting off
>> other workloads, such as xmit'ing pure ACKs.
>> 
>> Simply, again assuming Amdahl's law (and not looking to equalize the load
>> between cores), and creating logical separations where in a many core 
>> system, different cores could have new kernel threads  that operate in 
>> parallel within the tcp/ip stack. The initial separation points would be at 
>> the ip/tcp layer boundry and where any recv'd sk/pkt would generate some 
>> form of output.
>> 
>> The ip/tcp layer would be split like the vintage AT&T STREAMs protocol,
>> with some form of queuing & scheduling, would be needed. In addition,
>> the queuing/schedullng of other kernel threads would occur within ip & tcp
>> to separate the I/O.
>> 
>> A possible validation test is to identify the max recv'd pps rate within the
>> tcp/ip modules within normal flow TCP established state with normal order 
>> of say 64byte non fragmented segments, before and after each 
>> incremental change. Or the same rate with fewer core/cpu cycles.
>> 
>> I am willing to have a private git Linux.org tree that concentrates proposed
>> changes into this tree and if there is willingness, a seen want/need then identify
>> how to implement the merge.
> 
> Hi Mitchell
> 
> We work everyday to improve network stack, and standard linux tree is
> pretty scalable, you dont need to setup a separate git tree for that.
> 
> Our beloved maintainer David S. Miller handles two trees, net-2.6 and
> net-next-2.6 where we put all our changes.
> 
> http://git.kernel.org/?p=linux/kernel/git/davem/net-next-2.6.git
> git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next-2.6.git
> 
> I suggest you read the last patches (say .. about 10.000 of them), to
> have an idea of things we did during last years.
> 
> keywords : RCU, multiqueue, RPS, percpu data, lockless algos, cache line
> placement...
> 
> Its nice to see another man joining the team !
> 
> Thanks
> 


Lets start with a two part Linux kernel change and a tcp input/output change:

2 Parts: 2nd part TBD

Summary: Don't use last free pages for TCP ACKs with GFP_ATOMIC for our
sk buf allocs. 1 line change in tcp_output.c with a new gfp.h arg, and a change
in the generic kernel. TBD.

This change should have no effect with normal available kernel mem allocs.

Assuming memory pressure ( WAITING for clean memory) we should be allocating
our last pages for input skbufs and not for xmit allocs.

By delaying skbuf allocations when we have low kmem, we secondarily slow down the
tcp flow : if in slow start (SS) we are almost doing a DELACK, else CA should/could
decrease the number of in-flight ACKs and the peer should do burst avoidance
if our later ack increases the window in a larger chunk..

And use the last pages to decrease the chance of dropping a input pkt or
running out of recv descriptors, because of mem back pressure.

The change could check for some form of mem pressure before the alloc,
but the alloc in itself should suffice. We could also do a ECN type check before
the alloc.

Now the kicker.  I want a GFP_KERNEL with NO_SLEEP OR a GFP_ATOMIC and
NOT use emergency pools, thus CAN FAIL, to have 0 other secondary effects
and change just the 1 arg.

code : tcp_output.c : tcp_send_ack()
   line : buff = alloc_skb(MAX_TCP_HDR, GFP_KERNEL_NSLEEP);   /* with a NO SLEEP */

Suggestions, feedback??

Mitchell Erblich





> 
> --
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Proposed linux kernel changes : scaling  tcp/ip stack : 2nd part
  2010-06-16  3:11   ` Mitchell Erblich
@ 2010-06-16  3:30     ` Mitchell Erblich
  2010-06-16  6:09       ` Proposed linux kernel changes : scaling tcp/ip stack : 3rd part Mitchell Erblich
  2010-06-16  9:10     ` Proposed linux kernel changes : scaling tcp/ip stack Andi Kleen
  1 sibling, 1 reply; 9+ messages in thread
From: Mitchell Erblich @ 2010-06-16  3:30 UTC (permalink / raw)
  To: Mitchell Erblich; +Cc: Eric Dumazet, netdev


On Jun 15, 2010, at 8:11 PM, Mitchell Erblich wrote:

> 
> On Jun 3, 2010, at 2:14 AM, Eric Dumazet wrote:
> 
>> Le jeudi 03 juin 2010 à 01:16 -0700, Mitchell Erblich a écrit :
>>> To whom it may concern,
>>> 
>>> First, my assumption is to keep this discussion local to just a few tcp/ip
>>> developers to see if there is any consensus that the below is a logical 
>>> approach. Please also pass this email if there is a "owner(s)" of this stack
>>> to identify if a case exists for the below possible changes.
>>> 
>>> I am not currently on the linux kernel mail group.
>>> 			
>>> I have experience with modifications of the Linux tcp/ip stack, and have
>>> merged the changes into the company's local tree and left the possible 
>>> global integration to others.
>>> 
>>> I have been approached by a number of companies about scaling the
>>> stack with the assumption of a number of cpu cores. At present, I find extra
>>> time on my hands and am considering looking into this area on my own.
>>> 
>>> The first assumption is that if extra cores are available, that a single
>>> received homogeneous flow of a large number of packets/segments per
>>> second (pps) can be split into non-equal flows. This split can in effect
>>> allow a larger recv'd pps rate at the same core load while splitting off
>>> other workloads, such as xmit'ing pure ACKs.
>>> 
>>> Simply, again assuming Amdahl's law (and not looking to equalize the load
>>> between cores), and creating logical separations where in a many core 
>>> system, different cores could have new kernel threads  that operate in 
>>> parallel within the tcp/ip stack. The initial separation points would be at 
>>> the ip/tcp layer boundry and where any recv'd sk/pkt would generate some 
>>> form of output.
>>> 
>>> The ip/tcp layer would be split like the vintage AT&T STREAMs protocol,
>>> with some form of queuing & scheduling, would be needed. In addition,
>>> the queuing/schedullng of other kernel threads would occur within ip & tcp
>>> to separate the I/O.
>>> 
>>> A possible validation test is to identify the max recv'd pps rate within the
>>> tcp/ip modules within normal flow TCP established state with normal order 
>>> of say 64byte non fragmented segments, before and after each 
>>> incremental change. Or the same rate with fewer core/cpu cycles.
>>> 
>>> I am willing to have a private git Linux.org tree that concentrates proposed
>>> changes into this tree and if there is willingness, a seen want/need then identify
>>> how to implement the merge.
>> 
>> Hi Mitchell
>> 
>> We work everyday to improve network stack, and standard linux tree is
>> pretty scalable, you dont need to setup a separate git tree for that.
>> 
>> Our beloved maintainer David S. Miller handles two trees, net-2.6 and
>> net-next-2.6 where we put all our changes.
>> 
>> http://git.kernel.org/?p=linux/kernel/git/davem/net-next-2.6.git
>> git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next-2.6.git
>> 
>> I suggest you read the last patches (say .. about 10.000 of them), to
>> have an idea of things we did during last years.
>> 
>> keywords : RCU, multiqueue, RPS, percpu data, lockless algos, cache line
>> placement...
>> 
>> Its nice to see another man joining the team !
>> 
>> Thanks
>> 
> 
> 
> Lets start with a two part Linux kernel change and a tcp input/output change:
> 
> 2 Parts: 2nd part TBD
> 
> Summary: Don't use last free pages for TCP ACKs with GFP_ATOMIC for our
> sk buf allocs. 1 line change in tcp_output.c with a new gfp.h arg, and a change
> in the generic kernel. TBD.
> 
> This change should have no effect with normal available kernel mem allocs.
> 
> Assuming memory pressure ( WAITING for clean memory) we should be allocating
> our last pages for input skbufs and not for xmit allocs.
> 
> By delaying skbuf allocations when we have low kmem, we secondarily slow down the
> tcp flow : if in slow start (SS) we are almost doing a DELACK, else CA should/could
> decrease the number of in-flight ACKs and the peer should do burst avoidance
> if our later ack increases the window in a larger chunk..
> 
> And use the last pages to decrease the chance of dropping a input pkt or
> running out of recv descriptors, because of mem back pressure.
> 
> The change could check for some form of mem pressure before the alloc,
> but the alloc in itself should suffice. We could also do a ECN type check before
> the alloc.
> 
> Now the kicker.  I want a GFP_KERNEL with NO_SLEEP OR a GFP_ATOMIC and
> NOT use emergency pools, thus CAN FAIL, to have 0 other secondary effects
> and change just the 1 arg.
> 
> code : tcp_output.c : tcp_send_ack()
>   line : buff = alloc_skb(MAX_TCP_HDR, GFP_KERNEL_NSLEEP);   /* with a NO SLEEP */
> 
> Suggestions, feedback??
> 
> Mitchell Erblich
> 
> 
> 
> 

Sorry :),

		2nd part:

		use GFP_NOWAIT as 2nd arg to alloc_skb()

Mitchell Erblich
> 
>> 
>> --
>> To unsubscribe from this list: send the line "unsubscribe netdev" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Proposed linux kernel changes : scaling  tcp/ip stack : 3rd part
  2010-06-16  3:30     ` Proposed linux kernel changes : scaling tcp/ip stack : 2nd part Mitchell Erblich
@ 2010-06-16  6:09       ` Mitchell Erblich
  2010-06-16  6:37         ` Eric Dumazet
  0 siblings, 1 reply; 9+ messages in thread
From: Mitchell Erblich @ 2010-06-16  6:09 UTC (permalink / raw)
  To: Mitchell Erblich; +Cc: Eric Dumazet, netdev


On Jun 15, 2010, at 8:30 PM, Mitchell Erblich wrote:

> 
> On Jun 15, 2010, at 8:11 PM, Mitchell Erblich wrote:
> 
>> 
>> On Jun 3, 2010, at 2:14 AM, Eric Dumazet wrote:
>> 
>>> Le jeudi 03 juin 2010 à 01:16 -0700, Mitchell Erblich a écrit :
>>>> To whom it may concern,
>>>> 
>>>> First, my assumption is to keep this discussion local to just a few tcp/ip
>>>> developers to see if there is any consensus that the below is a logical 
>>>> approach. Please also pass this email if there is a "owner(s)" of this stack
>>>> to identify if a case exists for the below possible changes.
>>>> 
>>>> I am not currently on the linux kernel mail group.
>>>> 			
>>>> I have experience with modifications of the Linux tcp/ip stack, and have
>>>> merged the changes into the company's local tree and left the possible 
>>>> global integration to others.
>>>> 
>>>> I have been approached by a number of companies about scaling the
>>>> stack with the assumption of a number of cpu cores. At present, I find extra
>>>> time on my hands and am considering looking into this area on my own.
>>>> 
>>>> The first assumption is that if extra cores are available, that a single
>>>> received homogeneous flow of a large number of packets/segments per
>>>> second (pps) can be split into non-equal flows. This split can in effect
>>>> allow a larger recv'd pps rate at the same core load while splitting off
>>>> other workloads, such as xmit'ing pure ACKs.
>>>> 
>>>> Simply, again assuming Amdahl's law (and not looking to equalize the load
>>>> between cores), and creating logical separations where in a many core 
>>>> system, different cores could have new kernel threads  that operate in 
>>>> parallel within the tcp/ip stack. The initial separation points would be at 
>>>> the ip/tcp layer boundry and where any recv'd sk/pkt would generate some 
>>>> form of output.
>>>> 
>>>> The ip/tcp layer would be split like the vintage AT&T STREAMs protocol,
>>>> with some form of queuing & scheduling, would be needed. In addition,
>>>> the queuing/schedullng of other kernel threads would occur within ip & tcp
>>>> to separate the I/O.
>>>> 
>>>> A possible validation test is to identify the max recv'd pps rate within the
>>>> tcp/ip modules within normal flow TCP established state with normal order 
>>>> of say 64byte non fragmented segments, before and after each 
>>>> incremental change. Or the same rate with fewer core/cpu cycles.
>>>> 
>>>> I am willing to have a private git Linux.org tree that concentrates proposed
>>>> changes into this tree and if there is willingness, a seen want/need then identify
>>>> how to implement the merge.
>>> 
>>> Hi Mitchell
>>> 
>>> We work everyday to improve network stack, and standard linux tree is
>>> pretty scalable, you dont need to setup a separate git tree for that.
>>> 
>>> Our beloved maintainer David S. Miller handles two trees, net-2.6 and
>>> net-next-2.6 where we put all our changes.
>>> 
>>> http://git.kernel.org/?p=linux/kernel/git/davem/net-next-2.6.git
>>> git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next-2.6.git
>>> 
>>> I suggest you read the last patches (say .. about 10.000 of them), to
>>> have an idea of things we did during last years.
>>> 
>>> keywords : RCU, multiqueue, RPS, percpu data, lockless algos, cache line
>>> placement...
>>> 
>>> Its nice to see another man joining the team !
>>> 
>>> Thanks
>>> 
>> 
>> 
>> Lets start with a two part Linux kernel change and a tcp input/output change:
>> 
>> 2 Parts: 2nd part TBD
>> 
>> Summary: Don't use last free pages for TCP ACKs with GFP_ATOMIC for our
>> sk buf allocs. 1 line change in tcp_output.c with a new gfp.h arg, and a change
>> in the generic kernel. TBD.
>> 
>> This change should have no effect with normal available kernel mem allocs.
>> 
>> Assuming memory pressure ( WAITING for clean memory) we should be allocating
>> our last pages for input skbufs and not for xmit allocs.
>> 
>> By delaying skbuf allocations when we have low kmem, we secondarily slow down the
>> tcp flow : if in slow start (SS) we are almost doing a DELACK, else CA should/could
>> decrease the number of in-flight ACKs and the peer should do burst avoidance
>> if our later ack increases the window in a larger chunk..
>> 
>> And use the last pages to decrease the chance of dropping a input pkt or
>> running out of recv descriptors, because of mem back pressure.
>> 
>> The change could check for some form of mem pressure before the alloc,
>> but the alloc in itself should suffice. We could also do a ECN type check before
>> the alloc.
>> 
>> Now the kicker.  I want a GFP_KERNEL with NO_SLEEP OR a GFP_ATOMIC and
>> NOT use emergency pools, thus CAN FAIL, to have 0 other secondary effects
>> and change just the 1 arg.
>> 
>> code : tcp_output.c : tcp_send_ack()
>>  line : buff = alloc_skb(MAX_TCP_HDR, GFP_KERNEL_NSLEEP);   /* with a NO SLEEP */
>> 
>> Suggestions, feedback??
>> 
>> Mitchell Erblich
>> 
>> 
>> 
>> 
> 
> Sorry :),
> 
> 		2nd part:
> 
> 		use GFP_NOWAIT as 2nd arg to alloc_skb()
> 
> Mitchell Erblich

Going in the same direction,


If tcp_out_of_resources() and the number of orphaned sockets is above
a configured number (maybe because of DoS attack), SHOULD we consume
our last available resources and most likely effect skbufs that we aren't
reset-ing because NOW the recv sk allocs are failing.

thus,
file tcp_timer.c : tcp_out_of_resources()
suggestion change 2nd arg GFP_ATOMIC: tcp_send_active_reset(sk, GFP_NOWAIT);

Please note that even if we believed that the GFP_ATOMIC would have a higher 
probability to send a TCP pkt/seg, that gives us no guarantee that the peer
will recv it or will process it.

We COULD also do some form of ECN in this function to inform the peer that our 
system is in distress if tcp_send_active_reset() did not return void and informed 
us of a mem alloc failure with the GFP_NOWAIT.

Since the ECN would benefit the our node/system, this ECN sending event COULD
be argued to have a higher priority and mem argument then sent with a GFP_ATOMIC.


Suggestions, opinions...

Mitchell Erblich

>> 
>>> 
>>> --
>>> To unsubscribe from this list: send the line "unsubscribe netdev" in
>>> the body of a message to majordomo@vger.kernel.org
>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>> 
> 
> --
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Proposed linux kernel changes : scaling  tcp/ip stack : 3rd part
  2010-06-16  6:09       ` Proposed linux kernel changes : scaling tcp/ip stack : 3rd part Mitchell Erblich
@ 2010-06-16  6:37         ` Eric Dumazet
  2010-06-16  7:46           ` Mitchell Erblich
  0 siblings, 1 reply; 9+ messages in thread
From: Eric Dumazet @ 2010-06-16  6:37 UTC (permalink / raw)
  To: Mitchell Erblich; +Cc: netdev

Le mardi 15 juin 2010 à 23:09 -0700, Mitchell Erblich a écrit :
> On Jun 15, 2010, at 8:30 PM, Mitchell Erblich wrote:
> 
> > 
> > On Jun 15, 2010, at 8:11 PM, Mitchell Erblich wrote:
> > 
> >> 
> >> On Jun 3, 2010, at 2:14 AM, Eric Dumazet wrote:
> >> 
> >>> Le jeudi 03 juin 2010 à 01:16 -0700, Mitchell Erblich a écrit :
> >>>> To whom it may concern,
> >>>> 
> >>>> First, my assumption is to keep this discussion local to just a few tcp/ip
> >>>> developers to see if there is any consensus that the below is a logical 
> >>>> approach. Please also pass this email if there is a "owner(s)" of this stack
> >>>> to identify if a case exists for the below possible changes.
> >>>> 
> >>>> I am not currently on the linux kernel mail group.
> >>>> 			
> >>>> I have experience with modifications of the Linux tcp/ip stack, and have
> >>>> merged the changes into the company's local tree and left the possible 
> >>>> global integration to others.
> >>>> 
> >>>> I have been approached by a number of companies about scaling the
> >>>> stack with the assumption of a number of cpu cores. At present, I find extra
> >>>> time on my hands and am considering looking into this area on my own.
> >>>> 
> >>>> The first assumption is that if extra cores are available, that a single
> >>>> received homogeneous flow of a large number of packets/segments per
> >>>> second (pps) can be split into non-equal flows. This split can in effect
> >>>> allow a larger recv'd pps rate at the same core load while splitting off
> >>>> other workloads, such as xmit'ing pure ACKs.
> >>>> 
> >>>> Simply, again assuming Amdahl's law (and not looking to equalize the load
> >>>> between cores), and creating logical separations where in a many core 
> >>>> system, different cores could have new kernel threads  that operate in 
> >>>> parallel within the tcp/ip stack. The initial separation points would be at 
> >>>> the ip/tcp layer boundry and where any recv'd sk/pkt would generate some 
> >>>> form of output.
> >>>> 
> >>>> The ip/tcp layer would be split like the vintage AT&T STREAMs protocol,
> >>>> with some form of queuing & scheduling, would be needed. In addition,
> >>>> the queuing/schedullng of other kernel threads would occur within ip & tcp
> >>>> to separate the I/O.
> >>>> 
> >>>> A possible validation test is to identify the max recv'd pps rate within the
> >>>> tcp/ip modules within normal flow TCP established state with normal order 
> >>>> of say 64byte non fragmented segments, before and after each 
> >>>> incremental change. Or the same rate with fewer core/cpu cycles.
> >>>> 
> >>>> I am willing to have a private git Linux.org tree that concentrates proposed
> >>>> changes into this tree and if there is willingness, a seen want/need then identify
> >>>> how to implement the merge.
> >>> 
> >>> Hi Mitchell
> >>> 
> >>> We work everyday to improve network stack, and standard linux tree is
> >>> pretty scalable, you dont need to setup a separate git tree for that.
> >>> 
> >>> Our beloved maintainer David S. Miller handles two trees, net-2.6 and
> >>> net-next-2.6 where we put all our changes.
> >>> 
> >>> http://git.kernel.org/?p=linux/kernel/git/davem/net-next-2.6.git
> >>> git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next-2.6.git
> >>> 
> >>> I suggest you read the last patches (say .. about 10.000 of them), to
> >>> have an idea of things we did during last years.
> >>> 
> >>> keywords : RCU, multiqueue, RPS, percpu data, lockless algos, cache line
> >>> placement...
> >>> 
> >>> Its nice to see another man joining the team !
> >>> 
> >>> Thanks
> >>> 
> >> 
> >> 
> >> Lets start with a two part Linux kernel change and a tcp input/output change:
> >> 
> >> 2 Parts: 2nd part TBD
> >> 
> >> Summary: Don't use last free pages for TCP ACKs with GFP_ATOMIC for our
> >> sk buf allocs. 1 line change in tcp_output.c with a new gfp.h arg, and a change
> >> in the generic kernel. TBD.
> >> 
> >> This change should have no effect with normal available kernel mem allocs.
> >> 
> >> Assuming memory pressure ( WAITING for clean memory) we should be allocating
> >> our last pages for input skbufs and not for xmit allocs.
> >> 
> >> By delaying skbuf allocations when we have low kmem, we secondarily slow down the
> >> tcp flow : if in slow start (SS) we are almost doing a DELACK, else CA should/could
> >> decrease the number of in-flight ACKs and the peer should do burst avoidance
> >> if our later ack increases the window in a larger chunk..
> >> 
> >> And use the last pages to decrease the chance of dropping a input pkt or
> >> running out of recv descriptors, because of mem back pressure.
> >> 
> >> The change could check for some form of mem pressure before the alloc,
> >> but the alloc in itself should suffice. We could also do a ECN type check before
> >> the alloc.
> >> 
> >> Now the kicker.  I want a GFP_KERNEL with NO_SLEEP OR a GFP_ATOMIC and
> >> NOT use emergency pools, thus CAN FAIL, to have 0 other secondary effects
> >> and change just the 1 arg.
> >> 
> >> code : tcp_output.c : tcp_send_ack()
> >>  line : buff = alloc_skb(MAX_TCP_HDR, GFP_KERNEL_NSLEEP);   /* with a NO SLEEP */
> >> 
> >> Suggestions, feedback??
> >> 
> >> Mitchell Erblich
> >> 
> >> 
> >> 
> >> 
> > 
> > Sorry :),
> > 
> > 		2nd part:
> > 
> > 		use GFP_NOWAIT as 2nd arg to alloc_skb()
> > 
> > Mitchell Erblich
> 
> Going in the same direction,
> 
> 
> If tcp_out_of_resources() and the number of orphaned sockets is above
> a configured number (maybe because of DoS attack), SHOULD we consume
> our last available resources and most likely effect skbufs that we aren't
> reset-ing because NOW the recv sk allocs are failing.
> 
> thus,
> file tcp_timer.c : tcp_out_of_resources()
> suggestion change 2nd arg GFP_ATOMIC: tcp_send_active_reset(sk, GFP_NOWAIT);
> 
> Please note that even if we believed that the GFP_ATOMIC would have a higher 
> probability to send a TCP pkt/seg, that gives us no guarantee that the peer
> will recv it or will process it.
> 
> We COULD also do some form of ECN in this function to inform the peer that our 
> system is in distress if tcp_send_active_reset() did not return void and informed 
> us of a mem alloc failure with the GFP_NOWAIT.
> 
> Since the ECN would benefit the our node/system, this ECN sending event COULD
> be argued to have a higher priority and mem argument then sent with a GFP_ATOMIC.
> 
> 
> Suggestions, opinions...


1) Acks are about the smallest chunks that are ever allocated in network
stack.

2) Their lifetime is close to 0 us. They are not cloned (queued on a
socket queue), only given to device xmit. Unless you play with trafic
shaping and insane queue lengths, acks should not use more than 0.0001 %
of your ram.

3) Under attack, adding complex algos to try to resist only delay a bit
the moment where nothing can be done to stop the attack. Being clever or
not. Dropping packets is very fine.

4) Maybe all the work you think about is the balance between ATOMIC and
non ATOMIC (GFP_KERNEL) memory allocations ? Some tuning maybe ?
   input path always use ATOMIC ops, being run from sofirq, and cannot
wait.



^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Proposed linux kernel changes : scaling  tcp/ip stack : 3rd part
  2010-06-16  6:37         ` Eric Dumazet
@ 2010-06-16  7:46           ` Mitchell Erblich
  0 siblings, 0 replies; 9+ messages in thread
From: Mitchell Erblich @ 2010-06-16  7:46 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: netdev


On Jun 15, 2010, at 11:37 PM, Eric Dumazet wrote:

> Le mardi 15 juin 2010 à 23:09 -0700, Mitchell Erblich a écrit :
>> On Jun 15, 2010, at 8:30 PM, Mitchell Erblich wrote:
>> 
>>> 
>>> On Jun 15, 2010, at 8:11 PM, Mitchell Erblich wrote:
>>> 
>>>> 
>>>> On Jun 3, 2010, at 2:14 AM, Eric Dumazet wrote:
>>>> 
>>>>> Le jeudi 03 juin 2010 à 01:16 -0700, Mitchell Erblich a écrit :
>>>>>> To whom it may concern,
>>>>>> 
>>>>>> First, my assumption is to keep this discussion local to just a few tcp/ip
>>>>>> developers to see if there is any consensus that the below is a logical 
>>>>>> approach. Please also pass this email if there is a "owner(s)" of this stack
>>>>>> to identify if a case exists for the below possible changes.
>>>>>> 
>>>>>> I am not currently on the linux kernel mail group.
>>>>>> 			
>>>>>> I have experience with modifications of the Linux tcp/ip stack, and have
>>>>>> merged the changes into the company's local tree and left the possible 
>>>>>> global integration to others.
>>>>>> 
>>>>>> I have been approached by a number of companies about scaling the
>>>>>> stack with the assumption of a number of cpu cores. At present, I find extra
>>>>>> time on my hands and am considering looking into this area on my own.
>>>>>> 
>>>>>> The first assumption is that if extra cores are available, that a single
>>>>>> received homogeneous flow of a large number of packets/segments per
>>>>>> second (pps) can be split into non-equal flows. This split can in effect
>>>>>> allow a larger recv'd pps rate at the same core load while splitting off
>>>>>> other workloads, such as xmit'ing pure ACKs.
>>>>>> 
>>>>>> Simply, again assuming Amdahl's law (and not looking to equalize the load
>>>>>> between cores), and creating logical separations where in a many core 
>>>>>> system, different cores could have new kernel threads  that operate in 
>>>>>> parallel within the tcp/ip stack. The initial separation points would be at 
>>>>>> the ip/tcp layer boundry and where any recv'd sk/pkt would generate some 
>>>>>> form of output.
>>>>>> 
>>>>>> The ip/tcp layer would be split like the vintage AT&T STREAMs protocol,
>>>>>> with some form of queuing & scheduling, would be needed. In addition,
>>>>>> the queuing/schedullng of other kernel threads would occur within ip & tcp
>>>>>> to separate the I/O.
>>>>>> 
>>>>>> A possible validation test is to identify the max recv'd pps rate within the
>>>>>> tcp/ip modules within normal flow TCP established state with normal order 
>>>>>> of say 64byte non fragmented segments, before and after each 
>>>>>> incremental change. Or the same rate with fewer core/cpu cycles.
>>>>>> 
>>>>>> I am willing to have a private git Linux.org tree that concentrates proposed
>>>>>> changes into this tree and if there is willingness, a seen want/need then identify
>>>>>> how to implement the merge.
>>>>> 
>>>>> Hi Mitchell
>>>>> 
>>>>> We work everyday to improve network stack, and standard linux tree is
>>>>> pretty scalable, you dont need to setup a separate git tree for that.
>>>>> 
>>>>> Our beloved maintainer David S. Miller handles two trees, net-2.6 and
>>>>> net-next-2.6 where we put all our changes.
>>>>> 
>>>>> http://git.kernel.org/?p=linux/kernel/git/davem/net-next-2.6.git
>>>>> git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next-2.6.git
>>>>> 
>>>>> I suggest you read the last patches (say .. about 10.000 of them), to
>>>>> have an idea of things we did during last years.
>>>>> 
>>>>> keywords : RCU, multiqueue, RPS, percpu data, lockless algos, cache line
>>>>> placement...
>>>>> 
>>>>> Its nice to see another man joining the team !
>>>>> 
>>>>> Thanks
>>>>> 
>>>> 
>>>> 
>>>> Lets start with a two part Linux kernel change and a tcp input/output change:
>>>> 
>>>> 2 Parts: 2nd part TBD
>>>> 
>>>> Summary: Don't use last free pages for TCP ACKs with GFP_ATOMIC for our
>>>> sk buf allocs. 1 line change in tcp_output.c with a new gfp.h arg, and a change
>>>> in the generic kernel. TBD.
>>>> 
>>>> This change should have no effect with normal available kernel mem allocs.
>>>> 
>>>> Assuming memory pressure ( WAITING for clean memory) we should be allocating
>>>> our last pages for input skbufs and not for xmit allocs.
>>>> 
>>>> By delaying skbuf allocations when we have low kmem, we secondarily slow down the
>>>> tcp flow : if in slow start (SS) we are almost doing a DELACK, else CA should/could
>>>> decrease the number of in-flight ACKs and the peer should do burst avoidance
>>>> if our later ack increases the window in a larger chunk..
>>>> 
>>>> And use the last pages to decrease the chance of dropping a input pkt or
>>>> running out of recv descriptors, because of mem back pressure.
>>>> 
>>>> The change could check for some form of mem pressure before the alloc,
>>>> but the alloc in itself should suffice. We could also do a ECN type check before
>>>> the alloc.
>>>> 
>>>> Now the kicker.  I want a GFP_KERNEL with NO_SLEEP OR a GFP_ATOMIC and
>>>> NOT use emergency pools, thus CAN FAIL, to have 0 other secondary effects
>>>> and change just the 1 arg.
>>>> 
>>>> code : tcp_output.c : tcp_send_ack()
>>>> line : buff = alloc_skb(MAX_TCP_HDR, GFP_KERNEL_NSLEEP);   /* with a NO SLEEP */
>>>> 
>>>> Suggestions, feedback??
>>>> 
>>>> Mitchell Erblich
>>>> 
>>>> 
>>>> 
>>>> 
>>> 
>>> Sorry :),
>>> 
>>> 		2nd part:
>>> 
>>> 		use GFP_NOWAIT as 2nd arg to alloc_skb()
>>> 
>>> Mitchell Erblich
>> 
>> Going in the same direction,
>> 
>> 
>> If tcp_out_of_resources() and the number of orphaned sockets is above
>> a configured number (maybe because of DoS attack), SHOULD we consume
>> our last available resources and most likely effect skbufs that we aren't
>> reset-ing because NOW the recv sk allocs are failing.
>> 
>> thus,
>> file tcp_timer.c : tcp_out_of_resources()
>> suggestion change 2nd arg GFP_ATOMIC: tcp_send_active_reset(sk, GFP_NOWAIT);
>> 
>> Please note that even if we believed that the GFP_ATOMIC would have a higher 
>> probability to send a TCP pkt/seg, that gives us no guarantee that the peer
>> will recv it or will process it.
>> 
>> We COULD also do some form of ECN in this function to inform the peer that our 
>> system is in distress if tcp_send_active_reset() did not return void and informed 
>> us of a mem alloc failure with the GFP_NOWAIT.
>> 
>> Since the ECN would benefit the our node/system, this ECN sending event COULD
>> be argued to have a higher priority and mem argument then sent with a GFP_ATOMIC.
>> 
>> 
>> Suggestions, opinions...
> 
> 
> 1) Acks are about the smallest chunks that are ever allocated in network
> stack.
> 
> 2) Their lifetime is close to 0 us. They are not cloned (queued on a
> socket queue), only given to device xmit. Unless you play with trafic
> shaping and insane queue lengths, acks should not use more than 0.0001 %
> of your ram.
> 
> 3) Under attack, adding complex algos to try to resist only delay a bit
> the moment where nothing can be done to stop the attack. Being clever or
> not. Dropping packets is very fine.
> 
> 4) Maybe all the work you think about is the balance between ATOMIC and
> non ATOMIC (GFP_KERNEL) memory allocations ? Some tuning maybe ?
>   input path always use ATOMIC ops, being run from sofirq, and cannot
> wait.
> 

I am not suggesting a pure GFP_KERNEL /sleep allocs on either side.
I suggested it with a NO-SLEEP and then looked and saw GFP_NOWAIT.

However, when under mem pressure, it might make sense to delay/slow
the opening of the snd and recv TCP windows, to allow mem pages
to be cleaned and re-used to relieve the mem pressure.

If queues are already existing, then they are the first delay-points
to be reviewed / used, however only under abnormal circumstances.

Maybe even short-lived flows have expired and those resources can now
be used.

> 
> --

Eric & group,

I am starting simple with a different look at what COULD/SHOULD IT be done?

The question that I am asking is if GFP_NOWAIT WOULD fail  AND GFP_ATOMIC
would succeed, then only the last percentage of kernel memory allocations are
being done. Do you want to use this last few percentage of memory with ACKs/xmits?

Maybe if we failed a few non-necessary / delayable items, then enough time
may occur to clean pages, and not execute any OOM like code.

The effect of generating a short DELACK SHOULD reduce memory pressure from 
the peer in 1 RTT. Also, failing and executing the non-buff code, MAY ALSO slow-down
the ramp-up based on the TCP ACK clock.

The number of ACTIVE (recv at the NIC to xmit at the NIC) tcp flows may account
for a non-neglible number of later allocs and decrease mem pressure.

If I understand the diff between GFP_NOWAIT and GFP_ATOMIC, both
don't sleep, and only GFP_NOWAIT doesn't grab for the last available pages. So,
they are both atomic/no-sleeps. 

To conclude with part 4: tcp_timer.c has 2 additional calls to tcp_send_active_reset()
use GFP_NOWAIT instead of GFP_ATOMIC.

Now, with your statement about "input path always uses atomic". SHOULD IT?
SAY under a DoS attack? Why not reject some if no-memory via GFP_NOWAIT.
If a new flow is to be started and we are under memory pressure, should it not ALSO
use GFP_NOWAIT?  If we ALSO do GFP_NOWAIT on ESTABLISHED flows and (shoot me)
drop a seg/pkt, that should drop them to 1/2 bandwidth. This is in effect TCP fair-ness.
Thus, delays in ACKs are much more preferred.

Again, we will only minimally effect flows with the NEW suggested changes if their is 
mem pressure and going to do more aggressive things like reset sockets/flows.

Since, my suggested changes are NOT in the form of a patch, then someone ELSE 
needs to agree with me and then the maintainer must see that it does no harm and 
the changes slowly moves the code in the right direction.

Mitchell Erblich

==================


	
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Proposed linux kernel changes : scaling  tcp/ip stack
  2010-06-16  3:11   ` Mitchell Erblich
  2010-06-16  3:30     ` Proposed linux kernel changes : scaling tcp/ip stack : 2nd part Mitchell Erblich
@ 2010-06-16  9:10     ` Andi Kleen
  2010-06-16 19:39       ` Mitchell Erblich
  1 sibling, 1 reply; 9+ messages in thread
From: Andi Kleen @ 2010-06-16  9:10 UTC (permalink / raw)
  To: Mitchell Erblich; +Cc: Eric Dumazet, netdev

Mitchell Erblich <erblichs@earthlink.net> writes:
>
> Summary: Don't use last free pages for TCP ACKs with GFP_ATOMIC for our
> sk buf allocs. 1 line change in tcp_output.c with a new gfp.h arg, and a change
> in the generic kernel. TBD.
>
> This change should have no effect with normal available kernel mem allocs.
>
> Assuming memory pressure ( WAITING for clean memory) we should be allocating
> our last pages for input skbufs and not for xmit allocs.

How about you instrument a kernel and measure if this really happens
frequently under reasonable loads?  That is you can probably
use the existing dropped page counters in netstat 
Stephen added some time ago.

Since soft irqs cannot really wait exhausted GFP_ATOMIC would normally
lead to dropped packets. FWIW I am not aware of any serious dropped
packets problem on normal loads.

Running a kernel with nearly zero free memory is dangerous anyways
-- pretty much any kernel service can fail arbitarily --
if this happened frequently I suspect we would need generic
VM solution for it.

-Andi

-- 
ak@linux.intel.com -- Speaking for myself only.

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Proposed linux kernel changes : scaling  tcp/ip stack
  2010-06-16  9:10     ` Proposed linux kernel changes : scaling tcp/ip stack Andi Kleen
@ 2010-06-16 19:39       ` Mitchell Erblich
  0 siblings, 0 replies; 9+ messages in thread
From: Mitchell Erblich @ 2010-06-16 19:39 UTC (permalink / raw)
  To: Andi Kleen; +Cc: Eric Dumazet, netdev

On Jun 16, 2010, at 2:10 AM, Andi Kleen wrote:

> Mitchell Erblich <erblichs@earthlink.net> writes:
>> 
>> Summary: Don't use last free pages for TCP ACKs with GFP_ATOMIC for our
>> sk buf allocs. 1 line change in tcp_output.c with a new gfp.h arg, and a change
>> in the generic kernel. TBD.
>> 
>> This change should have no effect with normal available kernel mem allocs.
>> 
>> Assuming memory pressure ( WAITING for clean memory) we should be allocating
>> our last pages for input skbufs and not for xmit allocs.
> 
> How about you instrument a kernel and measure if this really happens
> frequently under reasonable loads?  That is you can probably
> use the existing dropped page counters in netstat 
> Stephen added some time ago.
> 
> Since soft irqs cannot really wait exhausted GFP_ATOMIC would normally
> lead to dropped packets. FWIW I am not aware of any serious dropped
> packets problem on normal loads.
> 
> Running a kernel with nearly zero free memory is dangerous anyways
> -- pretty much any kernel service can fail arbitarily --
> if this happened frequently I suspect we would need generic
> VM solution for it.
> 
> -Andi
> 
> -- 
> ak@linux.intel.com -- Speaking for myself only.
> --

Andi Kleen and group,

I actually did instrument memory years ago an older Linux kernel for a 
multiple core system/server. Also, threw out the oom killer as a last item
when it wasn't need via a /proc value. These changes were for a now
defunct Linux OS company that built a hi-end  Linux NAS server.

In general, an increasing larger percentage of memory is cached and 
fragmented over time. So, buddy algors tend to fail if the mem is continually
held and over time smaller and smaller page order allocs fail.

The instrumenting was to be able to repeat a condition to verify
that the changes were mostly transparent and added only minimal
load when the system was experiencing a lull.

A problem found was that Linux tracks free pages and not dirty pages.

However, I am starting small and simply say that:

Can we agree that the GFP_NOWAIT is atomic, but just doesn't grab the
last pages?
#define GFP_NOWAIT	(GFP_ATOMIC & ~_GFP_HIGH)

Thus, in the general case of an atomic kernel memory consumer,
the GFP_NOWAIT SHOULD be used.

And where a safety valve to be able to clean or free kernel memory the
GFP_ATOMIC should be used.

Later, I will suggest changes changes to clean kernel memory when low
I/O is being done, so if the memory then later needs to be freed, it can be
done quickly.

Later, a /proc percentage  variable that reps a percent of memory is 
marked/saved/separated for rotating hi-order page allocs for consumers
after the system has been up for weeks/months. This work was initially
done at another UNIX company, based on an Internal public paper.

Mitchell Erblich

> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2010-06-16 19:39 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2010-06-03  8:16 Proposed linux kernel changes : scaling tcp/ip stack Mitchell Erblich
2010-06-03  9:14 ` Eric Dumazet
2010-06-16  3:11   ` Mitchell Erblich
2010-06-16  3:30     ` Proposed linux kernel changes : scaling tcp/ip stack : 2nd part Mitchell Erblich
2010-06-16  6:09       ` Proposed linux kernel changes : scaling tcp/ip stack : 3rd part Mitchell Erblich
2010-06-16  6:37         ` Eric Dumazet
2010-06-16  7:46           ` Mitchell Erblich
2010-06-16  9:10     ` Proposed linux kernel changes : scaling tcp/ip stack Andi Kleen
2010-06-16 19:39       ` Mitchell Erblich

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).