From mboxrd@z Thu Jan 1 00:00:00 1970 From: Mitchell Erblich Subject: Re: Proposed linux kernel changes : scaling tcp/ip stack : 3rd part Date: Tue, 15 Jun 2010 23:09:49 -0700 Message-ID: References: <1275556440.2456.19.camel@edumazet-laptop> <97746864-ED54-4A12-AFE7-752AA6E41CDD@earthlink.net> Mime-Version: 1.0 (Apple Message framework v1078) Content-Type: text/plain; charset=iso-8859-1 Content-Transfer-Encoding: QUOTED-PRINTABLE Cc: Eric Dumazet , netdev@vger.kernel.org To: Mitchell Erblich Return-path: Received: from elasmtp-masked.atl.sa.earthlink.net ([209.86.89.68]:38708 "EHLO elasmtp-masked.atl.sa.earthlink.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1753418Ab0FPGJy convert rfc822-to-8bit (ORCPT ); Wed, 16 Jun 2010 02:09:54 -0400 In-Reply-To: Sender: netdev-owner@vger.kernel.org List-ID: On Jun 15, 2010, at 8:30 PM, Mitchell Erblich wrote: >=20 > On Jun 15, 2010, at 8:11 PM, Mitchell Erblich wrote: >=20 >>=20 >> On Jun 3, 2010, at 2:14 AM, Eric Dumazet wrote: >>=20 >>> Le jeudi 03 juin 2010 =E0 01:16 -0700, Mitchell Erblich a =E9crit : >>>> To whom it may concern, >>>>=20 >>>> First, my assumption is to keep this discussion local to just a fe= w tcp/ip >>>> developers to see if there is any consensus that the below is a lo= gical=20 >>>> approach. Please also pass this email if there is a "owner(s)" of = this stack >>>> to identify if a case exists for the below possible changes. >>>>=20 >>>> I am not currently on the linux kernel mail group. >>>> =09 >>>> I have experience with modifications of the Linux tcp/ip stack, an= d have >>>> merged the changes into the company's local tree and left the poss= ible=20 >>>> global integration to others. >>>>=20 >>>> I have been approached by a number of companies about scaling the >>>> stack with the assumption of a number of cpu cores. At present, I = find extra >>>> time on my hands and am considering looking into this area on my o= wn. >>>>=20 >>>> The first assumption is that if extra cores are available, that a = single >>>> received homogeneous flow of a large number of packets/segments pe= r >>>> second (pps) can be split into non-equal flows. This split can in = effect >>>> allow a larger recv'd pps rate at the same core load while splitti= ng off >>>> other workloads, such as xmit'ing pure ACKs. >>>>=20 >>>> Simply, again assuming Amdahl's law (and not looking to equalize t= he load >>>> between cores), and creating logical separations where in a many c= ore=20 >>>> system, different cores could have new kernel threads that operat= e in=20 >>>> parallel within the tcp/ip stack. The initial separation points wo= uld be at=20 >>>> the ip/tcp layer boundry and where any recv'd sk/pkt would generat= e some=20 >>>> form of output. >>>>=20 >>>> The ip/tcp layer would be split like the vintage AT&T STREAMs prot= ocol, >>>> with some form of queuing & scheduling, would be needed. In additi= on, >>>> the queuing/schedullng of other kernel threads would occur within = ip & tcp >>>> to separate the I/O. >>>>=20 >>>> A possible validation test is to identify the max recv'd pps rate = within the >>>> tcp/ip modules within normal flow TCP established state with norma= l order=20 >>>> of say 64byte non fragmented segments, before and after each=20 >>>> incremental change. Or the same rate with fewer core/cpu cycles. >>>>=20 >>>> I am willing to have a private git Linux.org tree that concentrate= s proposed >>>> changes into this tree and if there is willingness, a seen want/ne= ed then identify >>>> how to implement the merge. >>>=20 >>> Hi Mitchell >>>=20 >>> We work everyday to improve network stack, and standard linux tree = is >>> pretty scalable, you dont need to setup a separate git tree for tha= t. >>>=20 >>> Our beloved maintainer David S. Miller handles two trees, net-2.6 a= nd >>> net-next-2.6 where we put all our changes. >>>=20 >>> http://git.kernel.org/?p=3Dlinux/kernel/git/davem/net-next-2.6.git >>> git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next-2.6.gi= t >>>=20 >>> I suggest you read the last patches (say .. about 10.000 of them), = to >>> have an idea of things we did during last years. >>>=20 >>> keywords : RCU, multiqueue, RPS, percpu data, lockless algos, cache= line >>> placement... >>>=20 >>> Its nice to see another man joining the team ! >>>=20 >>> Thanks >>>=20 >>=20 >>=20 >> Lets start with a two part Linux kernel change and a tcp input/outpu= t change: >>=20 >> 2 Parts: 2nd part TBD >>=20 >> Summary: Don't use last free pages for TCP ACKs with GFP_ATOMIC for = our >> sk buf allocs. 1 line change in tcp_output.c with a new gfp.h arg, a= nd a change >> in the generic kernel. TBD. >>=20 >> This change should have no effect with normal available kernel mem a= llocs. >>=20 >> Assuming memory pressure ( WAITING for clean memory) we should be al= locating >> our last pages for input skbufs and not for xmit allocs. >>=20 >> By delaying skbuf allocations when we have low kmem, we secondarily = slow down the >> tcp flow : if in slow start (SS) we are almost doing a DELACK, else = CA should/could >> decrease the number of in-flight ACKs and the peer should do burst a= voidance >> if our later ack increases the window in a larger chunk.. >>=20 >> And use the last pages to decrease the chance of dropping a input pk= t or >> running out of recv descriptors, because of mem back pressure. >>=20 >> The change could check for some form of mem pressure before the allo= c, >> but the alloc in itself should suffice. We could also do a ECN type = check before >> the alloc. >>=20 >> Now the kicker. I want a GFP_KERNEL with NO_SLEEP OR a GFP_ATOMIC a= nd >> NOT use emergency pools, thus CAN FAIL, to have 0 other secondary ef= fects >> and change just the 1 arg. >>=20 >> code : tcp_output.c : tcp_send_ack() >> line : buff =3D alloc_skb(MAX_TCP_HDR, GFP_KERNEL_NSLEEP); /* wit= h a NO SLEEP */ >>=20 >> Suggestions, feedback?? >>=20 >> Mitchell Erblich >>=20 >>=20 >>=20 >>=20 >=20 > Sorry :), >=20 > 2nd part: >=20 > use GFP_NOWAIT as 2nd arg to alloc_skb() >=20 > Mitchell Erblich Going in the same direction, If tcp_out_of_resources() and the number of orphaned sockets is above a configured number (maybe because of DoS attack), SHOULD we consume our last available resources and most likely effect skbufs that we aren= 't reset-ing because NOW the recv sk allocs are failing. thus, file tcp_timer.c : tcp_out_of_resources() suggestion change 2nd arg GFP_ATOMIC: tcp_send_active_reset(sk, GFP_NOW= AIT); Please note that even if we believed that the GFP_ATOMIC would have a h= igher=20 probability to send a TCP pkt/seg, that gives us no guarantee that the = peer will recv it or will process it. We COULD also do some form of ECN in this function to inform the peer t= hat our=20 system is in distress if tcp_send_active_reset() did not return void an= d informed=20 us of a mem alloc failure with the GFP_NOWAIT. Since the ECN would benefit the our node/system, this ECN sending event= COULD be argued to have a higher priority and mem argument then sent with a G= =46P_ATOMIC. Suggestions, opinions... Mitchell Erblich >>=20 >>>=20 >>> -- >>> To unsubscribe from this list: send the line "unsubscribe netdev" i= n >>> the body of a message to majordomo@vger.kernel.org >>> More majordomo info at http://vger.kernel.org/majordomo-info.html >>=20 >=20 > -- > To unsubscribe from this list: send the line "unsubscribe netdev" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html