From mboxrd@z Thu Jan 1 00:00:00 1970 From: Changli Gao Subject: Re: [PATCH] rfs: Receive Flow Steering Date: Thu, 8 Apr 2010 09:37:28 +0800 Message-ID: References: <1270193393.1936.52.camel@edumazet-laptop> <4BB622F6.10606@hp.com> <4BB6367D.9090600@hp.com> Mime-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: QUOTED-PRINTABLE Cc: Tom Herbert , Eric Dumazet , davem@davemloft.net, netdev@vger.kernel.org To: Rick Jones Return-path: Received: from mail-gy0-f174.google.com ([209.85.160.174]:63607 "EHLO mail-gy0-f174.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1757547Ab0DHBhv convert rfc822-to-8bit (ORCPT ); Wed, 7 Apr 2010 21:37:51 -0400 Received: by gyg13 with SMTP id 13so891284gyg.19 for ; Wed, 07 Apr 2010 18:37:48 -0700 (PDT) In-Reply-To: <4BB6367D.9090600@hp.com> Sender: netdev-owner@vger.kernel.org List-ID: On Sat, Apr 3, 2010 at 2:25 AM, Rick Jones wrote: > Tom Herbert wrote: >> =C2=A0 =C2=A0The progression in HP-UX was IPS (10.20) (aka RPS) then= TOPS (11.0) >> =C2=A0 =C2=A0(aka RFS). We found that IPS was great for >> =C2=A0 =C2=A0single-flow-per-thread-of-execution stuff and that TOPS= was better >> =C2=A0 =C2=A0for multiple-flow-per-thread-of-execution stuff. =C2=A0= It was long enough >> =C2=A0 =C2=A0ago now that I can safely say for one system-level benc= hmark not >> =C2=A0 =C2=A0known to be a "networking" benchmark, and without a mas= sive kernel >> =C2=A0 =C2=A0component, TOPS was a 10% win. =C2=A0Not too shabby. >> >> =C2=A0 =C2=A0It wasn't that IPS wasn't good in its context - just th= at TOPS was >> =C2=A0 =C2=A0even better. >> >> I would assume that with IPS threads would migrate to where packets = were >> being delivered thus giving the same sort of locality TOPS was provi= ding? >> =C2=A0That would work great without any other constraints (multiple = flows per >> thread, thread CPU bindings, etc.). > > Well... that depended - at the time, and still, we were and are also > encouraging users and app designers to make copious use of > processor/locality affinity (SMP and NUMA going back far longer in th= e RISC > et al space than the x86 space). =C2=A0So, it was and is entirely pos= sible that > the application thread of execution is hard-bound to a specific > core/locality. =C2=A0Also, I do not recall if HP-UX was as aggressive= about > waking a process/thread on the processor from which the wake-up came = vs on > the processor on which it last ran. > Maybe RPS should be work against process not processor. For packets forwarding, the process is net_rx softirq. >> =C2=A0 =C2=A0We also preferred the concept of the scheduler giving n= etworking >> =C2=A0 =C2=A0clues as to where to process an application's packets r= ather than >> =C2=A0 =C2=A0networking trying to tell the scheduler. =C2=A0There wa= s some discussion >> =C2=A0 =C2=A0of out of order worries, but we were willing to trust t= o the basic >> =C2=A0 =C2=A0soundness of the scheduler - if it was moving threads a= round willy >> =C2=A0 =C2=A0nilly at a rate able to cause big packet reordering it = had >> =C2=A0 =C2=A0fundamental problems that would have to be addressed an= yway. >> >> >> I also think scheduler leading networking, like in RPS, =C2=A0is gen= erally more >> scalable. =C2=A0As for OOO packets, I've spent way to much time tryi= ng to >> convince the bean-counters that a small number of them aren't proble= matic >> :-), in the end it's just easier to not introduce new mechanisms tha= t will >> cause them! > > So long as it doesn't drive you to produce new mechanisms heavier tha= n they > would have otherwise been. > > The irony in the case of HP-UX IPS was that it was put in place in re= sponse > to the severe out of order packet problems in HP-UX in 10.X before 10= =2E20 - > there were multiple netisr processes and only one netisr queue. =C2=A0= The other > little tweak that came along in 10.20 with IPS, was inaddition to hav= ing a > per processor (well, per core in today's parlance) netisr queue, the = netisr > would grab the entire queue under the one spinlock and work off of th= at. > =C2=A0That was nice because the code path became more efficient under= load - more > packets processed per spinlock/unlock pair. > RPS dispatches packets among all the CPUs permitted fairly, in order to take full advantage of all the CPU power. The assumption is the cpu cycles each CPU gives to packet processing are the same. But it isn't always true as scheduler is mixed in. In this case, scheduler leading network is a good choice. Maybe we should make softirq threaded under the control of scheduler. And the number of softirq threads can be specified by users. By default, the number of the softirq threads are the same as the number of CPUs, and each thread binds to a special CPU, to keep the current behavior. If the other tasks aren't dispatched among the CPUs even, system administrator may increase the number of softirq thread, and dissolve the thread binding, then there will be enough schedulable softirq threads for scheduler scheduling. Oh, maybe there is no need of weighted packets dispatching RPS. --=20 Regards=EF=BC=8C Changli Gao(xiaosuo@gmail.com)