From mboxrd@z Thu Jan 1 00:00:00 1970 From: Eric Dumazet Subject: Re: [net-next-2.6 PATCH][be2net] remove napi in the tx path and do tx completion processing in interrupt context Date: Wed, 20 May 2009 11:25:44 +0200 Message-ID: <4A13CC98.60506@cosmosbay.com> References: <20090519121056.GB26870@serverengines.com> <20090519.151334.87370735.davem@davemloft.net> Mime-Version: 1.0 Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: QUOTED-PRINTABLE Cc: ajitk@serverengines.com, netdev@vger.kernel.org To: David Miller Return-path: Received: from gw1.cosmosbay.com ([212.99.114.194]:42377 "EHLO gw1.cosmosbay.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1754184AbZETJZu convert rfc822-to-8bit (ORCPT ); Wed, 20 May 2009 05:25:50 -0400 In-Reply-To: <20090519.151334.87370735.davem@davemloft.net> Sender: netdev-owner@vger.kernel.org List-ID: David Miller a =E9crit : > From: Ajit Khaparde > Date: Tue, 19 May 2009 17:40:58 +0530 >=20 >> This patch will remove napi in tx path and do Tx compleiton >> processing in interrupt context. This makes Tx completion >> processing simpler without loss of performance. >> >> Signed-off-by: Ajit Khaparde >=20 > This is different from how every other NAPI driver does this. >=20 > You should have a single NAPI context, that handles both TX and RX > processing. Except, that for TX processing, no work budget > adjustments are made. You simply unconditionally process all pending > TX work without accounting it into the POLL call budget. >=20 > I have no idea why this driver tried to split the RX and TX > work like this, it accomplishes nothing but add overhead. > Simply add the TX completion code to the RX poll handler > and that's all you need to do. Also, make sure to run TX > polling work before RX polling work, this makes fresh SKBs > available for responses generated by RX packet processing. >=20 > I bet this is why you really saw performance problems, rather than > something to do with running it directly in interrupt context. There > should be zero gain from that if you do the TX poll work properly in > the RX poll handler. When you free TX packets in hardware interrupt > context using dev_kfree_skb_any() that just schedules a software > interrupt to do the actual SKB free, which adds just more overhead fo= r > TX processing work. You aren't avoiding software IRQ work by doing T= X > processing in the hardware interrupt handler, in fact you > theoretically are doing more. >=20 > So the only conclusion I can come to is that what is important is > doing the TX completion work before the RX packets get processed in > the NAPI poll handler, and you accomplish that more efficiently and > more properly by simply moving the TX completion work to the top of > the RX poll handler code. >=20 Thanks David for this analysis I would like to point a scalability problem we currently have with non multiqueue devices, and multi core host with the schem you described/ad= vocated. (this has nothing to do with the be2net patch, please forgive me for ju= mping in) When a lot of network trafic is handled by one device, we enter in a ksofirqd/napi mode, where one cpu is almost dedicated in handling both TX completions and RX completions, while other cpus run application code (and some parts of TCP/UDP stack ) Thats really expensive because of many cache line ping pongs occurring. In that case, it would make sense to transfert most part of the TX comp= letion work to the other cpus (cpus that order the xmits actually). skb freeing of = course, and sock_wfree() callbacks... So maybe some NIC device drivers could let their ndo_start_xmit() do some cleanup work of previously sent skbs. If correctly done, we could lower number of cache line ping pongs. This would give a breath to the cpu that would only take care of RX com= pletions, and probably give better throughput. Some machines out there want to tr= ansmit lot of frames, while receiving few ones... There is also a minor latency problem with current schem : Taking care of TX completion takes some time and delay RX handling, inc= reasing latencies of incoming trafic.