From mboxrd@z Thu Jan  1 00:00:00 1970
From: Eric Dumazet <dada1@cosmosbay.com>
Subject: Re: [net-next-2.6 PATCH][be2net] remove napi in the tx path and do
 tx completion processing in interrupt context
Date: Wed, 20 May 2009 11:25:44 +0200
Message-ID: <4A13CC98.60506@cosmosbay.com>
References: <20090519121056.GB26870@serverengines.com> <20090519.151334.87370735.davem@davemloft.net>
Mime-Version: 1.0
Content-Type: text/plain; charset=ISO-8859-1
Content-Transfer-Encoding: QUOTED-PRINTABLE
Cc: ajitk@serverengines.com, netdev@vger.kernel.org
To: David Miller <davem@davemloft.net>
Return-path: <netdev-owner@vger.kernel.org>
Received: from gw1.cosmosbay.com ([212.99.114.194]:42377 "EHLO
	gw1.cosmosbay.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org
	with ESMTP id S1754184AbZETJZu convert rfc822-to-8bit (ORCPT
	<rfc822;netdev@vger.kernel.org>); Wed, 20 May 2009 05:25:50 -0400
In-Reply-To: <20090519.151334.87370735.davem@davemloft.net>
Sender: netdev-owner@vger.kernel.org
List-ID: <netdev.vger.kernel.org>

David Miller a =E9crit :
> From: Ajit Khaparde <ajitk@serverengines.com>
> Date: Tue, 19 May 2009 17:40:58 +0530
>=20
>> This patch will remove napi in tx path and do Tx compleiton
>> processing in interrupt context.  This makes Tx completion
>> processing simpler without loss of performance.
>>
>> Signed-off-by: Ajit Khaparde <ajitk@serverengines.com>
>=20
> This is different from how every other NAPI driver does this.
>=20
> You should have a single NAPI context, that handles both TX and RX
> processing.  Except, that for TX processing, no work budget
> adjustments are made.  You simply unconditionally process all pending
> TX work without accounting it into the POLL call budget.
>=20
> I have no idea why this driver tried to split the RX and TX
> work like this, it accomplishes nothing but add overhead.
> Simply add the TX completion code to the RX poll handler
> and that's all you need to do.  Also, make sure to run TX
> polling work before RX polling work, this makes fresh SKBs
> available for responses generated by RX packet processing.
>=20
> I bet this is why you really saw performance problems, rather than
> something to do with running it directly in interrupt context.  There
> should be zero gain from that if you do the TX poll work properly in
> the RX poll handler.  When you free TX packets in hardware interrupt
> context using dev_kfree_skb_any() that just schedules a software
> interrupt to do the actual SKB free, which adds just more overhead fo=
r
> TX processing work.  You aren't avoiding software IRQ work by doing T=
X
> processing in the hardware interrupt handler, in fact you
> theoretically are doing more.
>=20
> So the only conclusion I can come to is that what is important is
> doing the TX completion work before the RX packets get processed in
> the NAPI poll handler, and you accomplish that more efficiently and
> more properly by simply moving the TX completion work to the top of
> the RX poll handler code.
>=20

Thanks David for this analysis

I would like to point a scalability problem we currently have with non
multiqueue devices, and multi core host with the schem you described/ad=
vocated.

(this has nothing to do with the be2net patch, please forgive me for ju=
mping in)

When a lot of network trafic is handled by one device, we enter in a
ksofirqd/napi mode, where one cpu is almost dedicated in handling
both TX completions and RX completions, while other cpus
run application code (and some parts of TCP/UDP stack )

Thats really expensive because of many cache line ping pongs occurring.

In that case, it would make sense to transfert most part of the TX comp=
letion work
to the other cpus (cpus that order the xmits actually). skb freeing of =
course,
and sock_wfree() callbacks...

So maybe some NIC device drivers could let their ndo_start_xmit()
do some cleanup work of previously sent skbs. If correctly done,
we could lower number of cache line ping pongs.

This would give a breath to the cpu that would only take care of RX com=
pletions,
and probably give better throughput. Some machines out there want to tr=
ansmit
lot of frames, while receiving few ones...


There is also a minor latency problem with current schem :
Taking care of TX completion takes some time and delay RX handling, inc=
reasing latencies
of incoming trafic.