From mboxrd@z Thu Jan 1 00:00:00 1970 From: David Miller Subject: Re: [ofa-general] Re: [PATCH 2/3][NET_BATCH] net core use batching Date: Wed, 10 Oct 2007 02:25:50 -0700 (PDT) Message-ID: <20071010.022550.21928751.davem@davemloft.net> References: <20071010003716.GB552@one.firstfloor.org> <20071009.175025.59469417.davem@davemloft.net> <20071010091644.GA9807@one.firstfloor.org> Mime-Version: 1.0 Content-Type: Text/Plain; charset=us-ascii Content-Transfer-Encoding: 7bit Cc: hadi@cyberus.ca, shemminger@linux-foundation.org, jeff@garzik.org, johnpol@2ka.mipt.ru, herbert@gondor.apana.org.au, gaagaan@gmail.com, Robert.Olsson@data.slu.se, netdev@vger.kernel.org, rdreier@cisco.com, peter.p.waskiewicz.jr@intel.com, mcarlson@broadcom.com, jagana@us.ibm.com, general@lists.openfabrics.org, mchan@broadcom.com, tgraf@suug.ch, randy.dunlap@oracle.com, sri@us.ibm.com, kaber@trash.net To: andi@firstfloor.org Return-path: Received: from 74-93-104-97-Washington.hfc.comcastbusiness.net ([74.93.104.97]:49052 "EHLO sunset.davemloft.net" rhost-flags-OK-FAIL-OK-OK) by vger.kernel.org with ESMTP id S1752137AbXJJJZs (ORCPT ); Wed, 10 Oct 2007 05:25:48 -0400 In-Reply-To: <20071010091644.GA9807@one.firstfloor.org> Sender: netdev-owner@vger.kernel.org List-Id: netdev.vger.kernel.org From: Andi Kleen Date: Wed, 10 Oct 2007 11:16:44 +0200 > > A 256 entry TX hw queue fills up trivially on 1GB and 10GB, but if you > > With TSO really? Yes. > > increase the size much more performance starts to go down due to L2 > > cache thrashing. > > Another possibility would be to consider using cache avoidance > instructions while updating the TX ring (e.g. write combining > on x86) The chip I was working with at the time (UltraSPARC-IIi) compressed all the linear stores into 64-byte full cacheline transactions via the store buffer. It's true that it would allocate in the L2 cache on a miss, which is different from your suggestion. In fact, such a thing might not pan out well, because most of the time you write a single descriptor or two, and that isn't a full cacheline, which means a read/modify/write is the only coherent way to make such a write to RAM. Sure you could batch, but I'd rather give the chip work to do unless I unequivocably knew I'd have enough pending to fill a cacheline's worth of descriptors. And since you suggest we shouldn't queue in software... :-)