From mboxrd@z Thu Jan 1 00:00:00 1970 From: Neil Horman Subject: Re: [PATCH] x86: Run checksumming in parallel accross multiple alu's Date: Fri, 1 Nov 2013 22:07:13 -0400 Message-ID: <20131102020713.GA16290@hmsreliant.think-freely.org> References: <20131030110214.GA10220@localhost.localdomain> <52710B09.6090302@redhat.com> <20131031183003.GC25894@hmsreliant.think-freely.org> <1383320566.1737.0.camel@bwh-desktop.uk.level5networks.com> <20131101160802.GB8467@hmsreliant.think-freely.org> <20131101173701.GC8467@hmsreliant.think-freely.org> <1383335129.3042.10.camel@joe-AO722> <20131101195850.GD8467@hmsreliant.think-freely.org> <1383337612.3042.21.camel@joe-AO722> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Cc: David Laight , Ben Hutchings , Doug Ledford , Ingo Molnar , Eric Dumazet , linux-kernel@vger.kernel.org, netdev@vger.kernel.org To: Joe Perches Return-path: Received: from charlotte.tuxdriver.com ([70.61.120.58]:59031 "EHLO smtp.tuxdriver.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1752103Ab3KBCHc (ORCPT ); Fri, 1 Nov 2013 22:07:32 -0400 Content-Disposition: inline In-Reply-To: <1383337612.3042.21.camel@joe-AO722> Sender: netdev-owner@vger.kernel.org List-ID: On Fri, Nov 01, 2013 at 01:26:52PM -0700, Joe Perches wrote: > On Fri, 2013-11-01 at 15:58 -0400, Neil Horman wrote: > > On Fri, Nov 01, 2013 at 12:45:29PM -0700, Joe Perches wrote: > > > On Fri, 2013-11-01 at 13:37 -0400, Neil Horman wrote: > > > > > > > I think it would be better if we just did the prefetch here > > > > and re-addressed this area when AVX (or addcx/addox) instructions were available > > > > for testing on hardware. > > > > > > Could there be a difference if only a single software > > > prefetch was done at the beginning of transfer before > > > the while loop and hardware prefetches did the rest? > > > > > I wouldn't think so. If hardware was going to do any prefetching based on > > memory access patterns it will do so regardless of the leading prefetch, and > > that first prefetch isn't helpful because we still wind up stalling on the adds > > while its completing > > I imagine one benefit to be helping prevent > prefetching beyond the actual data required. > > Maybe some hardware optimizes prefetch stride > better than 5*64. > > I wonder also if using > > if (count > some_length) > prefetch > while (...) > > helps small lengths more than the test/jump cost. > We've already done this and it is in fact the best performing. I'll be posting that patch along with ingos request to add do_csum to the perf bench code when I have that done Best Neil >