From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from ipmail04.adl6.internode.on.net ([150.101.137.141]:11095 "EHLO ipmail04.adl6.internode.on.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751743AbcJ1E3f (ORCPT ); Fri, 28 Oct 2016 00:29:35 -0400 Date: Fri, 28 Oct 2016 15:29:29 +1100 From: Dave Chinner Subject: Re: [rfc] larger batches for crc32c Message-ID: <20161028042929.GH22126@dastard> References: <20161028031747.68472ac7@roar.ozlabs.ibm.com> <20161027214244.GO14023@dastard> <20161028131234.24a5cb6f@roar.ozlabs.ibm.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20161028131234.24a5cb6f@roar.ozlabs.ibm.com> Sender: linux-xfs-owner@vger.kernel.org List-ID: List-Id: xfs To: Nicholas Piggin Cc: linux-xfs@vger.kernel.org, Christoph Hellwig , Dave Chinner , "Darrick J. Wong" On Fri, Oct 28, 2016 at 01:12:34PM +1100, Nicholas Piggin wrote: > On Fri, 28 Oct 2016 08:42:44 +1100 > Dave Chinner wrote: > > > As a rule, it helps the crc implementation if it can operate on as large a > > > chunk as possible (alignment, startup overhead, etc). So I did a quick hack > > > at getting XFS checksumming to feed crc32c() with larger chunks, by setting > > > the existing crc to 0 before running over the entire buffer. Together with > > > some small work on the powerpc crc implementation, crc drops below 0.1%. > > > > I wouldn't have expected reducing call numbers and small alignment > > changes to make that amount of difference given the amount of data > > we are actually checksumming. How much of that difference was due to > > the improved CRC implementation? > > Sorry, I should have been more clear about what was happening. Not enough > sleep. The larger sizes allows the vectorized crc implementation to kick in. Ah, ok. So it never gets out of the slow, branchy lead in/lead out code for the smaller chunks.Fair enough. For the verify side it probably doesn't matter than much - the latency of the initial memory fetches on the data to be verified are likely to be the dominant factor for performance... > > FWIW, can you provide some additional context by grabbing the log > > stats that tell us the load on the log that is generating this > > profile? A sample over a minute of a typical workload (with a > > corresponding CPU profile) would probably be sufficient. You can get > > them simply by zeroing the xfs stats via > > /proc/sys/fs/xfs/stats_clear at the start of the sample period and > > then dumping /proc/fs/xfs/stat at the end. > > Yeah I'll get some better information for you. > > > > I don't know if something like this would be acceptable? It's not pretty, > > > but I didn't see an easier way. > > > > ISTR we made the choice not to do that to avoid potential problems > > with potential race conditions and bugs (i.e. don't modify anything > > in objects on read access) but I can't point you at anything > > specific... > > Sounds pretty reasonable, especially for the verifiers. For the paths > that create/update the checksums (including this log checksum), it seems > like it should be less controversial. Yup. For the paths that update the checksum we have to have an exclusive lock on the object (and will always require that), so I can't see a problem with changing the update code to use this method. Cheers, Dave. -- Dave Chinner david@fromorbit.com