From mboxrd@z Thu Jan 1 00:00:00 1970 From: Ric Wheeler Subject: Re: background on the ext3 batching performance issue Date: Fri, 29 Feb 2008 09:52:56 -0500 Message-ID: <47C81C48.1030706@emc.com> References: <47C6A46D.8020700@emc.com> <200802281005.13068.jbacik@redhat.com> <200802281041.01411.jbacik@redhat.com> <47C6B2A5.4030609@emc.com> <20080228175422.GU155259@sgi.com> Reply-To: ric@emc.com Mime-Version: 1.0 Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit Cc: Josef Bacik , "Theodore Ts'o" , adilger@sun.com, jack@ucw.cz, "Feld, Andy" , linux-fsdevel@vger.kernel.org To: David Chinner Return-path: Received: from mexforward.lss.emc.com ([128.222.32.20]:28819 "EHLO mexforward.lss.emc.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1759228AbYB2O7J (ORCPT ); Fri, 29 Feb 2008 09:59:09 -0500 In-Reply-To: <20080228175422.GU155259@sgi.com> Sender: linux-fsdevel-owner@vger.kernel.org List-ID: David Chinner wrote: > On Thu, Feb 28, 2008 at 08:09:57AM -0500, Ric Wheeler wrote: >> One more thought - what we really want here is to have a sense of the >> latency of the device. In the S-ATA disk case, this optimization works >> well for batching since we "spend" an extra 4ms worst case in the chance >> of combining multiple, slow 18ms operations. >> >> With the clariion box we tested, the optimization fails badly since the >> cost is only 1.3 ms so we optimize by waiting 3-4 times longer than it >> would take to do the operation immediately. >> >> This problem has also seemed to me to be the same problem that IO >> schedulers do with plugging - we want to dynamically figure out when to >> plug and unplug here without hard coding in device specific tunings. >> >> If we bypass the snippet for multi-threaded writers, we would probably >> slow down this workload on normal S-ATA/ATA drives (or even higher >> performance non-RAID disks). > > It's the self-tuning aspect of this problem that makes it hard. In > the case of XFS, the way this tuning is done is that we look at the > state of the previous log I/O buffer to check if it is still syncing > to disk. If it is sync to disk, we go to sleep waiting for that log > buffer I/O to complete. This holds the current buffer open to > aggregate more transactions before syncing it to disk and hence > allows parallel fsyncs to be issued in the one log write. The fact > that it waits for the previous log I/O to complete means it > self-tunes to the latency of the underlying storage medium..... > > Cheers, > > Dave. This sounds like a really clean way to self tune without having any hard coded assumptions (like the current 1HZ wait)... ric