From mboxrd@z Thu Jan 1 00:00:00 1970 From: Ric Wheeler Subject: Re: background on the ext3 batching performance issue Date: Thu, 28 Feb 2008 14:48:50 -0500 Message-ID: <47C71022.5070608@emc.com> References: <47C6A46D.8020700@emc.com> <200802281005.13068.jbacik@redhat.com> <200802281041.01411.jbacik@redhat.com> <47C6B2A5.4030609@emc.com> <20080228175422.GU155259@sgi.com> Mime-Version: 1.0 Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit Cc: Josef Bacik , "Theodore Ts'o" , adilger@sun.com, jack@ucw.cz, "Feld, Andy" , linux-fsdevel@vger.kernel.org To: David Chinner Return-path: Received: from mexforward.lss.emc.com ([128.222.32.20]:24160 "EHLO mexforward.lss.emc.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1754091AbYB1WvO (ORCPT ); Thu, 28 Feb 2008 17:51:14 -0500 In-Reply-To: <20080228175422.GU155259@sgi.com> Sender: linux-fsdevel-owner@vger.kernel.org List-ID: David Chinner wrote: > On Thu, Feb 28, 2008 at 08:09:57AM -0500, Ric Wheeler wrote: >> One more thought - what we really want here is to have a sense of the >> latency of the device. In the S-ATA disk case, this optimization works >> well for batching since we "spend" an extra 4ms worst case in the chance >> of combining multiple, slow 18ms operations. >> >> With the clariion box we tested, the optimization fails badly since the >> cost is only 1.3 ms so we optimize by waiting 3-4 times longer than it >> would take to do the operation immediately. >> >> This problem has also seemed to me to be the same problem that IO >> schedulers do with plugging - we want to dynamically figure out when to >> plug and unplug here without hard coding in device specific tunings. >> >> If we bypass the snippet for multi-threaded writers, we would probably >> slow down this workload on normal S-ATA/ATA drives (or even higher >> performance non-RAID disks). > > It's the self-tuning aspect of this problem that makes it hard. In > the case of XFS, the way this tuning is done is that we look at the > state of the previous log I/O buffer to check if it is still syncing > to disk. If it is sync to disk, we go to sleep waiting for that log > buffer I/O to complete. This holds the current buffer open to > aggregate more transactions before syncing it to disk and hence > allows parallel fsyncs to be issued in the one log write. The fact > that it waits for the previous log I/O to complete means it > self-tunes to the latency of the underlying storage medium..... > > Cheers, > > Dave. With the experiments we ran before, the heuristic did eventually start helping when we hit really high numbers of concurrent writing threads on the Clariion box. I forget how many, but it was at least 12 or so. ric