From mboxrd@z Thu Jan  1 00:00:00 1970
From: Ric Wheeler <ric@emc.com>
Subject: Re: background on the ext3 batching performance issue
Date: Fri, 29 Feb 2008 09:52:56 -0500
Message-ID: <47C81C48.1030706@emc.com>
References: <47C6A46D.8020700@emc.com> <200802281005.13068.jbacik@redhat.com> <200802281041.01411.jbacik@redhat.com> <47C6B2A5.4030609@emc.com> <20080228175422.GU155259@sgi.com>
Reply-To: ric@emc.com
Mime-Version: 1.0
Content-Type: text/plain; charset=ISO-8859-1; format=flowed
Content-Transfer-Encoding: 7bit
Cc: Josef Bacik <jbacik@redhat.com>, "Theodore Ts'o" <tytso@mit.edu>,
	adilger@sun.com, jack@ucw.cz, "Feld, Andy" <Feld_Andy@emc.com>,
	linux-fsdevel@vger.kernel.org
To: David Chinner <dgc@sgi.com>
Return-path: <linux-fsdevel-owner@vger.kernel.org>
Received: from mexforward.lss.emc.com ([128.222.32.20]:28819 "EHLO
	mexforward.lss.emc.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org
	with ESMTP id S1759228AbYB2O7J (ORCPT
	<rfc822;linux-fsdevel@vger.kernel.org>);
	Fri, 29 Feb 2008 09:59:09 -0500
In-Reply-To: <20080228175422.GU155259@sgi.com>
Sender: linux-fsdevel-owner@vger.kernel.org
List-ID: <linux-fsdevel.vger.kernel.org>


David Chinner wrote:
> On Thu, Feb 28, 2008 at 08:09:57AM -0500, Ric Wheeler wrote:
>> One more thought - what we really want here is to have a sense of the 
>> latency of the device. In the S-ATA disk case, this optimization works 
>> well for batching since we "spend" an extra 4ms worst case in the chance 
>> of combining multiple, slow 18ms operations.
>>
>> With the clariion box we tested, the optimization fails badly since the 
>> cost is only 1.3 ms so we optimize by waiting 3-4 times longer than it 
>> would take to do the operation immediately.
>>
>> This problem has also seemed to me to be the same problem that IO 
>> schedulers do with plugging - we want to dynamically figure out when to 
>> plug and unplug here without hard coding in device specific tunings.
>>
>> If we bypass the snippet for multi-threaded writers, we would probably 
>> slow down this workload on normal S-ATA/ATA drives (or even higher 
>> performance non-RAID disks).
> 
> It's the self-tuning aspect of this problem that makes it hard. In
> the case of XFS, the way this tuning is done is that we look at the
> state of the previous log I/O buffer to check if it is still syncing
> to disk. If it is sync to disk, we go to sleep waiting for that log
> buffer I/O to complete. This holds the current buffer open to
> aggregate more transactions before syncing it to disk and hence
> allows parallel fsyncs to be issued in the one log write. The fact
> that it waits for the previous log I/O to complete means it
> self-tunes to the latency of the underlying storage medium.....
> 
> Cheers,
> 
> Dave.

This sounds like a really clean way to self tune without having any hard coded 
assumptions (like the current 1HZ wait)...

ric