From mboxrd@z Thu Jan  1 00:00:00 1970
From: Ric Wheeler <ric@emc.com>
Subject: Re: [JBD] change batching logic to improve O_SYNC performance
Date: Thu, 15 Dec 2005 16:39:12 -0500
Message-ID: <43A1E280.9080609@emc.com>
References: <20051215145951.GB2444@kvack.org> <20051215155552.1f71a16e.akpm@osdl.org>
Mime-Version: 1.0
Content-Type: text/plain; charset=ISO-8859-1; format=flowed
Content-Transfer-Encoding: 7bit
Cc: Benjamin LaHaise <bcrl@kvack.org>, sct@redhat.com,
	linux-fsdevel@vger.kernel.org
Return-path: <linux-fsdevel-owner@vger.kernel.org>
Received: from mexforward.lss.emc.com ([168.159.213.200]:30575 "EHLO
	mexforward.lss.emc.com") by vger.kernel.org with ESMTP
	id S1751221AbVLPAjt (ORCPT <rfc822;linux-fsdevel@vger.kernel.org>);
	Thu, 15 Dec 2005 19:39:49 -0500
To: Andrew Morton <akpm@osdl.org>
In-Reply-To: <20051215155552.1f71a16e.akpm@osdl.org>
Sender: linux-fsdevel-owner@vger.kernel.org
List-Id: linux-fsdevel.vger.kernel.org

Andrew Morton wrote:
> Benjamin LaHaise <bcrl@kvack.org> wrote:
> 
>>Hello folks,
>>
>>When writing files out using O_SYNC, jbd's 1 jiffy delay results in a 
>>significant drop in throughput as the disk sits idle.  The patch below 
>>results in a 4-5x performance improvement (from 6.5MB/s to ~24-30MB/s on 
>>my IDE test box) when writing out files using O_SYNC.
> 
> 
> That's really sad.   Thanks for working that out.
> 
> 
>> Instead of always 
>>delaying for 1 jiffy when trying to batch, merely do a yield() to allow 
>>other processes to execute and potentially batch requests.
> 
> 
> Yeah, 2.4 has yield().  The O(1) yield semantics resulted in a performance
> catastrophe in ext3 when the system was busy, so the batching code got
> changed to a one-jiffy-sleep.  I don't think we can go back to yield().
> 
> Worst-case we should just dump the batching code: single-threaded
> O_SYNC/fsync is probably a commoner case than multi-threaded, dunno.

I think that the above assumption might be true for a single threaded 
O_SYNC process, but is not normally true for fsync() heavy workloads.

We have a multi-threaded write workload since we can boost files/sec by 
about 4-5x the single threaded write rate.  Using the a properly 
configured write barrier (highly recommended if you care about your data 
;-)) makes the cost of a fsync() call quite high so batching is a huge win.

I think that NFS servers and other multi-threaded apps (mail servers?) 
might have a similar profile . In these cases, you definitely benefit by 
combining multiple fsync() requests in one disk operation.

> 
> But surely we can do better than that.
> 
> How's about something simple like just saying "if the last process which
> did a synchronous write is not this process, do the batching thing".
> 
> 

Despite some obvious complexity, I still think that adjusting the delay 
based on rate of the synchronous requests would be the best case.  For 
example, even in the O_SYNC write case, if you have a single thread 
writing to disk in rapid succession, any delay is probably a waste.

Another way to attack this is to actually expose some of the 
transacation mechanisms to the applications so they can do some explicit 
control over the commit phase which could be used to build batched 
fsync(), etc.