From mboxrd@z Thu Jan  1 00:00:00 1970
From: Ric Wheeler <ric@emc.com>
Subject: Re: background on the ext3 batching performance issue
Date: Thu, 28 Feb 2008 08:03:42 -0500
Message-ID: <47C6B12E.8020306@emc.com>
References: <47C6A46D.8020700@emc.com> <200802281005.13068.jbacik@redhat.com> <200802281041.01411.jbacik@redhat.com>
Mime-Version: 1.0
Content-Type: text/plain; charset=ISO-8859-1; format=flowed
Content-Transfer-Encoding: 7bit
Cc: "Theodore Ts'o" <tytso@mit.edu>, adilger@sun.com,
	David Chinner <dgc@sgi.com>, jack@ucw.cz,
	"Feld, Andy" <Feld_Andy@emc.com>, linux-fsdevel@vger.kernel.org
To: Josef Bacik <jbacik@redhat.com>
Return-path: <linux-fsdevel-owner@vger.kernel.org>
Received: from mexforward.lss.emc.com ([128.222.32.20]:37870 "EHLO
	mexforward.lss.emc.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org
	with ESMTP id S1754625AbYB1QFm (ORCPT
	<rfc822;linux-fsdevel@vger.kernel.org>);
	Thu, 28 Feb 2008 11:05:42 -0500
In-Reply-To: <200802281041.01411.jbacik@redhat.com>
Sender: linux-fsdevel-owner@vger.kernel.org
List-ID: <linux-fsdevel.vger.kernel.org>

Josef Bacik wrote:
> On Thursday 28 February 2008 10:05:11 am Josef Bacik wrote:
>> On Thursday 28 February 2008 7:09:17 am Ric Wheeler wrote:
>>> At the LSF workshop, I mentioned that we have tripped across an
>>> embarrassing performance issue in the jbd transaction code which is
>>> clearly not tuned for low latency devices.
>>>
>>> The short summary is that we can do say 800 10k files/sec in a
>>> write/fsync/close loop with a single thread, but drop down to under 250
>>> files/sec with 2 or more threads.
>>>
>>> This is pretty easy to reproduce with any small file write synchronous
>>> workload (i.e., fsync() each file before close).  We used my fs_mark
>>> tool to reproduce.
>>>
>>> The core of the issue is the call in the jbd transaction code call out
>>> to schedule_timeout_uninterruptible(1) which causes us to sleep for 4ms:
>>>
>>>         pid = current->pid;
>>>         if (handle->h_sync && journal->j_last_sync_writer != pid) {
>>>                 journal->j_last_sync_writer = pid;
>>>                 do {
>>>                         old_handle_count = transaction->t_handle_count;
>>>                         schedule_timeout_uninterruptible(1);
>>>                 } while (old_handle_count !=
>>> transaction->t_handle_count); }
>>>
>>> This is quite topical to the concern we had with low latency devices in
>>> general, but specifically things like SSD's.
>> Your testcase does in fact show a weakness in this optimization, but look
>> at the more likely case, where you have multiple writers on the same
>> filesystem rather than one guy doing write/fsync.  If we wait we could
>> potentially add quite a few more buffers to this transaction before
>> flushing it, rather than flushing a buffer or two at a time.  What would
>> you propose as a solution?
>>
> 
> Forgive me, I said that badly, now that I've had my morning coffee let me try 
> again.  You are ping-ponging the j_last_sync_writer back and forth between the 
> two threads, so you don't get the speedup you would get with one thread where 
> we would just bypass the next sleep since we know we've got one thread doing 
> write/sync.  So this brings up the question, should we try and figure out if we 
> have the situation where we have multiple threads doing write/sync and 
> therefore exploiting the weakness in this optimization, and if we should, how 
> would we do this properly?  The only thing I can think to do is to track sync 
> writers on a transaction, and if its more than one bypass this little snippet.  
> In fact I think I'll go ahead and do that and see what fs_mark comes up with.  
> Thank you,
> 
> Josef
> 

Even worse, we go 4 times slower with 2 threads than we do with a single 
thread!

This code has tried several things in the past - reiserfs used to do a 
yield() at one point.

I am traveling until the weekend, but will be able to help with this 
when I get back in to my lab on Monday...

ric