From mboxrd@z Thu Jan  1 00:00:00 1970
From: Josef Bacik <jbacik@redhat.com>
Subject: Re: background on the ext3 batching performance issue
Date: Thu, 28 Feb 2008 10:05:11 -0500
Message-ID: <200802281005.13068.jbacik@redhat.com>
References: <47C6A46D.8020700@emc.com>
Mime-Version: 1.0
Content-Type: text/plain;
  charset="iso-8859-1"
Content-Transfer-Encoding: 7bit
Cc: "Theodore Ts'o" <tytso@mit.edu>, adilger@sun.com,
	David Chinner <dgc@sgi.com>, jack@ucw.cz,
	"Feld, Andy" <Feld_Andy@emc.com>, linux-fsdevel@vger.kernel.org
To: Ric Wheeler <ric@emc.com>
Return-path: <linux-fsdevel-owner@vger.kernel.org>
Received: from mx1.redhat.com ([66.187.233.31]:38706 "EHLO mx1.redhat.com"
	rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP
	id S1752447AbYB1PQz (ORCPT <rfc822;linux-fsdevel@vger.kernel.org>);
	Thu, 28 Feb 2008 10:16:55 -0500
In-Reply-To: <47C6A46D.8020700@emc.com>
Content-Disposition: inline
Sender: linux-fsdevel-owner@vger.kernel.org
List-ID: <linux-fsdevel.vger.kernel.org>

On Thursday 28 February 2008 7:09:17 am Ric Wheeler wrote:
> At the LSF workshop, I mentioned that we have tripped across an
> embarrassing performance issue in the jbd transaction code which is
> clearly not tuned for low latency devices.
>
> The short summary is that we can do say 800 10k files/sec in a
> write/fsync/close loop with a single thread, but drop down to under 250
> files/sec with 2 or more threads.
>
> This is pretty easy to reproduce with any small file write synchronous
> workload (i.e., fsync() each file before close).  We used my fs_mark
> tool to reproduce.
>
> The core of the issue is the call in the jbd transaction code call out
> to schedule_timeout_uninterruptible(1) which causes us to sleep for 4ms:
>
>         pid = current->pid;
>         if (handle->h_sync && journal->j_last_sync_writer != pid) {
>                 journal->j_last_sync_writer = pid;
>                 do {
>                         old_handle_count = transaction->t_handle_count;
>                         schedule_timeout_uninterruptible(1);
>                 } while (old_handle_count != transaction->t_handle_count);
>         }
>
> This is quite topical to the concern we had with low latency devices in
> general, but specifically things like SSD's.
>

Your testcase does in fact show a weakness in this optimization, but look at the 
more likely case, where you have multiple writers on the same filesystem rather 
than one guy doing write/fsync.  If we wait we could potentially add quite a 
few more buffers to this transaction before flushing it, rather than flushing a 
buffer or two at a time.  What would you propose as a solution?

Josef