From mboxrd@z Thu Jan 1 00:00:00 1970 From: Josef Bacik Subject: Re: background on the ext3 batching performance issue Date: Thu, 28 Feb 2008 10:05:11 -0500 Message-ID: <200802281005.13068.jbacik@redhat.com> References: <47C6A46D.8020700@emc.com> Mime-Version: 1.0 Content-Type: text/plain; charset="iso-8859-1" Content-Transfer-Encoding: 7bit Cc: "Theodore Ts'o" , adilger@sun.com, David Chinner , jack@ucw.cz, "Feld, Andy" , linux-fsdevel@vger.kernel.org To: Ric Wheeler Return-path: Received: from mx1.redhat.com ([66.187.233.31]:38706 "EHLO mx1.redhat.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1752447AbYB1PQz (ORCPT ); Thu, 28 Feb 2008 10:16:55 -0500 In-Reply-To: <47C6A46D.8020700@emc.com> Content-Disposition: inline Sender: linux-fsdevel-owner@vger.kernel.org List-ID: On Thursday 28 February 2008 7:09:17 am Ric Wheeler wrote: > At the LSF workshop, I mentioned that we have tripped across an > embarrassing performance issue in the jbd transaction code which is > clearly not tuned for low latency devices. > > The short summary is that we can do say 800 10k files/sec in a > write/fsync/close loop with a single thread, but drop down to under 250 > files/sec with 2 or more threads. > > This is pretty easy to reproduce with any small file write synchronous > workload (i.e., fsync() each file before close). We used my fs_mark > tool to reproduce. > > The core of the issue is the call in the jbd transaction code call out > to schedule_timeout_uninterruptible(1) which causes us to sleep for 4ms: > > pid = current->pid; > if (handle->h_sync && journal->j_last_sync_writer != pid) { > journal->j_last_sync_writer = pid; > do { > old_handle_count = transaction->t_handle_count; > schedule_timeout_uninterruptible(1); > } while (old_handle_count != transaction->t_handle_count); > } > > This is quite topical to the concern we had with low latency devices in > general, but specifically things like SSD's. > Your testcase does in fact show a weakness in this optimization, but look at the more likely case, where you have multiple writers on the same filesystem rather than one guy doing write/fsync. If we wait we could potentially add quite a few more buffers to this transaction before flushing it, rather than flushing a buffer or two at a time. What would you propose as a solution? Josef