From mboxrd@z Thu Jan 1 00:00:00 1970 From: Ric Wheeler Subject: Re: background on the ext3 batching performance issue Date: Thu, 28 Feb 2008 08:03:42 -0500 Message-ID: <47C6B12E.8020306@emc.com> References: <47C6A46D.8020700@emc.com> <200802281005.13068.jbacik@redhat.com> <200802281041.01411.jbacik@redhat.com> Mime-Version: 1.0 Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit Cc: "Theodore Ts'o" , adilger@sun.com, David Chinner , jack@ucw.cz, "Feld, Andy" , linux-fsdevel@vger.kernel.org To: Josef Bacik Return-path: Received: from mexforward.lss.emc.com ([128.222.32.20]:37870 "EHLO mexforward.lss.emc.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1754625AbYB1QFm (ORCPT ); Thu, 28 Feb 2008 11:05:42 -0500 In-Reply-To: <200802281041.01411.jbacik@redhat.com> Sender: linux-fsdevel-owner@vger.kernel.org List-ID: Josef Bacik wrote: > On Thursday 28 February 2008 10:05:11 am Josef Bacik wrote: >> On Thursday 28 February 2008 7:09:17 am Ric Wheeler wrote: >>> At the LSF workshop, I mentioned that we have tripped across an >>> embarrassing performance issue in the jbd transaction code which is >>> clearly not tuned for low latency devices. >>> >>> The short summary is that we can do say 800 10k files/sec in a >>> write/fsync/close loop with a single thread, but drop down to under 250 >>> files/sec with 2 or more threads. >>> >>> This is pretty easy to reproduce with any small file write synchronous >>> workload (i.e., fsync() each file before close). We used my fs_mark >>> tool to reproduce. >>> >>> The core of the issue is the call in the jbd transaction code call out >>> to schedule_timeout_uninterruptible(1) which causes us to sleep for 4ms: >>> >>> pid = current->pid; >>> if (handle->h_sync && journal->j_last_sync_writer != pid) { >>> journal->j_last_sync_writer = pid; >>> do { >>> old_handle_count = transaction->t_handle_count; >>> schedule_timeout_uninterruptible(1); >>> } while (old_handle_count != >>> transaction->t_handle_count); } >>> >>> This is quite topical to the concern we had with low latency devices in >>> general, but specifically things like SSD's. >> Your testcase does in fact show a weakness in this optimization, but look >> at the more likely case, where you have multiple writers on the same >> filesystem rather than one guy doing write/fsync. If we wait we could >> potentially add quite a few more buffers to this transaction before >> flushing it, rather than flushing a buffer or two at a time. What would >> you propose as a solution? >> > > Forgive me, I said that badly, now that I've had my morning coffee let me try > again. You are ping-ponging the j_last_sync_writer back and forth between the > two threads, so you don't get the speedup you would get with one thread where > we would just bypass the next sleep since we know we've got one thread doing > write/sync. So this brings up the question, should we try and figure out if we > have the situation where we have multiple threads doing write/sync and > therefore exploiting the weakness in this optimization, and if we should, how > would we do this properly? The only thing I can think to do is to track sync > writers on a transaction, and if its more than one bypass this little snippet. > In fact I think I'll go ahead and do that and see what fs_mark comes up with. > Thank you, > > Josef > Even worse, we go 4 times slower with 2 threads than we do with a single thread! This code has tried several things in the past - reiserfs used to do a yield() at one point. I am traveling until the weekend, but will be able to help with this when I get back in to my lab on Monday... ric