From mboxrd@z Thu Jan 1 00:00:00 1970 From: Ric Wheeler Subject: Re: some hard numbers on ext3 & batching performance issue Date: Fri, 07 Mar 2008 15:45:58 -0500 Message-ID: <47D1A986.8010307@emc.com> References: <47C6A46D.8020700@emc.com> <200803051520.09931.jbacik@redhat.com> <47D1A0C0.8010908@emc.com> <200803071540.13958.jbacik@redhat.com> Reply-To: ric@emc.com Mime-Version: 1.0 Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit Return-path: Received: from mexforward.lss.emc.com ([128.222.32.20]:33450 "EHLO mexforward.lss.emc.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1761407AbYCGUtb (ORCPT ); Fri, 7 Mar 2008 15:49:31 -0500 In-Reply-To: <200803071540.13958.jbacik@redhat.com> Sender: linux-scsi-owner@vger.kernel.org List-Id: linux-scsi@vger.kernel.org To: Josef Bacik Cc: David Chinner , Theodore Ts'o , adilger@sun.com, jack@ucw.cz, "Feld, Andy" , linux-fsdevel@vger.kernel.org, linux-scsi Josef Bacik wrote: > On Friday 07 March 2008 3:08:32 pm Ric Wheeler wrote: >> Josef Bacik wrote: >>> On Wednesday 05 March 2008 2:19:48 pm Ric Wheeler wrote: >>>> After the IO/FS workshop last week, I posted some details on the slow >>>> down we see with ext3 when we have a low latency back end instead of a >>>> normal local disk (SCSI/S-ATA/etc). >> ... >> ... >> ... >> >>>> It would be really interesting to rerun some of these tests on xfs which >>>> Dave explained in the thread last week has a more self tuning way to >>>> batch up transactions.... >>>> >>>> Note that all of those poor users who have a synchronous write workload >>>> today are in the "1" row for each of the above tables. >>> Mind giving this a whirl? The fastest thing I've got here is an Apple X >>> RAID and its being used for something else atm, so I've only tested this >>> on local disk to make sure it didn't make local performance suck (which >>> it doesn't btw). This should be equivalent with what David says XFS does. >>> Thanks much, >>> >>> Josef >>> >>> diff --git a/fs/jbd/transaction.c b/fs/jbd/transaction.c >>> index c6cbb6c..4596e1c 100644 >>> --- a/fs/jbd/transaction.c >>> +++ b/fs/jbd/transaction.c >>> @@ -1333,8 +1333,7 @@ int journal_stop(handle_t *handle) >>> { >>> transaction_t *transaction = handle->h_transaction; >>> journal_t *journal = transaction->t_journal; >>> - int old_handle_count, err; >>> - pid_t pid; >>> + int err; >>> >>> J_ASSERT(journal_current_handle() == handle); >>> >>> @@ -1353,32 +1352,22 @@ int journal_stop(handle_t *handle) >>> >>> jbd_debug(4, "Handle %p going down\n", handle); >>> >>> - /* >>> - * Implement synchronous transaction batching. If the handle >>> - * was synchronous, don't force a commit immediately. Let's >>> - * yield and let another thread piggyback onto this transaction. >>> - * Keep doing that while new threads continue to arrive. >>> - * It doesn't cost much - we're about to run a commit and sleep >>> - * on IO anyway. Speeds up many-threaded, many-dir operations >>> - * by 30x or more... >>> - * >>> - * But don't do this if this process was the most recent one to >>> - * perform a synchronous write. We do this to detect the case where a >>> - * single process is doing a stream of sync writes. No point in >>> waiting - * for joiners in that case. >>> - */ >>> - pid = current->pid; >>> - if (handle->h_sync && journal->j_last_sync_writer != pid) { >>> - journal->j_last_sync_writer = pid; >>> - do { >>> - old_handle_count = transaction->t_handle_count; >>> - schedule_timeout_uninterruptible(1); >>> - } while (old_handle_count != transaction->t_handle_count); >>> - } >>> - >>> current->journal_info = NULL; >>> spin_lock(&journal->j_state_lock); >>> spin_lock(&transaction->t_handle_lock); >>> + >>> + if (journal->j_committing_transaction && handle->h_sync) { >>> + tid_t tid = journal->j_committing_transaction->t_tid; >>> + >>> + spin_unlock(&transaction->t_handle_lock); >>> + spin_unlock(&journal->j_state_lock); >>> + >>> + err = log_wait_commit(journal, tid); >>> + >>> + spin_lock(&journal->j_state_lock); >>> + spin_lock(&transaction->t_handle_lock); >>> + } >>> + >>> transaction->t_outstanding_credits -= handle->h_buffer_credits; >>> transaction->t_updates--; >>> if (!transaction->t_updates) { >> Running with Josef's patch, I was able to see a clear improvement for >> batching these synchronous operations on ext3 with the RAM disk and >> array. It is not too often that you get to do a simple change and see a >> 27 times improvement ;-) >> >> On the bad side, the local disk case took as much as a 30% drop in >> performance. The specific disk is not one that I have a lot of >> experience with, I would like to retry on a disk that has been qualified >> by our group (i.e., we have reasonable confidence that there are no >> firmware issues, etc). >> >> Now for the actual results. >> >> The results are the average value of 5 runs for each number of threads. >> >> Type Threads Baseline Josef Speedup (Josef/Baseline) >> array 1 320.5 325.4 1.01 >> array 2 174.9 351.9 2.01 >> array 4 382.7 593.5 1.55 >> array 8 644.1 963.0 1.49 >> array 10 842.9 1038.7 1.23 >> array 20 1319.6 1432.3 1.08 >> >> RAM disk 1 5621.4 5595.1 0.99 >> RAM disk 2 281.5 7613.3 27.04 >> RAM disk 4 579.9 9111.5 15.71 >> RAM disk 8 891.1 9357.3 10.50 >> RAM disk 10 1116.3 9873.6 8.84 >> RAM disk 20 1952.0 10703.6 5.48 >> >> S-ATA disk 1 19.0 15.1 0.79 >> S-ATA disk 2 19.9 14.4 0.72 >> S-ATA disk 4 41.0 27.9 0.68 >> S-ATA disk 8 60.4 43.2 0.71 >> S-ATA disk 10 67.1 48.7 0.72 >> S-ATA disk 20 102.7 74.0 0.72 >> >> Background on the tests: >> >> All of this is measured on three devices - a relatively old & slow >> array, the local (slow!) 2.5" S-ATA disk in the box and a RAM disk. >> >> These numbers are used fs_mark to write 4096 byte files with the >> following commands: >> >> fs_mark -d /home/test/t -s 4096 -n 40000 -N 50 -D 64 -t 1 >> ... >> fs_mark -d /home/test/t -s 4096 -n 20000 -N 50 -D 64 -t 2 >> ... >> fs_mark -d /home/test/t -s 4096 -n 10000 -N 50 -D 64 -t 4 >> ... >> fs_mark -d /home/test/t -s 4096 -n 5000 -N 50 -D 64 -t 8 >> ... >> fs_mark -d /home/test/t -s 4096 -n 4000 -N 50 -D 64 -t 10 >> ... >> fs_mark -d /home/test/t -s 4096 -n 2000 -N 50 -D 64 -t 20 >> ... >> >> Note that this spreads the files across 64 subdirectories, each thread >> writes 50 files and then moves on to the next in a round robin. >> > > I'm starting to wonder about the disks I have, because my files/second is > spanking yours, and its just a local samsung 3gb/s sata drive. With those > commands I'm consistently getting over 700 files/sec. I'm seeing about a 1-5% > increase in speed locally with my patch. I guess I'll start looking around for > some other hardware and check on there in case this box is more badass than I > think it is. Thanks much, > > Josef > Sounds like you might be running with write cache on & barriers off ;-) Make sure you have write cache & barriers enabled on the drive. With a good S-ATA drive, you should be seeing about 35-50 files/sec with a single threaded writer. The local disk that I tested on is a relatively slow s-ata disk that is more laptop quality/performance than server. One thought I had about the results is that we might be flipping the IO sequence with the local disk case. It is the only device of the three that I tested which is seek/head movement sensitive for small files. ric