* some hard numbers on ext3 & batching performance issue
[not found] ` <20080228175422.GU155259@sgi.com>
@ 2008-03-05 19:19 ` Ric Wheeler
2008-03-05 20:20 ` Josef Bacik
2008-03-06 0:28 ` David Chinner
0 siblings, 2 replies; 8+ messages in thread
From: Ric Wheeler @ 2008-03-05 19:19 UTC (permalink / raw)
To: David Chinner
Cc: Josef Bacik, Theodore Ts'o, adilger, jack, Feld, Andy,
linux-fsdevel, linux-scsi
After the IO/FS workshop last week, I posted some details on the slow
down we see with ext3 when we have a low latency back end instead of a
normal local disk (SCSI/S-ATA/etc).
As a follow up to that thread, I wanted to post some real numbers that
Andy from our performance team pulled together. Andy tested various
patches using three classes of storage (S-ATA, RAM disk and Clariion array).
Note that this testing was done on a SLES10/SP1 kernel, but the code in
question has not changed in mainline but we should probably retest on
something newer just to clear up any doubts.
The work load is generated using fs_mark
(http://sourceforge.net/projects/fsmark/) which is basically a write
workload with small files, each file gets fsync'ed before close. The
metric is "files/sec".
The clearest result used a ramdisk to store 4k files.
We modified ext3 and jbd to accept a new mount option: bdelay Use it like:
mount -o bdelay=n dev mountpoint
n is passed to schedule_timeout_interruptible() in the jbd code. if n ==
0, it skips the whole loop. if n is "yield", then substitute the
schedule...(n) with yield().
Note that the first row is the value of the delay with a 250HZ build
followed by the number of concurrent threads writing 4KB files.
Ramdisk test:
bdelay 1 2 4 8 10 20
0 4640 4498 3226 1721 1436 664
yield 4640 4078 2977 1611 1136 551
1 4647 250 482 588 629 483
2 4522 149 233 422 450 389
3 4504 86 165 271 308 334
4 4425 84 128 222 253 293
Midrange clariion:
bdelay 1 2 4 8 10 20
0 778 923 1567 1424 1276 785
yield 791 931 1551 1473 1328 806
1 793 304 499 714 751 760
2 789 132 201 382 441 589
3 792 124 168 298 342 471
4 786 71 116 237 277 393
Local disk:
bdelay 1 2 4 8 10 20
0 47 51 81 135 160 234
yield 36 45 74 117 138 214
1 44 52 86 148 183 258
2 40 60 109 163 184 265
3 40 52 97 148 171 264
4 35 42 83 149 169 246
Apologies for mangling the nicely formatted tables.
Note that the justification for the batching as we have it today is
basically this last local drive test case.
It would be really interesting to rerun some of these tests on xfs which
Dave explained in the thread last week has a more self tuning way to
batch up transactions....
Note that all of those poor users who have a synchronous write workload
today are in the "1" row for each of the above tables.
ric
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: some hard numbers on ext3 & batching performance issue
2008-03-05 19:19 ` some hard numbers on ext3 & batching performance issue Ric Wheeler
@ 2008-03-05 20:20 ` Josef Bacik
2008-03-07 20:08 ` Ric Wheeler
2008-03-06 0:28 ` David Chinner
1 sibling, 1 reply; 8+ messages in thread
From: Josef Bacik @ 2008-03-05 20:20 UTC (permalink / raw)
To: ric
Cc: David Chinner, Theodore Ts'o, adilger, jack, Feld, Andy,
linux-fsdevel, linux-scsi
On Wednesday 05 March 2008 2:19:48 pm Ric Wheeler wrote:
> After the IO/FS workshop last week, I posted some details on the slow
> down we see with ext3 when we have a low latency back end instead of a
> normal local disk (SCSI/S-ATA/etc).
>
> As a follow up to that thread, I wanted to post some real numbers that
> Andy from our performance team pulled together. Andy tested various
> patches using three classes of storage (S-ATA, RAM disk and Clariion
> array).
>
> Note that this testing was done on a SLES10/SP1 kernel, but the code in
> question has not changed in mainline but we should probably retest on
> something newer just to clear up any doubts.
>
> The work load is generated using fs_mark
> (http://sourceforge.net/projects/fsmark/) which is basically a write
> workload with small files, each file gets fsync'ed before close. The
> metric is "files/sec".
>
> The clearest result used a ramdisk to store 4k files.
>
> We modified ext3 and jbd to accept a new mount option: bdelay Use it like:
>
> mount -o bdelay=n dev mountpoint
>
> n is passed to schedule_timeout_interruptible() in the jbd code. if n ==
> 0, it skips the whole loop. if n is "yield", then substitute the
> schedule...(n) with yield().
>
> Note that the first row is the value of the delay with a 250HZ build
> followed by the number of concurrent threads writing 4KB files.
>
> Ramdisk test:
>
> bdelay 1 2 4 8 10 20
> 0 4640 4498 3226 1721 1436 664
> yield 4640 4078 2977 1611 1136 551
> 1 4647 250 482 588 629 483
> 2 4522 149 233 422 450 389
> 3 4504 86 165 271 308 334
> 4 4425 84 128 222 253 293
>
> Midrange clariion:
>
> bdelay 1 2 4 8 10 20
> 0 778 923 1567 1424 1276 785
> yield 791 931 1551 1473 1328 806
> 1 793 304 499 714 751 760
> 2 789 132 201 382 441 589
> 3 792 124 168 298 342 471
> 4 786 71 116 237 277 393
>
> Local disk:
>
> bdelay 1 2 4 8 10 20
> 0 47 51 81 135 160 234
> yield 36 45 74 117 138 214
> 1 44 52 86 148 183 258
> 2 40 60 109 163 184 265
> 3 40 52 97 148 171 264
> 4 35 42 83 149 169 246
>
> Apologies for mangling the nicely formatted tables.
>
> Note that the justification for the batching as we have it today is
> basically this last local drive test case.
>
> It would be really interesting to rerun some of these tests on xfs which
> Dave explained in the thread last week has a more self tuning way to
> batch up transactions....
>
> Note that all of those poor users who have a synchronous write workload
> today are in the "1" row for each of the above tables.
Mind giving this a whirl? The fastest thing I've got here is an Apple X RAID
and its being used for something else atm, so I've only tested this on local
disk to make sure it didn't make local performance suck (which it doesn't btw).
This should be equivalent with what David says XFS does. Thanks much,
Josef
diff --git a/fs/jbd/transaction.c b/fs/jbd/transaction.c
index c6cbb6c..4596e1c 100644
--- a/fs/jbd/transaction.c
+++ b/fs/jbd/transaction.c
@@ -1333,8 +1333,7 @@ int journal_stop(handle_t *handle)
{
transaction_t *transaction = handle->h_transaction;
journal_t *journal = transaction->t_journal;
- int old_handle_count, err;
- pid_t pid;
+ int err;
J_ASSERT(journal_current_handle() == handle);
@@ -1353,32 +1352,22 @@ int journal_stop(handle_t *handle)
jbd_debug(4, "Handle %p going down\n", handle);
- /*
- * Implement synchronous transaction batching. If the handle
- * was synchronous, don't force a commit immediately. Let's
- * yield and let another thread piggyback onto this transaction.
- * Keep doing that while new threads continue to arrive.
- * It doesn't cost much - we're about to run a commit and sleep
- * on IO anyway. Speeds up many-threaded, many-dir operations
- * by 30x or more...
- *
- * But don't do this if this process was the most recent one to
- * perform a synchronous write. We do this to detect the case where a
- * single process is doing a stream of sync writes. No point in waiting
- * for joiners in that case.
- */
- pid = current->pid;
- if (handle->h_sync && journal->j_last_sync_writer != pid) {
- journal->j_last_sync_writer = pid;
- do {
- old_handle_count = transaction->t_handle_count;
- schedule_timeout_uninterruptible(1);
- } while (old_handle_count != transaction->t_handle_count);
- }
-
current->journal_info = NULL;
spin_lock(&journal->j_state_lock);
spin_lock(&transaction->t_handle_lock);
+
+ if (journal->j_committing_transaction && handle->h_sync) {
+ tid_t tid = journal->j_committing_transaction->t_tid;
+
+ spin_unlock(&transaction->t_handle_lock);
+ spin_unlock(&journal->j_state_lock);
+
+ err = log_wait_commit(journal, tid);
+
+ spin_lock(&journal->j_state_lock);
+ spin_lock(&transaction->t_handle_lock);
+ }
+
transaction->t_outstanding_credits -= handle->h_buffer_credits;
transaction->t_updates--;
if (!transaction->t_updates) {
^ permalink raw reply related [flat|nested] 8+ messages in thread
* Re: some hard numbers on ext3 & batching performance issue
2008-03-05 19:19 ` some hard numbers on ext3 & batching performance issue Ric Wheeler
2008-03-05 20:20 ` Josef Bacik
@ 2008-03-06 0:28 ` David Chinner
1 sibling, 0 replies; 8+ messages in thread
From: David Chinner @ 2008-03-06 0:28 UTC (permalink / raw)
To: Ric Wheeler
Cc: David Chinner, Josef Bacik, Theodore Ts'o, adilger, jack,
Feld, Andy, linux-fsdevel, linux-scsi
On Wed, Mar 05, 2008 at 02:19:48PM -0500, Ric Wheeler wrote:
> The work load is generated using fs_mark
> (http://sourceforge.net/projects/fsmark/) which is basically a write
> workload with small files, each file gets fsync'ed before close. The
> metric is "files/sec".
.......
> It would be really interesting to rerun some of these tests on xfs which
> Dave explained in the thread last week has a more self tuning way to
> batch up transactions....
Ok, so XFS numbers. note these are all on a CONFIG_XFS_DEBUG=y kernel, so
there's lots of extra checks in the code as compared to a normal production
kernel.
Local disk (15krpm SCSI, WCD, CONFIG_XFS_DEBUG=y):
threads files/s
1 97
2 117
4 109
8 110
10 113
20 116
Local disk (15krpm SCSI, WCE, nobarrier, CONFIG_XFS_DEBUG=y):
threads files/s
1 203
2 216
4 243
8 332
10 405
20 424
Ramdisk (nobarrier, CONFIG_XFS_DEBUG=y):
agcount=4 agcount=16
threads files/s files/s
1 1298 1298
2 2073 2394
4 3296 3321
8 3464 4199
10 3394 3937
20 3251 3691
Note the difference the amount of parallel allocation in the
filesystem makes - agcount=4 only allows up to 4 parallel allocations
at once, so even if they are all aggregated into the one log I/O,
no further allocation can take place until that log I/O is complete.
And at about 4000 files/s the system (4p ia64) is becoming CPU bound
due to all the debug checks in XFS.
Cheers,
Dave.
--
Dave Chinner
Principal Engineer
SGI Australian Software Group
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: some hard numbers on ext3 & batching performance issue
2008-03-05 20:20 ` Josef Bacik
@ 2008-03-07 20:08 ` Ric Wheeler
2008-03-07 20:40 ` Josef Bacik
0 siblings, 1 reply; 8+ messages in thread
From: Ric Wheeler @ 2008-03-07 20:08 UTC (permalink / raw)
To: Josef Bacik
Cc: David Chinner, Theodore Ts'o, adilger, jack, Feld, Andy,
linux-fsdevel, linux-scsi
Josef Bacik wrote:
> On Wednesday 05 March 2008 2:19:48 pm Ric Wheeler wrote:
>> After the IO/FS workshop last week, I posted some details on the slow
>> down we see with ext3 when we have a low latency back end instead of a
>> normal local disk (SCSI/S-ATA/etc).
...
...
...
>> It would be really interesting to rerun some of these tests on xfs which
>> Dave explained in the thread last week has a more self tuning way to
>> batch up transactions....
>>
>> Note that all of those poor users who have a synchronous write workload
>> today are in the "1" row for each of the above tables.
>
> Mind giving this a whirl? The fastest thing I've got here is an Apple X RAID
> and its being used for something else atm, so I've only tested this on local
> disk to make sure it didn't make local performance suck (which it doesn't btw).
> This should be equivalent with what David says XFS does. Thanks much,
>
> Josef
>
> diff --git a/fs/jbd/transaction.c b/fs/jbd/transaction.c
> index c6cbb6c..4596e1c 100644
> --- a/fs/jbd/transaction.c
> +++ b/fs/jbd/transaction.c
> @@ -1333,8 +1333,7 @@ int journal_stop(handle_t *handle)
> {
> transaction_t *transaction = handle->h_transaction;
> journal_t *journal = transaction->t_journal;
> - int old_handle_count, err;
> - pid_t pid;
> + int err;
>
> J_ASSERT(journal_current_handle() == handle);
>
> @@ -1353,32 +1352,22 @@ int journal_stop(handle_t *handle)
>
> jbd_debug(4, "Handle %p going down\n", handle);
>
> - /*
> - * Implement synchronous transaction batching. If the handle
> - * was synchronous, don't force a commit immediately. Let's
> - * yield and let another thread piggyback onto this transaction.
> - * Keep doing that while new threads continue to arrive.
> - * It doesn't cost much - we're about to run a commit and sleep
> - * on IO anyway. Speeds up many-threaded, many-dir operations
> - * by 30x or more...
> - *
> - * But don't do this if this process was the most recent one to
> - * perform a synchronous write. We do this to detect the case where a
> - * single process is doing a stream of sync writes. No point in waiting
> - * for joiners in that case.
> - */
> - pid = current->pid;
> - if (handle->h_sync && journal->j_last_sync_writer != pid) {
> - journal->j_last_sync_writer = pid;
> - do {
> - old_handle_count = transaction->t_handle_count;
> - schedule_timeout_uninterruptible(1);
> - } while (old_handle_count != transaction->t_handle_count);
> - }
> -
> current->journal_info = NULL;
> spin_lock(&journal->j_state_lock);
> spin_lock(&transaction->t_handle_lock);
> +
> + if (journal->j_committing_transaction && handle->h_sync) {
> + tid_t tid = journal->j_committing_transaction->t_tid;
> +
> + spin_unlock(&transaction->t_handle_lock);
> + spin_unlock(&journal->j_state_lock);
> +
> + err = log_wait_commit(journal, tid);
> +
> + spin_lock(&journal->j_state_lock);
> + spin_lock(&transaction->t_handle_lock);
> + }
> +
> transaction->t_outstanding_credits -= handle->h_buffer_credits;
> transaction->t_updates--;
> if (!transaction->t_updates) {
>
>
>
Running with Josef's patch, I was able to see a clear improvement for
batching these synchronous operations on ext3 with the RAM disk and
array. It is not too often that you get to do a simple change and see a
27 times improvement ;-)
On the bad side, the local disk case took as much as a 30% drop in
performance. The specific disk is not one that I have a lot of
experience with, I would like to retry on a disk that has been qualified
by our group (i.e., we have reasonable confidence that there are no
firmware issues, etc).
Now for the actual results.
The results are the average value of 5 runs for each number of threads.
Type Threads Baseline Josef Speedup (Josef/Baseline)
array 1 320.5 325.4 1.01
array 2 174.9 351.9 2.01
array 4 382.7 593.5 1.55
array 8 644.1 963.0 1.49
array 10 842.9 1038.7 1.23
array 20 1319.6 1432.3 1.08
RAM disk 1 5621.4 5595.1 0.99
RAM disk 2 281.5 7613.3 27.04
RAM disk 4 579.9 9111.5 15.71
RAM disk 8 891.1 9357.3 10.50
RAM disk 10 1116.3 9873.6 8.84
RAM disk 20 1952.0 10703.6 5.48
S-ATA disk 1 19.0 15.1 0.79
S-ATA disk 2 19.9 14.4 0.72
S-ATA disk 4 41.0 27.9 0.68
S-ATA disk 8 60.4 43.2 0.71
S-ATA disk 10 67.1 48.7 0.72
S-ATA disk 20 102.7 74.0 0.72
Background on the tests:
All of this is measured on three devices - a relatively old & slow
array, the local (slow!) 2.5" S-ATA disk in the box and a RAM disk.
These numbers are used fs_mark to write 4096 byte files with the
following commands:
fs_mark -d /home/test/t -s 4096 -n 40000 -N 50 -D 64 -t 1
...
fs_mark -d /home/test/t -s 4096 -n 20000 -N 50 -D 64 -t 2
...
fs_mark -d /home/test/t -s 4096 -n 10000 -N 50 -D 64 -t 4
...
fs_mark -d /home/test/t -s 4096 -n 5000 -N 50 -D 64 -t 8
...
fs_mark -d /home/test/t -s 4096 -n 4000 -N 50 -D 64 -t 10
...
fs_mark -d /home/test/t -s 4096 -n 2000 -N 50 -D 64 -t 20
...
Note that this spreads the files across 64 subdirectories, each thread
writes 50 files and then moves on to the next in a round robin.
ric
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: some hard numbers on ext3 & batching performance issue
2008-03-07 20:08 ` Ric Wheeler
@ 2008-03-07 20:40 ` Josef Bacik
2008-03-07 20:45 ` Ric Wheeler
0 siblings, 1 reply; 8+ messages in thread
From: Josef Bacik @ 2008-03-07 20:40 UTC (permalink / raw)
To: ric
Cc: David Chinner, Theodore Ts'o, adilger, jack, Feld, Andy,
linux-fsdevel, linux-scsi
On Friday 07 March 2008 3:08:32 pm Ric Wheeler wrote:
> Josef Bacik wrote:
> > On Wednesday 05 March 2008 2:19:48 pm Ric Wheeler wrote:
> >> After the IO/FS workshop last week, I posted some details on the slow
> >> down we see with ext3 when we have a low latency back end instead of a
> >> normal local disk (SCSI/S-ATA/etc).
>
> ...
> ...
> ...
>
> >> It would be really interesting to rerun some of these tests on xfs which
> >> Dave explained in the thread last week has a more self tuning way to
> >> batch up transactions....
> >>
> >> Note that all of those poor users who have a synchronous write workload
> >> today are in the "1" row for each of the above tables.
> >
> > Mind giving this a whirl? The fastest thing I've got here is an Apple X
> > RAID and its being used for something else atm, so I've only tested this
> > on local disk to make sure it didn't make local performance suck (which
> > it doesn't btw). This should be equivalent with what David says XFS does.
> > Thanks much,
> >
> > Josef
> >
> > diff --git a/fs/jbd/transaction.c b/fs/jbd/transaction.c
> > index c6cbb6c..4596e1c 100644
> > --- a/fs/jbd/transaction.c
> > +++ b/fs/jbd/transaction.c
> > @@ -1333,8 +1333,7 @@ int journal_stop(handle_t *handle)
> > {
> > transaction_t *transaction = handle->h_transaction;
> > journal_t *journal = transaction->t_journal;
> > - int old_handle_count, err;
> > - pid_t pid;
> > + int err;
> >
> > J_ASSERT(journal_current_handle() == handle);
> >
> > @@ -1353,32 +1352,22 @@ int journal_stop(handle_t *handle)
> >
> > jbd_debug(4, "Handle %p going down\n", handle);
> >
> > - /*
> > - * Implement synchronous transaction batching. If the handle
> > - * was synchronous, don't force a commit immediately. Let's
> > - * yield and let another thread piggyback onto this transaction.
> > - * Keep doing that while new threads continue to arrive.
> > - * It doesn't cost much - we're about to run a commit and sleep
> > - * on IO anyway. Speeds up many-threaded, many-dir operations
> > - * by 30x or more...
> > - *
> > - * But don't do this if this process was the most recent one to
> > - * perform a synchronous write. We do this to detect the case where a
> > - * single process is doing a stream of sync writes. No point in
> > waiting - * for joiners in that case.
> > - */
> > - pid = current->pid;
> > - if (handle->h_sync && journal->j_last_sync_writer != pid) {
> > - journal->j_last_sync_writer = pid;
> > - do {
> > - old_handle_count = transaction->t_handle_count;
> > - schedule_timeout_uninterruptible(1);
> > - } while (old_handle_count != transaction->t_handle_count);
> > - }
> > -
> > current->journal_info = NULL;
> > spin_lock(&journal->j_state_lock);
> > spin_lock(&transaction->t_handle_lock);
> > +
> > + if (journal->j_committing_transaction && handle->h_sync) {
> > + tid_t tid = journal->j_committing_transaction->t_tid;
> > +
> > + spin_unlock(&transaction->t_handle_lock);
> > + spin_unlock(&journal->j_state_lock);
> > +
> > + err = log_wait_commit(journal, tid);
> > +
> > + spin_lock(&journal->j_state_lock);
> > + spin_lock(&transaction->t_handle_lock);
> > + }
> > +
> > transaction->t_outstanding_credits -= handle->h_buffer_credits;
> > transaction->t_updates--;
> > if (!transaction->t_updates) {
>
> Running with Josef's patch, I was able to see a clear improvement for
> batching these synchronous operations on ext3 with the RAM disk and
> array. It is not too often that you get to do a simple change and see a
> 27 times improvement ;-)
>
> On the bad side, the local disk case took as much as a 30% drop in
> performance. The specific disk is not one that I have a lot of
> experience with, I would like to retry on a disk that has been qualified
> by our group (i.e., we have reasonable confidence that there are no
> firmware issues, etc).
>
> Now for the actual results.
>
> The results are the average value of 5 runs for each number of threads.
>
> Type Threads Baseline Josef Speedup (Josef/Baseline)
> array 1 320.5 325.4 1.01
> array 2 174.9 351.9 2.01
> array 4 382.7 593.5 1.55
> array 8 644.1 963.0 1.49
> array 10 842.9 1038.7 1.23
> array 20 1319.6 1432.3 1.08
>
> RAM disk 1 5621.4 5595.1 0.99
> RAM disk 2 281.5 7613.3 27.04
> RAM disk 4 579.9 9111.5 15.71
> RAM disk 8 891.1 9357.3 10.50
> RAM disk 10 1116.3 9873.6 8.84
> RAM disk 20 1952.0 10703.6 5.48
>
> S-ATA disk 1 19.0 15.1 0.79
> S-ATA disk 2 19.9 14.4 0.72
> S-ATA disk 4 41.0 27.9 0.68
> S-ATA disk 8 60.4 43.2 0.71
> S-ATA disk 10 67.1 48.7 0.72
> S-ATA disk 20 102.7 74.0 0.72
>
> Background on the tests:
>
> All of this is measured on three devices - a relatively old & slow
> array, the local (slow!) 2.5" S-ATA disk in the box and a RAM disk.
>
> These numbers are used fs_mark to write 4096 byte files with the
> following commands:
>
> fs_mark -d /home/test/t -s 4096 -n 40000 -N 50 -D 64 -t 1
> ...
> fs_mark -d /home/test/t -s 4096 -n 20000 -N 50 -D 64 -t 2
> ...
> fs_mark -d /home/test/t -s 4096 -n 10000 -N 50 -D 64 -t 4
> ...
> fs_mark -d /home/test/t -s 4096 -n 5000 -N 50 -D 64 -t 8
> ...
> fs_mark -d /home/test/t -s 4096 -n 4000 -N 50 -D 64 -t 10
> ...
> fs_mark -d /home/test/t -s 4096 -n 2000 -N 50 -D 64 -t 20
> ...
>
> Note that this spreads the files across 64 subdirectories, each thread
> writes 50 files and then moves on to the next in a round robin.
>
I'm starting to wonder about the disks I have, because my files/second is
spanking yours, and its just a local samsung 3gb/s sata drive. With those
commands I'm consistently getting over 700 files/sec. I'm seeing about a 1-5%
increase in speed locally with my patch. I guess I'll start looking around for
some other hardware and check on there in case this box is more badass than I
think it is. Thanks much,
Josef
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: some hard numbers on ext3 & batching performance issue
2008-03-07 20:40 ` Josef Bacik
@ 2008-03-07 20:45 ` Ric Wheeler
2008-03-12 18:37 ` Josef Bacik
0 siblings, 1 reply; 8+ messages in thread
From: Ric Wheeler @ 2008-03-07 20:45 UTC (permalink / raw)
To: Josef Bacik
Cc: David Chinner, Theodore Ts'o, adilger, jack, Feld, Andy,
linux-fsdevel, linux-scsi
Josef Bacik wrote:
> On Friday 07 March 2008 3:08:32 pm Ric Wheeler wrote:
>> Josef Bacik wrote:
>>> On Wednesday 05 March 2008 2:19:48 pm Ric Wheeler wrote:
>>>> After the IO/FS workshop last week, I posted some details on the slow
>>>> down we see with ext3 when we have a low latency back end instead of a
>>>> normal local disk (SCSI/S-ATA/etc).
>> ...
>> ...
>> ...
>>
>>>> It would be really interesting to rerun some of these tests on xfs which
>>>> Dave explained in the thread last week has a more self tuning way to
>>>> batch up transactions....
>>>>
>>>> Note that all of those poor users who have a synchronous write workload
>>>> today are in the "1" row for each of the above tables.
>>> Mind giving this a whirl? The fastest thing I've got here is an Apple X
>>> RAID and its being used for something else atm, so I've only tested this
>>> on local disk to make sure it didn't make local performance suck (which
>>> it doesn't btw). This should be equivalent with what David says XFS does.
>>> Thanks much,
>>>
>>> Josef
>>>
>>> diff --git a/fs/jbd/transaction.c b/fs/jbd/transaction.c
>>> index c6cbb6c..4596e1c 100644
>>> --- a/fs/jbd/transaction.c
>>> +++ b/fs/jbd/transaction.c
>>> @@ -1333,8 +1333,7 @@ int journal_stop(handle_t *handle)
>>> {
>>> transaction_t *transaction = handle->h_transaction;
>>> journal_t *journal = transaction->t_journal;
>>> - int old_handle_count, err;
>>> - pid_t pid;
>>> + int err;
>>>
>>> J_ASSERT(journal_current_handle() == handle);
>>>
>>> @@ -1353,32 +1352,22 @@ int journal_stop(handle_t *handle)
>>>
>>> jbd_debug(4, "Handle %p going down\n", handle);
>>>
>>> - /*
>>> - * Implement synchronous transaction batching. If the handle
>>> - * was synchronous, don't force a commit immediately. Let's
>>> - * yield and let another thread piggyback onto this transaction.
>>> - * Keep doing that while new threads continue to arrive.
>>> - * It doesn't cost much - we're about to run a commit and sleep
>>> - * on IO anyway. Speeds up many-threaded, many-dir operations
>>> - * by 30x or more...
>>> - *
>>> - * But don't do this if this process was the most recent one to
>>> - * perform a synchronous write. We do this to detect the case where a
>>> - * single process is doing a stream of sync writes. No point in
>>> waiting - * for joiners in that case.
>>> - */
>>> - pid = current->pid;
>>> - if (handle->h_sync && journal->j_last_sync_writer != pid) {
>>> - journal->j_last_sync_writer = pid;
>>> - do {
>>> - old_handle_count = transaction->t_handle_count;
>>> - schedule_timeout_uninterruptible(1);
>>> - } while (old_handle_count != transaction->t_handle_count);
>>> - }
>>> -
>>> current->journal_info = NULL;
>>> spin_lock(&journal->j_state_lock);
>>> spin_lock(&transaction->t_handle_lock);
>>> +
>>> + if (journal->j_committing_transaction && handle->h_sync) {
>>> + tid_t tid = journal->j_committing_transaction->t_tid;
>>> +
>>> + spin_unlock(&transaction->t_handle_lock);
>>> + spin_unlock(&journal->j_state_lock);
>>> +
>>> + err = log_wait_commit(journal, tid);
>>> +
>>> + spin_lock(&journal->j_state_lock);
>>> + spin_lock(&transaction->t_handle_lock);
>>> + }
>>> +
>>> transaction->t_outstanding_credits -= handle->h_buffer_credits;
>>> transaction->t_updates--;
>>> if (!transaction->t_updates) {
>> Running with Josef's patch, I was able to see a clear improvement for
>> batching these synchronous operations on ext3 with the RAM disk and
>> array. It is not too often that you get to do a simple change and see a
>> 27 times improvement ;-)
>>
>> On the bad side, the local disk case took as much as a 30% drop in
>> performance. The specific disk is not one that I have a lot of
>> experience with, I would like to retry on a disk that has been qualified
>> by our group (i.e., we have reasonable confidence that there are no
>> firmware issues, etc).
>>
>> Now for the actual results.
>>
>> The results are the average value of 5 runs for each number of threads.
>>
>> Type Threads Baseline Josef Speedup (Josef/Baseline)
>> array 1 320.5 325.4 1.01
>> array 2 174.9 351.9 2.01
>> array 4 382.7 593.5 1.55
>> array 8 644.1 963.0 1.49
>> array 10 842.9 1038.7 1.23
>> array 20 1319.6 1432.3 1.08
>>
>> RAM disk 1 5621.4 5595.1 0.99
>> RAM disk 2 281.5 7613.3 27.04
>> RAM disk 4 579.9 9111.5 15.71
>> RAM disk 8 891.1 9357.3 10.50
>> RAM disk 10 1116.3 9873.6 8.84
>> RAM disk 20 1952.0 10703.6 5.48
>>
>> S-ATA disk 1 19.0 15.1 0.79
>> S-ATA disk 2 19.9 14.4 0.72
>> S-ATA disk 4 41.0 27.9 0.68
>> S-ATA disk 8 60.4 43.2 0.71
>> S-ATA disk 10 67.1 48.7 0.72
>> S-ATA disk 20 102.7 74.0 0.72
>>
>> Background on the tests:
>>
>> All of this is measured on three devices - a relatively old & slow
>> array, the local (slow!) 2.5" S-ATA disk in the box and a RAM disk.
>>
>> These numbers are used fs_mark to write 4096 byte files with the
>> following commands:
>>
>> fs_mark -d /home/test/t -s 4096 -n 40000 -N 50 -D 64 -t 1
>> ...
>> fs_mark -d /home/test/t -s 4096 -n 20000 -N 50 -D 64 -t 2
>> ...
>> fs_mark -d /home/test/t -s 4096 -n 10000 -N 50 -D 64 -t 4
>> ...
>> fs_mark -d /home/test/t -s 4096 -n 5000 -N 50 -D 64 -t 8
>> ...
>> fs_mark -d /home/test/t -s 4096 -n 4000 -N 50 -D 64 -t 10
>> ...
>> fs_mark -d /home/test/t -s 4096 -n 2000 -N 50 -D 64 -t 20
>> ...
>>
>> Note that this spreads the files across 64 subdirectories, each thread
>> writes 50 files and then moves on to the next in a round robin.
>>
>
> I'm starting to wonder about the disks I have, because my files/second is
> spanking yours, and its just a local samsung 3gb/s sata drive. With those
> commands I'm consistently getting over 700 files/sec. I'm seeing about a 1-5%
> increase in speed locally with my patch. I guess I'll start looking around for
> some other hardware and check on there in case this box is more badass than I
> think it is. Thanks much,
>
> Josef
>
Sounds like you might be running with write cache on & barriers off ;-)
Make sure you have write cache & barriers enabled on the drive. With a
good S-ATA drive, you should be seeing about 35-50 files/sec with a
single threaded writer.
The local disk that I tested on is a relatively slow s-ata disk that is
more laptop quality/performance than server.
One thought I had about the results is that we might be flipping the IO
sequence with the local disk case. It is the only device of the three
that I tested which is seek/head movement sensitive for small files.
ric
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: some hard numbers on ext3 & batching performance issue
2008-03-07 20:45 ` Ric Wheeler
@ 2008-03-12 18:37 ` Josef Bacik
2008-03-13 11:26 ` Ric Wheeler
0 siblings, 1 reply; 8+ messages in thread
From: Josef Bacik @ 2008-03-12 18:37 UTC (permalink / raw)
To: Ric Wheeler
Cc: Josef Bacik, David Chinner, Theodore Ts'o, adilger, jack,
Feld, Andy, linux-fsdevel, linux-scsi
On Fri, Mar 07, 2008 at 03:45:58PM -0500, Ric Wheeler wrote:
> Josef Bacik wrote:
>> On Friday 07 March 2008 3:08:32 pm Ric Wheeler wrote:
>>> Josef Bacik wrote:
>>>> On Wednesday 05 March 2008 2:19:48 pm Ric Wheeler wrote:
>>>>> After the IO/FS workshop last week, I posted some details on the slow
>>>>> down we see with ext3 when we have a low latency back end instead of a
>>>>> normal local disk (SCSI/S-ATA/etc).
>>> ...
>>> ...
>>>
>>> Note that this spreads the files across 64 subdirectories, each thread
>>> writes 50 files and then moves on to the next in a round robin.
>>>
>>
>> I'm starting to wonder about the disks I have, because my files/second is
>> spanking yours, and its just a local samsung 3gb/s sata drive. With those
>> commands I'm consistently getting over 700 files/sec. I'm seeing about a
>> 1-5% increase in speed locally with my patch. I guess I'll start looking
>> around for some other hardware and check on there in case this box is more
>> badass than I think it is. Thanks much,
>>
>> Josef
>>
>
> Sounds like you might be running with write cache on & barriers off ;-)
>
> Make sure you have write cache & barriers enabled on the drive. With a good
> S-ATA drive, you should be seeing about 35-50 files/sec with a single
> threaded writer.
>
> The local disk that I tested on is a relatively slow s-ata disk that is
> more laptop quality/performance than server.
>
> One thought I had about the results is that we might be flipping the IO
> sequence with the local disk case. It is the only device of the three that
> I tested which is seek/head movement sensitive for small files.
>
Ahh yes turning off write cache off and barriers on I get your numbers, however
I'm not seeing the slowdown that you are, with and without my patch I'm seeing
the same performance. Its just a plane jane intel sata controller with a
samsung sata disk set at 1.5gbps. Same thing with an nvidia sata controller.
I'll think about this some more and see if there is something better that could
be done that may help you. Thanks much,
Josef
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: some hard numbers on ext3 & batching performance issue
2008-03-12 18:37 ` Josef Bacik
@ 2008-03-13 11:26 ` Ric Wheeler
0 siblings, 0 replies; 8+ messages in thread
From: Ric Wheeler @ 2008-03-13 11:26 UTC (permalink / raw)
To: Josef Bacik
Cc: David Chinner, Theodore Ts'o, adilger, jack, Feld, Andy,
linux-fsdevel, linux-scsi
Josef Bacik wrote:
> On Fri, Mar 07, 2008 at 03:45:58PM -0500, Ric Wheeler wrote:
>> Josef Bacik wrote:
>>> On Friday 07 March 2008 3:08:32 pm Ric Wheeler wrote:
>>>> Josef Bacik wrote:
>>>>> On Wednesday 05 March 2008 2:19:48 pm Ric Wheeler wrote:
>>>>>> After the IO/FS workshop last week, I posted some details on the slow
>>>>>> down we see with ext3 when we have a low latency back end instead of a
>>>>>> normal local disk (SCSI/S-ATA/etc).
>>>> ...
>>>> ...
>>>>
>>>> Note that this spreads the files across 64 subdirectories, each thread
>>>> writes 50 files and then moves on to the next in a round robin.
>>>>
>>> I'm starting to wonder about the disks I have, because my files/second is
>>> spanking yours, and its just a local samsung 3gb/s sata drive. With those
>>> commands I'm consistently getting over 700 files/sec. I'm seeing about a
>>> 1-5% increase in speed locally with my patch. I guess I'll start looking
>>> around for some other hardware and check on there in case this box is more
>>> badass than I think it is. Thanks much,
>>>
>>> Josef
>>>
>> Sounds like you might be running with write cache on & barriers off ;-)
>>
>> Make sure you have write cache & barriers enabled on the drive. With a good
>> S-ATA drive, you should be seeing about 35-50 files/sec with a single
>> threaded writer.
>>
>> The local disk that I tested on is a relatively slow s-ata disk that is
>> more laptop quality/performance than server.
>>
>> One thought I had about the results is that we might be flipping the IO
>> sequence with the local disk case. It is the only device of the three that
>> I tested which is seek/head movement sensitive for small files.
>>
>
> Ahh yes turning off write cache off and barriers on I get your numbers, however
> I'm not seeing the slowdown that you are, with and without my patch I'm seeing
> the same performance. Its just a plane jane intel sata controller with a
> samsung sata disk set at 1.5gbps. Same thing with an nvidia sata controller.
> I'll think about this some more and see if there is something better that could
> be done that may help you. Thanks much,
>
> Josef
>
Thanks - you should see the numbers with write cache enabled and
barriers on as well, but for small files, write cache disabled is quite
close ;-)
I am happy to rerun the tests at any point, I have a variety of disk
types and controllers (lots of Intel AHCI boxes) to use.
ric
^ permalink raw reply [flat|nested] 8+ messages in thread
end of thread, other threads:[~2008-03-13 11:28 UTC | newest]
Thread overview: 8+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
[not found] <47C6A46D.8020700@emc.com>
[not found] ` <200802281005.13068.jbacik@redhat.com>
[not found] ` <200802281041.01411.jbacik@redhat.com>
[not found] ` <47C6B2A5.4030609@emc.com>
[not found] ` <20080228175422.GU155259@sgi.com>
2008-03-05 19:19 ` some hard numbers on ext3 & batching performance issue Ric Wheeler
2008-03-05 20:20 ` Josef Bacik
2008-03-07 20:08 ` Ric Wheeler
2008-03-07 20:40 ` Josef Bacik
2008-03-07 20:45 ` Ric Wheeler
2008-03-12 18:37 ` Josef Bacik
2008-03-13 11:26 ` Ric Wheeler
2008-03-06 0:28 ` David Chinner
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).