some hard numbers on ext3 & batching performance issue

linux-scsi.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* some hard numbers on ext3 & batching performance issue
       [not found]       ` <20080228175422.GU155259@sgi.com>
@ 2008-03-05 19:19         ` Ric Wheeler
  2008-03-05 20:20           ` Josef Bacik
  2008-03-06  0:28           ` David Chinner
  0 siblings, 2 replies; 8+ messages in thread
From: Ric Wheeler @ 2008-03-05 19:19 UTC (permalink / raw)
  To: David Chinner
  Cc: Josef Bacik, Theodore Ts'o, adilger, jack, Feld, Andy,
	linux-fsdevel, linux-scsi

After the IO/FS workshop last week, I posted some details on the slow 
down we see with ext3 when we have a low latency back end instead of a 
normal local disk (SCSI/S-ATA/etc).

As a follow up to that thread, I wanted to post some real numbers that 
Andy from our performance team pulled together. Andy tested various 
patches using three classes of storage (S-ATA, RAM disk and Clariion array).

Note that this testing was done on a SLES10/SP1 kernel, but the code in 
question has not changed in mainline but we should probably retest on 
something newer just to clear up any doubts.

The work load is generated using fs_mark 
(http://sourceforge.net/projects/fsmark/) which is basically a write 
workload with small files, each file gets fsync'ed before close. The 
metric is "files/sec".

The clearest result used a ramdisk to store 4k files.

We modified ext3 and jbd to accept a new mount option: bdelay Use it like:

mount -o bdelay=n dev mountpoint

n is passed to schedule_timeout_interruptible() in the jbd code. if n == 
0, it skips the whole loop. if n is "yield", then substitute the 
schedule...(n) with yield().

Note that the first row is the value of the delay with a 250HZ build 
followed by the number of concurrent threads writing 4KB files.

Ramdisk test:

bdelay  1       2       4       8       10      20
0       4640    4498    3226    1721    1436     664
yield   4640    4078    2977    1611    1136     551
1       4647     250     482     588     629     483
2       4522     149     233     422     450     389
3       4504      86     165     271     308     334
4       4425      84     128     222     253     293

Midrange clariion:

bdelay   1       2       4       8       10      20
0        778     923    1567    1424    1276     785
yield    791     931    1551    1473    1328     806
1        793     304     499     714     751     760
2        789     132     201     382     441     589
3        792     124     168     298     342     471
4        786      71     116     237     277     393

Local disk:

bdelay    1       2       4       8       10      20
0         47      51      81     135     160     234
yield     36      45      74     117     138     214
1         44      52      86     148     183     258
2         40      60     109     163     184     265
3         40      52      97     148     171     264
4         35      42      83     149     169     246

Apologies for mangling the nicely formatted tables.

Note that the justification for the batching as we have it today is 
basically this last local drive test case.

It would be really interesting to rerun some of these tests on xfs which 
Dave explained in the thread last week has a more self tuning way to 
batch up transactions....

Note that all of those poor users who have a synchronous write workload 
today are in the "1" row for each of the above tables.

ric

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: some hard numbers on ext3 & batching performance issue
  2008-03-05 19:19         ` some hard numbers on ext3 & batching performance issue Ric Wheeler
@ 2008-03-05 20:20           ` Josef Bacik
  2008-03-07 20:08             ` Ric Wheeler
  2008-03-06  0:28           ` David Chinner
  1 sibling, 1 reply; 8+ messages in thread
From: Josef Bacik @ 2008-03-05 20:20 UTC (permalink / raw)
  To: ric
  Cc: David Chinner, Theodore Ts'o, adilger, jack, Feld, Andy,
	linux-fsdevel, linux-scsi

On Wednesday 05 March 2008 2:19:48 pm Ric Wheeler wrote:
> After the IO/FS workshop last week, I posted some details on the slow
> down we see with ext3 when we have a low latency back end instead of a
> normal local disk (SCSI/S-ATA/etc).
>
> As a follow up to that thread, I wanted to post some real numbers that
> Andy from our performance team pulled together. Andy tested various
> patches using three classes of storage (S-ATA, RAM disk and Clariion
> array).
>
> Note that this testing was done on a SLES10/SP1 kernel, but the code in
> question has not changed in mainline but we should probably retest on
> something newer just to clear up any doubts.
>
> The work load is generated using fs_mark
> (http://sourceforge.net/projects/fsmark/) which is basically a write
> workload with small files, each file gets fsync'ed before close. The
> metric is "files/sec".
>
> The clearest result used a ramdisk to store 4k files.
>
> We modified ext3 and jbd to accept a new mount option: bdelay Use it like:
>
> mount -o bdelay=n dev mountpoint
>
> n is passed to schedule_timeout_interruptible() in the jbd code. if n ==
> 0, it skips the whole loop. if n is "yield", then substitute the
> schedule...(n) with yield().
>
> Note that the first row is the value of the delay with a 250HZ build
> followed by the number of concurrent threads writing 4KB files.
>
> Ramdisk test:
>
> bdelay  1       2       4       8       10      20
> 0       4640    4498    3226    1721    1436     664
> yield   4640    4078    2977    1611    1136     551
> 1       4647     250     482     588     629     483
> 2       4522     149     233     422     450     389
> 3       4504      86     165     271     308     334
> 4       4425      84     128     222     253     293
>
> Midrange clariion:
>
> bdelay   1       2       4       8       10      20
> 0        778     923    1567    1424    1276     785
> yield    791     931    1551    1473    1328     806
> 1        793     304     499     714     751     760
> 2        789     132     201     382     441     589
> 3        792     124     168     298     342     471
> 4        786      71     116     237     277     393
>
> Local disk:
>
> bdelay    1       2       4       8       10      20
> 0         47      51      81     135     160     234
> yield     36      45      74     117     138     214
> 1         44      52      86     148     183     258
> 2         40      60     109     163     184     265
> 3         40      52      97     148     171     264
> 4         35      42      83     149     169     246
>
> Apologies for mangling the nicely formatted tables.
>
> Note that the justification for the batching as we have it today is
> basically this last local drive test case.
>
> It would be really interesting to rerun some of these tests on xfs which
> Dave explained in the thread last week has a more self tuning way to
> batch up transactions....
>
> Note that all of those poor users who have a synchronous write workload
> today are in the "1" row for each of the above tables.

Mind giving this a whirl?  The fastest thing I've got here is an Apple X RAID 
and its being used for something else atm, so I've only tested this on local 
disk to make sure it didn't make local performance suck (which it doesn't btw).  
This should be equivalent with what David says XFS does.  Thanks much,

Josef

diff --git a/fs/jbd/transaction.c b/fs/jbd/transaction.c
index c6cbb6c..4596e1c 100644
--- a/fs/jbd/transaction.c
+++ b/fs/jbd/transaction.c
@@ -1333,8 +1333,7 @@ int journal_stop(handle_t *handle)
 {
 	transaction_t *transaction = handle->h_transaction;
 	journal_t *journal = transaction->t_journal;
-	int old_handle_count, err;
-	pid_t pid;
+	int err;
 
 	J_ASSERT(journal_current_handle() == handle);
 
@@ -1353,32 +1352,22 @@ int journal_stop(handle_t *handle)
 
 	jbd_debug(4, "Handle %p going down\n", handle);
 
-	/*
-	 * Implement synchronous transaction batching.  If the handle
-	 * was synchronous, don't force a commit immediately.  Let's
-	 * yield and let another thread piggyback onto this transaction.
-	 * Keep doing that while new threads continue to arrive.
-	 * It doesn't cost much - we're about to run a commit and sleep
-	 * on IO anyway.  Speeds up many-threaded, many-dir operations
-	 * by 30x or more...
-	 *
-	 * But don't do this if this process was the most recent one to
-	 * perform a synchronous write.  We do this to detect the case where a
-	 * single process is doing a stream of sync writes.  No point in waiting
-	 * for joiners in that case.
-	 */
-	pid = current->pid;
-	if (handle->h_sync && journal->j_last_sync_writer != pid) {
-		journal->j_last_sync_writer = pid;
-		do {
-			old_handle_count = transaction->t_handle_count;
-			schedule_timeout_uninterruptible(1);
-		} while (old_handle_count != transaction->t_handle_count);
-	}
-
 	current->journal_info = NULL;
 	spin_lock(&journal->j_state_lock);
 	spin_lock(&transaction->t_handle_lock);
+
+	if (journal->j_committing_transaction && handle->h_sync) {
+		tid_t tid = journal->j_committing_transaction->t_tid;
+
+		spin_unlock(&transaction->t_handle_lock);
+		spin_unlock(&journal->j_state_lock);
+
+		err = log_wait_commit(journal, tid);
+
+		spin_lock(&journal->j_state_lock);
+		spin_lock(&transaction->t_handle_lock);
+	}
+
 	transaction->t_outstanding_credits -= handle->h_buffer_credits;
 	transaction->t_updates--;
 	if (!transaction->t_updates) {



^ permalink raw reply related	[flat|nested] 8+ messages in thread

* Re: some hard numbers on ext3 & batching performance issue
  2008-03-05 20:20           ` Josef Bacik
@ 2008-03-07 20:08             ` Ric Wheeler
  2008-03-07 20:40               ` Josef Bacik
  0 siblings, 1 reply; 8+ messages in thread
From: Ric Wheeler @ 2008-03-07 20:08 UTC (permalink / raw)
  To: Josef Bacik
  Cc: David Chinner, Theodore Ts'o, adilger, jack, Feld, Andy,
	linux-fsdevel, linux-scsi

Josef Bacik wrote:
> On Wednesday 05 March 2008 2:19:48 pm Ric Wheeler wrote:
>> After the IO/FS workshop last week, I posted some details on the slow
>> down we see with ext3 when we have a low latency back end instead of a
>> normal local disk (SCSI/S-ATA/etc).
...
...
...
>> It would be really interesting to rerun some of these tests on xfs which
>> Dave explained in the thread last week has a more self tuning way to
>> batch up transactions....
>>
>> Note that all of those poor users who have a synchronous write workload
>> today are in the "1" row for each of the above tables.
> 
> Mind giving this a whirl?  The fastest thing I've got here is an Apple X RAID 
> and its being used for something else atm, so I've only tested this on local 
> disk to make sure it didn't make local performance suck (which it doesn't btw).  
> This should be equivalent with what David says XFS does.  Thanks much,
> 
> Josef
> 
> diff --git a/fs/jbd/transaction.c b/fs/jbd/transaction.c
> index c6cbb6c..4596e1c 100644
> --- a/fs/jbd/transaction.c
> +++ b/fs/jbd/transaction.c
> @@ -1333,8 +1333,7 @@ int journal_stop(handle_t *handle)
>  {
>  	transaction_t *transaction = handle->h_transaction;
>  	journal_t *journal = transaction->t_journal;
> -	int old_handle_count, err;
> -	pid_t pid;
> +	int err;
>  
>  	J_ASSERT(journal_current_handle() == handle);
>  
> @@ -1353,32 +1352,22 @@ int journal_stop(handle_t *handle)
>  
>  	jbd_debug(4, "Handle %p going down\n", handle);
>  
> -	/*
> -	 * Implement synchronous transaction batching.  If the handle
> -	 * was synchronous, don't force a commit immediately.  Let's
> -	 * yield and let another thread piggyback onto this transaction.
> -	 * Keep doing that while new threads continue to arrive.
> -	 * It doesn't cost much - we're about to run a commit and sleep
> -	 * on IO anyway.  Speeds up many-threaded, many-dir operations
> -	 * by 30x or more...
> -	 *
> -	 * But don't do this if this process was the most recent one to
> -	 * perform a synchronous write.  We do this to detect the case where a
> -	 * single process is doing a stream of sync writes.  No point in waiting
> -	 * for joiners in that case.
> -	 */
> -	pid = current->pid;
> -	if (handle->h_sync && journal->j_last_sync_writer != pid) {
> -		journal->j_last_sync_writer = pid;
> -		do {
> -			old_handle_count = transaction->t_handle_count;
> -			schedule_timeout_uninterruptible(1);
> -		} while (old_handle_count != transaction->t_handle_count);
> -	}
> -
>  	current->journal_info = NULL;
>  	spin_lock(&journal->j_state_lock);
>  	spin_lock(&transaction->t_handle_lock);
> +
> +	if (journal->j_committing_transaction && handle->h_sync) {
> +		tid_t tid = journal->j_committing_transaction->t_tid;
> +
> +		spin_unlock(&transaction->t_handle_lock);
> +		spin_unlock(&journal->j_state_lock);
> +
> +		err = log_wait_commit(journal, tid);
> +
> +		spin_lock(&journal->j_state_lock);
> +		spin_lock(&transaction->t_handle_lock);
> +	}
> +
>  	transaction->t_outstanding_credits -= handle->h_buffer_credits;
>  	transaction->t_updates--;
>  	if (!transaction->t_updates) {
> 
> 
> 

Running with Josef's patch, I was able to see a clear improvement for 
batching these synchronous operations on ext3 with the RAM disk and 
array. It is not too often that you get to do a simple change and see a 
27 times improvement ;-)

On the bad side, the local disk case took as much as a 30% drop in 
performance.  The specific disk is not one that I have a lot of 
experience with, I would like to retry on a disk that has been qualified 
  by our group (i.e., we have reasonable confidence that there are no 
firmware issues, etc).

Now for the actual results.

The results are the average value of 5 runs for each number of threads.

Type     Threads   Baseline    Josef    Speedup (Josef/Baseline)
array	    1	     320.5      325.4      1.01
array	    2	     174.9      351.9      2.01
array	    4	     382.7      593.5      1.55
array	    8	     644.1      963.0      1.49
array	    10	     842.9     1038.7      1.23
array	    20	    1319.6     1432.3      1.08

RAM disk    1       5621.4     5595.1      0.99
RAM disk    2        281.5     7613.3     27.04
RAM disk    4        579.9     9111.5     15.71
RAM disk    8        891.1     9357.3     10.50
RAM disk    10      1116.3     9873.6      8.84
RAM disk    20      1952.0    10703.6      5.48

S-ATA disk  1         19.0       15.1      0.79
S-ATA disk  2         19.9       14.4      0.72
S-ATA disk  4         41.0       27.9      0.68
S-ATA disk  8         60.4       43.2      0.71
S-ATA disk  10        67.1       48.7      0.72
S-ATA disk  20       102.7       74.0      0.72

Background on the tests:

All of this is measured on three devices - a relatively old & slow 
array, the local (slow!) 2.5" S-ATA disk in the box and a RAM disk.

These numbers are used fs_mark to write 4096 byte files with the 
following commands:

fs_mark  -d  /home/test/t  -s  4096  -n  40000  -N  50  -D  64  -t  1
...
fs_mark  -d  /home/test/t  -s  4096  -n  20000  -N  50  -D  64  -t  2
...
fs_mark  -d  /home/test/t  -s  4096  -n  10000  -N  50  -D  64  -t  4
...
fs_mark  -d  /home/test/t  -s  4096  -n  5000  -N  50  -D  64  -t  8
...
fs_mark  -d  /home/test/t  -s  4096  -n  4000  -N  50  -D  64  -t  10
...
fs_mark  -d  /home/test/t  -s  4096  -n  2000  -N  50  -D  64  -t  20
...

Note that this spreads the files across 64 subdirectories, each thread 
writes 50 files and then moves on to the next in a round robin.

ric


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: some hard numbers on ext3 & batching performance issue
  2008-03-07 20:08             ` Ric Wheeler
@ 2008-03-07 20:40               ` Josef Bacik
  2008-03-07 20:45                 ` Ric Wheeler
  0 siblings, 1 reply; 8+ messages in thread
From: Josef Bacik @ 2008-03-07 20:40 UTC (permalink / raw)
  To: ric
  Cc: David Chinner, Theodore Ts'o, adilger, jack, Feld, Andy,
	linux-fsdevel, linux-scsi

On Friday 07 March 2008 3:08:32 pm Ric Wheeler wrote:
> Josef Bacik wrote:
> > On Wednesday 05 March 2008 2:19:48 pm Ric Wheeler wrote:
> >> After the IO/FS workshop last week, I posted some details on the slow
> >> down we see with ext3 when we have a low latency back end instead of a
> >> normal local disk (SCSI/S-ATA/etc).
>
> ...
> ...
> ...
>
> >> It would be really interesting to rerun some of these tests on xfs which
> >> Dave explained in the thread last week has a more self tuning way to
> >> batch up transactions....
> >>
> >> Note that all of those poor users who have a synchronous write workload
> >> today are in the "1" row for each of the above tables.
> >
> > Mind giving this a whirl?  The fastest thing I've got here is an Apple X
> > RAID and its being used for something else atm, so I've only tested this
> > on local disk to make sure it didn't make local performance suck (which
> > it doesn't btw). This should be equivalent with what David says XFS does.
> >  Thanks much,
> >
> > Josef
> >
> > diff --git a/fs/jbd/transaction.c b/fs/jbd/transaction.c
> > index c6cbb6c..4596e1c 100644
> > --- a/fs/jbd/transaction.c
> > +++ b/fs/jbd/transaction.c
> > @@ -1333,8 +1333,7 @@ int journal_stop(handle_t *handle)
> >  {
> >  	transaction_t *transaction = handle->h_transaction;
> >  	journal_t *journal = transaction->t_journal;
> > -	int old_handle_count, err;
> > -	pid_t pid;
> > +	int err;
> >
> >  	J_ASSERT(journal_current_handle() == handle);
> >
> > @@ -1353,32 +1352,22 @@ int journal_stop(handle_t *handle)
> >
> >  	jbd_debug(4, "Handle %p going down\n", handle);
> >
> > -	/*
> > -	 * Implement synchronous transaction batching.  If the handle
> > -	 * was synchronous, don't force a commit immediately.  Let's
> > -	 * yield and let another thread piggyback onto this transaction.
> > -	 * Keep doing that while new threads continue to arrive.
> > -	 * It doesn't cost much - we're about to run a commit and sleep
> > -	 * on IO anyway.  Speeds up many-threaded, many-dir operations
> > -	 * by 30x or more...
> > -	 *
> > -	 * But don't do this if this process was the most recent one to
> > -	 * perform a synchronous write.  We do this to detect the case where a
> > -	 * single process is doing a stream of sync writes.  No point in
> > waiting -	 * for joiners in that case.
> > -	 */
> > -	pid = current->pid;
> > -	if (handle->h_sync && journal->j_last_sync_writer != pid) {
> > -		journal->j_last_sync_writer = pid;
> > -		do {
> > -			old_handle_count = transaction->t_handle_count;
> > -			schedule_timeout_uninterruptible(1);
> > -		} while (old_handle_count != transaction->t_handle_count);
> > -	}
> > -
> >  	current->journal_info = NULL;
> >  	spin_lock(&journal->j_state_lock);
> >  	spin_lock(&transaction->t_handle_lock);
> > +
> > +	if (journal->j_committing_transaction && handle->h_sync) {
> > +		tid_t tid = journal->j_committing_transaction->t_tid;
> > +
> > +		spin_unlock(&transaction->t_handle_lock);
> > +		spin_unlock(&journal->j_state_lock);
> > +
> > +		err = log_wait_commit(journal, tid);
> > +
> > +		spin_lock(&journal->j_state_lock);
> > +		spin_lock(&transaction->t_handle_lock);
> > +	}
> > +
> >  	transaction->t_outstanding_credits -= handle->h_buffer_credits;
> >  	transaction->t_updates--;
> >  	if (!transaction->t_updates) {
>
> Running with Josef's patch, I was able to see a clear improvement for
> batching these synchronous operations on ext3 with the RAM disk and
> array. It is not too often that you get to do a simple change and see a
> 27 times improvement ;-)
>
> On the bad side, the local disk case took as much as a 30% drop in
> performance.  The specific disk is not one that I have a lot of
> experience with, I would like to retry on a disk that has been qualified
>   by our group (i.e., we have reasonable confidence that there are no
> firmware issues, etc).
>
> Now for the actual results.
>
> The results are the average value of 5 runs for each number of threads.
>
> Type     Threads   Baseline    Josef    Speedup (Josef/Baseline)
> array	    1	     320.5      325.4      1.01
> array	    2	     174.9      351.9      2.01
> array	    4	     382.7      593.5      1.55
> array	    8	     644.1      963.0      1.49
> array	    10	     842.9     1038.7      1.23
> array	    20	    1319.6     1432.3      1.08
>
> RAM disk    1       5621.4     5595.1      0.99
> RAM disk    2        281.5     7613.3     27.04
> RAM disk    4        579.9     9111.5     15.71
> RAM disk    8        891.1     9357.3     10.50
> RAM disk    10      1116.3     9873.6      8.84
> RAM disk    20      1952.0    10703.6      5.48
>
> S-ATA disk  1         19.0       15.1      0.79
> S-ATA disk  2         19.9       14.4      0.72
> S-ATA disk  4         41.0       27.9      0.68
> S-ATA disk  8         60.4       43.2      0.71
> S-ATA disk  10        67.1       48.7      0.72
> S-ATA disk  20       102.7       74.0      0.72
>
> Background on the tests:
>
> All of this is measured on three devices - a relatively old & slow
> array, the local (slow!) 2.5" S-ATA disk in the box and a RAM disk.
>
> These numbers are used fs_mark to write 4096 byte files with the
> following commands:
>
> fs_mark  -d  /home/test/t  -s  4096  -n  40000  -N  50  -D  64  -t  1
> ...
> fs_mark  -d  /home/test/t  -s  4096  -n  20000  -N  50  -D  64  -t  2
> ...
> fs_mark  -d  /home/test/t  -s  4096  -n  10000  -N  50  -D  64  -t  4
> ...
> fs_mark  -d  /home/test/t  -s  4096  -n  5000  -N  50  -D  64  -t  8
> ...
> fs_mark  -d  /home/test/t  -s  4096  -n  4000  -N  50  -D  64  -t  10
> ...
> fs_mark  -d  /home/test/t  -s  4096  -n  2000  -N  50  -D  64  -t  20
> ...
>
> Note that this spreads the files across 64 subdirectories, each thread
> writes 50 files and then moves on to the next in a round robin.
>

I'm starting to wonder about the disks I have, because my files/second is 
spanking yours, and its just a local samsung 3gb/s sata drive.  With those 
commands I'm consistently getting over 700 files/sec.  I'm seeing about a 1-5% 
increase in speed locally with my patch.  I guess I'll start looking around for 
some other hardware and check on there in case this box is more badass than I 
think it is.  Thanks much,

Josef


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: some hard numbers on ext3 & batching performance issue
  2008-03-07 20:40               ` Josef Bacik
@ 2008-03-07 20:45                 ` Ric Wheeler
  2008-03-12 18:37                   ` Josef Bacik
  0 siblings, 1 reply; 8+ messages in thread
From: Ric Wheeler @ 2008-03-07 20:45 UTC (permalink / raw)
  To: Josef Bacik
  Cc: David Chinner, Theodore Ts'o, adilger, jack, Feld, Andy,
	linux-fsdevel, linux-scsi

Josef Bacik wrote:
> On Friday 07 March 2008 3:08:32 pm Ric Wheeler wrote:
>> Josef Bacik wrote:
>>> On Wednesday 05 March 2008 2:19:48 pm Ric Wheeler wrote:
>>>> After the IO/FS workshop last week, I posted some details on the slow
>>>> down we see with ext3 when we have a low latency back end instead of a
>>>> normal local disk (SCSI/S-ATA/etc).
>> ...
>> ...
>> ...
>>
>>>> It would be really interesting to rerun some of these tests on xfs which
>>>> Dave explained in the thread last week has a more self tuning way to
>>>> batch up transactions....
>>>>
>>>> Note that all of those poor users who have a synchronous write workload
>>>> today are in the "1" row for each of the above tables.
>>> Mind giving this a whirl?  The fastest thing I've got here is an Apple X
>>> RAID and its being used for something else atm, so I've only tested this
>>> on local disk to make sure it didn't make local performance suck (which
>>> it doesn't btw). This should be equivalent with what David says XFS does.
>>>  Thanks much,
>>>
>>> Josef
>>>
>>> diff --git a/fs/jbd/transaction.c b/fs/jbd/transaction.c
>>> index c6cbb6c..4596e1c 100644
>>> --- a/fs/jbd/transaction.c
>>> +++ b/fs/jbd/transaction.c
>>> @@ -1333,8 +1333,7 @@ int journal_stop(handle_t *handle)
>>>  {
>>>  	transaction_t *transaction = handle->h_transaction;
>>>  	journal_t *journal = transaction->t_journal;
>>> -	int old_handle_count, err;
>>> -	pid_t pid;
>>> +	int err;
>>>
>>>  	J_ASSERT(journal_current_handle() == handle);
>>>
>>> @@ -1353,32 +1352,22 @@ int journal_stop(handle_t *handle)
>>>
>>>  	jbd_debug(4, "Handle %p going down\n", handle);
>>>
>>> -	/*
>>> -	 * Implement synchronous transaction batching.  If the handle
>>> -	 * was synchronous, don't force a commit immediately.  Let's
>>> -	 * yield and let another thread piggyback onto this transaction.
>>> -	 * Keep doing that while new threads continue to arrive.
>>> -	 * It doesn't cost much - we're about to run a commit and sleep
>>> -	 * on IO anyway.  Speeds up many-threaded, many-dir operations
>>> -	 * by 30x or more...
>>> -	 *
>>> -	 * But don't do this if this process was the most recent one to
>>> -	 * perform a synchronous write.  We do this to detect the case where a
>>> -	 * single process is doing a stream of sync writes.  No point in
>>> waiting -	 * for joiners in that case.
>>> -	 */
>>> -	pid = current->pid;
>>> -	if (handle->h_sync && journal->j_last_sync_writer != pid) {
>>> -		journal->j_last_sync_writer = pid;
>>> -		do {
>>> -			old_handle_count = transaction->t_handle_count;
>>> -			schedule_timeout_uninterruptible(1);
>>> -		} while (old_handle_count != transaction->t_handle_count);
>>> -	}
>>> -
>>>  	current->journal_info = NULL;
>>>  	spin_lock(&journal->j_state_lock);
>>>  	spin_lock(&transaction->t_handle_lock);
>>> +
>>> +	if (journal->j_committing_transaction && handle->h_sync) {
>>> +		tid_t tid = journal->j_committing_transaction->t_tid;
>>> +
>>> +		spin_unlock(&transaction->t_handle_lock);
>>> +		spin_unlock(&journal->j_state_lock);
>>> +
>>> +		err = log_wait_commit(journal, tid);
>>> +
>>> +		spin_lock(&journal->j_state_lock);
>>> +		spin_lock(&transaction->t_handle_lock);
>>> +	}
>>> +
>>>  	transaction->t_outstanding_credits -= handle->h_buffer_credits;
>>>  	transaction->t_updates--;
>>>  	if (!transaction->t_updates) {
>> Running with Josef's patch, I was able to see a clear improvement for
>> batching these synchronous operations on ext3 with the RAM disk and
>> array. It is not too often that you get to do a simple change and see a
>> 27 times improvement ;-)
>>
>> On the bad side, the local disk case took as much as a 30% drop in
>> performance.  The specific disk is not one that I have a lot of
>> experience with, I would like to retry on a disk that has been qualified
>>   by our group (i.e., we have reasonable confidence that there are no
>> firmware issues, etc).
>>
>> Now for the actual results.
>>
>> The results are the average value of 5 runs for each number of threads.
>>
>> Type     Threads   Baseline    Josef    Speedup (Josef/Baseline)
>> array	    1	     320.5      325.4      1.01
>> array	    2	     174.9      351.9      2.01
>> array	    4	     382.7      593.5      1.55
>> array	    8	     644.1      963.0      1.49
>> array	    10	     842.9     1038.7      1.23
>> array	    20	    1319.6     1432.3      1.08
>>
>> RAM disk    1       5621.4     5595.1      0.99
>> RAM disk    2        281.5     7613.3     27.04
>> RAM disk    4        579.9     9111.5     15.71
>> RAM disk    8        891.1     9357.3     10.50
>> RAM disk    10      1116.3     9873.6      8.84
>> RAM disk    20      1952.0    10703.6      5.48
>>
>> S-ATA disk  1         19.0       15.1      0.79
>> S-ATA disk  2         19.9       14.4      0.72
>> S-ATA disk  4         41.0       27.9      0.68
>> S-ATA disk  8         60.4       43.2      0.71
>> S-ATA disk  10        67.1       48.7      0.72
>> S-ATA disk  20       102.7       74.0      0.72
>>
>> Background on the tests:
>>
>> All of this is measured on three devices - a relatively old & slow
>> array, the local (slow!) 2.5" S-ATA disk in the box and a RAM disk.
>>
>> These numbers are used fs_mark to write 4096 byte files with the
>> following commands:
>>
>> fs_mark  -d  /home/test/t  -s  4096  -n  40000  -N  50  -D  64  -t  1
>> ...
>> fs_mark  -d  /home/test/t  -s  4096  -n  20000  -N  50  -D  64  -t  2
>> ...
>> fs_mark  -d  /home/test/t  -s  4096  -n  10000  -N  50  -D  64  -t  4
>> ...
>> fs_mark  -d  /home/test/t  -s  4096  -n  5000  -N  50  -D  64  -t  8
>> ...
>> fs_mark  -d  /home/test/t  -s  4096  -n  4000  -N  50  -D  64  -t  10
>> ...
>> fs_mark  -d  /home/test/t  -s  4096  -n  2000  -N  50  -D  64  -t  20
>> ...
>>
>> Note that this spreads the files across 64 subdirectories, each thread
>> writes 50 files and then moves on to the next in a round robin.
>>
> 
> I'm starting to wonder about the disks I have, because my files/second is 
> spanking yours, and its just a local samsung 3gb/s sata drive.  With those 
> commands I'm consistently getting over 700 files/sec.  I'm seeing about a 1-5% 
> increase in speed locally with my patch.  I guess I'll start looking around for 
> some other hardware and check on there in case this box is more badass than I 
> think it is.  Thanks much,
> 
> Josef
> 

Sounds like you might be running with write cache on & barriers off ;-)

Make sure you have write cache & barriers enabled on the drive. With a 
good S-ATA drive, you should be seeing about 35-50 files/sec with a 
single threaded writer.

The local disk that I tested on is a relatively slow s-ata disk that is 
more laptop quality/performance than server.

One thought I had about the results is that we might be flipping the IO 
sequence with the local disk case. It is the only device of the three 
that I tested which is seek/head movement sensitive for small files.

ric

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: some hard numbers on ext3 & batching performance issue
  2008-03-07 20:45                 ` Ric Wheeler
@ 2008-03-12 18:37                   ` Josef Bacik
  2008-03-13 11:26                     ` Ric Wheeler
  0 siblings, 1 reply; 8+ messages in thread
From: Josef Bacik @ 2008-03-12 18:37 UTC (permalink / raw)
  To: Ric Wheeler
  Cc: Josef Bacik, David Chinner, Theodore Ts'o, adilger, jack,
	Feld, Andy, linux-fsdevel, linux-scsi

On Fri, Mar 07, 2008 at 03:45:58PM -0500, Ric Wheeler wrote:
> Josef Bacik wrote:
>> On Friday 07 March 2008 3:08:32 pm Ric Wheeler wrote:
>>> Josef Bacik wrote:
>>>> On Wednesday 05 March 2008 2:19:48 pm Ric Wheeler wrote:
>>>>> After the IO/FS workshop last week, I posted some details on the slow
>>>>> down we see with ext3 when we have a low latency back end instead of a
>>>>> normal local disk (SCSI/S-ATA/etc).
>>> ...
>>> ...
>>>
>>> Note that this spreads the files across 64 subdirectories, each thread
>>> writes 50 files and then moves on to the next in a round robin.
>>>
>>
>> I'm starting to wonder about the disks I have, because my files/second is 
>> spanking yours, and its just a local samsung 3gb/s sata drive.  With those 
>> commands I'm consistently getting over 700 files/sec.  I'm seeing about a 
>> 1-5% increase in speed locally with my patch.  I guess I'll start looking 
>> around for some other hardware and check on there in case this box is more 
>> badass than I think it is.  Thanks much,
>>
>> Josef
>>
>
> Sounds like you might be running with write cache on & barriers off ;-)
>
> Make sure you have write cache & barriers enabled on the drive. With a good 
> S-ATA drive, you should be seeing about 35-50 files/sec with a single 
> threaded writer.
>
> The local disk that I tested on is a relatively slow s-ata disk that is 
> more laptop quality/performance than server.
>
> One thought I had about the results is that we might be flipping the IO 
> sequence with the local disk case. It is the only device of the three that 
> I tested which is seek/head movement sensitive for small files.
>

Ahh yes turning off write cache off and barriers on I get your numbers, however
I'm not seeing the slowdown that you are, with and without my patch I'm seeing
the same performance.  Its just a plane jane intel sata controller with a
samsung sata disk set at 1.5gbps.  Same thing with an nvidia sata controller.
I'll think about this some more and see if there is something better that could
be done that may help you.  Thanks much,

Josef

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: some hard numbers on ext3 & batching performance issue
  2008-03-12 18:37                   ` Josef Bacik
@ 2008-03-13 11:26                     ` Ric Wheeler
  0 siblings, 0 replies; 8+ messages in thread
From: Ric Wheeler @ 2008-03-13 11:26 UTC (permalink / raw)
  To: Josef Bacik
  Cc: David Chinner, Theodore Ts'o, adilger, jack, Feld, Andy,
	linux-fsdevel, linux-scsi

Josef Bacik wrote:
> On Fri, Mar 07, 2008 at 03:45:58PM -0500, Ric Wheeler wrote:
>> Josef Bacik wrote:
>>> On Friday 07 March 2008 3:08:32 pm Ric Wheeler wrote:
>>>> Josef Bacik wrote:
>>>>> On Wednesday 05 March 2008 2:19:48 pm Ric Wheeler wrote:
>>>>>> After the IO/FS workshop last week, I posted some details on the slow
>>>>>> down we see with ext3 when we have a low latency back end instead of a
>>>>>> normal local disk (SCSI/S-ATA/etc).
>>>> ...
>>>> ...
>>>>
>>>> Note that this spreads the files across 64 subdirectories, each thread
>>>> writes 50 files and then moves on to the next in a round robin.
>>>>
>>> I'm starting to wonder about the disks I have, because my files/second is 
>>> spanking yours, and its just a local samsung 3gb/s sata drive.  With those 
>>> commands I'm consistently getting over 700 files/sec.  I'm seeing about a 
>>> 1-5% increase in speed locally with my patch.  I guess I'll start looking 
>>> around for some other hardware and check on there in case this box is more 
>>> badass than I think it is.  Thanks much,
>>>
>>> Josef
>>>
>> Sounds like you might be running with write cache on & barriers off ;-)
>>
>> Make sure you have write cache & barriers enabled on the drive. With a good 
>> S-ATA drive, you should be seeing about 35-50 files/sec with a single 
>> threaded writer.
>>
>> The local disk that I tested on is a relatively slow s-ata disk that is 
>> more laptop quality/performance than server.
>>
>> One thought I had about the results is that we might be flipping the IO 
>> sequence with the local disk case. It is the only device of the three that 
>> I tested which is seek/head movement sensitive for small files.
>>
> 
> Ahh yes turning off write cache off and barriers on I get your numbers, however
> I'm not seeing the slowdown that you are, with and without my patch I'm seeing
> the same performance.  Its just a plane jane intel sata controller with a
> samsung sata disk set at 1.5gbps.  Same thing with an nvidia sata controller.
> I'll think about this some more and see if there is something better that could
> be done that may help you.  Thanks much,
> 
> Josef
> 

Thanks - you should see the numbers with write cache enabled and 
barriers on as well, but for small files, write cache disabled is quite 
close ;-)

I am happy to rerun the tests at any point, I have a variety of disk 
types and controllers (lots of Intel AHCI boxes) to use.

ric


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: some hard numbers on ext3 & batching performance issue
  2008-03-05 19:19         ` some hard numbers on ext3 & batching performance issue Ric Wheeler
  2008-03-05 20:20           ` Josef Bacik
@ 2008-03-06  0:28           ` David Chinner
  1 sibling, 0 replies; 8+ messages in thread
From: David Chinner @ 2008-03-06  0:28 UTC (permalink / raw)
  To: Ric Wheeler
  Cc: David Chinner, Josef Bacik, Theodore Ts'o, adilger, jack,
	Feld, Andy, linux-fsdevel, linux-scsi

On Wed, Mar 05, 2008 at 02:19:48PM -0500, Ric Wheeler wrote:
> The work load is generated using fs_mark 
> (http://sourceforge.net/projects/fsmark/) which is basically a write 
> workload with small files, each file gets fsync'ed before close. The 
> metric is "files/sec".

.......

> It would be really interesting to rerun some of these tests on xfs which 
> Dave explained in the thread last week has a more self tuning way to 
> batch up transactions....

Ok, so XFS numbers. note these are all on a CONFIG_XFS_DEBUG=y kernel, so
there's lots of extra checks in the code as compared to a normal production
kernel.

Local disk (15krpm SCSI, WCD, CONFIG_XFS_DEBUG=y):

threads		files/s
  1		  97
  2		 117
  4		 109
  8		 110
 10		 113
 20		 116

Local disk (15krpm SCSI, WCE, nobarrier, CONFIG_XFS_DEBUG=y):

threads		files/s
  1		 203
  2		 216
  4		 243
  8		 332
 10		 405
 20		 424

Ramdisk (nobarrier, CONFIG_XFS_DEBUG=y):

	       agcount=4	agcount=16
threads		files/s		 files/s
  1		 1298		  1298
  2		 2073             2394
  4		 3296             3321
  8		 3464             4199
 10		 3394             3937
 20		 3251             3691

Note the difference the amount of parallel allocation in the
filesystem makes - agcount=4 only allows up to 4 parallel allocations
at once, so even if they are all aggregated into the one log I/O,
no further allocation can take place until that log I/O is complete.

And at about 4000 files/s the system (4p ia64) is becoming CPU bound
due to all the debug checks in XFS.

Cheers,

Dave.
-- 
Dave Chinner
Principal Engineer
SGI Australian Software Group

^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2008-03-13 11:28 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
     [not found] <47C6A46D.8020700@emc.com>
     [not found] ` <200802281005.13068.jbacik@redhat.com>
     [not found]   ` <200802281041.01411.jbacik@redhat.com>
     [not found]     ` <47C6B2A5.4030609@emc.com>
     [not found]       ` <20080228175422.GU155259@sgi.com>
2008-03-05 19:19         ` some hard numbers on ext3 & batching performance issue Ric Wheeler
2008-03-05 20:20           ` Josef Bacik
2008-03-07 20:08             ` Ric Wheeler
2008-03-07 20:40               ` Josef Bacik
2008-03-07 20:45                 ` Ric Wheeler
2008-03-12 18:37                   ` Josef Bacik
2008-03-13 11:26                     ` Ric Wheeler
2008-03-06  0:28           ` David Chinner

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).