background on the ext3 batching performance issue

linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* background on the ext3 batching performance issue
@ 2008-02-28 12:09 Ric Wheeler
  2008-02-28 15:05 ` Josef Bacik
  0 siblings, 1 reply; 21+ messages in thread
From: Ric Wheeler @ 2008-02-28 12:09 UTC (permalink / raw)
  To: Theodore Ts'o, adilger, David Chinner, jack; +Cc: Feld, Andy, linux-fsdevel

At the LSF workshop, I mentioned that we have tripped across an 
embarrassing performance issue in the jbd transaction code which is 
clearly not tuned for low latency devices.

The short summary is that we can do say 800 10k files/sec in a 
write/fsync/close loop with a single thread, but drop down to under 250 
files/sec with 2 or more threads.

This is pretty easy to reproduce with any small file write synchronous 
workload (i.e., fsync() each file before close).  We used my fs_mark 
tool to reproduce.

The core of the issue is the call in the jbd transaction code call out 
to schedule_timeout_uninterruptible(1) which causes us to sleep for 4ms:

        pid = current->pid;
        if (handle->h_sync && journal->j_last_sync_writer != pid) {
                journal->j_last_sync_writer = pid;
                do {
                        old_handle_count = transaction->t_handle_count;
                        schedule_timeout_uninterruptible(1);
                } while (old_handle_count != transaction->t_handle_count);
        }

This is quite topical to the concern we had with low latency devices in 
general, but specifically things like SSD's.

regards,

ric

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: background on the ext3 batching performance issue
  2008-02-28 12:09 background on the ext3 batching performance issue Ric Wheeler
@ 2008-02-28 15:05 ` Josef Bacik
  2008-02-28 15:41   ` Josef Bacik
  0 siblings, 1 reply; 21+ messages in thread
From: Josef Bacik @ 2008-02-28 15:05 UTC (permalink / raw)
  To: Ric Wheeler
  Cc: Theodore Ts'o, adilger, David Chinner, jack, Feld, Andy,
	linux-fsdevel

On Thursday 28 February 2008 7:09:17 am Ric Wheeler wrote:
> At the LSF workshop, I mentioned that we have tripped across an
> embarrassing performance issue in the jbd transaction code which is
> clearly not tuned for low latency devices.
>
> The short summary is that we can do say 800 10k files/sec in a
> write/fsync/close loop with a single thread, but drop down to under 250
> files/sec with 2 or more threads.
>
> This is pretty easy to reproduce with any small file write synchronous
> workload (i.e., fsync() each file before close).  We used my fs_mark
> tool to reproduce.
>
> The core of the issue is the call in the jbd transaction code call out
> to schedule_timeout_uninterruptible(1) which causes us to sleep for 4ms:
>
>         pid = current->pid;
>         if (handle->h_sync && journal->j_last_sync_writer != pid) {
>                 journal->j_last_sync_writer = pid;
>                 do {
>                         old_handle_count = transaction->t_handle_count;
>                         schedule_timeout_uninterruptible(1);
>                 } while (old_handle_count != transaction->t_handle_count);
>         }
>
> This is quite topical to the concern we had with low latency devices in
> general, but specifically things like SSD's.
>

Your testcase does in fact show a weakness in this optimization, but look at the 
more likely case, where you have multiple writers on the same filesystem rather 
than one guy doing write/fsync.  If we wait we could potentially add quite a 
few more buffers to this transaction before flushing it, rather than flushing a 
buffer or two at a time.  What would you propose as a solution?

Josef

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: background on the ext3 batching performance issue
  2008-02-28 15:05 ` Josef Bacik
@ 2008-02-28 15:41   ` Josef Bacik
  2008-02-28 13:03     ` Ric Wheeler
  2008-02-28 13:09     ` Ric Wheeler
  0 siblings, 2 replies; 21+ messages in thread
From: Josef Bacik @ 2008-02-28 15:41 UTC (permalink / raw)
  To: Ric Wheeler
  Cc: Theodore Ts'o, adilger, David Chinner, jack, Feld, Andy,
	linux-fsdevel

On Thursday 28 February 2008 10:05:11 am Josef Bacik wrote:
> On Thursday 28 February 2008 7:09:17 am Ric Wheeler wrote:
> > At the LSF workshop, I mentioned that we have tripped across an
> > embarrassing performance issue in the jbd transaction code which is
> > clearly not tuned for low latency devices.
> >
> > The short summary is that we can do say 800 10k files/sec in a
> > write/fsync/close loop with a single thread, but drop down to under 250
> > files/sec with 2 or more threads.
> >
> > This is pretty easy to reproduce with any small file write synchronous
> > workload (i.e., fsync() each file before close).  We used my fs_mark
> > tool to reproduce.
> >
> > The core of the issue is the call in the jbd transaction code call out
> > to schedule_timeout_uninterruptible(1) which causes us to sleep for 4ms:
> >
> >         pid = current->pid;
> >         if (handle->h_sync && journal->j_last_sync_writer != pid) {
> >                 journal->j_last_sync_writer = pid;
> >                 do {
> >                         old_handle_count = transaction->t_handle_count;
> >                         schedule_timeout_uninterruptible(1);
> >                 } while (old_handle_count !=
> > transaction->t_handle_count); }
> >
> > This is quite topical to the concern we had with low latency devices in
> > general, but specifically things like SSD's.
>
> Your testcase does in fact show a weakness in this optimization, but look
> at the more likely case, where you have multiple writers on the same
> filesystem rather than one guy doing write/fsync.  If we wait we could
> potentially add quite a few more buffers to this transaction before
> flushing it, rather than flushing a buffer or two at a time.  What would
> you propose as a solution?
>

Forgive me, I said that badly, now that I've had my morning coffee let me try 
again.  You are ping-ponging the j_last_sync_writer back and forth between the 
two threads, so you don't get the speedup you would get with one thread where 
we would just bypass the next sleep since we know we've got one thread doing 
write/sync.  So this brings up the question, should we try and figure out if we 
have the situation where we have multiple threads doing write/sync and 
therefore exploiting the weakness in this optimization, and if we should, how 
would we do this properly?  The only thing I can think to do is to track sync 
writers on a transaction, and if its more than one bypass this little snippet.  
In fact I think I'll go ahead and do that and see what fs_mark comes up with.  
Thank you,

Josef

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: background on the ext3 batching performance issue
  2008-02-28 15:41   ` Josef Bacik
@ 2008-02-28 13:03     ` Ric Wheeler
  2008-02-28 13:09     ` Ric Wheeler
  1 sibling, 0 replies; 21+ messages in thread
From: Ric Wheeler @ 2008-02-28 13:03 UTC (permalink / raw)
  To: Josef Bacik
  Cc: Theodore Ts'o, adilger, David Chinner, jack, Feld, Andy,
	linux-fsdevel

Josef Bacik wrote:
> On Thursday 28 February 2008 10:05:11 am Josef Bacik wrote:
>> On Thursday 28 February 2008 7:09:17 am Ric Wheeler wrote:
>>> At the LSF workshop, I mentioned that we have tripped across an
>>> embarrassing performance issue in the jbd transaction code which is
>>> clearly not tuned for low latency devices.
>>>
>>> The short summary is that we can do say 800 10k files/sec in a
>>> write/fsync/close loop with a single thread, but drop down to under 250
>>> files/sec with 2 or more threads.
>>>
>>> This is pretty easy to reproduce with any small file write synchronous
>>> workload (i.e., fsync() each file before close).  We used my fs_mark
>>> tool to reproduce.
>>>
>>> The core of the issue is the call in the jbd transaction code call out
>>> to schedule_timeout_uninterruptible(1) which causes us to sleep for 4ms:
>>>
>>>         pid = current->pid;
>>>         if (handle->h_sync && journal->j_last_sync_writer != pid) {
>>>                 journal->j_last_sync_writer = pid;
>>>                 do {
>>>                         old_handle_count = transaction->t_handle_count;
>>>                         schedule_timeout_uninterruptible(1);
>>>                 } while (old_handle_count !=
>>> transaction->t_handle_count); }
>>>
>>> This is quite topical to the concern we had with low latency devices in
>>> general, but specifically things like SSD's.
>> Your testcase does in fact show a weakness in this optimization, but look
>> at the more likely case, where you have multiple writers on the same
>> filesystem rather than one guy doing write/fsync.  If we wait we could
>> potentially add quite a few more buffers to this transaction before
>> flushing it, rather than flushing a buffer or two at a time.  What would
>> you propose as a solution?
>>
> 
> Forgive me, I said that badly, now that I've had my morning coffee let me try 
> again.  You are ping-ponging the j_last_sync_writer back and forth between the 
> two threads, so you don't get the speedup you would get with one thread where 
> we would just bypass the next sleep since we know we've got one thread doing 
> write/sync.  So this brings up the question, should we try and figure out if we 
> have the situation where we have multiple threads doing write/sync and 
> therefore exploiting the weakness in this optimization, and if we should, how 
> would we do this properly?  The only thing I can think to do is to track sync 
> writers on a transaction, and if its more than one bypass this little snippet.  
> In fact I think I'll go ahead and do that and see what fs_mark comes up with.  
> Thank you,
> 
> Josef
> 

Even worse, we go 4 times slower with 2 threads than we do with a single 
thread!

This code has tried several things in the past - reiserfs used to do a 
yield() at one point.

I am traveling until the weekend, but will be able to help with this 
when I get back in to my lab on Monday...

ric


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: background on the ext3 batching performance issue
  2008-02-28 15:41   ` Josef Bacik
  2008-02-28 13:03     ` Ric Wheeler
@ 2008-02-28 13:09     ` Ric Wheeler
  2008-02-28 16:41       ` Jan Kara
                         ` (2 more replies)
  1 sibling, 3 replies; 21+ messages in thread
From: Ric Wheeler @ 2008-02-28 13:09 UTC (permalink / raw)
  To: Josef Bacik
  Cc: Theodore Ts'o, adilger, David Chinner, jack, Feld, Andy,
	linux-fsdevel

Josef Bacik wrote:
> On Thursday 28 February 2008 10:05:11 am Josef Bacik wrote:
>> On Thursday 28 February 2008 7:09:17 am Ric Wheeler wrote:
>>> At the LSF workshop, I mentioned that we have tripped across an
>>> embarrassing performance issue in the jbd transaction code which is
>>> clearly not tuned for low latency devices.
>>>
>>> The short summary is that we can do say 800 10k files/sec in a
>>> write/fsync/close loop with a single thread, but drop down to under 250
>>> files/sec with 2 or more threads.
>>>
>>> This is pretty easy to reproduce with any small file write synchronous
>>> workload (i.e., fsync() each file before close).  We used my fs_mark
>>> tool to reproduce.
>>>
>>> The core of the issue is the call in the jbd transaction code call out
>>> to schedule_timeout_uninterruptible(1) which causes us to sleep for 4ms:
>>>
>>>         pid = current->pid;
>>>         if (handle->h_sync && journal->j_last_sync_writer != pid) {
>>>                 journal->j_last_sync_writer = pid;
>>>                 do {
>>>                         old_handle_count = transaction->t_handle_count;
>>>                         schedule_timeout_uninterruptible(1);
>>>                 } while (old_handle_count !=
>>> transaction->t_handle_count); }
>>>
>>> This is quite topical to the concern we had with low latency devices in
>>> general, but specifically things like SSD's.
>> Your testcase does in fact show a weakness in this optimization, but look
>> at the more likely case, where you have multiple writers on the same
>> filesystem rather than one guy doing write/fsync.  If we wait we could
>> potentially add quite a few more buffers to this transaction before
>> flushing it, rather than flushing a buffer or two at a time.  What would
>> you propose as a solution?
>>
> 
> Forgive me, I said that badly, now that I've had my morning coffee let me try 
> again.  You are ping-ponging the j_last_sync_writer back and forth between the 
> two threads, so you don't get the speedup you would get with one thread where 
> we would just bypass the next sleep since we know we've got one thread doing 
> write/sync.  So this brings up the question, should we try and figure out if we 
> have the situation where we have multiple threads doing write/sync and 
> therefore exploiting the weakness in this optimization, and if we should, how 
> would we do this properly?  The only thing I can think to do is to track sync 
> writers on a transaction, and if its more than one bypass this little snippet.  
> In fact I think I'll go ahead and do that and see what fs_mark comes up with.  
> Thank you,
> 
> Josef
> 

One more thought - what we really want here is to have a sense of the 
latency of the device. In the S-ATA disk case, this optimization works 
well for batching since we "spend" an extra 4ms worst case in the chance 
of combining multiple, slow 18ms operations.

With the clariion box we tested, the optimization fails badly since the 
cost is only 1.3 ms so we optimize by waiting 3-4 times longer than it 
would take to do the operation immediately.

This problem has also seemed to me to be the same problem that IO 
schedulers do with plugging - we want to dynamically figure out when to 
plug and unplug here without hard coding in device specific tunings.

If we bypass the snippet for multi-threaded writers, we would probably 
slow down this workload on normal S-ATA/ATA drives (or even higher 
performance non-RAID disks).

ric


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: background on the ext3 batching performance issue
  2008-02-28 13:09     ` Ric Wheeler
@ 2008-02-28 16:41       ` Jan Kara
  2008-02-28 17:02       ` Chris Mason
  2008-02-28 17:54       ` David Chinner
  2 siblings, 0 replies; 21+ messages in thread
From: Jan Kara @ 2008-02-28 16:41 UTC (permalink / raw)
  To: Ric Wheeler
  Cc: Josef Bacik, Theodore Ts'o, adilger, David Chinner,
	Feld, Andy, linux-fsdevel

> Josef Bacik wrote:
> >On Thursday 28 February 2008 10:05:11 am Josef Bacik wrote:
> >>On Thursday 28 February 2008 7:09:17 am Ric Wheeler wrote:
> >>>At the LSF workshop, I mentioned that we have tripped across an
> >>>embarrassing performance issue in the jbd transaction code which is
> >>>clearly not tuned for low latency devices.
> >>>
> >>>The short summary is that we can do say 800 10k files/sec in a
> >>>write/fsync/close loop with a single thread, but drop down to under 250
> >>>files/sec with 2 or more threads.
> >>>
> >>>This is pretty easy to reproduce with any small file write synchronous
> >>>workload (i.e., fsync() each file before close).  We used my fs_mark
> >>>tool to reproduce.
> >>>
> >>>The core of the issue is the call in the jbd transaction code call out
> >>>to schedule_timeout_uninterruptible(1) which causes us to sleep for 4ms:
> >>>
> >>>        pid = current->pid;
> >>>        if (handle->h_sync && journal->j_last_sync_writer != pid) {
> >>>                journal->j_last_sync_writer = pid;
> >>>                do {
> >>>                        old_handle_count = transaction->t_handle_count;
> >>>                        schedule_timeout_uninterruptible(1);
> >>>                } while (old_handle_count !=
> >>>transaction->t_handle_count); }
> >>>
> >>>This is quite topical to the concern we had with low latency devices in
> >>>general, but specifically things like SSD's.
> >>Your testcase does in fact show a weakness in this optimization, but look
> >>at the more likely case, where you have multiple writers on the same
> >>filesystem rather than one guy doing write/fsync.  If we wait we could
> >>potentially add quite a few more buffers to this transaction before
> >>flushing it, rather than flushing a buffer or two at a time.  What would
> >>you propose as a solution?
> >>
> >
> >Forgive me, I said that badly, now that I've had my morning coffee let me 
> >try again.  You are ping-ponging the j_last_sync_writer back and forth 
> >between the two threads, so you don't get the speedup you would get with 
> >one thread where we would just bypass the next sleep since we know we've 
> >got one thread doing write/sync.  So this brings up the question, should 
> >we try and figure out if we have the situation where we have multiple 
> >threads doing write/sync and therefore exploiting the weakness in this 
> >optimization, and if we should, how would we do this properly?  The only 
> >thing I can think to do is to track sync writers on a transaction, and if 
> >its more than one bypass this little snippet.  In fact I think I'll go 
> >ahead and do that and see what fs_mark comes up with.  Thank you,
> >
> >Josef
> >
> 
> One more thought - what we really want here is to have a sense of the 
> latency of the device. In the S-ATA disk case, this optimization works 
> well for batching since we "spend" an extra 4ms worst case in the chance 
> of combining multiple, slow 18ms operations.
> 
> With the clariion box we tested, the optimization fails badly since the 
> cost is only 1.3 ms so we optimize by waiting 3-4 times longer than it 
> would take to do the operation immediately.
> 
> This problem has also seemed to me to be the same problem that IO 
> schedulers do with plugging - we want to dynamically figure out when to 
> plug and unplug here without hard coding in device specific tunings.
> 
> If we bypass the snippet for multi-threaded writers, we would probably 
> slow down this workload on normal S-ATA/ATA drives (or even higher 
> performance non-RAID disks).
  Exactly. I can run some tests next week but I guess for standard disk
you have in your desktop, this optimization is really worthwhile since
transaction commit has a significant cost on such drive (and we already
suck in fsync() performance in ext3 for other reasons so I wouldn't like
to make it even worse ;).
  The question is how we could tell in JBD whether the optimisation is
worth it or not. Journal flag (settable via tunefs) is always an option
but if somebody has a better idea... But if mkfs did some magic and
automatically set the flag when it found out the device has low latency,
it might be actually quite satisfactory solution. Also this option might
be useful also for people preferring lower fsync latency over general
throughput.

								Honza
-- 
Jan Kara <jack@suse.cz>
SuSE CR Labs

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: background on the ext3 batching performance issue
  2008-02-28 13:09     ` Ric Wheeler
  2008-02-28 16:41       ` Jan Kara
@ 2008-02-28 17:02       ` Chris Mason
  2008-02-28 17:13         ` Jan Kara
  2008-02-28 17:54       ` David Chinner
  2 siblings, 1 reply; 21+ messages in thread
From: Chris Mason @ 2008-02-28 17:02 UTC (permalink / raw)
  To: Ric Wheeler
  Cc: Josef Bacik, Theodore Ts'o, adilger, David Chinner, jack,
	Feld, Andy, linux-fsdevel

On Thursday 28 February 2008, Ric Wheeler wrote:

[ fsync batching can be slow ]

> One more thought - what we really want here is to have a sense of the
> latency of the device. In the S-ATA disk case, this optimization works
> well for batching since we "spend" an extra 4ms worst case in the chance
> of combining multiple, slow 18ms operations.
>
> With the clariion box we tested, the optimization fails badly since the
> cost is only 1.3 ms so we optimize by waiting 3-4 times longer than it
> would take to do the operation immediately.
>
> This problem has also seemed to me to be the same problem that IO
> schedulers do with plugging - we want to dynamically figure out when to
> plug and unplug here without hard coding in device specific tunings.
>
> If we bypass the snippet for multi-threaded writers, we would probably
> slow down this workload on normal S-ATA/ATA drives (or even higher
> performance non-RAID disks).

It probably makes sense to keep track of the average number of writers we are 
able to gather into a transcation.  There are lots of similar workloads where 
we have a pool of procs doing fsyncs and the size of the transaction or the 
number of times we joined a running transaction will be fairly constant.

-chris

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: background on the ext3 batching performance issue
  2008-02-28 17:02       ` Chris Mason
@ 2008-02-28 17:13         ` Jan Kara
  2008-02-28 17:35           ` Chris Mason
  0 siblings, 1 reply; 21+ messages in thread
From: Jan Kara @ 2008-02-28 17:13 UTC (permalink / raw)
  To: Chris Mason
  Cc: Ric Wheeler, Josef Bacik, Theodore Ts'o, adilger,
	David Chinner, Feld, Andy, linux-fsdevel

> On Thursday 28 February 2008, Ric Wheeler wrote:
> 
> [ fsync batching can be slow ]
> 
> > One more thought - what we really want here is to have a sense of the
> > latency of the device. In the S-ATA disk case, this optimization works
> > well for batching since we "spend" an extra 4ms worst case in the chance
> > of combining multiple, slow 18ms operations.
> >
> > With the clariion box we tested, the optimization fails badly since the
> > cost is only 1.3 ms so we optimize by waiting 3-4 times longer than it
> > would take to do the operation immediately.
> >
> > This problem has also seemed to me to be the same problem that IO
> > schedulers do with plugging - we want to dynamically figure out when to
> > plug and unplug here without hard coding in device specific tunings.
> >
> > If we bypass the snippet for multi-threaded writers, we would probably
> > slow down this workload on normal S-ATA/ATA drives (or even higher
> > performance non-RAID disks).
> 
> It probably makes sense to keep track of the average number of writers we are 
> able to gather into a transcation.  There are lots of similar workloads where 
> we have a pool of procs doing fsyncs and the size of the transaction or the 
> number of times we joined a running transaction will be fairly constant.
  I'm probably missing something, but what are you trying to say? Either we
wait for writers and the number of writes is higher, or we don't wait and the
number of writes in a transaction is lower...

								Honza
-- 
Jan Kara <jack@suse.cz>
SuSE CR Labs

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: background on the ext3 batching performance issue
  2008-02-28 17:13         ` Jan Kara
@ 2008-02-28 17:35           ` Chris Mason
  2008-02-28 18:15             ` Jan Kara
  0 siblings, 1 reply; 21+ messages in thread
From: Chris Mason @ 2008-02-28 17:35 UTC (permalink / raw)
  To: Jan Kara
  Cc: Ric Wheeler, Josef Bacik, Theodore Ts'o, adilger,
	David Chinner, Feld, Andy, linux-fsdevel

On Thursday 28 February 2008, Jan Kara wrote:
> > On Thursday 28 February 2008, Ric Wheeler wrote:
> >
> > [ fsync batching can be slow ]
> >
> > > One more thought - what we really want here is to have a sense of the
> > > latency of the device. In the S-ATA disk case, this optimization works
> > > well for batching since we "spend" an extra 4ms worst case in the
> > > chance of combining multiple, slow 18ms operations.
> > >
> > > With the clariion box we tested, the optimization fails badly since the
> > > cost is only 1.3 ms so we optimize by waiting 3-4 times longer than it
> > > would take to do the operation immediately.
> > >
> > > This problem has also seemed to me to be the same problem that IO
> > > schedulers do with plugging - we want to dynamically figure out when to
> > > plug and unplug here without hard coding in device specific tunings.
> > >
> > > If we bypass the snippet for multi-threaded writers, we would probably
> > > slow down this workload on normal S-ATA/ATA drives (or even higher
> > > performance non-RAID disks).
> >
> > It probably makes sense to keep track of the average number of writers we
> > are able to gather into a transcation.  There are lots of similar
> > workloads where we have a pool of procs doing fsyncs and the size of the
> > transaction or the number of times we joined a running transaction will
> > be fairly constant.
>
>   I'm probably missing something, but what are you trying to say? Either we
> wait for writers and the number of writes is higher, or we don't wait and
> the number of writes in a transaction is lower...

The common workload would be N mail server threads servicing incoming requests 
at a fairly constant rate.  Right now we sleep for a bit and wait for the 
number of writers to increase.  

My guess is that if we record the average number of times a writer joins an 
existing transaction, or if we record the average size of the transactions, 
we'll end up with a fairly constant number.

So, we can skip the sleep if the transaction has already grown close to that 
number.  This would avoid the latencies Ric is seeing.

-chris

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: background on the ext3 batching performance issue
  2008-02-28 17:35           ` Chris Mason
@ 2008-02-28 18:15             ` Jan Kara
  0 siblings, 0 replies; 21+ messages in thread
From: Jan Kara @ 2008-02-28 18:15 UTC (permalink / raw)
  To: Chris Mason
  Cc: Ric Wheeler, Josef Bacik, Theodore Ts'o, adilger,
	David Chinner, Feld, Andy, linux-fsdevel

On Thu 28-02-08 12:35:17, Chris Mason wrote:
> On Thursday 28 February 2008, Jan Kara wrote:
> > > On Thursday 28 February 2008, Ric Wheeler wrote:
> > >
> > > [ fsync batching can be slow ]
> > >
> > > > One more thought - what we really want here is to have a sense of the
> > > > latency of the device. In the S-ATA disk case, this optimization works
> > > > well for batching since we "spend" an extra 4ms worst case in the
> > > > chance of combining multiple, slow 18ms operations.
> > > >
> > > > With the clariion box we tested, the optimization fails badly since the
> > > > cost is only 1.3 ms so we optimize by waiting 3-4 times longer than it
> > > > would take to do the operation immediately.
> > > >
> > > > This problem has also seemed to me to be the same problem that IO
> > > > schedulers do with plugging - we want to dynamically figure out when to
> > > > plug and unplug here without hard coding in device specific tunings.
> > > >
> > > > If we bypass the snippet for multi-threaded writers, we would probably
> > > > slow down this workload on normal S-ATA/ATA drives (or even higher
> > > > performance non-RAID disks).
> > >
> > > It probably makes sense to keep track of the average number of writers we
> > > are able to gather into a transcation.  There are lots of similar
> > > workloads where we have a pool of procs doing fsyncs and the size of the
> > > transaction or the number of times we joined a running transaction will
> > > be fairly constant.
> >
> >   I'm probably missing something, but what are you trying to say? Either we
> > wait for writers and the number of writes is higher, or we don't wait and
> > the number of writes in a transaction is lower...
> 
> The common workload would be N mail server threads servicing incoming requests 
> at a fairly constant rate.  Right now we sleep for a bit and wait for the 
> number of writers to increase.  
> 
> My guess is that if we record the average number of times a writer joins an 
> existing transaction, or if we record the average size of the transactions, 
> we'll end up with a fairly constant number.
> 
> So, we can skip the sleep if the transaction has already grown close to that 
> number.  This would avoid the latencies Ric is seeing.
  OK, I see. Interesting idea, but in Ric's case, you'd find out that two
writers always joined the transaction and so you'd always wait for them and
nothing changes, does it?

								Honza

-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: background on the ext3 batching performance issue
  2008-02-28 13:09     ` Ric Wheeler
  2008-02-28 16:41       ` Jan Kara
  2008-02-28 17:02       ` Chris Mason
@ 2008-02-28 17:54       ` David Chinner
  2008-02-28 19:48         ` Ric Wheeler
                           ` (2 more replies)
  2 siblings, 3 replies; 21+ messages in thread
From: David Chinner @ 2008-02-28 17:54 UTC (permalink / raw)
  To: Ric Wheeler
  Cc: Josef Bacik, Theodore Ts'o, adilger, David Chinner, jack,
	Feld, Andy, linux-fsdevel

On Thu, Feb 28, 2008 at 08:09:57AM -0500, Ric Wheeler wrote:
> One more thought - what we really want here is to have a sense of the 
> latency of the device. In the S-ATA disk case, this optimization works 
> well for batching since we "spend" an extra 4ms worst case in the chance 
> of combining multiple, slow 18ms operations.
> 
> With the clariion box we tested, the optimization fails badly since the 
> cost is only 1.3 ms so we optimize by waiting 3-4 times longer than it 
> would take to do the operation immediately.
> 
> This problem has also seemed to me to be the same problem that IO 
> schedulers do with plugging - we want to dynamically figure out when to 
> plug and unplug here without hard coding in device specific tunings.
> 
> If we bypass the snippet for multi-threaded writers, we would probably 
> slow down this workload on normal S-ATA/ATA drives (or even higher 
> performance non-RAID disks).

It's the self-tuning aspect of this problem that makes it hard. In
the case of XFS, the way this tuning is done is that we look at the
state of the previous log I/O buffer to check if it is still syncing
to disk. If it is sync to disk, we go to sleep waiting for that log
buffer I/O to complete. This holds the current buffer open to
aggregate more transactions before syncing it to disk and hence
allows parallel fsyncs to be issued in the one log write. The fact
that it waits for the previous log I/O to complete means it
self-tunes to the latency of the underlying storage medium.....

Cheers,

Dave.
-- 
Dave Chinner
Principal Engineer
SGI Australian Software Group

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: background on the ext3 batching performance issue
  2008-02-28 17:54       ` David Chinner
@ 2008-02-28 19:48         ` Ric Wheeler
  2008-02-29 14:52         ` Ric Wheeler
  2008-03-05 19:19         ` some hard numbers on ext3 & " Ric Wheeler
  2 siblings, 0 replies; 21+ messages in thread
From: Ric Wheeler @ 2008-02-28 19:48 UTC (permalink / raw)
  To: David Chinner
  Cc: Josef Bacik, Theodore Ts'o, adilger, jack, Feld, Andy,
	linux-fsdevel

David Chinner wrote:
> On Thu, Feb 28, 2008 at 08:09:57AM -0500, Ric Wheeler wrote:
>> One more thought - what we really want here is to have a sense of the 
>> latency of the device. In the S-ATA disk case, this optimization works 
>> well for batching since we "spend" an extra 4ms worst case in the chance 
>> of combining multiple, slow 18ms operations.
>>
>> With the clariion box we tested, the optimization fails badly since the 
>> cost is only 1.3 ms so we optimize by waiting 3-4 times longer than it 
>> would take to do the operation immediately.
>>
>> This problem has also seemed to me to be the same problem that IO 
>> schedulers do with plugging - we want to dynamically figure out when to 
>> plug and unplug here without hard coding in device specific tunings.
>>
>> If we bypass the snippet for multi-threaded writers, we would probably 
>> slow down this workload on normal S-ATA/ATA drives (or even higher 
>> performance non-RAID disks).
> 
> It's the self-tuning aspect of this problem that makes it hard. In
> the case of XFS, the way this tuning is done is that we look at the
> state of the previous log I/O buffer to check if it is still syncing
> to disk. If it is sync to disk, we go to sleep waiting for that log
> buffer I/O to complete. This holds the current buffer open to
> aggregate more transactions before syncing it to disk and hence
> allows parallel fsyncs to be issued in the one log write. The fact
> that it waits for the previous log I/O to complete means it
> self-tunes to the latency of the underlying storage medium.....
> 
> Cheers,
> 
> Dave.

With the experiments we ran before, the heuristic did eventually start 
helping when we hit really high numbers of concurrent writing threads on 
the Clariion box. I forget how many, but it was at least 12 or so.

ric



^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: background on the ext3 batching performance issue
  2008-02-28 17:54       ` David Chinner
  2008-02-28 19:48         ` Ric Wheeler
@ 2008-02-29 14:52         ` Ric Wheeler
  2008-03-05 19:19         ` some hard numbers on ext3 & " Ric Wheeler
  2 siblings, 0 replies; 21+ messages in thread
From: Ric Wheeler @ 2008-02-29 14:52 UTC (permalink / raw)
  To: David Chinner
  Cc: Josef Bacik, Theodore Ts'o, adilger, jack, Feld, Andy,
	linux-fsdevel



David Chinner wrote:
> On Thu, Feb 28, 2008 at 08:09:57AM -0500, Ric Wheeler wrote:
>> One more thought - what we really want here is to have a sense of the 
>> latency of the device. In the S-ATA disk case, this optimization works 
>> well for batching since we "spend" an extra 4ms worst case in the chance 
>> of combining multiple, slow 18ms operations.
>>
>> With the clariion box we tested, the optimization fails badly since the 
>> cost is only 1.3 ms so we optimize by waiting 3-4 times longer than it 
>> would take to do the operation immediately.
>>
>> This problem has also seemed to me to be the same problem that IO 
>> schedulers do with plugging - we want to dynamically figure out when to 
>> plug and unplug here without hard coding in device specific tunings.
>>
>> If we bypass the snippet for multi-threaded writers, we would probably 
>> slow down this workload on normal S-ATA/ATA drives (or even higher 
>> performance non-RAID disks).
> 
> It's the self-tuning aspect of this problem that makes it hard. In
> the case of XFS, the way this tuning is done is that we look at the
> state of the previous log I/O buffer to check if it is still syncing
> to disk. If it is sync to disk, we go to sleep waiting for that log
> buffer I/O to complete. This holds the current buffer open to
> aggregate more transactions before syncing it to disk and hence
> allows parallel fsyncs to be issued in the one log write. The fact
> that it waits for the previous log I/O to complete means it
> self-tunes to the latency of the underlying storage medium.....
> 
> Cheers,
> 
> Dave.

This sounds like a really clean way to self tune without having any hard coded 
assumptions (like the current 1HZ wait)...

ric


^ permalink raw reply	[flat|nested] 21+ messages in thread

* some hard numbers on ext3 & batching performance issue
  2008-02-28 17:54       ` David Chinner
  2008-02-28 19:48         ` Ric Wheeler
  2008-02-29 14:52         ` Ric Wheeler
@ 2008-03-05 19:19         ` Ric Wheeler
  2008-03-05 20:20           ` Josef Bacik
  2008-03-06  0:28           ` David Chinner
  2 siblings, 2 replies; 21+ messages in thread
From: Ric Wheeler @ 2008-03-05 19:19 UTC (permalink / raw)
  To: David Chinner
  Cc: Josef Bacik, Theodore Ts'o, adilger, jack, Feld, Andy,
	linux-fsdevel, linux-scsi

After the IO/FS workshop last week, I posted some details on the slow 
down we see with ext3 when we have a low latency back end instead of a 
normal local disk (SCSI/S-ATA/etc).

As a follow up to that thread, I wanted to post some real numbers that 
Andy from our performance team pulled together. Andy tested various 
patches using three classes of storage (S-ATA, RAM disk and Clariion array).

Note that this testing was done on a SLES10/SP1 kernel, but the code in 
question has not changed in mainline but we should probably retest on 
something newer just to clear up any doubts.

The work load is generated using fs_mark 
(http://sourceforge.net/projects/fsmark/) which is basically a write 
workload with small files, each file gets fsync'ed before close. The 
metric is "files/sec".

The clearest result used a ramdisk to store 4k files.

We modified ext3 and jbd to accept a new mount option: bdelay Use it like:

mount -o bdelay=n dev mountpoint

n is passed to schedule_timeout_interruptible() in the jbd code. if n == 
0, it skips the whole loop. if n is "yield", then substitute the 
schedule...(n) with yield().

Note that the first row is the value of the delay with a 250HZ build 
followed by the number of concurrent threads writing 4KB files.

Ramdisk test:

bdelay  1       2       4       8       10      20
0       4640    4498    3226    1721    1436     664
yield   4640    4078    2977    1611    1136     551
1       4647     250     482     588     629     483
2       4522     149     233     422     450     389
3       4504      86     165     271     308     334
4       4425      84     128     222     253     293

Midrange clariion:

bdelay   1       2       4       8       10      20
0        778     923    1567    1424    1276     785
yield    791     931    1551    1473    1328     806
1        793     304     499     714     751     760
2        789     132     201     382     441     589
3        792     124     168     298     342     471
4        786      71     116     237     277     393

Local disk:

bdelay    1       2       4       8       10      20
0         47      51      81     135     160     234
yield     36      45      74     117     138     214
1         44      52      86     148     183     258
2         40      60     109     163     184     265
3         40      52      97     148     171     264
4         35      42      83     149     169     246

Apologies for mangling the nicely formatted tables.

Note that the justification for the batching as we have it today is 
basically this last local drive test case.

It would be really interesting to rerun some of these tests on xfs which 
Dave explained in the thread last week has a more self tuning way to 
batch up transactions....

Note that all of those poor users who have a synchronous write workload 
today are in the "1" row for each of the above tables.

ric

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: some hard numbers on ext3 & batching performance issue
  2008-03-05 19:19         ` some hard numbers on ext3 & " Ric Wheeler
@ 2008-03-05 20:20           ` Josef Bacik
  2008-03-07 20:08             ` Ric Wheeler
  2008-03-06  0:28           ` David Chinner
  1 sibling, 1 reply; 21+ messages in thread
From: Josef Bacik @ 2008-03-05 20:20 UTC (permalink / raw)
  To: ric
  Cc: David Chinner, Theodore Ts'o, adilger, jack, Feld, Andy,
	linux-fsdevel, linux-scsi

On Wednesday 05 March 2008 2:19:48 pm Ric Wheeler wrote:
> After the IO/FS workshop last week, I posted some details on the slow
> down we see with ext3 when we have a low latency back end instead of a
> normal local disk (SCSI/S-ATA/etc).
>
> As a follow up to that thread, I wanted to post some real numbers that
> Andy from our performance team pulled together. Andy tested various
> patches using three classes of storage (S-ATA, RAM disk and Clariion
> array).
>
> Note that this testing was done on a SLES10/SP1 kernel, but the code in
> question has not changed in mainline but we should probably retest on
> something newer just to clear up any doubts.
>
> The work load is generated using fs_mark
> (http://sourceforge.net/projects/fsmark/) which is basically a write
> workload with small files, each file gets fsync'ed before close. The
> metric is "files/sec".
>
> The clearest result used a ramdisk to store 4k files.
>
> We modified ext3 and jbd to accept a new mount option: bdelay Use it like:
>
> mount -o bdelay=n dev mountpoint
>
> n is passed to schedule_timeout_interruptible() in the jbd code. if n ==
> 0, it skips the whole loop. if n is "yield", then substitute the
> schedule...(n) with yield().
>
> Note that the first row is the value of the delay with a 250HZ build
> followed by the number of concurrent threads writing 4KB files.
>
> Ramdisk test:
>
> bdelay  1       2       4       8       10      20
> 0       4640    4498    3226    1721    1436     664
> yield   4640    4078    2977    1611    1136     551
> 1       4647     250     482     588     629     483
> 2       4522     149     233     422     450     389
> 3       4504      86     165     271     308     334
> 4       4425      84     128     222     253     293
>
> Midrange clariion:
>
> bdelay   1       2       4       8       10      20
> 0        778     923    1567    1424    1276     785
> yield    791     931    1551    1473    1328     806
> 1        793     304     499     714     751     760
> 2        789     132     201     382     441     589
> 3        792     124     168     298     342     471
> 4        786      71     116     237     277     393
>
> Local disk:
>
> bdelay    1       2       4       8       10      20
> 0         47      51      81     135     160     234
> yield     36      45      74     117     138     214
> 1         44      52      86     148     183     258
> 2         40      60     109     163     184     265
> 3         40      52      97     148     171     264
> 4         35      42      83     149     169     246
>
> Apologies for mangling the nicely formatted tables.
>
> Note that the justification for the batching as we have it today is
> basically this last local drive test case.
>
> It would be really interesting to rerun some of these tests on xfs which
> Dave explained in the thread last week has a more self tuning way to
> batch up transactions....
>
> Note that all of those poor users who have a synchronous write workload
> today are in the "1" row for each of the above tables.

Mind giving this a whirl?  The fastest thing I've got here is an Apple X RAID 
and its being used for something else atm, so I've only tested this on local 
disk to make sure it didn't make local performance suck (which it doesn't btw).  
This should be equivalent with what David says XFS does.  Thanks much,

Josef

diff --git a/fs/jbd/transaction.c b/fs/jbd/transaction.c
index c6cbb6c..4596e1c 100644
--- a/fs/jbd/transaction.c
+++ b/fs/jbd/transaction.c
@@ -1333,8 +1333,7 @@ int journal_stop(handle_t *handle)
 {
 	transaction_t *transaction = handle->h_transaction;
 	journal_t *journal = transaction->t_journal;
-	int old_handle_count, err;
-	pid_t pid;
+	int err;
 
 	J_ASSERT(journal_current_handle() == handle);
 
@@ -1353,32 +1352,22 @@ int journal_stop(handle_t *handle)
 
 	jbd_debug(4, "Handle %p going down\n", handle);
 
-	/*
-	 * Implement synchronous transaction batching.  If the handle
-	 * was synchronous, don't force a commit immediately.  Let's
-	 * yield and let another thread piggyback onto this transaction.
-	 * Keep doing that while new threads continue to arrive.
-	 * It doesn't cost much - we're about to run a commit and sleep
-	 * on IO anyway.  Speeds up many-threaded, many-dir operations
-	 * by 30x or more...
-	 *
-	 * But don't do this if this process was the most recent one to
-	 * perform a synchronous write.  We do this to detect the case where a
-	 * single process is doing a stream of sync writes.  No point in waiting
-	 * for joiners in that case.
-	 */
-	pid = current->pid;
-	if (handle->h_sync && journal->j_last_sync_writer != pid) {
-		journal->j_last_sync_writer = pid;
-		do {
-			old_handle_count = transaction->t_handle_count;
-			schedule_timeout_uninterruptible(1);
-		} while (old_handle_count != transaction->t_handle_count);
-	}
-
 	current->journal_info = NULL;
 	spin_lock(&journal->j_state_lock);
 	spin_lock(&transaction->t_handle_lock);
+
+	if (journal->j_committing_transaction && handle->h_sync) {
+		tid_t tid = journal->j_committing_transaction->t_tid;
+
+		spin_unlock(&transaction->t_handle_lock);
+		spin_unlock(&journal->j_state_lock);
+
+		err = log_wait_commit(journal, tid);
+
+		spin_lock(&journal->j_state_lock);
+		spin_lock(&transaction->t_handle_lock);
+	}
+
 	transaction->t_outstanding_credits -= handle->h_buffer_credits;
 	transaction->t_updates--;
 	if (!transaction->t_updates) {



^ permalink raw reply related	[flat|nested] 21+ messages in thread

* Re: some hard numbers on ext3 & batching performance issue
  2008-03-05 20:20           ` Josef Bacik
@ 2008-03-07 20:08             ` Ric Wheeler
  2008-03-07 20:40               ` Josef Bacik
  0 siblings, 1 reply; 21+ messages in thread
From: Ric Wheeler @ 2008-03-07 20:08 UTC (permalink / raw)
  To: Josef Bacik
  Cc: David Chinner, Theodore Ts'o, adilger, jack, Feld, Andy,
	linux-fsdevel, linux-scsi

Josef Bacik wrote:
> On Wednesday 05 March 2008 2:19:48 pm Ric Wheeler wrote:
>> After the IO/FS workshop last week, I posted some details on the slow
>> down we see with ext3 when we have a low latency back end instead of a
>> normal local disk (SCSI/S-ATA/etc).
...
...
...
>> It would be really interesting to rerun some of these tests on xfs which
>> Dave explained in the thread last week has a more self tuning way to
>> batch up transactions....
>>
>> Note that all of those poor users who have a synchronous write workload
>> today are in the "1" row for each of the above tables.
> 
> Mind giving this a whirl?  The fastest thing I've got here is an Apple X RAID 
> and its being used for something else atm, so I've only tested this on local 
> disk to make sure it didn't make local performance suck (which it doesn't btw).  
> This should be equivalent with what David says XFS does.  Thanks much,
> 
> Josef
> 
> diff --git a/fs/jbd/transaction.c b/fs/jbd/transaction.c
> index c6cbb6c..4596e1c 100644
> --- a/fs/jbd/transaction.c
> +++ b/fs/jbd/transaction.c
> @@ -1333,8 +1333,7 @@ int journal_stop(handle_t *handle)
>  {
>  	transaction_t *transaction = handle->h_transaction;
>  	journal_t *journal = transaction->t_journal;
> -	int old_handle_count, err;
> -	pid_t pid;
> +	int err;
>  
>  	J_ASSERT(journal_current_handle() == handle);
>  
> @@ -1353,32 +1352,22 @@ int journal_stop(handle_t *handle)
>  
>  	jbd_debug(4, "Handle %p going down\n", handle);
>  
> -	/*
> -	 * Implement synchronous transaction batching.  If the handle
> -	 * was synchronous, don't force a commit immediately.  Let's
> -	 * yield and let another thread piggyback onto this transaction.
> -	 * Keep doing that while new threads continue to arrive.
> -	 * It doesn't cost much - we're about to run a commit and sleep
> -	 * on IO anyway.  Speeds up many-threaded, many-dir operations
> -	 * by 30x or more...
> -	 *
> -	 * But don't do this if this process was the most recent one to
> -	 * perform a synchronous write.  We do this to detect the case where a
> -	 * single process is doing a stream of sync writes.  No point in waiting
> -	 * for joiners in that case.
> -	 */
> -	pid = current->pid;
> -	if (handle->h_sync && journal->j_last_sync_writer != pid) {
> -		journal->j_last_sync_writer = pid;
> -		do {
> -			old_handle_count = transaction->t_handle_count;
> -			schedule_timeout_uninterruptible(1);
> -		} while (old_handle_count != transaction->t_handle_count);
> -	}
> -
>  	current->journal_info = NULL;
>  	spin_lock(&journal->j_state_lock);
>  	spin_lock(&transaction->t_handle_lock);
> +
> +	if (journal->j_committing_transaction && handle->h_sync) {
> +		tid_t tid = journal->j_committing_transaction->t_tid;
> +
> +		spin_unlock(&transaction->t_handle_lock);
> +		spin_unlock(&journal->j_state_lock);
> +
> +		err = log_wait_commit(journal, tid);
> +
> +		spin_lock(&journal->j_state_lock);
> +		spin_lock(&transaction->t_handle_lock);
> +	}
> +
>  	transaction->t_outstanding_credits -= handle->h_buffer_credits;
>  	transaction->t_updates--;
>  	if (!transaction->t_updates) {
> 
> 
> 

Running with Josef's patch, I was able to see a clear improvement for 
batching these synchronous operations on ext3 with the RAM disk and 
array. It is not too often that you get to do a simple change and see a 
27 times improvement ;-)

On the bad side, the local disk case took as much as a 30% drop in 
performance.  The specific disk is not one that I have a lot of 
experience with, I would like to retry on a disk that has been qualified 
  by our group (i.e., we have reasonable confidence that there are no 
firmware issues, etc).

Now for the actual results.

The results are the average value of 5 runs for each number of threads.

Type     Threads   Baseline    Josef    Speedup (Josef/Baseline)
array	    1	     320.5      325.4      1.01
array	    2	     174.9      351.9      2.01
array	    4	     382.7      593.5      1.55
array	    8	     644.1      963.0      1.49
array	    10	     842.9     1038.7      1.23
array	    20	    1319.6     1432.3      1.08

RAM disk    1       5621.4     5595.1      0.99
RAM disk    2        281.5     7613.3     27.04
RAM disk    4        579.9     9111.5     15.71
RAM disk    8        891.1     9357.3     10.50
RAM disk    10      1116.3     9873.6      8.84
RAM disk    20      1952.0    10703.6      5.48

S-ATA disk  1         19.0       15.1      0.79
S-ATA disk  2         19.9       14.4      0.72
S-ATA disk  4         41.0       27.9      0.68
S-ATA disk  8         60.4       43.2      0.71
S-ATA disk  10        67.1       48.7      0.72
S-ATA disk  20       102.7       74.0      0.72

Background on the tests:

All of this is measured on three devices - a relatively old & slow 
array, the local (slow!) 2.5" S-ATA disk in the box and a RAM disk.

These numbers are used fs_mark to write 4096 byte files with the 
following commands:

fs_mark  -d  /home/test/t  -s  4096  -n  40000  -N  50  -D  64  -t  1
...
fs_mark  -d  /home/test/t  -s  4096  -n  20000  -N  50  -D  64  -t  2
...
fs_mark  -d  /home/test/t  -s  4096  -n  10000  -N  50  -D  64  -t  4
...
fs_mark  -d  /home/test/t  -s  4096  -n  5000  -N  50  -D  64  -t  8
...
fs_mark  -d  /home/test/t  -s  4096  -n  4000  -N  50  -D  64  -t  10
...
fs_mark  -d  /home/test/t  -s  4096  -n  2000  -N  50  -D  64  -t  20
...

Note that this spreads the files across 64 subdirectories, each thread 
writes 50 files and then moves on to the next in a round robin.

ric


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: some hard numbers on ext3 & batching performance issue
  2008-03-07 20:08             ` Ric Wheeler
@ 2008-03-07 20:40               ` Josef Bacik
  2008-03-07 20:45                 ` Ric Wheeler
  0 siblings, 1 reply; 21+ messages in thread
From: Josef Bacik @ 2008-03-07 20:40 UTC (permalink / raw)
  To: ric
  Cc: David Chinner, Theodore Ts'o, adilger, jack, Feld, Andy,
	linux-fsdevel, linux-scsi

On Friday 07 March 2008 3:08:32 pm Ric Wheeler wrote:
> Josef Bacik wrote:
> > On Wednesday 05 March 2008 2:19:48 pm Ric Wheeler wrote:
> >> After the IO/FS workshop last week, I posted some details on the slow
> >> down we see with ext3 when we have a low latency back end instead of a
> >> normal local disk (SCSI/S-ATA/etc).
>
> ...
> ...
> ...
>
> >> It would be really interesting to rerun some of these tests on xfs which
> >> Dave explained in the thread last week has a more self tuning way to
> >> batch up transactions....
> >>
> >> Note that all of those poor users who have a synchronous write workload
> >> today are in the "1" row for each of the above tables.
> >
> > Mind giving this a whirl?  The fastest thing I've got here is an Apple X
> > RAID and its being used for something else atm, so I've only tested this
> > on local disk to make sure it didn't make local performance suck (which
> > it doesn't btw). This should be equivalent with what David says XFS does.
> >  Thanks much,
> >
> > Josef
> >
> > diff --git a/fs/jbd/transaction.c b/fs/jbd/transaction.c
> > index c6cbb6c..4596e1c 100644
> > --- a/fs/jbd/transaction.c
> > +++ b/fs/jbd/transaction.c
> > @@ -1333,8 +1333,7 @@ int journal_stop(handle_t *handle)
> >  {
> >  	transaction_t *transaction = handle->h_transaction;
> >  	journal_t *journal = transaction->t_journal;
> > -	int old_handle_count, err;
> > -	pid_t pid;
> > +	int err;
> >
> >  	J_ASSERT(journal_current_handle() == handle);
> >
> > @@ -1353,32 +1352,22 @@ int journal_stop(handle_t *handle)
> >
> >  	jbd_debug(4, "Handle %p going down\n", handle);
> >
> > -	/*
> > -	 * Implement synchronous transaction batching.  If the handle
> > -	 * was synchronous, don't force a commit immediately.  Let's
> > -	 * yield and let another thread piggyback onto this transaction.
> > -	 * Keep doing that while new threads continue to arrive.
> > -	 * It doesn't cost much - we're about to run a commit and sleep
> > -	 * on IO anyway.  Speeds up many-threaded, many-dir operations
> > -	 * by 30x or more...
> > -	 *
> > -	 * But don't do this if this process was the most recent one to
> > -	 * perform a synchronous write.  We do this to detect the case where a
> > -	 * single process is doing a stream of sync writes.  No point in
> > waiting -	 * for joiners in that case.
> > -	 */
> > -	pid = current->pid;
> > -	if (handle->h_sync && journal->j_last_sync_writer != pid) {
> > -		journal->j_last_sync_writer = pid;
> > -		do {
> > -			old_handle_count = transaction->t_handle_count;
> > -			schedule_timeout_uninterruptible(1);
> > -		} while (old_handle_count != transaction->t_handle_count);
> > -	}
> > -
> >  	current->journal_info = NULL;
> >  	spin_lock(&journal->j_state_lock);
> >  	spin_lock(&transaction->t_handle_lock);
> > +
> > +	if (journal->j_committing_transaction && handle->h_sync) {
> > +		tid_t tid = journal->j_committing_transaction->t_tid;
> > +
> > +		spin_unlock(&transaction->t_handle_lock);
> > +		spin_unlock(&journal->j_state_lock);
> > +
> > +		err = log_wait_commit(journal, tid);
> > +
> > +		spin_lock(&journal->j_state_lock);
> > +		spin_lock(&transaction->t_handle_lock);
> > +	}
> > +
> >  	transaction->t_outstanding_credits -= handle->h_buffer_credits;
> >  	transaction->t_updates--;
> >  	if (!transaction->t_updates) {
>
> Running with Josef's patch, I was able to see a clear improvement for
> batching these synchronous operations on ext3 with the RAM disk and
> array. It is not too often that you get to do a simple change and see a
> 27 times improvement ;-)
>
> On the bad side, the local disk case took as much as a 30% drop in
> performance.  The specific disk is not one that I have a lot of
> experience with, I would like to retry on a disk that has been qualified
>   by our group (i.e., we have reasonable confidence that there are no
> firmware issues, etc).
>
> Now for the actual results.
>
> The results are the average value of 5 runs for each number of threads.
>
> Type     Threads   Baseline    Josef    Speedup (Josef/Baseline)
> array	    1	     320.5      325.4      1.01
> array	    2	     174.9      351.9      2.01
> array	    4	     382.7      593.5      1.55
> array	    8	     644.1      963.0      1.49
> array	    10	     842.9     1038.7      1.23
> array	    20	    1319.6     1432.3      1.08
>
> RAM disk    1       5621.4     5595.1      0.99
> RAM disk    2        281.5     7613.3     27.04
> RAM disk    4        579.9     9111.5     15.71
> RAM disk    8        891.1     9357.3     10.50
> RAM disk    10      1116.3     9873.6      8.84
> RAM disk    20      1952.0    10703.6      5.48
>
> S-ATA disk  1         19.0       15.1      0.79
> S-ATA disk  2         19.9       14.4      0.72
> S-ATA disk  4         41.0       27.9      0.68
> S-ATA disk  8         60.4       43.2      0.71
> S-ATA disk  10        67.1       48.7      0.72
> S-ATA disk  20       102.7       74.0      0.72
>
> Background on the tests:
>
> All of this is measured on three devices - a relatively old & slow
> array, the local (slow!) 2.5" S-ATA disk in the box and a RAM disk.
>
> These numbers are used fs_mark to write 4096 byte files with the
> following commands:
>
> fs_mark  -d  /home/test/t  -s  4096  -n  40000  -N  50  -D  64  -t  1
> ...
> fs_mark  -d  /home/test/t  -s  4096  -n  20000  -N  50  -D  64  -t  2
> ...
> fs_mark  -d  /home/test/t  -s  4096  -n  10000  -N  50  -D  64  -t  4
> ...
> fs_mark  -d  /home/test/t  -s  4096  -n  5000  -N  50  -D  64  -t  8
> ...
> fs_mark  -d  /home/test/t  -s  4096  -n  4000  -N  50  -D  64  -t  10
> ...
> fs_mark  -d  /home/test/t  -s  4096  -n  2000  -N  50  -D  64  -t  20
> ...
>
> Note that this spreads the files across 64 subdirectories, each thread
> writes 50 files and then moves on to the next in a round robin.
>

I'm starting to wonder about the disks I have, because my files/second is 
spanking yours, and its just a local samsung 3gb/s sata drive.  With those 
commands I'm consistently getting over 700 files/sec.  I'm seeing about a 1-5% 
increase in speed locally with my patch.  I guess I'll start looking around for 
some other hardware and check on there in case this box is more badass than I 
think it is.  Thanks much,

Josef


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: some hard numbers on ext3 & batching performance issue
  2008-03-07 20:40               ` Josef Bacik
@ 2008-03-07 20:45                 ` Ric Wheeler
  2008-03-12 18:37                   ` Josef Bacik
  0 siblings, 1 reply; 21+ messages in thread
From: Ric Wheeler @ 2008-03-07 20:45 UTC (permalink / raw)
  To: Josef Bacik
  Cc: David Chinner, Theodore Ts'o, adilger, jack, Feld, Andy,
	linux-fsdevel, linux-scsi

Josef Bacik wrote:
> On Friday 07 March 2008 3:08:32 pm Ric Wheeler wrote:
>> Josef Bacik wrote:
>>> On Wednesday 05 March 2008 2:19:48 pm Ric Wheeler wrote:
>>>> After the IO/FS workshop last week, I posted some details on the slow
>>>> down we see with ext3 when we have a low latency back end instead of a
>>>> normal local disk (SCSI/S-ATA/etc).
>> ...
>> ...
>> ...
>>
>>>> It would be really interesting to rerun some of these tests on xfs which
>>>> Dave explained in the thread last week has a more self tuning way to
>>>> batch up transactions....
>>>>
>>>> Note that all of those poor users who have a synchronous write workload
>>>> today are in the "1" row for each of the above tables.
>>> Mind giving this a whirl?  The fastest thing I've got here is an Apple X
>>> RAID and its being used for something else atm, so I've only tested this
>>> on local disk to make sure it didn't make local performance suck (which
>>> it doesn't btw). This should be equivalent with what David says XFS does.
>>>  Thanks much,
>>>
>>> Josef
>>>
>>> diff --git a/fs/jbd/transaction.c b/fs/jbd/transaction.c
>>> index c6cbb6c..4596e1c 100644
>>> --- a/fs/jbd/transaction.c
>>> +++ b/fs/jbd/transaction.c
>>> @@ -1333,8 +1333,7 @@ int journal_stop(handle_t *handle)
>>>  {
>>>  	transaction_t *transaction = handle->h_transaction;
>>>  	journal_t *journal = transaction->t_journal;
>>> -	int old_handle_count, err;
>>> -	pid_t pid;
>>> +	int err;
>>>
>>>  	J_ASSERT(journal_current_handle() == handle);
>>>
>>> @@ -1353,32 +1352,22 @@ int journal_stop(handle_t *handle)
>>>
>>>  	jbd_debug(4, "Handle %p going down\n", handle);
>>>
>>> -	/*
>>> -	 * Implement synchronous transaction batching.  If the handle
>>> -	 * was synchronous, don't force a commit immediately.  Let's
>>> -	 * yield and let another thread piggyback onto this transaction.
>>> -	 * Keep doing that while new threads continue to arrive.
>>> -	 * It doesn't cost much - we're about to run a commit and sleep
>>> -	 * on IO anyway.  Speeds up many-threaded, many-dir operations
>>> -	 * by 30x or more...
>>> -	 *
>>> -	 * But don't do this if this process was the most recent one to
>>> -	 * perform a synchronous write.  We do this to detect the case where a
>>> -	 * single process is doing a stream of sync writes.  No point in
>>> waiting -	 * for joiners in that case.
>>> -	 */
>>> -	pid = current->pid;
>>> -	if (handle->h_sync && journal->j_last_sync_writer != pid) {
>>> -		journal->j_last_sync_writer = pid;
>>> -		do {
>>> -			old_handle_count = transaction->t_handle_count;
>>> -			schedule_timeout_uninterruptible(1);
>>> -		} while (old_handle_count != transaction->t_handle_count);
>>> -	}
>>> -
>>>  	current->journal_info = NULL;
>>>  	spin_lock(&journal->j_state_lock);
>>>  	spin_lock(&transaction->t_handle_lock);
>>> +
>>> +	if (journal->j_committing_transaction && handle->h_sync) {
>>> +		tid_t tid = journal->j_committing_transaction->t_tid;
>>> +
>>> +		spin_unlock(&transaction->t_handle_lock);
>>> +		spin_unlock(&journal->j_state_lock);
>>> +
>>> +		err = log_wait_commit(journal, tid);
>>> +
>>> +		spin_lock(&journal->j_state_lock);
>>> +		spin_lock(&transaction->t_handle_lock);
>>> +	}
>>> +
>>>  	transaction->t_outstanding_credits -= handle->h_buffer_credits;
>>>  	transaction->t_updates--;
>>>  	if (!transaction->t_updates) {
>> Running with Josef's patch, I was able to see a clear improvement for
>> batching these synchronous operations on ext3 with the RAM disk and
>> array. It is not too often that you get to do a simple change and see a
>> 27 times improvement ;-)
>>
>> On the bad side, the local disk case took as much as a 30% drop in
>> performance.  The specific disk is not one that I have a lot of
>> experience with, I would like to retry on a disk that has been qualified
>>   by our group (i.e., we have reasonable confidence that there are no
>> firmware issues, etc).
>>
>> Now for the actual results.
>>
>> The results are the average value of 5 runs for each number of threads.
>>
>> Type     Threads   Baseline    Josef    Speedup (Josef/Baseline)
>> array	    1	     320.5      325.4      1.01
>> array	    2	     174.9      351.9      2.01
>> array	    4	     382.7      593.5      1.55
>> array	    8	     644.1      963.0      1.49
>> array	    10	     842.9     1038.7      1.23
>> array	    20	    1319.6     1432.3      1.08
>>
>> RAM disk    1       5621.4     5595.1      0.99
>> RAM disk    2        281.5     7613.3     27.04
>> RAM disk    4        579.9     9111.5     15.71
>> RAM disk    8        891.1     9357.3     10.50
>> RAM disk    10      1116.3     9873.6      8.84
>> RAM disk    20      1952.0    10703.6      5.48
>>
>> S-ATA disk  1         19.0       15.1      0.79
>> S-ATA disk  2         19.9       14.4      0.72
>> S-ATA disk  4         41.0       27.9      0.68
>> S-ATA disk  8         60.4       43.2      0.71
>> S-ATA disk  10        67.1       48.7      0.72
>> S-ATA disk  20       102.7       74.0      0.72
>>
>> Background on the tests:
>>
>> All of this is measured on three devices - a relatively old & slow
>> array, the local (slow!) 2.5" S-ATA disk in the box and a RAM disk.
>>
>> These numbers are used fs_mark to write 4096 byte files with the
>> following commands:
>>
>> fs_mark  -d  /home/test/t  -s  4096  -n  40000  -N  50  -D  64  -t  1
>> ...
>> fs_mark  -d  /home/test/t  -s  4096  -n  20000  -N  50  -D  64  -t  2
>> ...
>> fs_mark  -d  /home/test/t  -s  4096  -n  10000  -N  50  -D  64  -t  4
>> ...
>> fs_mark  -d  /home/test/t  -s  4096  -n  5000  -N  50  -D  64  -t  8
>> ...
>> fs_mark  -d  /home/test/t  -s  4096  -n  4000  -N  50  -D  64  -t  10
>> ...
>> fs_mark  -d  /home/test/t  -s  4096  -n  2000  -N  50  -D  64  -t  20
>> ...
>>
>> Note that this spreads the files across 64 subdirectories, each thread
>> writes 50 files and then moves on to the next in a round robin.
>>
> 
> I'm starting to wonder about the disks I have, because my files/second is 
> spanking yours, and its just a local samsung 3gb/s sata drive.  With those 
> commands I'm consistently getting over 700 files/sec.  I'm seeing about a 1-5% 
> increase in speed locally with my patch.  I guess I'll start looking around for 
> some other hardware and check on there in case this box is more badass than I 
> think it is.  Thanks much,
> 
> Josef
> 

Sounds like you might be running with write cache on & barriers off ;-)

Make sure you have write cache & barriers enabled on the drive. With a 
good S-ATA drive, you should be seeing about 35-50 files/sec with a 
single threaded writer.

The local disk that I tested on is a relatively slow s-ata disk that is 
more laptop quality/performance than server.

One thought I had about the results is that we might be flipping the IO 
sequence with the local disk case. It is the only device of the three 
that I tested which is seek/head movement sensitive for small files.

ric

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: some hard numbers on ext3 & batching performance issue
  2008-03-07 20:45                 ` Ric Wheeler
@ 2008-03-12 18:37                   ` Josef Bacik
  2008-03-13 11:26                     ` Ric Wheeler
  0 siblings, 1 reply; 21+ messages in thread
From: Josef Bacik @ 2008-03-12 18:37 UTC (permalink / raw)
  To: Ric Wheeler
  Cc: Josef Bacik, David Chinner, Theodore Ts'o, adilger, jack,
	Feld, Andy, linux-fsdevel, linux-scsi

On Fri, Mar 07, 2008 at 03:45:58PM -0500, Ric Wheeler wrote:
> Josef Bacik wrote:
>> On Friday 07 March 2008 3:08:32 pm Ric Wheeler wrote:
>>> Josef Bacik wrote:
>>>> On Wednesday 05 March 2008 2:19:48 pm Ric Wheeler wrote:
>>>>> After the IO/FS workshop last week, I posted some details on the slow
>>>>> down we see with ext3 when we have a low latency back end instead of a
>>>>> normal local disk (SCSI/S-ATA/etc).
>>> ...
>>> ...
>>>
>>> Note that this spreads the files across 64 subdirectories, each thread
>>> writes 50 files and then moves on to the next in a round robin.
>>>
>>
>> I'm starting to wonder about the disks I have, because my files/second is 
>> spanking yours, and its just a local samsung 3gb/s sata drive.  With those 
>> commands I'm consistently getting over 700 files/sec.  I'm seeing about a 
>> 1-5% increase in speed locally with my patch.  I guess I'll start looking 
>> around for some other hardware and check on there in case this box is more 
>> badass than I think it is.  Thanks much,
>>
>> Josef
>>
>
> Sounds like you might be running with write cache on & barriers off ;-)
>
> Make sure you have write cache & barriers enabled on the drive. With a good 
> S-ATA drive, you should be seeing about 35-50 files/sec with a single 
> threaded writer.
>
> The local disk that I tested on is a relatively slow s-ata disk that is 
> more laptop quality/performance than server.
>
> One thought I had about the results is that we might be flipping the IO 
> sequence with the local disk case. It is the only device of the three that 
> I tested which is seek/head movement sensitive for small files.
>

Ahh yes turning off write cache off and barriers on I get your numbers, however
I'm not seeing the slowdown that you are, with and without my patch I'm seeing
the same performance.  Its just a plane jane intel sata controller with a
samsung sata disk set at 1.5gbps.  Same thing with an nvidia sata controller.
I'll think about this some more and see if there is something better that could
be done that may help you.  Thanks much,

Josef

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: some hard numbers on ext3 & batching performance issue
  2008-03-12 18:37                   ` Josef Bacik
@ 2008-03-13 11:26                     ` Ric Wheeler
  0 siblings, 0 replies; 21+ messages in thread
From: Ric Wheeler @ 2008-03-13 11:26 UTC (permalink / raw)
  To: Josef Bacik
  Cc: David Chinner, Theodore Ts'o, adilger, jack, Feld, Andy,
	linux-fsdevel, linux-scsi

Josef Bacik wrote:
> On Fri, Mar 07, 2008 at 03:45:58PM -0500, Ric Wheeler wrote:
>> Josef Bacik wrote:
>>> On Friday 07 March 2008 3:08:32 pm Ric Wheeler wrote:
>>>> Josef Bacik wrote:
>>>>> On Wednesday 05 March 2008 2:19:48 pm Ric Wheeler wrote:
>>>>>> After the IO/FS workshop last week, I posted some details on the slow
>>>>>> down we see with ext3 when we have a low latency back end instead of a
>>>>>> normal local disk (SCSI/S-ATA/etc).
>>>> ...
>>>> ...
>>>>
>>>> Note that this spreads the files across 64 subdirectories, each thread
>>>> writes 50 files and then moves on to the next in a round robin.
>>>>
>>> I'm starting to wonder about the disks I have, because my files/second is 
>>> spanking yours, and its just a local samsung 3gb/s sata drive.  With those 
>>> commands I'm consistently getting over 700 files/sec.  I'm seeing about a 
>>> 1-5% increase in speed locally with my patch.  I guess I'll start looking 
>>> around for some other hardware and check on there in case this box is more 
>>> badass than I think it is.  Thanks much,
>>>
>>> Josef
>>>
>> Sounds like you might be running with write cache on & barriers off ;-)
>>
>> Make sure you have write cache & barriers enabled on the drive. With a good 
>> S-ATA drive, you should be seeing about 35-50 files/sec with a single 
>> threaded writer.
>>
>> The local disk that I tested on is a relatively slow s-ata disk that is 
>> more laptop quality/performance than server.
>>
>> One thought I had about the results is that we might be flipping the IO 
>> sequence with the local disk case. It is the only device of the three that 
>> I tested which is seek/head movement sensitive for small files.
>>
> 
> Ahh yes turning off write cache off and barriers on I get your numbers, however
> I'm not seeing the slowdown that you are, with and without my patch I'm seeing
> the same performance.  Its just a plane jane intel sata controller with a
> samsung sata disk set at 1.5gbps.  Same thing with an nvidia sata controller.
> I'll think about this some more and see if there is something better that could
> be done that may help you.  Thanks much,
> 
> Josef
> 

Thanks - you should see the numbers with write cache enabled and 
barriers on as well, but for small files, write cache disabled is quite 
close ;-)

I am happy to rerun the tests at any point, I have a variety of disk 
types and controllers (lots of Intel AHCI boxes) to use.

ric


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: some hard numbers on ext3 & batching performance issue
  2008-03-05 19:19         ` some hard numbers on ext3 & " Ric Wheeler
  2008-03-05 20:20           ` Josef Bacik
@ 2008-03-06  0:28           ` David Chinner
  1 sibling, 0 replies; 21+ messages in thread
From: David Chinner @ 2008-03-06  0:28 UTC (permalink / raw)
  To: Ric Wheeler
  Cc: David Chinner, Josef Bacik, Theodore Ts'o, adilger, jack,
	Feld, Andy, linux-fsdevel, linux-scsi

On Wed, Mar 05, 2008 at 02:19:48PM -0500, Ric Wheeler wrote:
> The work load is generated using fs_mark 
> (http://sourceforge.net/projects/fsmark/) which is basically a write 
> workload with small files, each file gets fsync'ed before close. The 
> metric is "files/sec".

.......

> It would be really interesting to rerun some of these tests on xfs which 
> Dave explained in the thread last week has a more self tuning way to 
> batch up transactions....

Ok, so XFS numbers. note these are all on a CONFIG_XFS_DEBUG=y kernel, so
there's lots of extra checks in the code as compared to a normal production
kernel.

Local disk (15krpm SCSI, WCD, CONFIG_XFS_DEBUG=y):

threads		files/s
  1		  97
  2		 117
  4		 109
  8		 110
 10		 113
 20		 116

Local disk (15krpm SCSI, WCE, nobarrier, CONFIG_XFS_DEBUG=y):

threads		files/s
  1		 203
  2		 216
  4		 243
  8		 332
 10		 405
 20		 424

Ramdisk (nobarrier, CONFIG_XFS_DEBUG=y):

	       agcount=4	agcount=16
threads		files/s		 files/s
  1		 1298		  1298
  2		 2073             2394
  4		 3296             3321
  8		 3464             4199
 10		 3394             3937
 20		 3251             3691

Note the difference the amount of parallel allocation in the
filesystem makes - agcount=4 only allows up to 4 parallel allocations
at once, so even if they are all aggregated into the one log I/O,
no further allocation can take place until that log I/O is complete.

And at about 4000 files/s the system (4p ia64) is becoming CPU bound
due to all the debug checks in XFS.

Cheers,

Dave.
-- 
Dave Chinner
Principal Engineer
SGI Australian Software Group

^ permalink raw reply	[flat|nested] 21+ messages in thread

end of thread, other threads:[~2008-03-13 11:26 UTC | newest]

Thread overview: 21+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2008-02-28 12:09 background on the ext3 batching performance issue Ric Wheeler
2008-02-28 15:05 ` Josef Bacik
2008-02-28 15:41   ` Josef Bacik
2008-02-28 13:03     ` Ric Wheeler
2008-02-28 13:09     ` Ric Wheeler
2008-02-28 16:41       ` Jan Kara
2008-02-28 17:02       ` Chris Mason
2008-02-28 17:13         ` Jan Kara
2008-02-28 17:35           ` Chris Mason
2008-02-28 18:15             ` Jan Kara
2008-02-28 17:54       ` David Chinner
2008-02-28 19:48         ` Ric Wheeler
2008-02-29 14:52         ` Ric Wheeler
2008-03-05 19:19         ` some hard numbers on ext3 & " Ric Wheeler
2008-03-05 20:20           ` Josef Bacik
2008-03-07 20:08             ` Ric Wheeler
2008-03-07 20:40               ` Josef Bacik
2008-03-07 20:45                 ` Ric Wheeler
2008-03-12 18:37                   ` Josef Bacik
2008-03-13 11:26                     ` Ric Wheeler
2008-03-06  0:28           ` David Chinner

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).