* background on the ext3 batching performance issue
@ 2008-02-28 12:09 Ric Wheeler
2008-02-28 15:05 ` Josef Bacik
0 siblings, 1 reply; 21+ messages in thread
From: Ric Wheeler @ 2008-02-28 12:09 UTC (permalink / raw)
To: Theodore Ts'o, adilger, David Chinner, jack; +Cc: Feld, Andy, linux-fsdevel
At the LSF workshop, I mentioned that we have tripped across an
embarrassing performance issue in the jbd transaction code which is
clearly not tuned for low latency devices.
The short summary is that we can do say 800 10k files/sec in a
write/fsync/close loop with a single thread, but drop down to under 250
files/sec with 2 or more threads.
This is pretty easy to reproduce with any small file write synchronous
workload (i.e., fsync() each file before close). We used my fs_mark
tool to reproduce.
The core of the issue is the call in the jbd transaction code call out
to schedule_timeout_uninterruptible(1) which causes us to sleep for 4ms:
pid = current->pid;
if (handle->h_sync && journal->j_last_sync_writer != pid) {
journal->j_last_sync_writer = pid;
do {
old_handle_count = transaction->t_handle_count;
schedule_timeout_uninterruptible(1);
} while (old_handle_count != transaction->t_handle_count);
}
This is quite topical to the concern we had with low latency devices in
general, but specifically things like SSD's.
regards,
ric
^ permalink raw reply [flat|nested] 21+ messages in thread* Re: background on the ext3 batching performance issue 2008-02-28 12:09 background on the ext3 batching performance issue Ric Wheeler @ 2008-02-28 15:05 ` Josef Bacik 2008-02-28 15:41 ` Josef Bacik 0 siblings, 1 reply; 21+ messages in thread From: Josef Bacik @ 2008-02-28 15:05 UTC (permalink / raw) To: Ric Wheeler Cc: Theodore Ts'o, adilger, David Chinner, jack, Feld, Andy, linux-fsdevel On Thursday 28 February 2008 7:09:17 am Ric Wheeler wrote: > At the LSF workshop, I mentioned that we have tripped across an > embarrassing performance issue in the jbd transaction code which is > clearly not tuned for low latency devices. > > The short summary is that we can do say 800 10k files/sec in a > write/fsync/close loop with a single thread, but drop down to under 250 > files/sec with 2 or more threads. > > This is pretty easy to reproduce with any small file write synchronous > workload (i.e., fsync() each file before close). We used my fs_mark > tool to reproduce. > > The core of the issue is the call in the jbd transaction code call out > to schedule_timeout_uninterruptible(1) which causes us to sleep for 4ms: > > pid = current->pid; > if (handle->h_sync && journal->j_last_sync_writer != pid) { > journal->j_last_sync_writer = pid; > do { > old_handle_count = transaction->t_handle_count; > schedule_timeout_uninterruptible(1); > } while (old_handle_count != transaction->t_handle_count); > } > > This is quite topical to the concern we had with low latency devices in > general, but specifically things like SSD's. > Your testcase does in fact show a weakness in this optimization, but look at the more likely case, where you have multiple writers on the same filesystem rather than one guy doing write/fsync. If we wait we could potentially add quite a few more buffers to this transaction before flushing it, rather than flushing a buffer or two at a time. What would you propose as a solution? Josef ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: background on the ext3 batching performance issue 2008-02-28 15:05 ` Josef Bacik @ 2008-02-28 15:41 ` Josef Bacik 2008-02-28 13:03 ` Ric Wheeler 2008-02-28 13:09 ` Ric Wheeler 0 siblings, 2 replies; 21+ messages in thread From: Josef Bacik @ 2008-02-28 15:41 UTC (permalink / raw) To: Ric Wheeler Cc: Theodore Ts'o, adilger, David Chinner, jack, Feld, Andy, linux-fsdevel On Thursday 28 February 2008 10:05:11 am Josef Bacik wrote: > On Thursday 28 February 2008 7:09:17 am Ric Wheeler wrote: > > At the LSF workshop, I mentioned that we have tripped across an > > embarrassing performance issue in the jbd transaction code which is > > clearly not tuned for low latency devices. > > > > The short summary is that we can do say 800 10k files/sec in a > > write/fsync/close loop with a single thread, but drop down to under 250 > > files/sec with 2 or more threads. > > > > This is pretty easy to reproduce with any small file write synchronous > > workload (i.e., fsync() each file before close). We used my fs_mark > > tool to reproduce. > > > > The core of the issue is the call in the jbd transaction code call out > > to schedule_timeout_uninterruptible(1) which causes us to sleep for 4ms: > > > > pid = current->pid; > > if (handle->h_sync && journal->j_last_sync_writer != pid) { > > journal->j_last_sync_writer = pid; > > do { > > old_handle_count = transaction->t_handle_count; > > schedule_timeout_uninterruptible(1); > > } while (old_handle_count != > > transaction->t_handle_count); } > > > > This is quite topical to the concern we had with low latency devices in > > general, but specifically things like SSD's. > > Your testcase does in fact show a weakness in this optimization, but look > at the more likely case, where you have multiple writers on the same > filesystem rather than one guy doing write/fsync. If we wait we could > potentially add quite a few more buffers to this transaction before > flushing it, rather than flushing a buffer or two at a time. What would > you propose as a solution? > Forgive me, I said that badly, now that I've had my morning coffee let me try again. You are ping-ponging the j_last_sync_writer back and forth between the two threads, so you don't get the speedup you would get with one thread where we would just bypass the next sleep since we know we've got one thread doing write/sync. So this brings up the question, should we try and figure out if we have the situation where we have multiple threads doing write/sync and therefore exploiting the weakness in this optimization, and if we should, how would we do this properly? The only thing I can think to do is to track sync writers on a transaction, and if its more than one bypass this little snippet. In fact I think I'll go ahead and do that and see what fs_mark comes up with. Thank you, Josef ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: background on the ext3 batching performance issue 2008-02-28 15:41 ` Josef Bacik @ 2008-02-28 13:03 ` Ric Wheeler 2008-02-28 13:09 ` Ric Wheeler 1 sibling, 0 replies; 21+ messages in thread From: Ric Wheeler @ 2008-02-28 13:03 UTC (permalink / raw) To: Josef Bacik Cc: Theodore Ts'o, adilger, David Chinner, jack, Feld, Andy, linux-fsdevel Josef Bacik wrote: > On Thursday 28 February 2008 10:05:11 am Josef Bacik wrote: >> On Thursday 28 February 2008 7:09:17 am Ric Wheeler wrote: >>> At the LSF workshop, I mentioned that we have tripped across an >>> embarrassing performance issue in the jbd transaction code which is >>> clearly not tuned for low latency devices. >>> >>> The short summary is that we can do say 800 10k files/sec in a >>> write/fsync/close loop with a single thread, but drop down to under 250 >>> files/sec with 2 or more threads. >>> >>> This is pretty easy to reproduce with any small file write synchronous >>> workload (i.e., fsync() each file before close). We used my fs_mark >>> tool to reproduce. >>> >>> The core of the issue is the call in the jbd transaction code call out >>> to schedule_timeout_uninterruptible(1) which causes us to sleep for 4ms: >>> >>> pid = current->pid; >>> if (handle->h_sync && journal->j_last_sync_writer != pid) { >>> journal->j_last_sync_writer = pid; >>> do { >>> old_handle_count = transaction->t_handle_count; >>> schedule_timeout_uninterruptible(1); >>> } while (old_handle_count != >>> transaction->t_handle_count); } >>> >>> This is quite topical to the concern we had with low latency devices in >>> general, but specifically things like SSD's. >> Your testcase does in fact show a weakness in this optimization, but look >> at the more likely case, where you have multiple writers on the same >> filesystem rather than one guy doing write/fsync. If we wait we could >> potentially add quite a few more buffers to this transaction before >> flushing it, rather than flushing a buffer or two at a time. What would >> you propose as a solution? >> > > Forgive me, I said that badly, now that I've had my morning coffee let me try > again. You are ping-ponging the j_last_sync_writer back and forth between the > two threads, so you don't get the speedup you would get with one thread where > we would just bypass the next sleep since we know we've got one thread doing > write/sync. So this brings up the question, should we try and figure out if we > have the situation where we have multiple threads doing write/sync and > therefore exploiting the weakness in this optimization, and if we should, how > would we do this properly? The only thing I can think to do is to track sync > writers on a transaction, and if its more than one bypass this little snippet. > In fact I think I'll go ahead and do that and see what fs_mark comes up with. > Thank you, > > Josef > Even worse, we go 4 times slower with 2 threads than we do with a single thread! This code has tried several things in the past - reiserfs used to do a yield() at one point. I am traveling until the weekend, but will be able to help with this when I get back in to my lab on Monday... ric ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: background on the ext3 batching performance issue 2008-02-28 15:41 ` Josef Bacik 2008-02-28 13:03 ` Ric Wheeler @ 2008-02-28 13:09 ` Ric Wheeler 2008-02-28 16:41 ` Jan Kara ` (2 more replies) 1 sibling, 3 replies; 21+ messages in thread From: Ric Wheeler @ 2008-02-28 13:09 UTC (permalink / raw) To: Josef Bacik Cc: Theodore Ts'o, adilger, David Chinner, jack, Feld, Andy, linux-fsdevel Josef Bacik wrote: > On Thursday 28 February 2008 10:05:11 am Josef Bacik wrote: >> On Thursday 28 February 2008 7:09:17 am Ric Wheeler wrote: >>> At the LSF workshop, I mentioned that we have tripped across an >>> embarrassing performance issue in the jbd transaction code which is >>> clearly not tuned for low latency devices. >>> >>> The short summary is that we can do say 800 10k files/sec in a >>> write/fsync/close loop with a single thread, but drop down to under 250 >>> files/sec with 2 or more threads. >>> >>> This is pretty easy to reproduce with any small file write synchronous >>> workload (i.e., fsync() each file before close). We used my fs_mark >>> tool to reproduce. >>> >>> The core of the issue is the call in the jbd transaction code call out >>> to schedule_timeout_uninterruptible(1) which causes us to sleep for 4ms: >>> >>> pid = current->pid; >>> if (handle->h_sync && journal->j_last_sync_writer != pid) { >>> journal->j_last_sync_writer = pid; >>> do { >>> old_handle_count = transaction->t_handle_count; >>> schedule_timeout_uninterruptible(1); >>> } while (old_handle_count != >>> transaction->t_handle_count); } >>> >>> This is quite topical to the concern we had with low latency devices in >>> general, but specifically things like SSD's. >> Your testcase does in fact show a weakness in this optimization, but look >> at the more likely case, where you have multiple writers on the same >> filesystem rather than one guy doing write/fsync. If we wait we could >> potentially add quite a few more buffers to this transaction before >> flushing it, rather than flushing a buffer or two at a time. What would >> you propose as a solution? >> > > Forgive me, I said that badly, now that I've had my morning coffee let me try > again. You are ping-ponging the j_last_sync_writer back and forth between the > two threads, so you don't get the speedup you would get with one thread where > we would just bypass the next sleep since we know we've got one thread doing > write/sync. So this brings up the question, should we try and figure out if we > have the situation where we have multiple threads doing write/sync and > therefore exploiting the weakness in this optimization, and if we should, how > would we do this properly? The only thing I can think to do is to track sync > writers on a transaction, and if its more than one bypass this little snippet. > In fact I think I'll go ahead and do that and see what fs_mark comes up with. > Thank you, > > Josef > One more thought - what we really want here is to have a sense of the latency of the device. In the S-ATA disk case, this optimization works well for batching since we "spend" an extra 4ms worst case in the chance of combining multiple, slow 18ms operations. With the clariion box we tested, the optimization fails badly since the cost is only 1.3 ms so we optimize by waiting 3-4 times longer than it would take to do the operation immediately. This problem has also seemed to me to be the same problem that IO schedulers do with plugging - we want to dynamically figure out when to plug and unplug here without hard coding in device specific tunings. If we bypass the snippet for multi-threaded writers, we would probably slow down this workload on normal S-ATA/ATA drives (or even higher performance non-RAID disks). ric ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: background on the ext3 batching performance issue 2008-02-28 13:09 ` Ric Wheeler @ 2008-02-28 16:41 ` Jan Kara 2008-02-28 17:02 ` Chris Mason 2008-02-28 17:54 ` David Chinner 2 siblings, 0 replies; 21+ messages in thread From: Jan Kara @ 2008-02-28 16:41 UTC (permalink / raw) To: Ric Wheeler Cc: Josef Bacik, Theodore Ts'o, adilger, David Chinner, Feld, Andy, linux-fsdevel > Josef Bacik wrote: > >On Thursday 28 February 2008 10:05:11 am Josef Bacik wrote: > >>On Thursday 28 February 2008 7:09:17 am Ric Wheeler wrote: > >>>At the LSF workshop, I mentioned that we have tripped across an > >>>embarrassing performance issue in the jbd transaction code which is > >>>clearly not tuned for low latency devices. > >>> > >>>The short summary is that we can do say 800 10k files/sec in a > >>>write/fsync/close loop with a single thread, but drop down to under 250 > >>>files/sec with 2 or more threads. > >>> > >>>This is pretty easy to reproduce with any small file write synchronous > >>>workload (i.e., fsync() each file before close). We used my fs_mark > >>>tool to reproduce. > >>> > >>>The core of the issue is the call in the jbd transaction code call out > >>>to schedule_timeout_uninterruptible(1) which causes us to sleep for 4ms: > >>> > >>> pid = current->pid; > >>> if (handle->h_sync && journal->j_last_sync_writer != pid) { > >>> journal->j_last_sync_writer = pid; > >>> do { > >>> old_handle_count = transaction->t_handle_count; > >>> schedule_timeout_uninterruptible(1); > >>> } while (old_handle_count != > >>>transaction->t_handle_count); } > >>> > >>>This is quite topical to the concern we had with low latency devices in > >>>general, but specifically things like SSD's. > >>Your testcase does in fact show a weakness in this optimization, but look > >>at the more likely case, where you have multiple writers on the same > >>filesystem rather than one guy doing write/fsync. If we wait we could > >>potentially add quite a few more buffers to this transaction before > >>flushing it, rather than flushing a buffer or two at a time. What would > >>you propose as a solution? > >> > > > >Forgive me, I said that badly, now that I've had my morning coffee let me > >try again. You are ping-ponging the j_last_sync_writer back and forth > >between the two threads, so you don't get the speedup you would get with > >one thread where we would just bypass the next sleep since we know we've > >got one thread doing write/sync. So this brings up the question, should > >we try and figure out if we have the situation where we have multiple > >threads doing write/sync and therefore exploiting the weakness in this > >optimization, and if we should, how would we do this properly? The only > >thing I can think to do is to track sync writers on a transaction, and if > >its more than one bypass this little snippet. In fact I think I'll go > >ahead and do that and see what fs_mark comes up with. Thank you, > > > >Josef > > > > One more thought - what we really want here is to have a sense of the > latency of the device. In the S-ATA disk case, this optimization works > well for batching since we "spend" an extra 4ms worst case in the chance > of combining multiple, slow 18ms operations. > > With the clariion box we tested, the optimization fails badly since the > cost is only 1.3 ms so we optimize by waiting 3-4 times longer than it > would take to do the operation immediately. > > This problem has also seemed to me to be the same problem that IO > schedulers do with plugging - we want to dynamically figure out when to > plug and unplug here without hard coding in device specific tunings. > > If we bypass the snippet for multi-threaded writers, we would probably > slow down this workload on normal S-ATA/ATA drives (or even higher > performance non-RAID disks). Exactly. I can run some tests next week but I guess for standard disk you have in your desktop, this optimization is really worthwhile since transaction commit has a significant cost on such drive (and we already suck in fsync() performance in ext3 for other reasons so I wouldn't like to make it even worse ;). The question is how we could tell in JBD whether the optimisation is worth it or not. Journal flag (settable via tunefs) is always an option but if somebody has a better idea... But if mkfs did some magic and automatically set the flag when it found out the device has low latency, it might be actually quite satisfactory solution. Also this option might be useful also for people preferring lower fsync latency over general throughput. Honza -- Jan Kara <jack@suse.cz> SuSE CR Labs ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: background on the ext3 batching performance issue 2008-02-28 13:09 ` Ric Wheeler 2008-02-28 16:41 ` Jan Kara @ 2008-02-28 17:02 ` Chris Mason 2008-02-28 17:13 ` Jan Kara 2008-02-28 17:54 ` David Chinner 2 siblings, 1 reply; 21+ messages in thread From: Chris Mason @ 2008-02-28 17:02 UTC (permalink / raw) To: Ric Wheeler Cc: Josef Bacik, Theodore Ts'o, adilger, David Chinner, jack, Feld, Andy, linux-fsdevel On Thursday 28 February 2008, Ric Wheeler wrote: [ fsync batching can be slow ] > One more thought - what we really want here is to have a sense of the > latency of the device. In the S-ATA disk case, this optimization works > well for batching since we "spend" an extra 4ms worst case in the chance > of combining multiple, slow 18ms operations. > > With the clariion box we tested, the optimization fails badly since the > cost is only 1.3 ms so we optimize by waiting 3-4 times longer than it > would take to do the operation immediately. > > This problem has also seemed to me to be the same problem that IO > schedulers do with plugging - we want to dynamically figure out when to > plug and unplug here without hard coding in device specific tunings. > > If we bypass the snippet for multi-threaded writers, we would probably > slow down this workload on normal S-ATA/ATA drives (or even higher > performance non-RAID disks). It probably makes sense to keep track of the average number of writers we are able to gather into a transcation. There are lots of similar workloads where we have a pool of procs doing fsyncs and the size of the transaction or the number of times we joined a running transaction will be fairly constant. -chris ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: background on the ext3 batching performance issue 2008-02-28 17:02 ` Chris Mason @ 2008-02-28 17:13 ` Jan Kara 2008-02-28 17:35 ` Chris Mason 0 siblings, 1 reply; 21+ messages in thread From: Jan Kara @ 2008-02-28 17:13 UTC (permalink / raw) To: Chris Mason Cc: Ric Wheeler, Josef Bacik, Theodore Ts'o, adilger, David Chinner, Feld, Andy, linux-fsdevel > On Thursday 28 February 2008, Ric Wheeler wrote: > > [ fsync batching can be slow ] > > > One more thought - what we really want here is to have a sense of the > > latency of the device. In the S-ATA disk case, this optimization works > > well for batching since we "spend" an extra 4ms worst case in the chance > > of combining multiple, slow 18ms operations. > > > > With the clariion box we tested, the optimization fails badly since the > > cost is only 1.3 ms so we optimize by waiting 3-4 times longer than it > > would take to do the operation immediately. > > > > This problem has also seemed to me to be the same problem that IO > > schedulers do with plugging - we want to dynamically figure out when to > > plug and unplug here without hard coding in device specific tunings. > > > > If we bypass the snippet for multi-threaded writers, we would probably > > slow down this workload on normal S-ATA/ATA drives (or even higher > > performance non-RAID disks). > > It probably makes sense to keep track of the average number of writers we are > able to gather into a transcation. There are lots of similar workloads where > we have a pool of procs doing fsyncs and the size of the transaction or the > number of times we joined a running transaction will be fairly constant. I'm probably missing something, but what are you trying to say? Either we wait for writers and the number of writes is higher, or we don't wait and the number of writes in a transaction is lower... Honza -- Jan Kara <jack@suse.cz> SuSE CR Labs ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: background on the ext3 batching performance issue 2008-02-28 17:13 ` Jan Kara @ 2008-02-28 17:35 ` Chris Mason 2008-02-28 18:15 ` Jan Kara 0 siblings, 1 reply; 21+ messages in thread From: Chris Mason @ 2008-02-28 17:35 UTC (permalink / raw) To: Jan Kara Cc: Ric Wheeler, Josef Bacik, Theodore Ts'o, adilger, David Chinner, Feld, Andy, linux-fsdevel On Thursday 28 February 2008, Jan Kara wrote: > > On Thursday 28 February 2008, Ric Wheeler wrote: > > > > [ fsync batching can be slow ] > > > > > One more thought - what we really want here is to have a sense of the > > > latency of the device. In the S-ATA disk case, this optimization works > > > well for batching since we "spend" an extra 4ms worst case in the > > > chance of combining multiple, slow 18ms operations. > > > > > > With the clariion box we tested, the optimization fails badly since the > > > cost is only 1.3 ms so we optimize by waiting 3-4 times longer than it > > > would take to do the operation immediately. > > > > > > This problem has also seemed to me to be the same problem that IO > > > schedulers do with plugging - we want to dynamically figure out when to > > > plug and unplug here without hard coding in device specific tunings. > > > > > > If we bypass the snippet for multi-threaded writers, we would probably > > > slow down this workload on normal S-ATA/ATA drives (or even higher > > > performance non-RAID disks). > > > > It probably makes sense to keep track of the average number of writers we > > are able to gather into a transcation. There are lots of similar > > workloads where we have a pool of procs doing fsyncs and the size of the > > transaction or the number of times we joined a running transaction will > > be fairly constant. > > I'm probably missing something, but what are you trying to say? Either we > wait for writers and the number of writes is higher, or we don't wait and > the number of writes in a transaction is lower... The common workload would be N mail server threads servicing incoming requests at a fairly constant rate. Right now we sleep for a bit and wait for the number of writers to increase. My guess is that if we record the average number of times a writer joins an existing transaction, or if we record the average size of the transactions, we'll end up with a fairly constant number. So, we can skip the sleep if the transaction has already grown close to that number. This would avoid the latencies Ric is seeing. -chris ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: background on the ext3 batching performance issue 2008-02-28 17:35 ` Chris Mason @ 2008-02-28 18:15 ` Jan Kara 0 siblings, 0 replies; 21+ messages in thread From: Jan Kara @ 2008-02-28 18:15 UTC (permalink / raw) To: Chris Mason Cc: Ric Wheeler, Josef Bacik, Theodore Ts'o, adilger, David Chinner, Feld, Andy, linux-fsdevel On Thu 28-02-08 12:35:17, Chris Mason wrote: > On Thursday 28 February 2008, Jan Kara wrote: > > > On Thursday 28 February 2008, Ric Wheeler wrote: > > > > > > [ fsync batching can be slow ] > > > > > > > One more thought - what we really want here is to have a sense of the > > > > latency of the device. In the S-ATA disk case, this optimization works > > > > well for batching since we "spend" an extra 4ms worst case in the > > > > chance of combining multiple, slow 18ms operations. > > > > > > > > With the clariion box we tested, the optimization fails badly since the > > > > cost is only 1.3 ms so we optimize by waiting 3-4 times longer than it > > > > would take to do the operation immediately. > > > > > > > > This problem has also seemed to me to be the same problem that IO > > > > schedulers do with plugging - we want to dynamically figure out when to > > > > plug and unplug here without hard coding in device specific tunings. > > > > > > > > If we bypass the snippet for multi-threaded writers, we would probably > > > > slow down this workload on normal S-ATA/ATA drives (or even higher > > > > performance non-RAID disks). > > > > > > It probably makes sense to keep track of the average number of writers we > > > are able to gather into a transcation. There are lots of similar > > > workloads where we have a pool of procs doing fsyncs and the size of the > > > transaction or the number of times we joined a running transaction will > > > be fairly constant. > > > > I'm probably missing something, but what are you trying to say? Either we > > wait for writers and the number of writes is higher, or we don't wait and > > the number of writes in a transaction is lower... > > The common workload would be N mail server threads servicing incoming requests > at a fairly constant rate. Right now we sleep for a bit and wait for the > number of writers to increase. > > My guess is that if we record the average number of times a writer joins an > existing transaction, or if we record the average size of the transactions, > we'll end up with a fairly constant number. > > So, we can skip the sleep if the transaction has already grown close to that > number. This would avoid the latencies Ric is seeing. OK, I see. Interesting idea, but in Ric's case, you'd find out that two writers always joined the transaction and so you'd always wait for them and nothing changes, does it? Honza -- Jan Kara <jack@suse.cz> SUSE Labs, CR ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: background on the ext3 batching performance issue 2008-02-28 13:09 ` Ric Wheeler 2008-02-28 16:41 ` Jan Kara 2008-02-28 17:02 ` Chris Mason @ 2008-02-28 17:54 ` David Chinner 2008-02-28 19:48 ` Ric Wheeler ` (2 more replies) 2 siblings, 3 replies; 21+ messages in thread From: David Chinner @ 2008-02-28 17:54 UTC (permalink / raw) To: Ric Wheeler Cc: Josef Bacik, Theodore Ts'o, adilger, David Chinner, jack, Feld, Andy, linux-fsdevel On Thu, Feb 28, 2008 at 08:09:57AM -0500, Ric Wheeler wrote: > One more thought - what we really want here is to have a sense of the > latency of the device. In the S-ATA disk case, this optimization works > well for batching since we "spend" an extra 4ms worst case in the chance > of combining multiple, slow 18ms operations. > > With the clariion box we tested, the optimization fails badly since the > cost is only 1.3 ms so we optimize by waiting 3-4 times longer than it > would take to do the operation immediately. > > This problem has also seemed to me to be the same problem that IO > schedulers do with plugging - we want to dynamically figure out when to > plug and unplug here without hard coding in device specific tunings. > > If we bypass the snippet for multi-threaded writers, we would probably > slow down this workload on normal S-ATA/ATA drives (or even higher > performance non-RAID disks). It's the self-tuning aspect of this problem that makes it hard. In the case of XFS, the way this tuning is done is that we look at the state of the previous log I/O buffer to check if it is still syncing to disk. If it is sync to disk, we go to sleep waiting for that log buffer I/O to complete. This holds the current buffer open to aggregate more transactions before syncing it to disk and hence allows parallel fsyncs to be issued in the one log write. The fact that it waits for the previous log I/O to complete means it self-tunes to the latency of the underlying storage medium..... Cheers, Dave. -- Dave Chinner Principal Engineer SGI Australian Software Group ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: background on the ext3 batching performance issue 2008-02-28 17:54 ` David Chinner @ 2008-02-28 19:48 ` Ric Wheeler 2008-02-29 14:52 ` Ric Wheeler 2008-03-05 19:19 ` some hard numbers on ext3 & " Ric Wheeler 2 siblings, 0 replies; 21+ messages in thread From: Ric Wheeler @ 2008-02-28 19:48 UTC (permalink / raw) To: David Chinner Cc: Josef Bacik, Theodore Ts'o, adilger, jack, Feld, Andy, linux-fsdevel David Chinner wrote: > On Thu, Feb 28, 2008 at 08:09:57AM -0500, Ric Wheeler wrote: >> One more thought - what we really want here is to have a sense of the >> latency of the device. In the S-ATA disk case, this optimization works >> well for batching since we "spend" an extra 4ms worst case in the chance >> of combining multiple, slow 18ms operations. >> >> With the clariion box we tested, the optimization fails badly since the >> cost is only 1.3 ms so we optimize by waiting 3-4 times longer than it >> would take to do the operation immediately. >> >> This problem has also seemed to me to be the same problem that IO >> schedulers do with plugging - we want to dynamically figure out when to >> plug and unplug here without hard coding in device specific tunings. >> >> If we bypass the snippet for multi-threaded writers, we would probably >> slow down this workload on normal S-ATA/ATA drives (or even higher >> performance non-RAID disks). > > It's the self-tuning aspect of this problem that makes it hard. In > the case of XFS, the way this tuning is done is that we look at the > state of the previous log I/O buffer to check if it is still syncing > to disk. If it is sync to disk, we go to sleep waiting for that log > buffer I/O to complete. This holds the current buffer open to > aggregate more transactions before syncing it to disk and hence > allows parallel fsyncs to be issued in the one log write. The fact > that it waits for the previous log I/O to complete means it > self-tunes to the latency of the underlying storage medium..... > > Cheers, > > Dave. With the experiments we ran before, the heuristic did eventually start helping when we hit really high numbers of concurrent writing threads on the Clariion box. I forget how many, but it was at least 12 or so. ric ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: background on the ext3 batching performance issue 2008-02-28 17:54 ` David Chinner 2008-02-28 19:48 ` Ric Wheeler @ 2008-02-29 14:52 ` Ric Wheeler 2008-03-05 19:19 ` some hard numbers on ext3 & " Ric Wheeler 2 siblings, 0 replies; 21+ messages in thread From: Ric Wheeler @ 2008-02-29 14:52 UTC (permalink / raw) To: David Chinner Cc: Josef Bacik, Theodore Ts'o, adilger, jack, Feld, Andy, linux-fsdevel David Chinner wrote: > On Thu, Feb 28, 2008 at 08:09:57AM -0500, Ric Wheeler wrote: >> One more thought - what we really want here is to have a sense of the >> latency of the device. In the S-ATA disk case, this optimization works >> well for batching since we "spend" an extra 4ms worst case in the chance >> of combining multiple, slow 18ms operations. >> >> With the clariion box we tested, the optimization fails badly since the >> cost is only 1.3 ms so we optimize by waiting 3-4 times longer than it >> would take to do the operation immediately. >> >> This problem has also seemed to me to be the same problem that IO >> schedulers do with plugging - we want to dynamically figure out when to >> plug and unplug here without hard coding in device specific tunings. >> >> If we bypass the snippet for multi-threaded writers, we would probably >> slow down this workload on normal S-ATA/ATA drives (or even higher >> performance non-RAID disks). > > It's the self-tuning aspect of this problem that makes it hard. In > the case of XFS, the way this tuning is done is that we look at the > state of the previous log I/O buffer to check if it is still syncing > to disk. If it is sync to disk, we go to sleep waiting for that log > buffer I/O to complete. This holds the current buffer open to > aggregate more transactions before syncing it to disk and hence > allows parallel fsyncs to be issued in the one log write. The fact > that it waits for the previous log I/O to complete means it > self-tunes to the latency of the underlying storage medium..... > > Cheers, > > Dave. This sounds like a really clean way to self tune without having any hard coded assumptions (like the current 1HZ wait)... ric ^ permalink raw reply [flat|nested] 21+ messages in thread
* some hard numbers on ext3 & batching performance issue 2008-02-28 17:54 ` David Chinner 2008-02-28 19:48 ` Ric Wheeler 2008-02-29 14:52 ` Ric Wheeler @ 2008-03-05 19:19 ` Ric Wheeler 2008-03-05 20:20 ` Josef Bacik 2008-03-06 0:28 ` David Chinner 2 siblings, 2 replies; 21+ messages in thread From: Ric Wheeler @ 2008-03-05 19:19 UTC (permalink / raw) To: David Chinner Cc: Josef Bacik, Theodore Ts'o, adilger, jack, Feld, Andy, linux-fsdevel, linux-scsi After the IO/FS workshop last week, I posted some details on the slow down we see with ext3 when we have a low latency back end instead of a normal local disk (SCSI/S-ATA/etc). As a follow up to that thread, I wanted to post some real numbers that Andy from our performance team pulled together. Andy tested various patches using three classes of storage (S-ATA, RAM disk and Clariion array). Note that this testing was done on a SLES10/SP1 kernel, but the code in question has not changed in mainline but we should probably retest on something newer just to clear up any doubts. The work load is generated using fs_mark (http://sourceforge.net/projects/fsmark/) which is basically a write workload with small files, each file gets fsync'ed before close. The metric is "files/sec". The clearest result used a ramdisk to store 4k files. We modified ext3 and jbd to accept a new mount option: bdelay Use it like: mount -o bdelay=n dev mountpoint n is passed to schedule_timeout_interruptible() in the jbd code. if n == 0, it skips the whole loop. if n is "yield", then substitute the schedule...(n) with yield(). Note that the first row is the value of the delay with a 250HZ build followed by the number of concurrent threads writing 4KB files. Ramdisk test: bdelay 1 2 4 8 10 20 0 4640 4498 3226 1721 1436 664 yield 4640 4078 2977 1611 1136 551 1 4647 250 482 588 629 483 2 4522 149 233 422 450 389 3 4504 86 165 271 308 334 4 4425 84 128 222 253 293 Midrange clariion: bdelay 1 2 4 8 10 20 0 778 923 1567 1424 1276 785 yield 791 931 1551 1473 1328 806 1 793 304 499 714 751 760 2 789 132 201 382 441 589 3 792 124 168 298 342 471 4 786 71 116 237 277 393 Local disk: bdelay 1 2 4 8 10 20 0 47 51 81 135 160 234 yield 36 45 74 117 138 214 1 44 52 86 148 183 258 2 40 60 109 163 184 265 3 40 52 97 148 171 264 4 35 42 83 149 169 246 Apologies for mangling the nicely formatted tables. Note that the justification for the batching as we have it today is basically this last local drive test case. It would be really interesting to rerun some of these tests on xfs which Dave explained in the thread last week has a more self tuning way to batch up transactions.... Note that all of those poor users who have a synchronous write workload today are in the "1" row for each of the above tables. ric ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: some hard numbers on ext3 & batching performance issue 2008-03-05 19:19 ` some hard numbers on ext3 & " Ric Wheeler @ 2008-03-05 20:20 ` Josef Bacik 2008-03-07 20:08 ` Ric Wheeler 2008-03-06 0:28 ` David Chinner 1 sibling, 1 reply; 21+ messages in thread From: Josef Bacik @ 2008-03-05 20:20 UTC (permalink / raw) To: ric Cc: David Chinner, Theodore Ts'o, adilger, jack, Feld, Andy, linux-fsdevel, linux-scsi On Wednesday 05 March 2008 2:19:48 pm Ric Wheeler wrote: > After the IO/FS workshop last week, I posted some details on the slow > down we see with ext3 when we have a low latency back end instead of a > normal local disk (SCSI/S-ATA/etc). > > As a follow up to that thread, I wanted to post some real numbers that > Andy from our performance team pulled together. Andy tested various > patches using three classes of storage (S-ATA, RAM disk and Clariion > array). > > Note that this testing was done on a SLES10/SP1 kernel, but the code in > question has not changed in mainline but we should probably retest on > something newer just to clear up any doubts. > > The work load is generated using fs_mark > (http://sourceforge.net/projects/fsmark/) which is basically a write > workload with small files, each file gets fsync'ed before close. The > metric is "files/sec". > > The clearest result used a ramdisk to store 4k files. > > We modified ext3 and jbd to accept a new mount option: bdelay Use it like: > > mount -o bdelay=n dev mountpoint > > n is passed to schedule_timeout_interruptible() in the jbd code. if n == > 0, it skips the whole loop. if n is "yield", then substitute the > schedule...(n) with yield(). > > Note that the first row is the value of the delay with a 250HZ build > followed by the number of concurrent threads writing 4KB files. > > Ramdisk test: > > bdelay 1 2 4 8 10 20 > 0 4640 4498 3226 1721 1436 664 > yield 4640 4078 2977 1611 1136 551 > 1 4647 250 482 588 629 483 > 2 4522 149 233 422 450 389 > 3 4504 86 165 271 308 334 > 4 4425 84 128 222 253 293 > > Midrange clariion: > > bdelay 1 2 4 8 10 20 > 0 778 923 1567 1424 1276 785 > yield 791 931 1551 1473 1328 806 > 1 793 304 499 714 751 760 > 2 789 132 201 382 441 589 > 3 792 124 168 298 342 471 > 4 786 71 116 237 277 393 > > Local disk: > > bdelay 1 2 4 8 10 20 > 0 47 51 81 135 160 234 > yield 36 45 74 117 138 214 > 1 44 52 86 148 183 258 > 2 40 60 109 163 184 265 > 3 40 52 97 148 171 264 > 4 35 42 83 149 169 246 > > Apologies for mangling the nicely formatted tables. > > Note that the justification for the batching as we have it today is > basically this last local drive test case. > > It would be really interesting to rerun some of these tests on xfs which > Dave explained in the thread last week has a more self tuning way to > batch up transactions.... > > Note that all of those poor users who have a synchronous write workload > today are in the "1" row for each of the above tables. Mind giving this a whirl? The fastest thing I've got here is an Apple X RAID and its being used for something else atm, so I've only tested this on local disk to make sure it didn't make local performance suck (which it doesn't btw). This should be equivalent with what David says XFS does. Thanks much, Josef diff --git a/fs/jbd/transaction.c b/fs/jbd/transaction.c index c6cbb6c..4596e1c 100644 --- a/fs/jbd/transaction.c +++ b/fs/jbd/transaction.c @@ -1333,8 +1333,7 @@ int journal_stop(handle_t *handle) { transaction_t *transaction = handle->h_transaction; journal_t *journal = transaction->t_journal; - int old_handle_count, err; - pid_t pid; + int err; J_ASSERT(journal_current_handle() == handle); @@ -1353,32 +1352,22 @@ int journal_stop(handle_t *handle) jbd_debug(4, "Handle %p going down\n", handle); - /* - * Implement synchronous transaction batching. If the handle - * was synchronous, don't force a commit immediately. Let's - * yield and let another thread piggyback onto this transaction. - * Keep doing that while new threads continue to arrive. - * It doesn't cost much - we're about to run a commit and sleep - * on IO anyway. Speeds up many-threaded, many-dir operations - * by 30x or more... - * - * But don't do this if this process was the most recent one to - * perform a synchronous write. We do this to detect the case where a - * single process is doing a stream of sync writes. No point in waiting - * for joiners in that case. - */ - pid = current->pid; - if (handle->h_sync && journal->j_last_sync_writer != pid) { - journal->j_last_sync_writer = pid; - do { - old_handle_count = transaction->t_handle_count; - schedule_timeout_uninterruptible(1); - } while (old_handle_count != transaction->t_handle_count); - } - current->journal_info = NULL; spin_lock(&journal->j_state_lock); spin_lock(&transaction->t_handle_lock); + + if (journal->j_committing_transaction && handle->h_sync) { + tid_t tid = journal->j_committing_transaction->t_tid; + + spin_unlock(&transaction->t_handle_lock); + spin_unlock(&journal->j_state_lock); + + err = log_wait_commit(journal, tid); + + spin_lock(&journal->j_state_lock); + spin_lock(&transaction->t_handle_lock); + } + transaction->t_outstanding_credits -= handle->h_buffer_credits; transaction->t_updates--; if (!transaction->t_updates) { ^ permalink raw reply related [flat|nested] 21+ messages in thread
* Re: some hard numbers on ext3 & batching performance issue 2008-03-05 20:20 ` Josef Bacik @ 2008-03-07 20:08 ` Ric Wheeler 2008-03-07 20:40 ` Josef Bacik 0 siblings, 1 reply; 21+ messages in thread From: Ric Wheeler @ 2008-03-07 20:08 UTC (permalink / raw) To: Josef Bacik Cc: David Chinner, Theodore Ts'o, adilger, jack, Feld, Andy, linux-fsdevel, linux-scsi Josef Bacik wrote: > On Wednesday 05 March 2008 2:19:48 pm Ric Wheeler wrote: >> After the IO/FS workshop last week, I posted some details on the slow >> down we see with ext3 when we have a low latency back end instead of a >> normal local disk (SCSI/S-ATA/etc). ... ... ... >> It would be really interesting to rerun some of these tests on xfs which >> Dave explained in the thread last week has a more self tuning way to >> batch up transactions.... >> >> Note that all of those poor users who have a synchronous write workload >> today are in the "1" row for each of the above tables. > > Mind giving this a whirl? The fastest thing I've got here is an Apple X RAID > and its being used for something else atm, so I've only tested this on local > disk to make sure it didn't make local performance suck (which it doesn't btw). > This should be equivalent with what David says XFS does. Thanks much, > > Josef > > diff --git a/fs/jbd/transaction.c b/fs/jbd/transaction.c > index c6cbb6c..4596e1c 100644 > --- a/fs/jbd/transaction.c > +++ b/fs/jbd/transaction.c > @@ -1333,8 +1333,7 @@ int journal_stop(handle_t *handle) > { > transaction_t *transaction = handle->h_transaction; > journal_t *journal = transaction->t_journal; > - int old_handle_count, err; > - pid_t pid; > + int err; > > J_ASSERT(journal_current_handle() == handle); > > @@ -1353,32 +1352,22 @@ int journal_stop(handle_t *handle) > > jbd_debug(4, "Handle %p going down\n", handle); > > - /* > - * Implement synchronous transaction batching. If the handle > - * was synchronous, don't force a commit immediately. Let's > - * yield and let another thread piggyback onto this transaction. > - * Keep doing that while new threads continue to arrive. > - * It doesn't cost much - we're about to run a commit and sleep > - * on IO anyway. Speeds up many-threaded, many-dir operations > - * by 30x or more... > - * > - * But don't do this if this process was the most recent one to > - * perform a synchronous write. We do this to detect the case where a > - * single process is doing a stream of sync writes. No point in waiting > - * for joiners in that case. > - */ > - pid = current->pid; > - if (handle->h_sync && journal->j_last_sync_writer != pid) { > - journal->j_last_sync_writer = pid; > - do { > - old_handle_count = transaction->t_handle_count; > - schedule_timeout_uninterruptible(1); > - } while (old_handle_count != transaction->t_handle_count); > - } > - > current->journal_info = NULL; > spin_lock(&journal->j_state_lock); > spin_lock(&transaction->t_handle_lock); > + > + if (journal->j_committing_transaction && handle->h_sync) { > + tid_t tid = journal->j_committing_transaction->t_tid; > + > + spin_unlock(&transaction->t_handle_lock); > + spin_unlock(&journal->j_state_lock); > + > + err = log_wait_commit(journal, tid); > + > + spin_lock(&journal->j_state_lock); > + spin_lock(&transaction->t_handle_lock); > + } > + > transaction->t_outstanding_credits -= handle->h_buffer_credits; > transaction->t_updates--; > if (!transaction->t_updates) { > > > Running with Josef's patch, I was able to see a clear improvement for batching these synchronous operations on ext3 with the RAM disk and array. It is not too often that you get to do a simple change and see a 27 times improvement ;-) On the bad side, the local disk case took as much as a 30% drop in performance. The specific disk is not one that I have a lot of experience with, I would like to retry on a disk that has been qualified by our group (i.e., we have reasonable confidence that there are no firmware issues, etc). Now for the actual results. The results are the average value of 5 runs for each number of threads. Type Threads Baseline Josef Speedup (Josef/Baseline) array 1 320.5 325.4 1.01 array 2 174.9 351.9 2.01 array 4 382.7 593.5 1.55 array 8 644.1 963.0 1.49 array 10 842.9 1038.7 1.23 array 20 1319.6 1432.3 1.08 RAM disk 1 5621.4 5595.1 0.99 RAM disk 2 281.5 7613.3 27.04 RAM disk 4 579.9 9111.5 15.71 RAM disk 8 891.1 9357.3 10.50 RAM disk 10 1116.3 9873.6 8.84 RAM disk 20 1952.0 10703.6 5.48 S-ATA disk 1 19.0 15.1 0.79 S-ATA disk 2 19.9 14.4 0.72 S-ATA disk 4 41.0 27.9 0.68 S-ATA disk 8 60.4 43.2 0.71 S-ATA disk 10 67.1 48.7 0.72 S-ATA disk 20 102.7 74.0 0.72 Background on the tests: All of this is measured on three devices - a relatively old & slow array, the local (slow!) 2.5" S-ATA disk in the box and a RAM disk. These numbers are used fs_mark to write 4096 byte files with the following commands: fs_mark -d /home/test/t -s 4096 -n 40000 -N 50 -D 64 -t 1 ... fs_mark -d /home/test/t -s 4096 -n 20000 -N 50 -D 64 -t 2 ... fs_mark -d /home/test/t -s 4096 -n 10000 -N 50 -D 64 -t 4 ... fs_mark -d /home/test/t -s 4096 -n 5000 -N 50 -D 64 -t 8 ... fs_mark -d /home/test/t -s 4096 -n 4000 -N 50 -D 64 -t 10 ... fs_mark -d /home/test/t -s 4096 -n 2000 -N 50 -D 64 -t 20 ... Note that this spreads the files across 64 subdirectories, each thread writes 50 files and then moves on to the next in a round robin. ric ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: some hard numbers on ext3 & batching performance issue 2008-03-07 20:08 ` Ric Wheeler @ 2008-03-07 20:40 ` Josef Bacik 2008-03-07 20:45 ` Ric Wheeler 0 siblings, 1 reply; 21+ messages in thread From: Josef Bacik @ 2008-03-07 20:40 UTC (permalink / raw) To: ric Cc: David Chinner, Theodore Ts'o, adilger, jack, Feld, Andy, linux-fsdevel, linux-scsi On Friday 07 March 2008 3:08:32 pm Ric Wheeler wrote: > Josef Bacik wrote: > > On Wednesday 05 March 2008 2:19:48 pm Ric Wheeler wrote: > >> After the IO/FS workshop last week, I posted some details on the slow > >> down we see with ext3 when we have a low latency back end instead of a > >> normal local disk (SCSI/S-ATA/etc). > > ... > ... > ... > > >> It would be really interesting to rerun some of these tests on xfs which > >> Dave explained in the thread last week has a more self tuning way to > >> batch up transactions.... > >> > >> Note that all of those poor users who have a synchronous write workload > >> today are in the "1" row for each of the above tables. > > > > Mind giving this a whirl? The fastest thing I've got here is an Apple X > > RAID and its being used for something else atm, so I've only tested this > > on local disk to make sure it didn't make local performance suck (which > > it doesn't btw). This should be equivalent with what David says XFS does. > > Thanks much, > > > > Josef > > > > diff --git a/fs/jbd/transaction.c b/fs/jbd/transaction.c > > index c6cbb6c..4596e1c 100644 > > --- a/fs/jbd/transaction.c > > +++ b/fs/jbd/transaction.c > > @@ -1333,8 +1333,7 @@ int journal_stop(handle_t *handle) > > { > > transaction_t *transaction = handle->h_transaction; > > journal_t *journal = transaction->t_journal; > > - int old_handle_count, err; > > - pid_t pid; > > + int err; > > > > J_ASSERT(journal_current_handle() == handle); > > > > @@ -1353,32 +1352,22 @@ int journal_stop(handle_t *handle) > > > > jbd_debug(4, "Handle %p going down\n", handle); > > > > - /* > > - * Implement synchronous transaction batching. If the handle > > - * was synchronous, don't force a commit immediately. Let's > > - * yield and let another thread piggyback onto this transaction. > > - * Keep doing that while new threads continue to arrive. > > - * It doesn't cost much - we're about to run a commit and sleep > > - * on IO anyway. Speeds up many-threaded, many-dir operations > > - * by 30x or more... > > - * > > - * But don't do this if this process was the most recent one to > > - * perform a synchronous write. We do this to detect the case where a > > - * single process is doing a stream of sync writes. No point in > > waiting - * for joiners in that case. > > - */ > > - pid = current->pid; > > - if (handle->h_sync && journal->j_last_sync_writer != pid) { > > - journal->j_last_sync_writer = pid; > > - do { > > - old_handle_count = transaction->t_handle_count; > > - schedule_timeout_uninterruptible(1); > > - } while (old_handle_count != transaction->t_handle_count); > > - } > > - > > current->journal_info = NULL; > > spin_lock(&journal->j_state_lock); > > spin_lock(&transaction->t_handle_lock); > > + > > + if (journal->j_committing_transaction && handle->h_sync) { > > + tid_t tid = journal->j_committing_transaction->t_tid; > > + > > + spin_unlock(&transaction->t_handle_lock); > > + spin_unlock(&journal->j_state_lock); > > + > > + err = log_wait_commit(journal, tid); > > + > > + spin_lock(&journal->j_state_lock); > > + spin_lock(&transaction->t_handle_lock); > > + } > > + > > transaction->t_outstanding_credits -= handle->h_buffer_credits; > > transaction->t_updates--; > > if (!transaction->t_updates) { > > Running with Josef's patch, I was able to see a clear improvement for > batching these synchronous operations on ext3 with the RAM disk and > array. It is not too often that you get to do a simple change and see a > 27 times improvement ;-) > > On the bad side, the local disk case took as much as a 30% drop in > performance. The specific disk is not one that I have a lot of > experience with, I would like to retry on a disk that has been qualified > by our group (i.e., we have reasonable confidence that there are no > firmware issues, etc). > > Now for the actual results. > > The results are the average value of 5 runs for each number of threads. > > Type Threads Baseline Josef Speedup (Josef/Baseline) > array 1 320.5 325.4 1.01 > array 2 174.9 351.9 2.01 > array 4 382.7 593.5 1.55 > array 8 644.1 963.0 1.49 > array 10 842.9 1038.7 1.23 > array 20 1319.6 1432.3 1.08 > > RAM disk 1 5621.4 5595.1 0.99 > RAM disk 2 281.5 7613.3 27.04 > RAM disk 4 579.9 9111.5 15.71 > RAM disk 8 891.1 9357.3 10.50 > RAM disk 10 1116.3 9873.6 8.84 > RAM disk 20 1952.0 10703.6 5.48 > > S-ATA disk 1 19.0 15.1 0.79 > S-ATA disk 2 19.9 14.4 0.72 > S-ATA disk 4 41.0 27.9 0.68 > S-ATA disk 8 60.4 43.2 0.71 > S-ATA disk 10 67.1 48.7 0.72 > S-ATA disk 20 102.7 74.0 0.72 > > Background on the tests: > > All of this is measured on three devices - a relatively old & slow > array, the local (slow!) 2.5" S-ATA disk in the box and a RAM disk. > > These numbers are used fs_mark to write 4096 byte files with the > following commands: > > fs_mark -d /home/test/t -s 4096 -n 40000 -N 50 -D 64 -t 1 > ... > fs_mark -d /home/test/t -s 4096 -n 20000 -N 50 -D 64 -t 2 > ... > fs_mark -d /home/test/t -s 4096 -n 10000 -N 50 -D 64 -t 4 > ... > fs_mark -d /home/test/t -s 4096 -n 5000 -N 50 -D 64 -t 8 > ... > fs_mark -d /home/test/t -s 4096 -n 4000 -N 50 -D 64 -t 10 > ... > fs_mark -d /home/test/t -s 4096 -n 2000 -N 50 -D 64 -t 20 > ... > > Note that this spreads the files across 64 subdirectories, each thread > writes 50 files and then moves on to the next in a round robin. > I'm starting to wonder about the disks I have, because my files/second is spanking yours, and its just a local samsung 3gb/s sata drive. With those commands I'm consistently getting over 700 files/sec. I'm seeing about a 1-5% increase in speed locally with my patch. I guess I'll start looking around for some other hardware and check on there in case this box is more badass than I think it is. Thanks much, Josef ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: some hard numbers on ext3 & batching performance issue 2008-03-07 20:40 ` Josef Bacik @ 2008-03-07 20:45 ` Ric Wheeler 2008-03-12 18:37 ` Josef Bacik 0 siblings, 1 reply; 21+ messages in thread From: Ric Wheeler @ 2008-03-07 20:45 UTC (permalink / raw) To: Josef Bacik Cc: David Chinner, Theodore Ts'o, adilger, jack, Feld, Andy, linux-fsdevel, linux-scsi Josef Bacik wrote: > On Friday 07 March 2008 3:08:32 pm Ric Wheeler wrote: >> Josef Bacik wrote: >>> On Wednesday 05 March 2008 2:19:48 pm Ric Wheeler wrote: >>>> After the IO/FS workshop last week, I posted some details on the slow >>>> down we see with ext3 when we have a low latency back end instead of a >>>> normal local disk (SCSI/S-ATA/etc). >> ... >> ... >> ... >> >>>> It would be really interesting to rerun some of these tests on xfs which >>>> Dave explained in the thread last week has a more self tuning way to >>>> batch up transactions.... >>>> >>>> Note that all of those poor users who have a synchronous write workload >>>> today are in the "1" row for each of the above tables. >>> Mind giving this a whirl? The fastest thing I've got here is an Apple X >>> RAID and its being used for something else atm, so I've only tested this >>> on local disk to make sure it didn't make local performance suck (which >>> it doesn't btw). This should be equivalent with what David says XFS does. >>> Thanks much, >>> >>> Josef >>> >>> diff --git a/fs/jbd/transaction.c b/fs/jbd/transaction.c >>> index c6cbb6c..4596e1c 100644 >>> --- a/fs/jbd/transaction.c >>> +++ b/fs/jbd/transaction.c >>> @@ -1333,8 +1333,7 @@ int journal_stop(handle_t *handle) >>> { >>> transaction_t *transaction = handle->h_transaction; >>> journal_t *journal = transaction->t_journal; >>> - int old_handle_count, err; >>> - pid_t pid; >>> + int err; >>> >>> J_ASSERT(journal_current_handle() == handle); >>> >>> @@ -1353,32 +1352,22 @@ int journal_stop(handle_t *handle) >>> >>> jbd_debug(4, "Handle %p going down\n", handle); >>> >>> - /* >>> - * Implement synchronous transaction batching. If the handle >>> - * was synchronous, don't force a commit immediately. Let's >>> - * yield and let another thread piggyback onto this transaction. >>> - * Keep doing that while new threads continue to arrive. >>> - * It doesn't cost much - we're about to run a commit and sleep >>> - * on IO anyway. Speeds up many-threaded, many-dir operations >>> - * by 30x or more... >>> - * >>> - * But don't do this if this process was the most recent one to >>> - * perform a synchronous write. We do this to detect the case where a >>> - * single process is doing a stream of sync writes. No point in >>> waiting - * for joiners in that case. >>> - */ >>> - pid = current->pid; >>> - if (handle->h_sync && journal->j_last_sync_writer != pid) { >>> - journal->j_last_sync_writer = pid; >>> - do { >>> - old_handle_count = transaction->t_handle_count; >>> - schedule_timeout_uninterruptible(1); >>> - } while (old_handle_count != transaction->t_handle_count); >>> - } >>> - >>> current->journal_info = NULL; >>> spin_lock(&journal->j_state_lock); >>> spin_lock(&transaction->t_handle_lock); >>> + >>> + if (journal->j_committing_transaction && handle->h_sync) { >>> + tid_t tid = journal->j_committing_transaction->t_tid; >>> + >>> + spin_unlock(&transaction->t_handle_lock); >>> + spin_unlock(&journal->j_state_lock); >>> + >>> + err = log_wait_commit(journal, tid); >>> + >>> + spin_lock(&journal->j_state_lock); >>> + spin_lock(&transaction->t_handle_lock); >>> + } >>> + >>> transaction->t_outstanding_credits -= handle->h_buffer_credits; >>> transaction->t_updates--; >>> if (!transaction->t_updates) { >> Running with Josef's patch, I was able to see a clear improvement for >> batching these synchronous operations on ext3 with the RAM disk and >> array. It is not too often that you get to do a simple change and see a >> 27 times improvement ;-) >> >> On the bad side, the local disk case took as much as a 30% drop in >> performance. The specific disk is not one that I have a lot of >> experience with, I would like to retry on a disk that has been qualified >> by our group (i.e., we have reasonable confidence that there are no >> firmware issues, etc). >> >> Now for the actual results. >> >> The results are the average value of 5 runs for each number of threads. >> >> Type Threads Baseline Josef Speedup (Josef/Baseline) >> array 1 320.5 325.4 1.01 >> array 2 174.9 351.9 2.01 >> array 4 382.7 593.5 1.55 >> array 8 644.1 963.0 1.49 >> array 10 842.9 1038.7 1.23 >> array 20 1319.6 1432.3 1.08 >> >> RAM disk 1 5621.4 5595.1 0.99 >> RAM disk 2 281.5 7613.3 27.04 >> RAM disk 4 579.9 9111.5 15.71 >> RAM disk 8 891.1 9357.3 10.50 >> RAM disk 10 1116.3 9873.6 8.84 >> RAM disk 20 1952.0 10703.6 5.48 >> >> S-ATA disk 1 19.0 15.1 0.79 >> S-ATA disk 2 19.9 14.4 0.72 >> S-ATA disk 4 41.0 27.9 0.68 >> S-ATA disk 8 60.4 43.2 0.71 >> S-ATA disk 10 67.1 48.7 0.72 >> S-ATA disk 20 102.7 74.0 0.72 >> >> Background on the tests: >> >> All of this is measured on three devices - a relatively old & slow >> array, the local (slow!) 2.5" S-ATA disk in the box and a RAM disk. >> >> These numbers are used fs_mark to write 4096 byte files with the >> following commands: >> >> fs_mark -d /home/test/t -s 4096 -n 40000 -N 50 -D 64 -t 1 >> ... >> fs_mark -d /home/test/t -s 4096 -n 20000 -N 50 -D 64 -t 2 >> ... >> fs_mark -d /home/test/t -s 4096 -n 10000 -N 50 -D 64 -t 4 >> ... >> fs_mark -d /home/test/t -s 4096 -n 5000 -N 50 -D 64 -t 8 >> ... >> fs_mark -d /home/test/t -s 4096 -n 4000 -N 50 -D 64 -t 10 >> ... >> fs_mark -d /home/test/t -s 4096 -n 2000 -N 50 -D 64 -t 20 >> ... >> >> Note that this spreads the files across 64 subdirectories, each thread >> writes 50 files and then moves on to the next in a round robin. >> > > I'm starting to wonder about the disks I have, because my files/second is > spanking yours, and its just a local samsung 3gb/s sata drive. With those > commands I'm consistently getting over 700 files/sec. I'm seeing about a 1-5% > increase in speed locally with my patch. I guess I'll start looking around for > some other hardware and check on there in case this box is more badass than I > think it is. Thanks much, > > Josef > Sounds like you might be running with write cache on & barriers off ;-) Make sure you have write cache & barriers enabled on the drive. With a good S-ATA drive, you should be seeing about 35-50 files/sec with a single threaded writer. The local disk that I tested on is a relatively slow s-ata disk that is more laptop quality/performance than server. One thought I had about the results is that we might be flipping the IO sequence with the local disk case. It is the only device of the three that I tested which is seek/head movement sensitive for small files. ric ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: some hard numbers on ext3 & batching performance issue 2008-03-07 20:45 ` Ric Wheeler @ 2008-03-12 18:37 ` Josef Bacik 2008-03-13 11:26 ` Ric Wheeler 0 siblings, 1 reply; 21+ messages in thread From: Josef Bacik @ 2008-03-12 18:37 UTC (permalink / raw) To: Ric Wheeler Cc: Josef Bacik, David Chinner, Theodore Ts'o, adilger, jack, Feld, Andy, linux-fsdevel, linux-scsi On Fri, Mar 07, 2008 at 03:45:58PM -0500, Ric Wheeler wrote: > Josef Bacik wrote: >> On Friday 07 March 2008 3:08:32 pm Ric Wheeler wrote: >>> Josef Bacik wrote: >>>> On Wednesday 05 March 2008 2:19:48 pm Ric Wheeler wrote: >>>>> After the IO/FS workshop last week, I posted some details on the slow >>>>> down we see with ext3 when we have a low latency back end instead of a >>>>> normal local disk (SCSI/S-ATA/etc). >>> ... >>> ... >>> >>> Note that this spreads the files across 64 subdirectories, each thread >>> writes 50 files and then moves on to the next in a round robin. >>> >> >> I'm starting to wonder about the disks I have, because my files/second is >> spanking yours, and its just a local samsung 3gb/s sata drive. With those >> commands I'm consistently getting over 700 files/sec. I'm seeing about a >> 1-5% increase in speed locally with my patch. I guess I'll start looking >> around for some other hardware and check on there in case this box is more >> badass than I think it is. Thanks much, >> >> Josef >> > > Sounds like you might be running with write cache on & barriers off ;-) > > Make sure you have write cache & barriers enabled on the drive. With a good > S-ATA drive, you should be seeing about 35-50 files/sec with a single > threaded writer. > > The local disk that I tested on is a relatively slow s-ata disk that is > more laptop quality/performance than server. > > One thought I had about the results is that we might be flipping the IO > sequence with the local disk case. It is the only device of the three that > I tested which is seek/head movement sensitive for small files. > Ahh yes turning off write cache off and barriers on I get your numbers, however I'm not seeing the slowdown that you are, with and without my patch I'm seeing the same performance. Its just a plane jane intel sata controller with a samsung sata disk set at 1.5gbps. Same thing with an nvidia sata controller. I'll think about this some more and see if there is something better that could be done that may help you. Thanks much, Josef ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: some hard numbers on ext3 & batching performance issue 2008-03-12 18:37 ` Josef Bacik @ 2008-03-13 11:26 ` Ric Wheeler 0 siblings, 0 replies; 21+ messages in thread From: Ric Wheeler @ 2008-03-13 11:26 UTC (permalink / raw) To: Josef Bacik Cc: David Chinner, Theodore Ts'o, adilger, jack, Feld, Andy, linux-fsdevel, linux-scsi Josef Bacik wrote: > On Fri, Mar 07, 2008 at 03:45:58PM -0500, Ric Wheeler wrote: >> Josef Bacik wrote: >>> On Friday 07 March 2008 3:08:32 pm Ric Wheeler wrote: >>>> Josef Bacik wrote: >>>>> On Wednesday 05 March 2008 2:19:48 pm Ric Wheeler wrote: >>>>>> After the IO/FS workshop last week, I posted some details on the slow >>>>>> down we see with ext3 when we have a low latency back end instead of a >>>>>> normal local disk (SCSI/S-ATA/etc). >>>> ... >>>> ... >>>> >>>> Note that this spreads the files across 64 subdirectories, each thread >>>> writes 50 files and then moves on to the next in a round robin. >>>> >>> I'm starting to wonder about the disks I have, because my files/second is >>> spanking yours, and its just a local samsung 3gb/s sata drive. With those >>> commands I'm consistently getting over 700 files/sec. I'm seeing about a >>> 1-5% increase in speed locally with my patch. I guess I'll start looking >>> around for some other hardware and check on there in case this box is more >>> badass than I think it is. Thanks much, >>> >>> Josef >>> >> Sounds like you might be running with write cache on & barriers off ;-) >> >> Make sure you have write cache & barriers enabled on the drive. With a good >> S-ATA drive, you should be seeing about 35-50 files/sec with a single >> threaded writer. >> >> The local disk that I tested on is a relatively slow s-ata disk that is >> more laptop quality/performance than server. >> >> One thought I had about the results is that we might be flipping the IO >> sequence with the local disk case. It is the only device of the three that >> I tested which is seek/head movement sensitive for small files. >> > > Ahh yes turning off write cache off and barriers on I get your numbers, however > I'm not seeing the slowdown that you are, with and without my patch I'm seeing > the same performance. Its just a plane jane intel sata controller with a > samsung sata disk set at 1.5gbps. Same thing with an nvidia sata controller. > I'll think about this some more and see if there is something better that could > be done that may help you. Thanks much, > > Josef > Thanks - you should see the numbers with write cache enabled and barriers on as well, but for small files, write cache disabled is quite close ;-) I am happy to rerun the tests at any point, I have a variety of disk types and controllers (lots of Intel AHCI boxes) to use. ric ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: some hard numbers on ext3 & batching performance issue 2008-03-05 19:19 ` some hard numbers on ext3 & " Ric Wheeler 2008-03-05 20:20 ` Josef Bacik @ 2008-03-06 0:28 ` David Chinner 1 sibling, 0 replies; 21+ messages in thread From: David Chinner @ 2008-03-06 0:28 UTC (permalink / raw) To: Ric Wheeler Cc: David Chinner, Josef Bacik, Theodore Ts'o, adilger, jack, Feld, Andy, linux-fsdevel, linux-scsi On Wed, Mar 05, 2008 at 02:19:48PM -0500, Ric Wheeler wrote: > The work load is generated using fs_mark > (http://sourceforge.net/projects/fsmark/) which is basically a write > workload with small files, each file gets fsync'ed before close. The > metric is "files/sec". ....... > It would be really interesting to rerun some of these tests on xfs which > Dave explained in the thread last week has a more self tuning way to > batch up transactions.... Ok, so XFS numbers. note these are all on a CONFIG_XFS_DEBUG=y kernel, so there's lots of extra checks in the code as compared to a normal production kernel. Local disk (15krpm SCSI, WCD, CONFIG_XFS_DEBUG=y): threads files/s 1 97 2 117 4 109 8 110 10 113 20 116 Local disk (15krpm SCSI, WCE, nobarrier, CONFIG_XFS_DEBUG=y): threads files/s 1 203 2 216 4 243 8 332 10 405 20 424 Ramdisk (nobarrier, CONFIG_XFS_DEBUG=y): agcount=4 agcount=16 threads files/s files/s 1 1298 1298 2 2073 2394 4 3296 3321 8 3464 4199 10 3394 3937 20 3251 3691 Note the difference the amount of parallel allocation in the filesystem makes - agcount=4 only allows up to 4 parallel allocations at once, so even if they are all aggregated into the one log I/O, no further allocation can take place until that log I/O is complete. And at about 4000 files/s the system (4p ia64) is becoming CPU bound due to all the debug checks in XFS. Cheers, Dave. -- Dave Chinner Principal Engineer SGI Australian Software Group ^ permalink raw reply [flat|nested] 21+ messages in thread
end of thread, other threads:[~2008-03-13 11:26 UTC | newest] Thread overview: 21+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2008-02-28 12:09 background on the ext3 batching performance issue Ric Wheeler 2008-02-28 15:05 ` Josef Bacik 2008-02-28 15:41 ` Josef Bacik 2008-02-28 13:03 ` Ric Wheeler 2008-02-28 13:09 ` Ric Wheeler 2008-02-28 16:41 ` Jan Kara 2008-02-28 17:02 ` Chris Mason 2008-02-28 17:13 ` Jan Kara 2008-02-28 17:35 ` Chris Mason 2008-02-28 18:15 ` Jan Kara 2008-02-28 17:54 ` David Chinner 2008-02-28 19:48 ` Ric Wheeler 2008-02-29 14:52 ` Ric Wheeler 2008-03-05 19:19 ` some hard numbers on ext3 & " Ric Wheeler 2008-03-05 20:20 ` Josef Bacik 2008-03-07 20:08 ` Ric Wheeler 2008-03-07 20:40 ` Josef Bacik 2008-03-07 20:45 ` Ric Wheeler 2008-03-12 18:37 ` Josef Bacik 2008-03-13 11:26 ` Ric Wheeler 2008-03-06 0:28 ` David Chinner
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).