* [JBD] change batching logic to improve O_SYNC performance @ 2005-12-15 14:59 Benjamin LaHaise 2005-12-15 14:22 ` Ric Wheeler 2005-12-15 23:55 ` Andrew Morton 0 siblings, 2 replies; 5+ messages in thread From: Benjamin LaHaise @ 2005-12-15 14:59 UTC (permalink / raw) To: akpm, sct; +Cc: linux-fsdevel Hello folks, When writing files out using O_SYNC, jbd's 1 jiffy delay results in a significant drop in throughput as the disk sits idle. The patch below results in a 4-5x performance improvement (from 6.5MB/s to ~24-30MB/s on my IDE test box) when writing out files using O_SYNC. Instead of always delaying for 1 jiffy when trying to batch, merely do a yield() to allow other processes to execute and potentially batch requests. Additionally, limit the delay to an absolute max of 1/50th of a second. -ben Signed-off-by: Benjamin LaHaise <benjamin.c.lahaise@intel.com> diff --git a/fs/jbd/transaction.c b/fs/jbd/transaction.c index 429f4b2..9c84ad0 100644 --- a/fs/jbd/transaction.c +++ b/fs/jbd/transaction.c @@ -1335,10 +1335,12 @@ int journal_stop(handle_t *handle) * by 30x or more... */ if (handle->h_sync) { + long start = jiffies; do { old_handle_count = transaction->t_handle_count; - schedule_timeout_uninterruptible(1); - } while (old_handle_count != transaction->t_handle_count); + yield(); + } while (old_handle_count != transaction->t_handle_count && + (jiffies - start < HZ/50)); } current->journal_info = NULL; -- "You know, I've seen some crystals do some pretty trippy shit, man." Don't Email: <dont@kvack.org>. ^ permalink raw reply related [flat|nested] 5+ messages in thread
* Re: [JBD] change batching logic to improve O_SYNC performance 2005-12-15 14:59 [JBD] change batching logic to improve O_SYNC performance Benjamin LaHaise @ 2005-12-15 14:22 ` Ric Wheeler 2005-12-15 23:55 ` Andrew Morton 1 sibling, 0 replies; 5+ messages in thread From: Ric Wheeler @ 2005-12-15 14:22 UTC (permalink / raw) To: Benjamin LaHaise; +Cc: akpm, sct, linux-fsdevel, Jens Axboe, Chris Mason Benjamin LaHaise wrote: > Hello folks, > > When writing files out using O_SYNC, jbd's 1 jiffy delay results in a > significant drop in throughput as the disk sits idle. The patch below > results in a 4-5x performance improvement (from 6.5MB/s to ~24-30MB/s on > my IDE test box) when writing out files using O_SYNC. Instead of always > delaying for 1 jiffy when trying to batch, merely do a yield() to allow > other processes to execute and potentially batch requests. Additionally, > limit the delay to an absolute max of 1/50th of a second. > > -ben > I was playing with similar code in reiserfs last year, trying to tune for fsync() performance. The challenge is that when the rate of fsync() arrivals is low, you don't want to introduce extra latency for these "lonely" requests while you wait for a request to combine which is not going to come any time soon. On the other hand, if the rate of fsync() requests is high enough, you gain a huge performance win by combining multiple fsync()'s into one transaction. This feels a bit like the IO combining logic in the io scheduler layer. It would seem to be begging for some dynamic logic based on the arrival rate for both the fsync & synchronous write combining... ric ^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: [JBD] change batching logic to improve O_SYNC performance 2005-12-15 14:59 [JBD] change batching logic to improve O_SYNC performance Benjamin LaHaise 2005-12-15 14:22 ` Ric Wheeler @ 2005-12-15 23:55 ` Andrew Morton 2005-12-15 21:39 ` Ric Wheeler 2005-12-16 0:48 ` Andreas Dilger 1 sibling, 2 replies; 5+ messages in thread From: Andrew Morton @ 2005-12-15 23:55 UTC (permalink / raw) To: Benjamin LaHaise; +Cc: sct, linux-fsdevel Benjamin LaHaise <bcrl@kvack.org> wrote: > > Hello folks, > > When writing files out using O_SYNC, jbd's 1 jiffy delay results in a > significant drop in throughput as the disk sits idle. The patch below > results in a 4-5x performance improvement (from 6.5MB/s to ~24-30MB/s on > my IDE test box) when writing out files using O_SYNC. That's really sad. Thanks for working that out. > Instead of always > delaying for 1 jiffy when trying to batch, merely do a yield() to allow > other processes to execute and potentially batch requests. Yeah, 2.4 has yield(). The O(1) yield semantics resulted in a performance catastrophe in ext3 when the system was busy, so the batching code got changed to a one-jiffy-sleep. I don't think we can go back to yield(). Worst-case we should just dump the batching code: single-threaded O_SYNC/fsync is probably a commoner case than multi-threaded, dunno. But surely we can do better than that. How's about something simple like just saying "if the last process which did a synchronous write is not this process, do the batching thing". Something like this? fs/jbd/transaction.c | 10 +++++++++- include/linux/jbd.h | 4 ++++ 2 files changed, 13 insertions(+), 1 deletion(-) diff -puN fs/jbd/transaction.c~jbd-fix-transaction-batching fs/jbd/transaction.c --- 25/fs/jbd/transaction.c~jbd-fix-transaction-batching Thu Dec 15 15:47:52 2005 +++ 25-akpm/fs/jbd/transaction.c Thu Dec 15 15:53:28 2005 @@ -1308,6 +1308,7 @@ int journal_stop(handle_t *handle) transaction_t *transaction = handle->h_transaction; journal_t *journal = transaction->t_journal; int old_handle_count, err; + pid_t pid; J_ASSERT(transaction->t_updates > 0); J_ASSERT(journal_current_handle() == handle); @@ -1333,8 +1334,15 @@ int journal_stop(handle_t *handle) * It doesn't cost much - we're about to run a commit and sleep * on IO anyway. Speeds up many-threaded, many-dir operations * by 30x or more... + * + * But don't do this if this process was the most recent one to + * perform a synchronous write. We do this to detect the case where a + * single process is doing a stream of sync writes. No point in waiting + * for joiners in that case. */ - if (handle->h_sync) { + pid = current->pid; + if (handle->h_sync && journal->j_last_sync_writer != pid) { + journal->j_last_sync_writer = pid; do { old_handle_count = transaction->t_handle_count; schedule_timeout_uninterruptible(1); diff -puN include/linux/jbd.h~jbd-fix-transaction-batching include/linux/jbd.h --- 25/include/linux/jbd.h~jbd-fix-transaction-batching Thu Dec 15 15:48:25 2005 +++ 25-akpm/include/linux/jbd.h Thu Dec 15 15:53:46 2005 @@ -23,6 +23,7 @@ #define jfs_debug jbd_debug #else +#include <linux/types.h> #include <linux/buffer_head.h> #include <linux/journal-head.h> #include <linux/stddef.h> @@ -618,6 +619,7 @@ struct transaction_s * @j_wbuf: array of buffer_heads for journal_commit_transaction * @j_wbufsize: maximum number of buffer_heads allowed in j_wbuf, the * number that will fit in j_blocksize + * @j_last_sync_writer: most recent pid which did a synchronous write * @j_private: An opaque pointer to fs-private information. */ @@ -807,6 +809,8 @@ struct journal_s struct buffer_head **j_wbuf; int j_wbufsize; + pid_t j_last_sync_writer; + /* * An opaque pointer to fs-private information. ext3 puts its * superblock pointer here _ ^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: [JBD] change batching logic to improve O_SYNC performance 2005-12-15 23:55 ` Andrew Morton @ 2005-12-15 21:39 ` Ric Wheeler 2005-12-16 0:48 ` Andreas Dilger 1 sibling, 0 replies; 5+ messages in thread From: Ric Wheeler @ 2005-12-15 21:39 UTC (permalink / raw) To: Andrew Morton; +Cc: Benjamin LaHaise, sct, linux-fsdevel Andrew Morton wrote: > Benjamin LaHaise <bcrl@kvack.org> wrote: > >>Hello folks, >> >>When writing files out using O_SYNC, jbd's 1 jiffy delay results in a >>significant drop in throughput as the disk sits idle. The patch below >>results in a 4-5x performance improvement (from 6.5MB/s to ~24-30MB/s on >>my IDE test box) when writing out files using O_SYNC. > > > That's really sad. Thanks for working that out. > > >> Instead of always >>delaying for 1 jiffy when trying to batch, merely do a yield() to allow >>other processes to execute and potentially batch requests. > > > Yeah, 2.4 has yield(). The O(1) yield semantics resulted in a performance > catastrophe in ext3 when the system was busy, so the batching code got > changed to a one-jiffy-sleep. I don't think we can go back to yield(). > > Worst-case we should just dump the batching code: single-threaded > O_SYNC/fsync is probably a commoner case than multi-threaded, dunno. I think that the above assumption might be true for a single threaded O_SYNC process, but is not normally true for fsync() heavy workloads. We have a multi-threaded write workload since we can boost files/sec by about 4-5x the single threaded write rate. Using the a properly configured write barrier (highly recommended if you care about your data ;-)) makes the cost of a fsync() call quite high so batching is a huge win. I think that NFS servers and other multi-threaded apps (mail servers?) might have a similar profile . In these cases, you definitely benefit by combining multiple fsync() requests in one disk operation. > > But surely we can do better than that. > > How's about something simple like just saying "if the last process which > did a synchronous write is not this process, do the batching thing". > > Despite some obvious complexity, I still think that adjusting the delay based on rate of the synchronous requests would be the best case. For example, even in the O_SYNC write case, if you have a single thread writing to disk in rapid succession, any delay is probably a waste. Another way to attack this is to actually expose some of the transacation mechanisms to the applications so they can do some explicit control over the commit phase which could be used to build batched fsync(), etc. ^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: [JBD] change batching logic to improve O_SYNC performance 2005-12-15 23:55 ` Andrew Morton 2005-12-15 21:39 ` Ric Wheeler @ 2005-12-16 0:48 ` Andreas Dilger 1 sibling, 0 replies; 5+ messages in thread From: Andreas Dilger @ 2005-12-16 0:48 UTC (permalink / raw) To: Andrew Morton; +Cc: Benjamin LaHaise, sct, linux-fsdevel, Alex Tomas On Dec 15, 2005 15:55 -0800, Andrew Morton wrote: > Yeah, 2.4 has yield(). The O(1) yield semantics resulted in a performance > catastrophe in ext3 when the system was busy, so the batching code got > changed to a one-jiffy-sleep. I don't think we can go back to yield(). > > Worst-case we should just dump the batching code: single-threaded > O_SYNC/fsync is probably a commoner case than multi-threaded, dunno. I hope we don't discard the sync batching entirely. This is commonly used by knfsd and also Lustre to aggregate sync IO operations from many clients. Having to wait for a whole transaction commit for every RPC would suck badly. > How's about something simple like just saying "if the last process which > did a synchronous write is not this process, do the batching thing". > > > Something like this? Looks interesting at least. > diff -puN fs/jbd/transaction.c~jbd-fix-transaction-batching fs/jbd/transaction.c > --- 25/fs/jbd/transaction.c~jbd-fix-transaction-batching Thu Dec 15 15:47:52 2005 > +++ 25-akpm/fs/jbd/transaction.c Thu Dec 15 15:53:28 2005 > @@ -1308,6 +1308,7 @@ int journal_stop(handle_t *handle) > transaction_t *transaction = handle->h_transaction; > journal_t *journal = transaction->t_journal; > int old_handle_count, err; > + pid_t pid; > > J_ASSERT(transaction->t_updates > 0); > J_ASSERT(journal_current_handle() == handle); > @@ -1333,8 +1334,15 @@ int journal_stop(handle_t *handle) > * It doesn't cost much - we're about to run a commit and sleep > * on IO anyway. Speeds up many-threaded, many-dir operations > * by 30x or more... > + * > + * But don't do this if this process was the most recent one to > + * perform a synchronous write. We do this to detect the case where a > + * single process is doing a stream of sync writes. No point in waiting > + * for joiners in that case. > */ > - if (handle->h_sync) { > + pid = current->pid; > + if (handle->h_sync && journal->j_last_sync_writer != pid) { > + journal->j_last_sync_writer = pid; > do { > old_handle_count = transaction->t_handle_count; > schedule_timeout_uninterruptible(1); > diff -puN include/linux/jbd.h~jbd-fix-transaction-batching include/linux/jbd.h > --- 25/include/linux/jbd.h~jbd-fix-transaction-batching Thu Dec 15 15:48:25 2005 > +++ 25-akpm/include/linux/jbd.h Thu Dec 15 15:53:46 2005 > @@ -23,6 +23,7 @@ > #define jfs_debug jbd_debug > #else > > +#include <linux/types.h> > #include <linux/buffer_head.h> > #include <linux/journal-head.h> > #include <linux/stddef.h> > @@ -618,6 +619,7 @@ struct transaction_s > * @j_wbuf: array of buffer_heads for journal_commit_transaction > * @j_wbufsize: maximum number of buffer_heads allowed in j_wbuf, the > * number that will fit in j_blocksize > + * @j_last_sync_writer: most recent pid which did a synchronous write > * @j_private: An opaque pointer to fs-private information. > */ > > @@ -807,6 +809,8 @@ struct journal_s > struct buffer_head **j_wbuf; > int j_wbufsize; > > + pid_t j_last_sync_writer; > + > /* > * An opaque pointer to fs-private information. ext3 puts its > * superblock pointer here > _ > > - > To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html Cheers, Andreas -- Andreas Dilger Principal Software Engineer Cluster File Systems, Inc. ^ permalink raw reply [flat|nested] 5+ messages in thread
end of thread, other threads:[~2005-12-16 0:48 UTC | newest] Thread overview: 5+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2005-12-15 14:59 [JBD] change batching logic to improve O_SYNC performance Benjamin LaHaise 2005-12-15 14:22 ` Ric Wheeler 2005-12-15 23:55 ` Andrew Morton 2005-12-15 21:39 ` Ric Wheeler 2005-12-16 0:48 ` Andreas Dilger
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).