Re: [JBD] change batching logic to improve O

linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* Re: [JBD] change batching logic to improve O_SYNC performance
  2005-12-15 14:59 [JBD] change batching logic to improve O_SYNC performance Benjamin LaHaise
@ 2005-12-15 14:22 ` Ric Wheeler
  2005-12-15 23:55 ` Andrew Morton
  1 sibling, 0 replies; 5+ messages in thread
From: Ric Wheeler @ 2005-12-15 14:22 UTC (permalink / raw)
  To: Benjamin LaHaise; +Cc: akpm, sct, linux-fsdevel, Jens Axboe, Chris Mason

Benjamin LaHaise wrote:
> Hello folks,
> 
> When writing files out using O_SYNC, jbd's 1 jiffy delay results in a 
> significant drop in throughput as the disk sits idle.  The patch below 
> results in a 4-5x performance improvement (from 6.5MB/s to ~24-30MB/s on 
> my IDE test box) when writing out files using O_SYNC.  Instead of always 
> delaying for 1 jiffy when trying to batch, merely do a yield() to allow 
> other processes to execute and potentially batch requests.  Additionally, 
> limit the delay to an absolute max of 1/50th of a second.  
> 
> 		-ben
> 

I was playing with similar code in reiserfs last year, trying to tune 
for fsync() performance.

The challenge is that when the rate of fsync() arrivals is low, you 
don't want to introduce extra latency for these "lonely" requests while 
you wait for a request to combine which is not going to come any time soon.

On the other hand, if the rate of fsync() requests is high enough, you 
gain a huge performance win by combining multiple fsync()'s into one 
transaction.

This feels a bit like the IO combining logic in the io scheduler layer.

It would seem to be begging for some dynamic logic based on the arrival 
rate for both the fsync & synchronous write combining...

ric

^ permalink raw reply	[flat|nested] 5+ messages in thread

* [JBD] change batching logic to improve O_SYNC performance
@ 2005-12-15 14:59 Benjamin LaHaise
  2005-12-15 14:22 ` Ric Wheeler
  2005-12-15 23:55 ` Andrew Morton
  0 siblings, 2 replies; 5+ messages in thread
From: Benjamin LaHaise @ 2005-12-15 14:59 UTC (permalink / raw)
  To: akpm, sct; +Cc: linux-fsdevel

Hello folks,

When writing files out using O_SYNC, jbd's 1 jiffy delay results in a 
significant drop in throughput as the disk sits idle.  The patch below 
results in a 4-5x performance improvement (from 6.5MB/s to ~24-30MB/s on 
my IDE test box) when writing out files using O_SYNC.  Instead of always 
delaying for 1 jiffy when trying to batch, merely do a yield() to allow 
other processes to execute and potentially batch requests.  Additionally, 
limit the delay to an absolute max of 1/50th of a second.  

		-ben

Signed-off-by: Benjamin LaHaise <benjamin.c.lahaise@intel.com>
diff --git a/fs/jbd/transaction.c b/fs/jbd/transaction.c
index 429f4b2..9c84ad0 100644
--- a/fs/jbd/transaction.c
+++ b/fs/jbd/transaction.c
@@ -1335,10 +1335,12 @@ int journal_stop(handle_t *handle)
 	 * by 30x or more...
 	 */
 	if (handle->h_sync) {
+		long start = jiffies;
 		do {
 			old_handle_count = transaction->t_handle_count;
-			schedule_timeout_uninterruptible(1);
-		} while (old_handle_count != transaction->t_handle_count);
+			yield();
+		} while (old_handle_count != transaction->t_handle_count &&
+			 (jiffies - start < HZ/50));
 	}

 	current->journal_info = NULL;
-- 
"You know, I've seen some crystals do some pretty trippy shit, man."
Don't Email: <dont@kvack.org>.

^ permalink raw reply related	[flat|nested] 5+ messages in thread

* Re: [JBD] change batching logic to improve O_SYNC performance
  2005-12-15 23:55 ` Andrew Morton
@ 2005-12-15 21:39   ` Ric Wheeler
  2005-12-16  0:48   ` Andreas Dilger
  1 sibling, 0 replies; 5+ messages in thread
From: Ric Wheeler @ 2005-12-15 21:39 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Benjamin LaHaise, sct, linux-fsdevel

Andrew Morton wrote:
> Benjamin LaHaise <bcrl@kvack.org> wrote:
> 
>>Hello folks,
>>
>>When writing files out using O_SYNC, jbd's 1 jiffy delay results in a 
>>significant drop in throughput as the disk sits idle.  The patch below 
>>results in a 4-5x performance improvement (from 6.5MB/s to ~24-30MB/s on 
>>my IDE test box) when writing out files using O_SYNC.
> 
> 
> That's really sad.   Thanks for working that out.
> 
> 
>> Instead of always 
>>delaying for 1 jiffy when trying to batch, merely do a yield() to allow 
>>other processes to execute and potentially batch requests.
> 
> 
> Yeah, 2.4 has yield().  The O(1) yield semantics resulted in a performance
> catastrophe in ext3 when the system was busy, so the batching code got
> changed to a one-jiffy-sleep.  I don't think we can go back to yield().
> 
> Worst-case we should just dump the batching code: single-threaded
> O_SYNC/fsync is probably a commoner case than multi-threaded, dunno.

I think that the above assumption might be true for a single threaded 
O_SYNC process, but is not normally true for fsync() heavy workloads.

We have a multi-threaded write workload since we can boost files/sec by 
about 4-5x the single threaded write rate.  Using the a properly 
configured write barrier (highly recommended if you care about your data 
;-)) makes the cost of a fsync() call quite high so batching is a huge win.

I think that NFS servers and other multi-threaded apps (mail servers?) 
might have a similar profile . In these cases, you definitely benefit by 
combining multiple fsync() requests in one disk operation.

> 
> But surely we can do better than that.
> 
> How's about something simple like just saying "if the last process which
> did a synchronous write is not this process, do the batching thing".
> 
> 

Despite some obvious complexity, I still think that adjusting the delay 
based on rate of the synchronous requests would be the best case.  For 
example, even in the O_SYNC write case, if you have a single thread 
writing to disk in rapid succession, any delay is probably a waste.

Another way to attack this is to actually expose some of the 
transacation mechanisms to the applications so they can do some explicit 
control over the commit phase which could be used to build batched 
fsync(), etc.

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [JBD] change batching logic to improve O_SYNC performance
  2005-12-15 14:59 [JBD] change batching logic to improve O_SYNC performance Benjamin LaHaise
  2005-12-15 14:22 ` Ric Wheeler
@ 2005-12-15 23:55 ` Andrew Morton
  2005-12-15 21:39   ` Ric Wheeler
  2005-12-16  0:48   ` Andreas Dilger
  1 sibling, 2 replies; 5+ messages in thread
From: Andrew Morton @ 2005-12-15 23:55 UTC (permalink / raw)
  To: Benjamin LaHaise; +Cc: sct, linux-fsdevel

Benjamin LaHaise <bcrl@kvack.org> wrote:
>
> Hello folks,
> 
> When writing files out using O_SYNC, jbd's 1 jiffy delay results in a 
> significant drop in throughput as the disk sits idle.  The patch below 
> results in a 4-5x performance improvement (from 6.5MB/s to ~24-30MB/s on 
> my IDE test box) when writing out files using O_SYNC.

That's really sad.   Thanks for working that out.

>  Instead of always 
> delaying for 1 jiffy when trying to batch, merely do a yield() to allow 
> other processes to execute and potentially batch requests.

Yeah, 2.4 has yield().  The O(1) yield semantics resulted in a performance
catastrophe in ext3 when the system was busy, so the batching code got
changed to a one-jiffy-sleep.  I don't think we can go back to yield().

Worst-case we should just dump the batching code: single-threaded
O_SYNC/fsync is probably a commoner case than multi-threaded, dunno.

But surely we can do better than that.

How's about something simple like just saying "if the last process which
did a synchronous write is not this process, do the batching thing".


Something like this?


 fs/jbd/transaction.c |   10 +++++++++-
 include/linux/jbd.h  |    4 ++++
 2 files changed, 13 insertions(+), 1 deletion(-)

diff -puN fs/jbd/transaction.c~jbd-fix-transaction-batching fs/jbd/transaction.c
--- 25/fs/jbd/transaction.c~jbd-fix-transaction-batching	Thu Dec 15 15:47:52 2005
+++ 25-akpm/fs/jbd/transaction.c	Thu Dec 15 15:53:28 2005
@@ -1308,6 +1308,7 @@ int journal_stop(handle_t *handle)
 	transaction_t *transaction = handle->h_transaction;
 	journal_t *journal = transaction->t_journal;
 	int old_handle_count, err;
+	pid_t pid;
 
 	J_ASSERT(transaction->t_updates > 0);
 	J_ASSERT(journal_current_handle() == handle);
@@ -1333,8 +1334,15 @@ int journal_stop(handle_t *handle)
 	 * It doesn't cost much - we're about to run a commit and sleep
 	 * on IO anyway.  Speeds up many-threaded, many-dir operations
 	 * by 30x or more...
+	 *
+	 * But don't do this if this process was the most recent one to
+	 * perform a synchronous write.  We do this to detect the case where a
+	 * single process is doing a stream of sync writes.  No point in waiting
+	 * for joiners in that case.
 	 */
-	if (handle->h_sync) {
+	pid = current->pid;
+	if (handle->h_sync && journal->j_last_sync_writer != pid) {
+		journal->j_last_sync_writer = pid;
 		do {
 			old_handle_count = transaction->t_handle_count;
 			schedule_timeout_uninterruptible(1);
diff -puN include/linux/jbd.h~jbd-fix-transaction-batching include/linux/jbd.h
--- 25/include/linux/jbd.h~jbd-fix-transaction-batching	Thu Dec 15 15:48:25 2005
+++ 25-akpm/include/linux/jbd.h	Thu Dec 15 15:53:46 2005
@@ -23,6 +23,7 @@
 #define jfs_debug jbd_debug
 #else
 
+#include <linux/types.h>
 #include <linux/buffer_head.h>
 #include <linux/journal-head.h>
 #include <linux/stddef.h>
@@ -618,6 +619,7 @@ struct transaction_s 
  * @j_wbuf: array of buffer_heads for journal_commit_transaction
  * @j_wbufsize: maximum number of buffer_heads allowed in j_wbuf, the
  *	number that will fit in j_blocksize
+ * @j_last_sync_writer: most recent pid which did a synchronous write
  * @j_private: An opaque pointer to fs-private information.
  */
 
@@ -807,6 +809,8 @@ struct journal_s
 	struct buffer_head	**j_wbuf;
 	int			j_wbufsize;
 
+	pid_t			j_last_sync_writer;
+
 	/*
 	 * An opaque pointer to fs-private information.  ext3 puts its
 	 * superblock pointer here
_


^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [JBD] change batching logic to improve O_SYNC performance
  2005-12-15 23:55 ` Andrew Morton
  2005-12-15 21:39   ` Ric Wheeler
@ 2005-12-16  0:48   ` Andreas Dilger
  1 sibling, 0 replies; 5+ messages in thread
From: Andreas Dilger @ 2005-12-16  0:48 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Benjamin LaHaise, sct, linux-fsdevel, Alex Tomas

On Dec 15, 2005  15:55 -0800, Andrew Morton wrote:
> Yeah, 2.4 has yield().  The O(1) yield semantics resulted in a performance
> catastrophe in ext3 when the system was busy, so the batching code got
> changed to a one-jiffy-sleep.  I don't think we can go back to yield().
> 
> Worst-case we should just dump the batching code: single-threaded
> O_SYNC/fsync is probably a commoner case than multi-threaded, dunno.

I hope we don't discard the sync batching entirely.  This is commonly used
by knfsd and also Lustre to aggregate sync IO operations from many clients.
Having to wait for a whole transaction commit for every RPC would suck badly.

> How's about something simple like just saying "if the last process which
> did a synchronous write is not this process, do the batching thing".
> 
> 
> Something like this?

Looks interesting at least.

> diff -puN fs/jbd/transaction.c~jbd-fix-transaction-batching fs/jbd/transaction.c
> --- 25/fs/jbd/transaction.c~jbd-fix-transaction-batching	Thu Dec 15 15:47:52 2005
> +++ 25-akpm/fs/jbd/transaction.c	Thu Dec 15 15:53:28 2005
> @@ -1308,6 +1308,7 @@ int journal_stop(handle_t *handle)
>  	transaction_t *transaction = handle->h_transaction;
>  	journal_t *journal = transaction->t_journal;
>  	int old_handle_count, err;
> +	pid_t pid;
>  
>  	J_ASSERT(transaction->t_updates > 0);
>  	J_ASSERT(journal_current_handle() == handle);
> @@ -1333,8 +1334,15 @@ int journal_stop(handle_t *handle)
>  	 * It doesn't cost much - we're about to run a commit and sleep
>  	 * on IO anyway.  Speeds up many-threaded, many-dir operations
>  	 * by 30x or more...
> +	 *
> +	 * But don't do this if this process was the most recent one to
> +	 * perform a synchronous write.  We do this to detect the case where a
> +	 * single process is doing a stream of sync writes.  No point in waiting
> +	 * for joiners in that case.
>  	 */
> -	if (handle->h_sync) {
> +	pid = current->pid;
> +	if (handle->h_sync && journal->j_last_sync_writer != pid) {
> +		journal->j_last_sync_writer = pid;
>  		do {
>  			old_handle_count = transaction->t_handle_count;
>  			schedule_timeout_uninterruptible(1);
> diff -puN include/linux/jbd.h~jbd-fix-transaction-batching include/linux/jbd.h
> --- 25/include/linux/jbd.h~jbd-fix-transaction-batching	Thu Dec 15 15:48:25 2005
> +++ 25-akpm/include/linux/jbd.h	Thu Dec 15 15:53:46 2005
> @@ -23,6 +23,7 @@
>  #define jfs_debug jbd_debug
>  #else
>  
> +#include <linux/types.h>
>  #include <linux/buffer_head.h>
>  #include <linux/journal-head.h>
>  #include <linux/stddef.h>
> @@ -618,6 +619,7 @@ struct transaction_s 
>   * @j_wbuf: array of buffer_heads for journal_commit_transaction
>   * @j_wbufsize: maximum number of buffer_heads allowed in j_wbuf, the
>   *	number that will fit in j_blocksize
> + * @j_last_sync_writer: most recent pid which did a synchronous write
>   * @j_private: An opaque pointer to fs-private information.
>   */
>  
> @@ -807,6 +809,8 @@ struct journal_s
>  	struct buffer_head	**j_wbuf;
>  	int			j_wbufsize;
>  
> +	pid_t			j_last_sync_writer;
> +
>  	/*
>  	 * An opaque pointer to fs-private information.  ext3 puts its
>  	 * superblock pointer here
> _
> 
> -
> To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

Cheers, Andreas
--
Andreas Dilger
Principal Software Engineer
Cluster File Systems, Inc.


^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2005-12-16  0:48 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2005-12-15 14:59 [JBD] change batching logic to improve O_SYNC performance Benjamin LaHaise
2005-12-15 14:22 ` Ric Wheeler
2005-12-15 23:55 ` Andrew Morton
2005-12-15 21:39   ` Ric Wheeler
2005-12-16  0:48   ` Andreas Dilger

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).