* Re: [JBD] change batching logic to improve O_SYNC performance
2005-12-15 14:59 [JBD] change batching logic to improve O_SYNC performance Benjamin LaHaise
@ 2005-12-15 14:22 ` Ric Wheeler
2005-12-15 23:55 ` Andrew Morton
1 sibling, 0 replies; 5+ messages in thread
From: Ric Wheeler @ 2005-12-15 14:22 UTC (permalink / raw)
To: Benjamin LaHaise; +Cc: akpm, sct, linux-fsdevel, Jens Axboe, Chris Mason
Benjamin LaHaise wrote:
> Hello folks,
>
> When writing files out using O_SYNC, jbd's 1 jiffy delay results in a
> significant drop in throughput as the disk sits idle. The patch below
> results in a 4-5x performance improvement (from 6.5MB/s to ~24-30MB/s on
> my IDE test box) when writing out files using O_SYNC. Instead of always
> delaying for 1 jiffy when trying to batch, merely do a yield() to allow
> other processes to execute and potentially batch requests. Additionally,
> limit the delay to an absolute max of 1/50th of a second.
>
> -ben
>
I was playing with similar code in reiserfs last year, trying to tune
for fsync() performance.
The challenge is that when the rate of fsync() arrivals is low, you
don't want to introduce extra latency for these "lonely" requests while
you wait for a request to combine which is not going to come any time soon.
On the other hand, if the rate of fsync() requests is high enough, you
gain a huge performance win by combining multiple fsync()'s into one
transaction.
This feels a bit like the IO combining logic in the io scheduler layer.
It would seem to be begging for some dynamic logic based on the arrival
rate for both the fsync & synchronous write combining...
ric
^ permalink raw reply [flat|nested] 5+ messages in thread
* [JBD] change batching logic to improve O_SYNC performance
@ 2005-12-15 14:59 Benjamin LaHaise
2005-12-15 14:22 ` Ric Wheeler
2005-12-15 23:55 ` Andrew Morton
0 siblings, 2 replies; 5+ messages in thread
From: Benjamin LaHaise @ 2005-12-15 14:59 UTC (permalink / raw)
To: akpm, sct; +Cc: linux-fsdevel
Hello folks,
When writing files out using O_SYNC, jbd's 1 jiffy delay results in a
significant drop in throughput as the disk sits idle. The patch below
results in a 4-5x performance improvement (from 6.5MB/s to ~24-30MB/s on
my IDE test box) when writing out files using O_SYNC. Instead of always
delaying for 1 jiffy when trying to batch, merely do a yield() to allow
other processes to execute and potentially batch requests. Additionally,
limit the delay to an absolute max of 1/50th of a second.
-ben
Signed-off-by: Benjamin LaHaise <benjamin.c.lahaise@intel.com>
diff --git a/fs/jbd/transaction.c b/fs/jbd/transaction.c
index 429f4b2..9c84ad0 100644
--- a/fs/jbd/transaction.c
+++ b/fs/jbd/transaction.c
@@ -1335,10 +1335,12 @@ int journal_stop(handle_t *handle)
* by 30x or more...
*/
if (handle->h_sync) {
+ long start = jiffies;
do {
old_handle_count = transaction->t_handle_count;
- schedule_timeout_uninterruptible(1);
- } while (old_handle_count != transaction->t_handle_count);
+ yield();
+ } while (old_handle_count != transaction->t_handle_count &&
+ (jiffies - start < HZ/50));
}
current->journal_info = NULL;
--
"You know, I've seen some crystals do some pretty trippy shit, man."
Don't Email: <dont@kvack.org>.
^ permalink raw reply related [flat|nested] 5+ messages in thread
* Re: [JBD] change batching logic to improve O_SYNC performance
2005-12-15 23:55 ` Andrew Morton
@ 2005-12-15 21:39 ` Ric Wheeler
2005-12-16 0:48 ` Andreas Dilger
1 sibling, 0 replies; 5+ messages in thread
From: Ric Wheeler @ 2005-12-15 21:39 UTC (permalink / raw)
To: Andrew Morton; +Cc: Benjamin LaHaise, sct, linux-fsdevel
Andrew Morton wrote:
> Benjamin LaHaise <bcrl@kvack.org> wrote:
>
>>Hello folks,
>>
>>When writing files out using O_SYNC, jbd's 1 jiffy delay results in a
>>significant drop in throughput as the disk sits idle. The patch below
>>results in a 4-5x performance improvement (from 6.5MB/s to ~24-30MB/s on
>>my IDE test box) when writing out files using O_SYNC.
>
>
> That's really sad. Thanks for working that out.
>
>
>> Instead of always
>>delaying for 1 jiffy when trying to batch, merely do a yield() to allow
>>other processes to execute and potentially batch requests.
>
>
> Yeah, 2.4 has yield(). The O(1) yield semantics resulted in a performance
> catastrophe in ext3 when the system was busy, so the batching code got
> changed to a one-jiffy-sleep. I don't think we can go back to yield().
>
> Worst-case we should just dump the batching code: single-threaded
> O_SYNC/fsync is probably a commoner case than multi-threaded, dunno.
I think that the above assumption might be true for a single threaded
O_SYNC process, but is not normally true for fsync() heavy workloads.
We have a multi-threaded write workload since we can boost files/sec by
about 4-5x the single threaded write rate. Using the a properly
configured write barrier (highly recommended if you care about your data
;-)) makes the cost of a fsync() call quite high so batching is a huge win.
I think that NFS servers and other multi-threaded apps (mail servers?)
might have a similar profile . In these cases, you definitely benefit by
combining multiple fsync() requests in one disk operation.
>
> But surely we can do better than that.
>
> How's about something simple like just saying "if the last process which
> did a synchronous write is not this process, do the batching thing".
>
>
Despite some obvious complexity, I still think that adjusting the delay
based on rate of the synchronous requests would be the best case. For
example, even in the O_SYNC write case, if you have a single thread
writing to disk in rapid succession, any delay is probably a waste.
Another way to attack this is to actually expose some of the
transacation mechanisms to the applications so they can do some explicit
control over the commit phase which could be used to build batched
fsync(), etc.
^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: [JBD] change batching logic to improve O_SYNC performance
2005-12-15 14:59 [JBD] change batching logic to improve O_SYNC performance Benjamin LaHaise
2005-12-15 14:22 ` Ric Wheeler
@ 2005-12-15 23:55 ` Andrew Morton
2005-12-15 21:39 ` Ric Wheeler
2005-12-16 0:48 ` Andreas Dilger
1 sibling, 2 replies; 5+ messages in thread
From: Andrew Morton @ 2005-12-15 23:55 UTC (permalink / raw)
To: Benjamin LaHaise; +Cc: sct, linux-fsdevel
Benjamin LaHaise <bcrl@kvack.org> wrote:
>
> Hello folks,
>
> When writing files out using O_SYNC, jbd's 1 jiffy delay results in a
> significant drop in throughput as the disk sits idle. The patch below
> results in a 4-5x performance improvement (from 6.5MB/s to ~24-30MB/s on
> my IDE test box) when writing out files using O_SYNC.
That's really sad. Thanks for working that out.
> Instead of always
> delaying for 1 jiffy when trying to batch, merely do a yield() to allow
> other processes to execute and potentially batch requests.
Yeah, 2.4 has yield(). The O(1) yield semantics resulted in a performance
catastrophe in ext3 when the system was busy, so the batching code got
changed to a one-jiffy-sleep. I don't think we can go back to yield().
Worst-case we should just dump the batching code: single-threaded
O_SYNC/fsync is probably a commoner case than multi-threaded, dunno.
But surely we can do better than that.
How's about something simple like just saying "if the last process which
did a synchronous write is not this process, do the batching thing".
Something like this?
fs/jbd/transaction.c | 10 +++++++++-
include/linux/jbd.h | 4 ++++
2 files changed, 13 insertions(+), 1 deletion(-)
diff -puN fs/jbd/transaction.c~jbd-fix-transaction-batching fs/jbd/transaction.c
--- 25/fs/jbd/transaction.c~jbd-fix-transaction-batching Thu Dec 15 15:47:52 2005
+++ 25-akpm/fs/jbd/transaction.c Thu Dec 15 15:53:28 2005
@@ -1308,6 +1308,7 @@ int journal_stop(handle_t *handle)
transaction_t *transaction = handle->h_transaction;
journal_t *journal = transaction->t_journal;
int old_handle_count, err;
+ pid_t pid;
J_ASSERT(transaction->t_updates > 0);
J_ASSERT(journal_current_handle() == handle);
@@ -1333,8 +1334,15 @@ int journal_stop(handle_t *handle)
* It doesn't cost much - we're about to run a commit and sleep
* on IO anyway. Speeds up many-threaded, many-dir operations
* by 30x or more...
+ *
+ * But don't do this if this process was the most recent one to
+ * perform a synchronous write. We do this to detect the case where a
+ * single process is doing a stream of sync writes. No point in waiting
+ * for joiners in that case.
*/
- if (handle->h_sync) {
+ pid = current->pid;
+ if (handle->h_sync && journal->j_last_sync_writer != pid) {
+ journal->j_last_sync_writer = pid;
do {
old_handle_count = transaction->t_handle_count;
schedule_timeout_uninterruptible(1);
diff -puN include/linux/jbd.h~jbd-fix-transaction-batching include/linux/jbd.h
--- 25/include/linux/jbd.h~jbd-fix-transaction-batching Thu Dec 15 15:48:25 2005
+++ 25-akpm/include/linux/jbd.h Thu Dec 15 15:53:46 2005
@@ -23,6 +23,7 @@
#define jfs_debug jbd_debug
#else
+#include <linux/types.h>
#include <linux/buffer_head.h>
#include <linux/journal-head.h>
#include <linux/stddef.h>
@@ -618,6 +619,7 @@ struct transaction_s
* @j_wbuf: array of buffer_heads for journal_commit_transaction
* @j_wbufsize: maximum number of buffer_heads allowed in j_wbuf, the
* number that will fit in j_blocksize
+ * @j_last_sync_writer: most recent pid which did a synchronous write
* @j_private: An opaque pointer to fs-private information.
*/
@@ -807,6 +809,8 @@ struct journal_s
struct buffer_head **j_wbuf;
int j_wbufsize;
+ pid_t j_last_sync_writer;
+
/*
* An opaque pointer to fs-private information. ext3 puts its
* superblock pointer here
_
^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: [JBD] change batching logic to improve O_SYNC performance
2005-12-15 23:55 ` Andrew Morton
2005-12-15 21:39 ` Ric Wheeler
@ 2005-12-16 0:48 ` Andreas Dilger
1 sibling, 0 replies; 5+ messages in thread
From: Andreas Dilger @ 2005-12-16 0:48 UTC (permalink / raw)
To: Andrew Morton; +Cc: Benjamin LaHaise, sct, linux-fsdevel, Alex Tomas
On Dec 15, 2005 15:55 -0800, Andrew Morton wrote:
> Yeah, 2.4 has yield(). The O(1) yield semantics resulted in a performance
> catastrophe in ext3 when the system was busy, so the batching code got
> changed to a one-jiffy-sleep. I don't think we can go back to yield().
>
> Worst-case we should just dump the batching code: single-threaded
> O_SYNC/fsync is probably a commoner case than multi-threaded, dunno.
I hope we don't discard the sync batching entirely. This is commonly used
by knfsd and also Lustre to aggregate sync IO operations from many clients.
Having to wait for a whole transaction commit for every RPC would suck badly.
> How's about something simple like just saying "if the last process which
> did a synchronous write is not this process, do the batching thing".
>
>
> Something like this?
Looks interesting at least.
> diff -puN fs/jbd/transaction.c~jbd-fix-transaction-batching fs/jbd/transaction.c
> --- 25/fs/jbd/transaction.c~jbd-fix-transaction-batching Thu Dec 15 15:47:52 2005
> +++ 25-akpm/fs/jbd/transaction.c Thu Dec 15 15:53:28 2005
> @@ -1308,6 +1308,7 @@ int journal_stop(handle_t *handle)
> transaction_t *transaction = handle->h_transaction;
> journal_t *journal = transaction->t_journal;
> int old_handle_count, err;
> + pid_t pid;
>
> J_ASSERT(transaction->t_updates > 0);
> J_ASSERT(journal_current_handle() == handle);
> @@ -1333,8 +1334,15 @@ int journal_stop(handle_t *handle)
> * It doesn't cost much - we're about to run a commit and sleep
> * on IO anyway. Speeds up many-threaded, many-dir operations
> * by 30x or more...
> + *
> + * But don't do this if this process was the most recent one to
> + * perform a synchronous write. We do this to detect the case where a
> + * single process is doing a stream of sync writes. No point in waiting
> + * for joiners in that case.
> */
> - if (handle->h_sync) {
> + pid = current->pid;
> + if (handle->h_sync && journal->j_last_sync_writer != pid) {
> + journal->j_last_sync_writer = pid;
> do {
> old_handle_count = transaction->t_handle_count;
> schedule_timeout_uninterruptible(1);
> diff -puN include/linux/jbd.h~jbd-fix-transaction-batching include/linux/jbd.h
> --- 25/include/linux/jbd.h~jbd-fix-transaction-batching Thu Dec 15 15:48:25 2005
> +++ 25-akpm/include/linux/jbd.h Thu Dec 15 15:53:46 2005
> @@ -23,6 +23,7 @@
> #define jfs_debug jbd_debug
> #else
>
> +#include <linux/types.h>
> #include <linux/buffer_head.h>
> #include <linux/journal-head.h>
> #include <linux/stddef.h>
> @@ -618,6 +619,7 @@ struct transaction_s
> * @j_wbuf: array of buffer_heads for journal_commit_transaction
> * @j_wbufsize: maximum number of buffer_heads allowed in j_wbuf, the
> * number that will fit in j_blocksize
> + * @j_last_sync_writer: most recent pid which did a synchronous write
> * @j_private: An opaque pointer to fs-private information.
> */
>
> @@ -807,6 +809,8 @@ struct journal_s
> struct buffer_head **j_wbuf;
> int j_wbufsize;
>
> + pid_t j_last_sync_writer;
> +
> /*
> * An opaque pointer to fs-private information. ext3 puts its
> * superblock pointer here
> _
>
> -
> To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
Cheers, Andreas
--
Andreas Dilger
Principal Software Engineer
Cluster File Systems, Inc.
^ permalink raw reply [flat|nested] 5+ messages in thread
end of thread, other threads:[~2005-12-16 0:48 UTC | newest]
Thread overview: 5+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2005-12-15 14:59 [JBD] change batching logic to improve O_SYNC performance Benjamin LaHaise
2005-12-15 14:22 ` Ric Wheeler
2005-12-15 23:55 ` Andrew Morton
2005-12-15 21:39 ` Ric Wheeler
2005-12-16 0:48 ` Andreas Dilger
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).