* [performance bug] kernel building regression on 64 LCPUs machine
@ 2011-01-19  1:55 Alex,Shi
  2011-01-19  2:03 ` Shaohua Li
                   ` (2 more replies)
  0 siblings, 3 replies; 38+ messages in thread
From: Alex,Shi @ 2011-01-19  1:55 UTC (permalink / raw)
  To: czoccolo
  Cc: vgoyal, jaxboe, linux-kernel@vger.kernel.org, Li, Shaohua,
	Chen, Tim C
Shaohua and I tested kernel building performance on latest kernel. and
found it is drop about 15% on our 64 LCPUs NHM-EX machine on ext4 file
system. We find this performance dropping is due to commit
749ef9f8423054e326f. If we revert this patch or just change the
WRITE_SYNC back to WRITE in jbd2/commit.c file. the performance can be
recovered. 
iostat report show with the commit, read request merge number increased
and write request merge dropped. The total request size increased and
queue length dropped. So we tested another patch: only change WRITE_SYNC
to WRITE_SYNC_PLUG in jbd2/commit.c, but nothing effected.
we didn't test deadline IO mode, just test cfq. seems insert write
request into sync queue effect much read performance, but we don't know
details. What's your comments of this? 
iostat of .37 kernel:
rrqm/s   wrqm/s r/s     w/s     rMB/s  wMB/s avgrq-sz avgqu-sz   await  svctm  %util
22.5772 96.46 92.3742 14.747  1.0048  0.439474 34.8557 0.18078 3.8076 0.30447 2.94302
iostat of commit reverted .37:
26.6223 80.21 107.875 6.03538 1.51415 0 41.3275 0.153385 3.80569 0.377231 3.22323
vmstat report show, read bandwidth dropping: 
vmstat of .37:
 r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa st
3.4 52.6 0.0 64303937.0 16466.7 121544.5 0.0 0.0 2102.7 1914.6 7414.1 3185.7 2.0 1.0 80.3 16.7 0.0
vmstat of revert all from .37
2.2 35.8 0.0 64306767.4 17265.6 126101.2 0.0 0.0 2415.8 1619.1 8532.2 3556.2 2.5 1.1 83.0 13.3 0.0
Regards
Alex
===
diff --git a/fs/jbd/commit.c b/fs/jbd/commit.c
index 34a4861..27ac2f3 100644
--- a/fs/jbd/commit.c
+++ b/fs/jbd/commit.c
@@ -294,7 +294,7 @@ void journal_commit_transaction(journal_t *journal)
 	int first_tag = 0;
 	int tag_flag;
 	int i;
-	int write_op = WRITE_SYNC;
+	int write_op = WRITE;
 
 	/*
 	 * First job: lock down the current transaction and wait for
diff --git a/fs/jbd2/commit.c b/fs/jbd2/commit.c
index f3ad159..69ff08e 100644
--- a/fs/jbd2/commit.c
+++ b/fs/jbd2/commit.c
@@ -329,7 +329,7 @@ void jbd2_journal_commit_transaction(journal_t *journal)
 	int tag_bytes = journal_tag_bytes(journal);
 	struct buffer_head *cbh = NULL; /* For transactional checksums */
 	__u32 crc32_sum = ~0;
-	int write_op = WRITE_SYNC;
+	int write_op = WRITE;
 
 	/*
 	 * First job: lock down the current transaction and wait for
^ permalink raw reply related	[flat|nested] 38+ messages in thread* Re: [performance bug] kernel building regression on 64 LCPUs machine 2011-01-19 1:55 [performance bug] kernel building regression on 64 LCPUs machine Alex,Shi @ 2011-01-19 2:03 ` Shaohua Li 2011-01-19 12:56 ` Jan Kara 2011-01-20 15:16 ` Vivek Goyal 2011-01-19 14:32 ` Ted Ts'o 2011-01-21 7:23 ` Corrado Zoccolo 2 siblings, 2 replies; 38+ messages in thread From: Shaohua Li @ 2011-01-19 2:03 UTC (permalink / raw) To: Shi, Alex, jack, tytso Cc: czoccolo@gmail.com, vgoyal@redhat.com, jaxboe@fusionio.com, linux-kernel@vger.kernel.org, Chen, Tim C add Jan and Theodore to the loop. On Wed, 2011-01-19 at 09:55 +0800, Shi, Alex wrote: > Shaohua and I tested kernel building performance on latest kernel. and > found it is drop about 15% on our 64 LCPUs NHM-EX machine on ext4 file > system. We find this performance dropping is due to commit > 749ef9f8423054e326f. If we revert this patch or just change the > WRITE_SYNC back to WRITE in jbd2/commit.c file. the performance can be > recovered. > > iostat report show with the commit, read request merge number increased > and write request merge dropped. The total request size increased and > queue length dropped. So we tested another patch: only change WRITE_SYNC > to WRITE_SYNC_PLUG in jbd2/commit.c, but nothing effected. since WRITE_SYNC_PLUG doesn't work, this isn't a simple no-write-merge issue. > we didn't test deadline IO mode, just test cfq. seems insert write > request into sync queue effect much read performance, but we don't know > details. What's your comments of this? > > iostat of .37 kernel: > rrqm/s wrqm/s r/s w/s rMB/s wMB/s avgrq-sz avgqu-sz await svctm %util > 22.5772 96.46 92.3742 14.747 1.0048 0.439474 34.8557 0.18078 3.8076 0.30447 2.94302 > iostat of commit reverted .37: > 26.6223 80.21 107.875 6.03538 1.51415 0 41.3275 0.153385 3.80569 0.377231 3.22323 > > vmstat report show, read bandwidth dropping: > vmstat of .37: > r b swpd free buff cache si so bi bo in cs us sy id wa st > 3.4 52.6 0.0 64303937.0 16466.7 121544.5 0.0 0.0 2102.7 1914.6 7414.1 3185.7 2.0 1.0 80.3 16.7 0.0 > vmstat of revert all from .37 > 2.2 35.8 0.0 64306767.4 17265.6 126101.2 0.0 0.0 2415.8 1619.1 8532.2 3556.2 2.5 1.1 83.0 13.3 0.0 > > Regards > Alex > > === > diff --git a/fs/jbd/commit.c b/fs/jbd/commit.c > index 34a4861..27ac2f3 100644 > --- a/fs/jbd/commit.c > +++ b/fs/jbd/commit.c > @@ -294,7 +294,7 @@ void journal_commit_transaction(journal_t *journal) > int first_tag = 0; > int tag_flag; > int i; > - int write_op = WRITE_SYNC; > + int write_op = WRITE; > > /* > * First job: lock down the current transaction and wait for > diff --git a/fs/jbd2/commit.c b/fs/jbd2/commit.c > index f3ad159..69ff08e 100644 > --- a/fs/jbd2/commit.c > +++ b/fs/jbd2/commit.c > @@ -329,7 +329,7 @@ void jbd2_journal_commit_transaction(journal_t *journal) > int tag_bytes = journal_tag_bytes(journal); > struct buffer_head *cbh = NULL; /* For transactional checksums */ > __u32 crc32_sum = ~0; > - int write_op = WRITE_SYNC; > + int write_op = WRITE; > > /* > * First job: lock down the current transaction and wait for > > ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: [performance bug] kernel building regression on 64 LCPUs machine 2011-01-19 2:03 ` Shaohua Li @ 2011-01-19 12:56 ` Jan Kara 2011-01-20 7:52 ` Alex,Shi 2011-01-20 15:16 ` Vivek Goyal 1 sibling, 1 reply; 38+ messages in thread From: Jan Kara @ 2011-01-19 12:56 UTC (permalink / raw) To: Shaohua Li Cc: Shi, Alex, jack, tytso, czoccolo@gmail.com, vgoyal@redhat.com, jaxboe@fusionio.com, linux-kernel@vger.kernel.org, Chen, Tim C On Wed 19-01-11 10:03:26, Shaohua Li wrote: > add Jan and Theodore to the loop. Thanks. > On Wed, 2011-01-19 at 09:55 +0800, Shi, Alex wrote: > > Shaohua and I tested kernel building performance on latest kernel. and > > found it is drop about 15% on our 64 LCPUs NHM-EX machine on ext4 file > > system. We find this performance dropping is due to commit > > 749ef9f8423054e326f. If we revert this patch or just change the > > WRITE_SYNC back to WRITE in jbd2/commit.c file. the performance can be > > recovered. > > > > iostat report show with the commit, read request merge number increased > > and write request merge dropped. The total request size increased and > > queue length dropped. So we tested another patch: only change WRITE_SYNC > > to WRITE_SYNC_PLUG in jbd2/commit.c, but nothing effected. > since WRITE_SYNC_PLUG doesn't work, this isn't a simple no-write-merge issue. > > > we didn't test deadline IO mode, just test cfq. seems insert write > > request into sync queue effect much read performance, but we don't know > > details. What's your comments of this? Indeed it seems that the optimization of the case where we wait for the transaction is negatively impacting the performance when we are not. Does patch below help for your load? It refines the logic when WRITE_SYNC is needed (of course, we should also test whether the patch works for fsync heavy loads as well). The patch is mostly a proof of concept and only lightly tested so be careful... Honza -- Jan Kara <jack@suse.cz> SUSE Labs, CR --- >From 6060f56ac919ac6b146057787af219bbf2503ea0 Mon Sep 17 00:00:00 2001 From: Jan Kara <jack@suse.cz> Date: Wed, 19 Jan 2011 13:45:04 +0100 Subject: [PATCH] jbd2: Refine commit writeout logic Currently we write out all journal buffers in WRITE_SYNC mode. This improves performance for fsync heavy workloads but hinders performance when writes are mostly asynchronous. So add possibility for callers starting a transaction commit to specify whether they are going to wait for the commit and submit journal writes in WRITE_SYNC mode only in that case. Signed-off-by: Jan Kara <jack@suse.cz> --- fs/ext4/fsync.c | 2 +- fs/ext4/super.c | 2 +- fs/jbd2/checkpoint.c | 2 +- fs/jbd2/commit.c | 4 ++-- fs/jbd2/journal.c | 19 ++++++++++--------- fs/jbd2/transaction.c | 13 ++++++------- fs/ocfs2/aops.c | 2 +- fs/ocfs2/super.c | 2 +- include/linux/jbd2.h | 18 ++++++++++-------- 9 files changed, 33 insertions(+), 31 deletions(-) diff --git a/fs/ext4/fsync.c b/fs/ext4/fsync.c index 7829b28..19434da 100644 --- a/fs/ext4/fsync.c +++ b/fs/ext4/fsync.c @@ -198,7 +198,7 @@ int ext4_sync_file(struct file *file, int datasync) return ext4_force_commit(inode->i_sb); commit_tid = datasync ? ei->i_datasync_tid : ei->i_sync_tid; - if (jbd2_log_start_commit(journal, commit_tid)) { + if (jbd2_log_start_commit(journal, commit_tid, true)) { /* * When the journal is on a different device than the * fs data disk, we need to issue the barrier in diff --git a/fs/ext4/super.c b/fs/ext4/super.c index 0f10ccd..eee8486 100644 --- a/fs/ext4/super.c +++ b/fs/ext4/super.c @@ -4121,7 +4121,7 @@ static int ext4_sync_fs(struct super_block *sb, int wait) trace_ext4_sync_fs(sb, wait); flush_workqueue(sbi->dio_unwritten_wq); - if (jbd2_journal_start_commit(sbi->s_journal, &target)) { + if (jbd2_journal_start_commit(sbi->s_journal, &target, true)) { if (wait) jbd2_log_wait_commit(sbi->s_journal, target); } diff --git a/fs/jbd2/checkpoint.c b/fs/jbd2/checkpoint.c index 6a79fd0..3436d53 100644 --- a/fs/jbd2/checkpoint.c +++ b/fs/jbd2/checkpoint.c @@ -309,7 +309,7 @@ static int __process_buffer(journal_t *journal, struct journal_head *jh, "Waiting for Godot: block %llu\n", journal->j_devname, (unsigned long long) bh->b_blocknr); - jbd2_log_start_commit(journal, tid); + jbd2_log_start_commit(journal, tid, true); jbd2_log_wait_commit(journal, tid); ret = 1; } else if (!buffer_dirty(bh)) { diff --git a/fs/jbd2/commit.c b/fs/jbd2/commit.c index f3ad159..19973eb 100644 --- a/fs/jbd2/commit.c +++ b/fs/jbd2/commit.c @@ -329,7 +329,7 @@ void jbd2_journal_commit_transaction(journal_t *journal) int tag_bytes = journal_tag_bytes(journal); struct buffer_head *cbh = NULL; /* For transactional checksums */ __u32 crc32_sum = ~0; - int write_op = WRITE_SYNC; + int write_op = WRITE; /* * First job: lock down the current transaction and wait for @@ -368,7 +368,7 @@ void jbd2_journal_commit_transaction(journal_t *journal) * we unplug the device. We don't do explicit unplugging in here, * instead we rely on sync_buffer() doing the unplug for us. */ - if (commit_transaction->t_synchronous_commit) + if (tid_geq(journal->j_commit_waited, commit_transaction->t_tid)) write_op = WRITE_SYNC_PLUG; trace_jbd2_commit_locking(journal, commit_transaction); stats.run.rs_wait = commit_transaction->t_max_wait; diff --git a/fs/jbd2/journal.c b/fs/jbd2/journal.c index 9e46869..e278fa2 100644 --- a/fs/jbd2/journal.c +++ b/fs/jbd2/journal.c @@ -475,7 +475,7 @@ int __jbd2_log_space_left(journal_t *journal) /* * Called under j_state_lock. Returns true if a transaction commit was started. */ -int __jbd2_log_start_commit(journal_t *journal, tid_t target) +int __jbd2_log_start_commit(journal_t *journal, tid_t target, bool will_wait) { /* * Are we already doing a recent enough commit? @@ -485,7 +485,8 @@ int __jbd2_log_start_commit(journal_t *journal, tid_t target) * We want a new commit: OK, mark the request and wakeup the * commit thread. We do _not_ do the commit ourselves. */ - + if (will_wait && !tid_geq(journal->j_commit_waited, target)) + journal->j_commit_waited = target; journal->j_commit_request = target; jbd_debug(1, "JBD: requesting commit %d/%d\n", journal->j_commit_request, @@ -496,12 +497,12 @@ int __jbd2_log_start_commit(journal_t *journal, tid_t target) return 0; } -int jbd2_log_start_commit(journal_t *journal, tid_t tid) +int jbd2_log_start_commit(journal_t *journal, tid_t tid, bool will_wait) { int ret; write_lock(&journal->j_state_lock); - ret = __jbd2_log_start_commit(journal, tid); + ret = __jbd2_log_start_commit(journal, tid, will_wait); write_unlock(&journal->j_state_lock); return ret; } @@ -524,7 +525,7 @@ int jbd2_journal_force_commit_nested(journal_t *journal) read_lock(&journal->j_state_lock); if (journal->j_running_transaction && !current->journal_info) { transaction = journal->j_running_transaction; - __jbd2_log_start_commit(journal, transaction->t_tid); + __jbd2_log_start_commit(journal, transaction->t_tid, true); } else if (journal->j_committing_transaction) transaction = journal->j_committing_transaction; @@ -544,7 +545,7 @@ int jbd2_journal_force_commit_nested(journal_t *journal) * if a transaction is going to be committed (or is currently already * committing), and fills its tid in at *ptid */ -int jbd2_journal_start_commit(journal_t *journal, tid_t *ptid) +int jbd2_journal_start_commit(journal_t *journal, tid_t *ptid, bool will_wait) { int ret = 0; @@ -552,7 +553,7 @@ int jbd2_journal_start_commit(journal_t *journal, tid_t *ptid) if (journal->j_running_transaction) { tid_t tid = journal->j_running_transaction->t_tid; - __jbd2_log_start_commit(journal, tid); + __jbd2_log_start_commit(journal, tid, will_wait); /* There's a running transaction and we've just made sure * it's commit has been scheduled. */ if (ptid) @@ -1559,7 +1560,7 @@ int jbd2_journal_flush(journal_t *journal) /* Force everything buffered to the log... */ if (journal->j_running_transaction) { transaction = journal->j_running_transaction; - __jbd2_log_start_commit(journal, transaction->t_tid); + __jbd2_log_start_commit(journal, transaction->t_tid, true); } else if (journal->j_committing_transaction) transaction = journal->j_committing_transaction; @@ -1675,7 +1676,7 @@ void __jbd2_journal_abort_hard(journal_t *journal) journal->j_flags |= JBD2_ABORT; transaction = journal->j_running_transaction; if (transaction) - __jbd2_log_start_commit(journal, transaction->t_tid); + __jbd2_log_start_commit(journal, transaction->t_tid, false); write_unlock(&journal->j_state_lock); } diff --git a/fs/jbd2/transaction.c b/fs/jbd2/transaction.c index 3948932..3d5215e 100644 --- a/fs/jbd2/transaction.c +++ b/fs/jbd2/transaction.c @@ -222,7 +222,7 @@ repeat: atomic_sub(nblocks, &transaction->t_outstanding_credits); prepare_to_wait(&journal->j_wait_transaction_locked, &wait, TASK_UNINTERRUPTIBLE); - __jbd2_log_start_commit(journal, transaction->t_tid); + __jbd2_log_start_commit(journal, transaction->t_tid, false); read_unlock(&journal->j_state_lock); schedule(); finish_wait(&journal->j_wait_transaction_locked, &wait); @@ -465,7 +465,7 @@ int jbd2__journal_restart(handle_t *handle, int nblocks, int gfp_mask) spin_unlock(&transaction->t_handle_lock); jbd_debug(2, "restarting handle %p\n", handle); - __jbd2_log_start_commit(journal, transaction->t_tid); + __jbd2_log_start_commit(journal, transaction->t_tid, false); read_unlock(&journal->j_state_lock); lock_map_release(&handle->h_lockdep_map); @@ -1361,8 +1361,6 @@ int jbd2_journal_stop(handle_t *handle) } } - if (handle->h_sync) - transaction->t_synchronous_commit = 1; current->journal_info = NULL; atomic_sub(handle->h_buffer_credits, &transaction->t_outstanding_credits); @@ -1383,15 +1381,16 @@ int jbd2_journal_stop(handle_t *handle) jbd_debug(2, "transaction too old, requesting commit for " "handle %p\n", handle); - /* This is non-blocking */ - jbd2_log_start_commit(journal, transaction->t_tid); - /* * Special case: JBD2_SYNC synchronous updates require us * to wait for the commit to complete. */ if (handle->h_sync && !(current->flags & PF_MEMALLOC)) wait_for_commit = 1; + + /* This is non-blocking */ + jbd2_log_start_commit(journal, transaction->t_tid, + wait_for_commit); } /* diff --git a/fs/ocfs2/aops.c b/fs/ocfs2/aops.c index 1fbb0e2..d493f32 100644 --- a/fs/ocfs2/aops.c +++ b/fs/ocfs2/aops.c @@ -1659,7 +1659,7 @@ static int ocfs2_try_to_free_truncate_log(struct ocfs2_super *osb, goto out; } - if (jbd2_journal_start_commit(osb->journal->j_journal, &target)) { + if (jbd2_journal_start_commit(osb->journal->j_journal, &target, true)) { jbd2_log_wait_commit(osb->journal->j_journal, target); ret = 1; } diff --git a/fs/ocfs2/super.c b/fs/ocfs2/super.c index 31c3ffd..aab448c 100644 --- a/fs/ocfs2/super.c +++ b/fs/ocfs2/super.c @@ -414,7 +414,7 @@ static int ocfs2_sync_fs(struct super_block *sb, int wait) } if (jbd2_journal_start_commit(OCFS2_SB(sb)->journal->j_journal, - &target)) { + &target, wait)) { if (wait) jbd2_log_wait_commit(OCFS2_SB(sb)->journal->j_journal, target); diff --git a/include/linux/jbd2.h b/include/linux/jbd2.h index 27e79c2..46aaf45 100644 --- a/include/linux/jbd2.h +++ b/include/linux/jbd2.h @@ -631,11 +631,6 @@ struct transaction_s */ atomic_t t_handle_count; - /* - * This transaction is being forced and some process is - * waiting for it to finish. - */ - unsigned int t_synchronous_commit:1; unsigned int t_flushed_data_blocks:1; /* @@ -900,6 +895,13 @@ struct journal_s tid_t j_commit_request; /* + * Sequence number of the most recent transaction someone is waiting + * for to commit. + * [j_state_lock] + */ + tid_t j_commit_waited; + + /* * Journal uuid: identifies the object (filesystem, LVM volume etc) * backed by this journal. This will eventually be replaced by an array * of uuids, allowing us to index multiple devices within a single @@ -1200,9 +1202,9 @@ extern void jbd2_journal_switch_revoke_table(journal_t *journal); */ int __jbd2_log_space_left(journal_t *); /* Called with journal locked */ -int jbd2_log_start_commit(journal_t *journal, tid_t tid); -int __jbd2_log_start_commit(journal_t *journal, tid_t tid); -int jbd2_journal_start_commit(journal_t *journal, tid_t *tid); +int jbd2_log_start_commit(journal_t *journal, tid_t tid, bool will_wait); +int __jbd2_log_start_commit(journal_t *journal, tid_t tid, bool will_wait); +int jbd2_journal_start_commit(journal_t *journal, tid_t *tid, bool will_wait); int jbd2_journal_force_commit_nested(journal_t *journal); int jbd2_log_wait_commit(journal_t *journal, tid_t tid); int jbd2_log_do_checkpoint(journal_t *journal); -- 1.7.1 ^ permalink raw reply related [flat|nested] 38+ messages in thread
* Re: [performance bug] kernel building regression on 64 LCPUs machine 2011-01-19 12:56 ` Jan Kara @ 2011-01-20 7:52 ` Alex,Shi 0 siblings, 0 replies; 38+ messages in thread From: Alex,Shi @ 2011-01-20 7:52 UTC (permalink / raw) To: Jan Kara Cc: Li, Shaohua, tytso@mit.edu, czoccolo@gmail.com, vgoyal@redhat.com, jaxboe@fusionio.com, linux-kernel@vger.kernel.org, Chen, Tim C On Wed, 2011-01-19 at 20:56 +0800, Jan Kara wrote: > On Wed 19-01-11 10:03:26, Shaohua Li wrote: > > add Jan and Theodore to the loop. > Thanks. > > > On Wed, 2011-01-19 at 09:55 +0800, Shi, Alex wrote: > > > Shaohua and I tested kernel building performance on latest kernel. and > > > found it is drop about 15% on our 64 LCPUs NHM-EX machine on ext4 file > > > system. We find this performance dropping is due to commit > > > 749ef9f8423054e326f. If we revert this patch or just change the > > > WRITE_SYNC back to WRITE in jbd2/commit.c file. the performance can be > > > recovered. > > > > > > iostat report show with the commit, read request merge number increased > > > and write request merge dropped. The total request size increased and > > > queue length dropped. So we tested another patch: only change WRITE_SYNC > > > to WRITE_SYNC_PLUG in jbd2/commit.c, but nothing effected. > > since WRITE_SYNC_PLUG doesn't work, this isn't a simple no-write-merge issue. > > > > > we didn't test deadline IO mode, just test cfq. seems insert write > > > request into sync queue effect much read performance, but we don't know > > > details. What's your comments of this? > Indeed it seems that the optimization of the case where we wait for the > transaction is negatively impacting the performance when we are not. Does > patch below help for your load? It refines the logic when WRITE_SYNC > is needed (of course, we should also test whether the patch works for fsync > heavy loads as well). > The patch is mostly a proof of concept and only lightly tested so be > careful... > I tested the patch after remove t_synchronous_commit lines in include/trace/events/jbd2.h on 2.6.38-rc1 kernel. but did not find clear improvement on kbuild. ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: [performance bug] kernel building regression on 64 LCPUs machine 2011-01-19 2:03 ` Shaohua Li 2011-01-19 12:56 ` Jan Kara @ 2011-01-20 15:16 ` Vivek Goyal 2011-01-21 7:17 ` Shaohua Li 2011-01-26 8:15 ` Shaohua Li 1 sibling, 2 replies; 38+ messages in thread From: Vivek Goyal @ 2011-01-20 15:16 UTC (permalink / raw) To: Shaohua Li Cc: Shi, Alex, jack, tytso, czoccolo@gmail.com, jaxboe@fusionio.com, linux-kernel@vger.kernel.org, Chen, Tim C On Wed, Jan 19, 2011 at 10:03:26AM +0800, Shaohua Li wrote: > add Jan and Theodore to the loop. > > On Wed, 2011-01-19 at 09:55 +0800, Shi, Alex wrote: > > Shaohua and I tested kernel building performance on latest kernel. and > > found it is drop about 15% on our 64 LCPUs NHM-EX machine on ext4 file > > system. We find this performance dropping is due to commit > > 749ef9f8423054e326f. If we revert this patch or just change the > > WRITE_SYNC back to WRITE in jbd2/commit.c file. the performance can be > > recovered. > > > > iostat report show with the commit, read request merge number increased > > and write request merge dropped. The total request size increased and > > queue length dropped. So we tested another patch: only change WRITE_SYNC > > to WRITE_SYNC_PLUG in jbd2/commit.c, but nothing effected. > since WRITE_SYNC_PLUG doesn't work, this isn't a simple no-write-merge issue. > Yep, it does sound like reduce write merging. But moving journal commits back to WRITE, then fsync performance will drop as there will be idling introduced between fsync thread and journalling thread. So that does not sound like a good idea either. Secondly, in presence of mixed workload (some other sync read happening) WRITES can get less bandwidth and sync workload much more. So by marking journal commits as WRITES you might increase the delay there in completion in presence of other sync workload. So Jan Kara's approach makes sense that if somebody is waiting on commit then make it WRITE_SYNC otherwise make it WRITE. Not sure why did it not work for you. Is it possible to run some traces and do more debugging that figure out what's happening. Thanks Vivek ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: [performance bug] kernel building regression on 64 LCPUs machine 2011-01-20 15:16 ` Vivek Goyal @ 2011-01-21 7:17 ` Shaohua Li 2011-01-26 8:15 ` Shaohua Li 1 sibling, 0 replies; 38+ messages in thread From: Shaohua Li @ 2011-01-21 7:17 UTC (permalink / raw) To: Vivek Goyal Cc: Shi, Alex, jack@suse.cz, tytso@mit.edu, czoccolo@gmail.com, jaxboe@fusionio.com, linux-kernel@vger.kernel.org, Chen, Tim C On Thu, 2011-01-20 at 23:16 +0800, Vivek Goyal wrote: > On Wed, Jan 19, 2011 at 10:03:26AM +0800, Shaohua Li wrote: > > add Jan and Theodore to the loop. > > > > On Wed, 2011-01-19 at 09:55 +0800, Shi, Alex wrote: > > > Shaohua and I tested kernel building performance on latest kernel. and > > > found it is drop about 15% on our 64 LCPUs NHM-EX machine on ext4 file > > > system. We find this performance dropping is due to commit > > > 749ef9f8423054e326f. If we revert this patch or just change the > > > WRITE_SYNC back to WRITE in jbd2/commit.c file. the performance can be > > > recovered. > > > > > > iostat report show with the commit, read request merge number increased > > > and write request merge dropped. The total request size increased and > > > queue length dropped. So we tested another patch: only change WRITE_SYNC > > > to WRITE_SYNC_PLUG in jbd2/commit.c, but nothing effected. > > since WRITE_SYNC_PLUG doesn't work, this isn't a simple no-write-merge issue. > > > > Yep, it does sound like reduce write merging. But moving journal commits > back to WRITE, then fsync performance will drop as there will be idling > introduced between fsync thread and journalling thread. So that does > not sound like a good idea either. > > Secondly, in presence of mixed workload (some other sync read happening) > WRITES can get less bandwidth and sync workload much more. So by > marking journal commits as WRITES you might increase the delay there > in completion in presence of other sync workload. > > So Jan Kara's approach makes sense that if somebody is waiting on > commit then make it WRITE_SYNC otherwise make it WRITE. Not sure why > did it not work for you. Is it possible to run some traces and do > more debugging that figure out what's happening. I'll debug and see if I can find anything. Thanks, Shaohua ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: [performance bug] kernel building regression on 64 LCPUs machine 2011-01-20 15:16 ` Vivek Goyal 2011-01-21 7:17 ` Shaohua Li @ 2011-01-26 8:15 ` Shaohua Li 2011-02-12 9:21 ` Alex,Shi 1 sibling, 1 reply; 38+ messages in thread From: Shaohua Li @ 2011-01-26 8:15 UTC (permalink / raw) To: Vivek Goyal Cc: Shi, Alex, jack@suse.cz, tytso@mit.edu, czoccolo@gmail.com, jaxboe@fusionio.com, linux-kernel@vger.kernel.org, Chen, Tim C On Thu, Jan 20, 2011 at 11:16:56PM +0800, Vivek Goyal wrote: > On Wed, Jan 19, 2011 at 10:03:26AM +0800, Shaohua Li wrote: > > add Jan and Theodore to the loop. > > > > On Wed, 2011-01-19 at 09:55 +0800, Shi, Alex wrote: > > > Shaohua and I tested kernel building performance on latest kernel. and > > > found it is drop about 15% on our 64 LCPUs NHM-EX machine on ext4 file > > > system. We find this performance dropping is due to commit > > > 749ef9f8423054e326f. If we revert this patch or just change the > > > WRITE_SYNC back to WRITE in jbd2/commit.c file. the performance can be > > > recovered. > > > > > > iostat report show with the commit, read request merge number increased > > > and write request merge dropped. The total request size increased and > > > queue length dropped. So we tested another patch: only change WRITE_SYNC > > > to WRITE_SYNC_PLUG in jbd2/commit.c, but nothing effected. > > since WRITE_SYNC_PLUG doesn't work, this isn't a simple no-write-merge issue. > > > > Yep, it does sound like reduce write merging. But moving journal commits > back to WRITE, then fsync performance will drop as there will be idling > introduced between fsync thread and journalling thread. So that does > not sound like a good idea either. > > Secondly, in presence of mixed workload (some other sync read happening) > WRITES can get less bandwidth and sync workload much more. So by > marking journal commits as WRITES you might increase the delay there > in completion in presence of other sync workload. > > So Jan Kara's approach makes sense that if somebody is waiting on > commit then make it WRITE_SYNC otherwise make it WRITE. Not sure why > did it not work for you. Is it possible to run some traces and do > more debugging that figure out what's happening. Sorry for the long delay. Looks fedora enables ccache by default. While our kbuild test is on ext4 disk but rootfs is on ext3 where ccache cache files live. Jan's patch only covers ext4, maybe this is the reason. I changed jbd to use WRITE for journal_commit_transaction. With the change and Jan's patch, the test seems fine. Jan, can you send a patch with similar change for ext3? So we can do more tests. Thanks, Shaohua ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: [performance bug] kernel building regression on 64 LCPUs machine 2011-01-26 8:15 ` Shaohua Li @ 2011-02-12 9:21 ` Alex,Shi 2011-02-12 18:25 ` Corrado Zoccolo 0 siblings, 1 reply; 38+ messages in thread From: Alex,Shi @ 2011-02-12 9:21 UTC (permalink / raw) To: Li, Shaohua Cc: Vivek Goyal, jack@suse.cz, tytso@mit.edu, czoccolo@gmail.com, jaxboe@fusionio.com, linux-kernel@vger.kernel.org, Chen, Tim C On Wed, 2011-01-26 at 16:15 +0800, Li, Shaohua wrote: > On Thu, Jan 20, 2011 at 11:16:56PM +0800, Vivek Goyal wrote: > > On Wed, Jan 19, 2011 at 10:03:26AM +0800, Shaohua Li wrote: > > > add Jan and Theodore to the loop. > > > > > > On Wed, 2011-01-19 at 09:55 +0800, Shi, Alex wrote: > > > > Shaohua and I tested kernel building performance on latest kernel. and > > > > found it is drop about 15% on our 64 LCPUs NHM-EX machine on ext4 file > > > > system. We find this performance dropping is due to commit > > > > 749ef9f8423054e326f. If we revert this patch or just change the > > > > WRITE_SYNC back to WRITE in jbd2/commit.c file. the performance can be > > > > recovered. > > > > > > > > iostat report show with the commit, read request merge number increased > > > > and write request merge dropped. The total request size increased and > > > > queue length dropped. So we tested another patch: only change WRITE_SYNC > > > > to WRITE_SYNC_PLUG in jbd2/commit.c, but nothing effected. > > > since WRITE_SYNC_PLUG doesn't work, this isn't a simple no-write-merge issue. > > > > > > > Yep, it does sound like reduce write merging. But moving journal commits > > back to WRITE, then fsync performance will drop as there will be idling > > introduced between fsync thread and journalling thread. So that does > > not sound like a good idea either. > > > > Secondly, in presence of mixed workload (some other sync read happening) > > WRITES can get less bandwidth and sync workload much more. So by > > marking journal commits as WRITES you might increase the delay there > > in completion in presence of other sync workload. > > > > So Jan Kara's approach makes sense that if somebody is waiting on > > commit then make it WRITE_SYNC otherwise make it WRITE. Not sure why > > did it not work for you. Is it possible to run some traces and do > > more debugging that figure out what's happening. > Sorry for the long delay. > > Looks fedora enables ccache by default. While our kbuild test is on ext4 disk > but rootfs is on ext3 where ccache cache files live. Jan's patch only covers > ext4, maybe this is the reason. > I changed jbd to use WRITE for journal_commit_transaction. With the change and > Jan's patch, the test seems fine. Let me clarify the bug situation again. With the following scenarios, the regression is clear. 1, ccache_dir setup at rootfs that format is ext3 on /dev/sda1; 2, kbuild on /dev/sdb1 with ext4. but if we disable the ccache, only do kbuild on sdb1 with ext4. There is no regressions whenever with or without Jan's patch. So, problem focus on the ccache scenario, (from fedora 11, ccache is default setting). If we compare the vmstat output with or without ccache, there is too many write when ccache enabled. According the result, it should to do some tunning on ext3 fs. vmstat average output per 10 seconds, without ccache procs -----------memory---------- ---swap-- -----io---- --system-- -----cpu----- r b swpd free buff cache si so bi bo in cs us sy id wa st 26.8 0.5 0.0 63930192.3 9677.0 96544.9 0.0 0.0 2486.9 337.9 17729.9 4496.4 17.5 2.5 79.8 0.2 0.0 vmstat average output per 10 seconds, with ccache procs -----------memory---------- ---swap-- -----io---- --system-- -----cpu----- r b swpd free buff cache si so bi bo in cs us sy id wa st 2.4 40.7 0.0 64316231.0 17260.6 119533.8 0.0 0.0 2477.6 1493.1 8606.4 3565.2 2.5 1.1 83.0 13.5 0.0 > > Jan, > can you send a patch with similar change for ext3? So we can do more tests. > > Thanks, > Shaohua ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: [performance bug] kernel building regression on 64 LCPUs machine 2011-02-12 9:21 ` Alex,Shi @ 2011-02-12 18:25 ` Corrado Zoccolo 2011-02-14 2:25 ` Alex,Shi 0 siblings, 1 reply; 38+ messages in thread From: Corrado Zoccolo @ 2011-02-12 18:25 UTC (permalink / raw) To: Alex,Shi Cc: Li, Shaohua, Vivek Goyal, jack@suse.cz, tytso@mit.edu, jaxboe@fusionio.com, linux-kernel@vger.kernel.org, Chen, Tim C On Sat, Feb 12, 2011 at 10:21 AM, Alex,Shi <alex.shi@intel.com> wrote: > On Wed, 2011-01-26 at 16:15 +0800, Li, Shaohua wrote: >> On Thu, Jan 20, 2011 at 11:16:56PM +0800, Vivek Goyal wrote: >> > On Wed, Jan 19, 2011 at 10:03:26AM +0800, Shaohua Li wrote: >> > > add Jan and Theodore to the loop. >> > > >> > > On Wed, 2011-01-19 at 09:55 +0800, Shi, Alex wrote: >> > > > Shaohua and I tested kernel building performance on latest kernel. and >> > > > found it is drop about 15% on our 64 LCPUs NHM-EX machine on ext4 file >> > > > system. We find this performance dropping is due to commit >> > > > 749ef9f8423054e326f. If we revert this patch or just change the >> > > > WRITE_SYNC back to WRITE in jbd2/commit.c file. the performance can be >> > > > recovered. >> > > > >> > > > iostat report show with the commit, read request merge number increased >> > > > and write request merge dropped. The total request size increased and >> > > > queue length dropped. So we tested another patch: only change WRITE_SYNC >> > > > to WRITE_SYNC_PLUG in jbd2/commit.c, but nothing effected. >> > > since WRITE_SYNC_PLUG doesn't work, this isn't a simple no-write-merge issue. >> > > >> > >> > Yep, it does sound like reduce write merging. But moving journal commits >> > back to WRITE, then fsync performance will drop as there will be idling >> > introduced between fsync thread and journalling thread. So that does >> > not sound like a good idea either. >> > >> > Secondly, in presence of mixed workload (some other sync read happening) >> > WRITES can get less bandwidth and sync workload much more. So by >> > marking journal commits as WRITES you might increase the delay there >> > in completion in presence of other sync workload. >> > >> > So Jan Kara's approach makes sense that if somebody is waiting on >> > commit then make it WRITE_SYNC otherwise make it WRITE. Not sure why >> > did it not work for you. Is it possible to run some traces and do >> > more debugging that figure out what's happening. >> Sorry for the long delay. >> >> Looks fedora enables ccache by default. While our kbuild test is on ext4 disk >> but rootfs is on ext3 where ccache cache files live. Jan's patch only covers >> ext4, maybe this is the reason. >> I changed jbd to use WRITE for journal_commit_transaction. With the change and >> Jan's patch, the test seems fine. > Let me clarify the bug situation again. > With the following scenarios, the regression is clear. > 1, ccache_dir setup at rootfs that format is ext3 on /dev/sda1; 2, > kbuild on /dev/sdb1 with ext4. > but if we disable the ccache, only do kbuild on sdb1 with ext4. There is > no regressions whenever with or without Jan's patch. > So, problem focus on the ccache scenario, (from fedora 11, ccache is > default setting). > > If we compare the vmstat output with or without ccache, there is too > many write when ccache enabled. According the result, it should to do > some tunning on ext3 fs. Is ext3 configured with data ordered or writeback? I think ccache might be performing fsyncs, and this is a bad workload for ext3, especially in ordered mode. It might be that my patch introduced a regression in ext3 fsync performance, but I don't understand how reverting only the change in jbd2 (that is the ext4 specific journaling daemon) could restore it. The two partitions are on different disks, so each one should be isolated from the I/O perspective (do they share a single controller?). The only interaction I see happens at the VM level, since changing performance of any of the two changes the rate at which pages can be cleaned. Corrado > > > vmstat average output per 10 seconds, without ccache > procs -----------memory---------- ---swap-- -----io---- --system-- -----cpu----- > r b swpd free buff cache si so bi bo in cs us sy id wa st > 26.8 0.5 0.0 63930192.3 9677.0 96544.9 0.0 0.0 2486.9 337.9 17729.9 4496.4 17.5 2.5 79.8 0.2 0.0 > > vmstat average output per 10 seconds, with ccache > procs -----------memory---------- ---swap-- -----io---- --system-- -----cpu----- > r b swpd free buff cache si so bi bo in cs us sy id wa st > 2.4 40.7 0.0 64316231.0 17260.6 119533.8 0.0 0.0 2477.6 1493.1 8606.4 3565.2 2.5 1.1 83.0 13.5 0.0 > > >> >> Jan, >> can you send a patch with similar change for ext3? So we can do more tests. >> >> Thanks, >> Shaohua > > > > -- __________________________________________________________________________ dott. Corrado Zoccolo mailto:czoccolo@gmail.com PhD - Department of Computer Science - University of Pisa, Italy -------------------------------------------------------------------------- The self-confidence of a warrior is not the self-confidence of the average man. The average man seeks certainty in the eyes of the onlooker and calls that self-confidence. The warrior seeks impeccability in his own eyes and calls that humbleness. Tales of Power - C. Castaneda ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: [performance bug] kernel building regression on 64 LCPUs machine 2011-02-12 18:25 ` Corrado Zoccolo @ 2011-02-14 2:25 ` Alex,Shi 2011-02-15 1:10 ` Shaohua Li 0 siblings, 1 reply; 38+ messages in thread From: Alex,Shi @ 2011-02-14 2:25 UTC (permalink / raw) To: Corrado Zoccolo Cc: Li, Shaohua, Vivek Goyal, jack@suse.cz, tytso@mit.edu, jaxboe@fusionio.com, linux-kernel@vger.kernel.org, Chen, Tim C On Sun, 2011-02-13 at 02:25 +0800, Corrado Zoccolo wrote: > On Sat, Feb 12, 2011 at 10:21 AM, Alex,Shi <alex.shi@intel.com> wrote: > > On Wed, 2011-01-26 at 16:15 +0800, Li, Shaohua wrote: > >> On Thu, Jan 20, 2011 at 11:16:56PM +0800, Vivek Goyal wrote: > >> > On Wed, Jan 19, 2011 at 10:03:26AM +0800, Shaohua Li wrote: > >> > > add Jan and Theodore to the loop. > >> > > > >> > > On Wed, 2011-01-19 at 09:55 +0800, Shi, Alex wrote: > >> > > > Shaohua and I tested kernel building performance on latest kernel. and > >> > > > found it is drop about 15% on our 64 LCPUs NHM-EX machine on ext4 file > >> > > > system. We find this performance dropping is due to commit > >> > > > 749ef9f8423054e326f. If we revert this patch or just change the > >> > > > WRITE_SYNC back to WRITE in jbd2/commit.c file. the performance can be > >> > > > recovered. > >> > > > > >> > > > iostat report show with the commit, read request merge number increased > >> > > > and write request merge dropped. The total request size increased and > >> > > > queue length dropped. So we tested another patch: only change WRITE_SYNC > >> > > > to WRITE_SYNC_PLUG in jbd2/commit.c, but nothing effected. > >> > > since WRITE_SYNC_PLUG doesn't work, this isn't a simple no-write-merge issue. > >> > > > >> > > >> > Yep, it does sound like reduce write merging. But moving journal commits > >> > back to WRITE, then fsync performance will drop as there will be idling > >> > introduced between fsync thread and journalling thread. So that does > >> > not sound like a good idea either. > >> > > >> > Secondly, in presence of mixed workload (some other sync read happening) > >> > WRITES can get less bandwidth and sync workload much more. So by > >> > marking journal commits as WRITES you might increase the delay there > >> > in completion in presence of other sync workload. > >> > > >> > So Jan Kara's approach makes sense that if somebody is waiting on > >> > commit then make it WRITE_SYNC otherwise make it WRITE. Not sure why > >> > did it not work for you. Is it possible to run some traces and do > >> > more debugging that figure out what's happening. > >> Sorry for the long delay. > >> > >> Looks fedora enables ccache by default. While our kbuild test is on ext4 disk > >> but rootfs is on ext3 where ccache cache files live. Jan's patch only covers > >> ext4, maybe this is the reason. > >> I changed jbd to use WRITE for journal_commit_transaction. With the change and > >> Jan's patch, the test seems fine. > > Let me clarify the bug situation again. > > With the following scenarios, the regression is clear. > > 1, ccache_dir setup at rootfs that format is ext3 on /dev/sda1; 2, > > kbuild on /dev/sdb1 with ext4. > > but if we disable the ccache, only do kbuild on sdb1 with ext4. There is > > no regressions whenever with or without Jan's patch. > > So, problem focus on the ccache scenario, (from fedora 11, ccache is > > default setting). > > > > If we compare the vmstat output with or without ccache, there is too > > many write when ccache enabled. According the result, it should to do > > some tunning on ext3 fs. > Is ext3 configured with data ordered or writeback? The ext3 on sda and ext4 on sdb are both used 'ordered' mounting mode. > I think ccache might be performing fsyncs, and this is a bad workload > for ext3, especially in ordered mode. > It might be that my patch introduced a regression in ext3 fsync > performance, but I don't understand how reverting only the change in > jbd2 (that is the ext4 specific journaling daemon) could restore it. > The two partitions are on different disks, so each one should be > isolated from the I/O perspective (do they share a single > controller?). No, sda/sdb use separated controller. > The only interaction I see happens at the VM level, > since changing performance of any of the two changes the rate at which > pages can be cleaned. > > Corrado > > > > > > vmstat average output per 10 seconds, without ccache > > procs -----------memory---------- ---swap-- -----io---- --system-- -----cpu----- > > r b swpd free buff cache si so bi bo in cs us sy id wa st > > 26.8 0.5 0.0 63930192.3 9677.0 96544.9 0.0 0.0 2486.9 337.9 17729.9 4496.4 17.5 2.5 79.8 0.2 0.0 > > > > vmstat average output per 10 seconds, with ccache > > procs -----------memory---------- ---swap-- -----io---- --system-- -----cpu----- > > r b swpd free buff cache si so bi bo in cs us sy id wa st > > 2.4 40.7 0.0 64316231.0 17260.6 119533.8 0.0 0.0 2477.6 1493.1 8606.4 3565.2 2.5 1.1 83.0 13.5 0.0 > > > > > >> > >> Jan, > >> can you send a patch with similar change for ext3? So we can do more tests. > >> > >> Thanks, > >> Shaohua > > > > > > > > > > > ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: [performance bug] kernel building regression on 64 LCPUs machine 2011-02-14 2:25 ` Alex,Shi @ 2011-02-15 1:10 ` Shaohua Li 2011-02-21 16:49 ` Jan Kara 0 siblings, 1 reply; 38+ messages in thread From: Shaohua Li @ 2011-02-15 1:10 UTC (permalink / raw) To: Shi, Alex, Jan Kara Cc: Corrado Zoccolo, Vivek Goyal, jack@suse.cz, tytso@mit.edu, jaxboe@fusionio.com, linux-kernel@vger.kernel.org, Chen, Tim C On Mon, 2011-02-14 at 10:25 +0800, Shi, Alex wrote: > On Sun, 2011-02-13 at 02:25 +0800, Corrado Zoccolo wrote: > > On Sat, Feb 12, 2011 at 10:21 AM, Alex,Shi <alex.shi@intel.com> wrote: > > > On Wed, 2011-01-26 at 16:15 +0800, Li, Shaohua wrote: > > >> On Thu, Jan 20, 2011 at 11:16:56PM +0800, Vivek Goyal wrote: > > >> > On Wed, Jan 19, 2011 at 10:03:26AM +0800, Shaohua Li wrote: > > >> > > add Jan and Theodore to the loop. > > >> > > > > >> > > On Wed, 2011-01-19 at 09:55 +0800, Shi, Alex wrote: > > >> > > > Shaohua and I tested kernel building performance on latest kernel. and > > >> > > > found it is drop about 15% on our 64 LCPUs NHM-EX machine on ext4 file > > >> > > > system. We find this performance dropping is due to commit > > >> > > > 749ef9f8423054e326f. If we revert this patch or just change the > > >> > > > WRITE_SYNC back to WRITE in jbd2/commit.c file. the performance can be > > >> > > > recovered. > > >> > > > > > >> > > > iostat report show with the commit, read request merge number increased > > >> > > > and write request merge dropped. The total request size increased and > > >> > > > queue length dropped. So we tested another patch: only change WRITE_SYNC > > >> > > > to WRITE_SYNC_PLUG in jbd2/commit.c, but nothing effected. > > >> > > since WRITE_SYNC_PLUG doesn't work, this isn't a simple no-write-merge issue. > > >> > > > > >> > > > >> > Yep, it does sound like reduce write merging. But moving journal commits > > >> > back to WRITE, then fsync performance will drop as there will be idling > > >> > introduced between fsync thread and journalling thread. So that does > > >> > not sound like a good idea either. > > >> > > > >> > Secondly, in presence of mixed workload (some other sync read happening) > > >> > WRITES can get less bandwidth and sync workload much more. So by > > >> > marking journal commits as WRITES you might increase the delay there > > >> > in completion in presence of other sync workload. > > >> > > > >> > So Jan Kara's approach makes sense that if somebody is waiting on > > >> > commit then make it WRITE_SYNC otherwise make it WRITE. Not sure why > > >> > did it not work for you. Is it possible to run some traces and do > > >> > more debugging that figure out what's happening. > > >> Sorry for the long delay. > > >> > > >> Looks fedora enables ccache by default. While our kbuild test is on ext4 disk > > >> but rootfs is on ext3 where ccache cache files live. Jan's patch only covers > > >> ext4, maybe this is the reason. > > >> I changed jbd to use WRITE for journal_commit_transaction. With the change and > > >> Jan's patch, the test seems fine. > > > Let me clarify the bug situation again. > > > With the following scenarios, the regression is clear. > > > 1, ccache_dir setup at rootfs that format is ext3 on /dev/sda1; 2, > > > kbuild on /dev/sdb1 with ext4. > > > but if we disable the ccache, only do kbuild on sdb1 with ext4. There is > > > no regressions whenever with or without Jan's patch. > > > So, problem focus on the ccache scenario, (from fedora 11, ccache is > > > default setting). > > > > > > If we compare the vmstat output with or without ccache, there is too > > > many write when ccache enabled. According the result, it should to do > > > some tunning on ext3 fs. > > Is ext3 configured with data ordered or writeback? > > The ext3 on sda and ext4 on sdb are both used 'ordered' mounting mode. > > > I think ccache might be performing fsyncs, and this is a bad workload > > for ext3, especially in ordered mode. > > It might be that my patch introduced a regression in ext3 fsync > > performance, but I don't understand how reverting only the change in > > jbd2 (that is the ext4 specific journaling daemon) could restore it. > > The two partitions are on different disks, so each one should be > > isolated from the I/O perspective (do they share a single > > controller?). > > No, sda/sdb use separated controller. > > > The only interaction I see happens at the VM level, > > since changing performance of any of the two changes the rate at which > > pages can be cleaned. > > > > Corrado > > > > > > > > > vmstat average output per 10 seconds, without ccache > > > procs -----------memory---------- ---swap-- -----io---- --system-- -----cpu----- > > > r b swpd free buff cache si so bi bo in cs us sy id wa st > > > 26.8 0.5 0.0 63930192.3 9677.0 96544.9 0.0 0.0 2486.9 337.9 17729.9 4496.4 17.5 2.5 79.8 0.2 0.0 > > > > > > vmstat average output per 10 seconds, with ccache > > > procs -----------memory---------- ---swap-- -----io---- --system-- -----cpu----- > > > r b swpd free buff cache si so bi bo in cs us sy id wa st > > > 2.4 40.7 0.0 64316231.0 17260.6 119533.8 0.0 0.0 2477.6 1493.1 8606.4 3565.2 2.5 1.1 83.0 13.5 0.0 > > > > > > > > >> > > >> Jan, > > >> can you send a patch with similar change for ext3? So we can do more tests. Hi Jan, can you send a patch with both ext3 and ext4 changes? Our test shows your patch has positive effect, but need confirm with the ext3 change. Thanks, Shaohua ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: [performance bug] kernel building regression on 64 LCPUs machine 2011-02-15 1:10 ` Shaohua Li @ 2011-02-21 16:49 ` Jan Kara 2011-02-23 8:24 ` Alex,Shi 0 siblings, 1 reply; 38+ messages in thread From: Jan Kara @ 2011-02-21 16:49 UTC (permalink / raw) To: Shaohua Li Cc: Shi, Alex, Jan Kara, Corrado Zoccolo, Vivek Goyal, tytso@mit.edu, jaxboe@fusionio.com, linux-kernel@vger.kernel.org, Chen, Tim C [-- Attachment #1: Type: text/plain, Size: 5448 bytes --] On Tue 15-02-11 09:10:01, Shaohua Li wrote: > On Mon, 2011-02-14 at 10:25 +0800, Shi, Alex wrote: > > On Sun, 2011-02-13 at 02:25 +0800, Corrado Zoccolo wrote: > > > On Sat, Feb 12, 2011 at 10:21 AM, Alex,Shi <alex.shi@intel.com> wrote: > > > > On Wed, 2011-01-26 at 16:15 +0800, Li, Shaohua wrote: > > > >> On Thu, Jan 20, 2011 at 11:16:56PM +0800, Vivek Goyal wrote: > > > >> > On Wed, Jan 19, 2011 at 10:03:26AM +0800, Shaohua Li wrote: > > > >> > > add Jan and Theodore to the loop. > > > >> > > > > > >> > > On Wed, 2011-01-19 at 09:55 +0800, Shi, Alex wrote: > > > >> > > > Shaohua and I tested kernel building performance on latest kernel. and > > > >> > > > found it is drop about 15% on our 64 LCPUs NHM-EX machine on ext4 file > > > >> > > > system. We find this performance dropping is due to commit > > > >> > > > 749ef9f8423054e326f. If we revert this patch or just change the > > > >> > > > WRITE_SYNC back to WRITE in jbd2/commit.c file. the performance can be > > > >> > > > recovered. > > > >> > > > > > > >> > > > iostat report show with the commit, read request merge number increased > > > >> > > > and write request merge dropped. The total request size increased and > > > >> > > > queue length dropped. So we tested another patch: only change WRITE_SYNC > > > >> > > > to WRITE_SYNC_PLUG in jbd2/commit.c, but nothing effected. > > > >> > > since WRITE_SYNC_PLUG doesn't work, this isn't a simple no-write-merge issue. > > > >> > > > > > >> > > > > >> > Yep, it does sound like reduce write merging. But moving journal commits > > > >> > back to WRITE, then fsync performance will drop as there will be idling > > > >> > introduced between fsync thread and journalling thread. So that does > > > >> > not sound like a good idea either. > > > >> > > > > >> > Secondly, in presence of mixed workload (some other sync read happening) > > > >> > WRITES can get less bandwidth and sync workload much more. So by > > > >> > marking journal commits as WRITES you might increase the delay there > > > >> > in completion in presence of other sync workload. > > > >> > > > > >> > So Jan Kara's approach makes sense that if somebody is waiting on > > > >> > commit then make it WRITE_SYNC otherwise make it WRITE. Not sure why > > > >> > did it not work for you. Is it possible to run some traces and do > > > >> > more debugging that figure out what's happening. > > > >> Sorry for the long delay. > > > >> > > > >> Looks fedora enables ccache by default. While our kbuild test is on ext4 disk > > > >> but rootfs is on ext3 where ccache cache files live. Jan's patch only covers > > > >> ext4, maybe this is the reason. > > > >> I changed jbd to use WRITE for journal_commit_transaction. With the change and > > > >> Jan's patch, the test seems fine. > > > > Let me clarify the bug situation again. > > > > With the following scenarios, the regression is clear. > > > > 1, ccache_dir setup at rootfs that format is ext3 on /dev/sda1; 2, > > > > kbuild on /dev/sdb1 with ext4. > > > > but if we disable the ccache, only do kbuild on sdb1 with ext4. There is > > > > no regressions whenever with or without Jan's patch. > > > > So, problem focus on the ccache scenario, (from fedora 11, ccache is > > > > default setting). > > > > > > > > If we compare the vmstat output with or without ccache, there is too > > > > many write when ccache enabled. According the result, it should to do > > > > some tunning on ext3 fs. > > > Is ext3 configured with data ordered or writeback? > > > > The ext3 on sda and ext4 on sdb are both used 'ordered' mounting mode. > > > > > I think ccache might be performing fsyncs, and this is a bad workload > > > for ext3, especially in ordered mode. > > > It might be that my patch introduced a regression in ext3 fsync > > > performance, but I don't understand how reverting only the change in > > > jbd2 (that is the ext4 specific journaling daemon) could restore it. > > > The two partitions are on different disks, so each one should be > > > isolated from the I/O perspective (do they share a single > > > controller?). > > > > No, sda/sdb use separated controller. > > > > > The only interaction I see happens at the VM level, > > > since changing performance of any of the two changes the rate at which > > > pages can be cleaned. > > > > > > Corrado > > > > > > > > > > > > vmstat average output per 10 seconds, without ccache > > > > procs -----------memory---------- ---swap-- -----io---- --system-- -----cpu----- > > > > r b swpd free buff cache si so bi bo in cs us sy id wa st > > > > 26.8 0.5 0.0 63930192.3 9677.0 96544.9 0.0 0.0 2486.9 337.9 17729.9 4496.4 17.5 2.5 79.8 0.2 0.0 > > > > > > > > vmstat average output per 10 seconds, with ccache > > > > procs -----------memory---------- ---swap-- -----io---- --system-- -----cpu----- > > > > r b swpd free buff cache si so bi bo in cs us sy id wa st > > > > 2.4 40.7 0.0 64316231.0 17260.6 119533.8 0.0 0.0 2477.6 1493.1 8606.4 3565.2 2.5 1.1 83.0 13.5 0.0 > > > > > > > > > > > >> > > > >> Jan, > > > >> can you send a patch with similar change for ext3? So we can do more tests. > Hi Jan, > can you send a patch with both ext3 and ext4 changes? Our test shows > your patch has positive effect, but need confirm with the ext3 change. Sure. Patches for both ext3 & ext4 are attached. Sorry, it took me a while to get to this. Honza -- Jan Kara <jack@suse.cz> SUSE Labs, CR [-- Attachment #2: 0001-jbd2-Refine-commit-writeout-logic.patch --] [-- Type: text/x-patch, Size: 11044 bytes --] >From 5674f84e70db8274d5aac56a41439ea3b3bcb46e Mon Sep 17 00:00:00 2001 From: Jan Kara <jack@suse.cz> Date: Wed, 19 Jan 2011 13:45:04 +0100 Subject: [PATCH 1/2] jbd2: Refine commit writeout logic Currently we write out all journal buffers in WRITE_SYNC mode. This improves performance for fsync heavy workloads but hinders performance when writes are mostly asynchronous. So add possibility for callers starting a transaction commit to specify whether they are going to wait for the commit and submit journal writes in WRITE_SYNC mode only in that case. Signed-off-by: Jan Kara <jack@suse.cz> --- fs/ext4/fsync.c | 2 +- fs/ext4/super.c | 2 +- fs/jbd2/checkpoint.c | 2 +- fs/jbd2/commit.c | 4 ++-- fs/jbd2/journal.c | 19 ++++++++++--------- fs/jbd2/transaction.c | 13 ++++++------- fs/ocfs2/aops.c | 2 +- fs/ocfs2/super.c | 2 +- include/linux/jbd2.h | 18 ++++++++++-------- 9 files changed, 33 insertions(+), 31 deletions(-) diff --git a/fs/ext4/fsync.c b/fs/ext4/fsync.c index 7829b28..19434da 100644 --- a/fs/ext4/fsync.c +++ b/fs/ext4/fsync.c @@ -198,7 +198,7 @@ int ext4_sync_file(struct file *file, int datasync) return ext4_force_commit(inode->i_sb); commit_tid = datasync ? ei->i_datasync_tid : ei->i_sync_tid; - if (jbd2_log_start_commit(journal, commit_tid)) { + if (jbd2_log_start_commit(journal, commit_tid, true)) { /* * When the journal is on a different device than the * fs data disk, we need to issue the barrier in diff --git a/fs/ext4/super.c b/fs/ext4/super.c index 48ce561..0aeb877 100644 --- a/fs/ext4/super.c +++ b/fs/ext4/super.c @@ -4113,7 +4113,7 @@ static int ext4_sync_fs(struct super_block *sb, int wait) trace_ext4_sync_fs(sb, wait); flush_workqueue(sbi->dio_unwritten_wq); - if (jbd2_journal_start_commit(sbi->s_journal, &target)) { + if (jbd2_journal_start_commit(sbi->s_journal, &target, true)) { if (wait) jbd2_log_wait_commit(sbi->s_journal, target); } diff --git a/fs/jbd2/checkpoint.c b/fs/jbd2/checkpoint.c index 6a79fd0..3436d53 100644 --- a/fs/jbd2/checkpoint.c +++ b/fs/jbd2/checkpoint.c @@ -309,7 +309,7 @@ static int __process_buffer(journal_t *journal, struct journal_head *jh, "Waiting for Godot: block %llu\n", journal->j_devname, (unsigned long long) bh->b_blocknr); - jbd2_log_start_commit(journal, tid); + jbd2_log_start_commit(journal, tid, true); jbd2_log_wait_commit(journal, tid); ret = 1; } else if (!buffer_dirty(bh)) { diff --git a/fs/jbd2/commit.c b/fs/jbd2/commit.c index f3ad159..19973eb 100644 --- a/fs/jbd2/commit.c +++ b/fs/jbd2/commit.c @@ -329,7 +329,7 @@ void jbd2_journal_commit_transaction(journal_t *journal) int tag_bytes = journal_tag_bytes(journal); struct buffer_head *cbh = NULL; /* For transactional checksums */ __u32 crc32_sum = ~0; - int write_op = WRITE_SYNC; + int write_op = WRITE; /* * First job: lock down the current transaction and wait for @@ -368,7 +368,7 @@ void jbd2_journal_commit_transaction(journal_t *journal) * we unplug the device. We don't do explicit unplugging in here, * instead we rely on sync_buffer() doing the unplug for us. */ - if (commit_transaction->t_synchronous_commit) + if (tid_geq(journal->j_commit_waited, commit_transaction->t_tid)) write_op = WRITE_SYNC_PLUG; trace_jbd2_commit_locking(journal, commit_transaction); stats.run.rs_wait = commit_transaction->t_max_wait; diff --git a/fs/jbd2/journal.c b/fs/jbd2/journal.c index 9e46869..e278fa2 100644 --- a/fs/jbd2/journal.c +++ b/fs/jbd2/journal.c @@ -475,7 +475,7 @@ int __jbd2_log_space_left(journal_t *journal) /* * Called under j_state_lock. Returns true if a transaction commit was started. */ -int __jbd2_log_start_commit(journal_t *journal, tid_t target) +int __jbd2_log_start_commit(journal_t *journal, tid_t target, bool will_wait) { /* * Are we already doing a recent enough commit? @@ -485,7 +485,8 @@ int __jbd2_log_start_commit(journal_t *journal, tid_t target) * We want a new commit: OK, mark the request and wakeup the * commit thread. We do _not_ do the commit ourselves. */ - + if (will_wait && !tid_geq(journal->j_commit_waited, target)) + journal->j_commit_waited = target; journal->j_commit_request = target; jbd_debug(1, "JBD: requesting commit %d/%d\n", journal->j_commit_request, @@ -496,12 +497,12 @@ int __jbd2_log_start_commit(journal_t *journal, tid_t target) return 0; } -int jbd2_log_start_commit(journal_t *journal, tid_t tid) +int jbd2_log_start_commit(journal_t *journal, tid_t tid, bool will_wait) { int ret; write_lock(&journal->j_state_lock); - ret = __jbd2_log_start_commit(journal, tid); + ret = __jbd2_log_start_commit(journal, tid, will_wait); write_unlock(&journal->j_state_lock); return ret; } @@ -524,7 +525,7 @@ int jbd2_journal_force_commit_nested(journal_t *journal) read_lock(&journal->j_state_lock); if (journal->j_running_transaction && !current->journal_info) { transaction = journal->j_running_transaction; - __jbd2_log_start_commit(journal, transaction->t_tid); + __jbd2_log_start_commit(journal, transaction->t_tid, true); } else if (journal->j_committing_transaction) transaction = journal->j_committing_transaction; @@ -544,7 +545,7 @@ int jbd2_journal_force_commit_nested(journal_t *journal) * if a transaction is going to be committed (or is currently already * committing), and fills its tid in at *ptid */ -int jbd2_journal_start_commit(journal_t *journal, tid_t *ptid) +int jbd2_journal_start_commit(journal_t *journal, tid_t *ptid, bool will_wait) { int ret = 0; @@ -552,7 +553,7 @@ int jbd2_journal_start_commit(journal_t *journal, tid_t *ptid) if (journal->j_running_transaction) { tid_t tid = journal->j_running_transaction->t_tid; - __jbd2_log_start_commit(journal, tid); + __jbd2_log_start_commit(journal, tid, will_wait); /* There's a running transaction and we've just made sure * it's commit has been scheduled. */ if (ptid) @@ -1559,7 +1560,7 @@ int jbd2_journal_flush(journal_t *journal) /* Force everything buffered to the log... */ if (journal->j_running_transaction) { transaction = journal->j_running_transaction; - __jbd2_log_start_commit(journal, transaction->t_tid); + __jbd2_log_start_commit(journal, transaction->t_tid, true); } else if (journal->j_committing_transaction) transaction = journal->j_committing_transaction; @@ -1675,7 +1676,7 @@ void __jbd2_journal_abort_hard(journal_t *journal) journal->j_flags |= JBD2_ABORT; transaction = journal->j_running_transaction; if (transaction) - __jbd2_log_start_commit(journal, transaction->t_tid); + __jbd2_log_start_commit(journal, transaction->t_tid, false); write_unlock(&journal->j_state_lock); } diff --git a/fs/jbd2/transaction.c b/fs/jbd2/transaction.c index faad2bd..c48e6e8 100644 --- a/fs/jbd2/transaction.c +++ b/fs/jbd2/transaction.c @@ -222,7 +222,7 @@ repeat: atomic_sub(nblocks, &transaction->t_outstanding_credits); prepare_to_wait(&journal->j_wait_transaction_locked, &wait, TASK_UNINTERRUPTIBLE); - __jbd2_log_start_commit(journal, transaction->t_tid); + __jbd2_log_start_commit(journal, transaction->t_tid, false); read_unlock(&journal->j_state_lock); schedule(); finish_wait(&journal->j_wait_transaction_locked, &wait); @@ -465,7 +465,7 @@ int jbd2__journal_restart(handle_t *handle, int nblocks, int gfp_mask) spin_unlock(&transaction->t_handle_lock); jbd_debug(2, "restarting handle %p\n", handle); - __jbd2_log_start_commit(journal, transaction->t_tid); + __jbd2_log_start_commit(journal, transaction->t_tid, false); read_unlock(&journal->j_state_lock); lock_map_release(&handle->h_lockdep_map); @@ -1361,8 +1361,6 @@ int jbd2_journal_stop(handle_t *handle) } } - if (handle->h_sync) - transaction->t_synchronous_commit = 1; current->journal_info = NULL; atomic_sub(handle->h_buffer_credits, &transaction->t_outstanding_credits); @@ -1383,15 +1381,16 @@ int jbd2_journal_stop(handle_t *handle) jbd_debug(2, "transaction too old, requesting commit for " "handle %p\n", handle); - /* This is non-blocking */ - jbd2_log_start_commit(journal, transaction->t_tid); - /* * Special case: JBD2_SYNC synchronous updates require us * to wait for the commit to complete. */ if (handle->h_sync && !(current->flags & PF_MEMALLOC)) wait_for_commit = 1; + + /* This is non-blocking */ + jbd2_log_start_commit(journal, transaction->t_tid, + wait_for_commit); } /* diff --git a/fs/ocfs2/aops.c b/fs/ocfs2/aops.c index 1fbb0e2..d493f32 100644 --- a/fs/ocfs2/aops.c +++ b/fs/ocfs2/aops.c @@ -1659,7 +1659,7 @@ static int ocfs2_try_to_free_truncate_log(struct ocfs2_super *osb, goto out; } - if (jbd2_journal_start_commit(osb->journal->j_journal, &target)) { + if (jbd2_journal_start_commit(osb->journal->j_journal, &target, true)) { jbd2_log_wait_commit(osb->journal->j_journal, target); ret = 1; } diff --git a/fs/ocfs2/super.c b/fs/ocfs2/super.c index 38f986d..45d9f82 100644 --- a/fs/ocfs2/super.c +++ b/fs/ocfs2/super.c @@ -414,7 +414,7 @@ static int ocfs2_sync_fs(struct super_block *sb, int wait) } if (jbd2_journal_start_commit(OCFS2_SB(sb)->journal->j_journal, - &target)) { + &target, wait)) { if (wait) jbd2_log_wait_commit(OCFS2_SB(sb)->journal->j_journal, target); diff --git a/include/linux/jbd2.h b/include/linux/jbd2.h index 27e79c2..46aaf45 100644 --- a/include/linux/jbd2.h +++ b/include/linux/jbd2.h @@ -631,11 +631,6 @@ struct transaction_s */ atomic_t t_handle_count; - /* - * This transaction is being forced and some process is - * waiting for it to finish. - */ - unsigned int t_synchronous_commit:1; unsigned int t_flushed_data_blocks:1; /* @@ -900,6 +895,13 @@ struct journal_s tid_t j_commit_request; /* + * Sequence number of the most recent transaction someone is waiting + * for to commit. + * [j_state_lock] + */ + tid_t j_commit_waited; + + /* * Journal uuid: identifies the object (filesystem, LVM volume etc) * backed by this journal. This will eventually be replaced by an array * of uuids, allowing us to index multiple devices within a single @@ -1200,9 +1202,9 @@ extern void jbd2_journal_switch_revoke_table(journal_t *journal); */ int __jbd2_log_space_left(journal_t *); /* Called with journal locked */ -int jbd2_log_start_commit(journal_t *journal, tid_t tid); -int __jbd2_log_start_commit(journal_t *journal, tid_t tid); -int jbd2_journal_start_commit(journal_t *journal, tid_t *tid); +int jbd2_log_start_commit(journal_t *journal, tid_t tid, bool will_wait); +int __jbd2_log_start_commit(journal_t *journal, tid_t tid, bool will_wait); +int jbd2_journal_start_commit(journal_t *journal, tid_t *tid, bool will_wait); int jbd2_journal_force_commit_nested(journal_t *journal); int jbd2_log_wait_commit(journal_t *journal, tid_t tid); int jbd2_log_do_checkpoint(journal_t *journal); -- 1.7.1 [-- Attachment #3: 0002-jbd-Refine-commit-writeout-logic.patch --] [-- Type: text/x-patch, Size: 9843 bytes --] >From ff2be0c11564d1253d9420ee0f805f2ba8f62e9f Mon Sep 17 00:00:00 2001 From: Jan Kara <jack@suse.cz> Date: Mon, 21 Feb 2011 17:25:37 +0100 Subject: [PATCH 2/2] jbd: Refine commit writeout logic Currently we write out all journal buffers in WRITE_SYNC mode. This improves performance for fsync heavy workloads but hinders performance when writes are mostly asynchronous. So add possibility for callers starting a transaction commit to specify whether they are going to wait for the commit and submit journal writes in WRITE_SYNC mode only in that case. Signed-off-by: Jan Kara <jack@suse.cz> --- fs/ext3/fsync.c | 2 +- fs/ext3/super.c | 2 +- fs/jbd/checkpoint.c | 2 +- fs/jbd/commit.c | 4 ++-- fs/jbd/journal.c | 19 ++++++++++--------- fs/jbd/transaction.c | 9 ++++----- include/linux/jbd.h | 21 ++++++++++++--------- 7 files changed, 31 insertions(+), 28 deletions(-) diff --git a/fs/ext3/fsync.c b/fs/ext3/fsync.c index 09b13bb..5396dd6 100644 --- a/fs/ext3/fsync.c +++ b/fs/ext3/fsync.c @@ -81,7 +81,7 @@ int ext3_sync_file(struct file *file, int datasync) if (test_opt(inode->i_sb, BARRIER) && !journal_trans_will_send_data_barrier(journal, commit_tid)) needs_barrier = 1; - log_start_commit(journal, commit_tid); + log_start_commit(journal, commit_tid, true); ret = log_wait_commit(journal, commit_tid); /* diff --git a/fs/ext3/super.c b/fs/ext3/super.c index 85c8cc8..58a4424 100644 --- a/fs/ext3/super.c +++ b/fs/ext3/super.c @@ -2497,7 +2497,7 @@ static int ext3_sync_fs(struct super_block *sb, int wait) { tid_t target; - if (journal_start_commit(EXT3_SB(sb)->s_journal, &target)) { + if (journal_start_commit(EXT3_SB(sb)->s_journal, &target, true)) { if (wait) log_wait_commit(EXT3_SB(sb)->s_journal, target); } diff --git a/fs/jbd/checkpoint.c b/fs/jbd/checkpoint.c index e4b87bc..26ca2c3 100644 --- a/fs/jbd/checkpoint.c +++ b/fs/jbd/checkpoint.c @@ -297,7 +297,7 @@ static int __process_buffer(journal_t *journal, struct journal_head *jh, spin_unlock(&journal->j_list_lock); jbd_unlock_bh_state(bh); - log_start_commit(journal, tid); + log_start_commit(journal, tid, true); log_wait_commit(journal, tid); ret = 1; } else if (!buffer_dirty(bh)) { diff --git a/fs/jbd/commit.c b/fs/jbd/commit.c index 34a4861..ac188cb 100644 --- a/fs/jbd/commit.c +++ b/fs/jbd/commit.c @@ -294,7 +294,7 @@ void journal_commit_transaction(journal_t *journal) int first_tag = 0; int tag_flag; int i; - int write_op = WRITE_SYNC; + int write_op = WRITE; /* * First job: lock down the current transaction and wait for @@ -332,7 +332,7 @@ void journal_commit_transaction(journal_t *journal) * we unplug the device. We don't do explicit unplugging in here, * instead we rely on sync_buffer() doing the unplug for us. */ - if (commit_transaction->t_synchronous_commit) + if (tid_geq(journal->j_commit_waited, commit_transaction->t_tid)) write_op = WRITE_SYNC_PLUG; spin_lock(&commit_transaction->t_handle_lock); while (commit_transaction->t_updates) { diff --git a/fs/jbd/journal.c b/fs/jbd/journal.c index da1b5e4..5743b4c 100644 --- a/fs/jbd/journal.c +++ b/fs/jbd/journal.c @@ -434,7 +434,7 @@ int __log_space_left(journal_t *journal) /* * Called under j_state_lock. Returns true if a transaction commit was started. */ -int __log_start_commit(journal_t *journal, tid_t target) +int __log_start_commit(journal_t *journal, tid_t target, bool will_wait) { /* * Are we already doing a recent enough commit? @@ -444,7 +444,8 @@ int __log_start_commit(journal_t *journal, tid_t target) * We want a new commit: OK, mark the request and wakeup the * commit thread. We do _not_ do the commit ourselves. */ - + if (will_wait && !tid_geq(journal->j_commit_waited, target)) + journal->j_commit_waited = target; journal->j_commit_request = target; jbd_debug(1, "JBD: requesting commit %d/%d\n", journal->j_commit_request, @@ -455,12 +456,12 @@ int __log_start_commit(journal_t *journal, tid_t target) return 0; } -int log_start_commit(journal_t *journal, tid_t tid) +int log_start_commit(journal_t *journal, tid_t tid, bool will_wait) { int ret; spin_lock(&journal->j_state_lock); - ret = __log_start_commit(journal, tid); + ret = __log_start_commit(journal, tid, will_wait); spin_unlock(&journal->j_state_lock); return ret; } @@ -483,7 +484,7 @@ int journal_force_commit_nested(journal_t *journal) spin_lock(&journal->j_state_lock); if (journal->j_running_transaction && !current->journal_info) { transaction = journal->j_running_transaction; - __log_start_commit(journal, transaction->t_tid); + __log_start_commit(journal, transaction->t_tid, true); } else if (journal->j_committing_transaction) transaction = journal->j_committing_transaction; @@ -503,7 +504,7 @@ int journal_force_commit_nested(journal_t *journal) * if a transaction is going to be committed (or is currently already * committing), and fills its tid in at *ptid */ -int journal_start_commit(journal_t *journal, tid_t *ptid) +int journal_start_commit(journal_t *journal, tid_t *ptid, bool will_wait) { int ret = 0; @@ -511,7 +512,7 @@ int journal_start_commit(journal_t *journal, tid_t *ptid) if (journal->j_running_transaction) { tid_t tid = journal->j_running_transaction->t_tid; - __log_start_commit(journal, tid); + __log_start_commit(journal, tid, will_wait); /* There's a running transaction and we've just made sure * it's commit has been scheduled. */ if (ptid) @@ -1439,7 +1440,7 @@ int journal_flush(journal_t *journal) /* Force everything buffered to the log... */ if (journal->j_running_transaction) { transaction = journal->j_running_transaction; - __log_start_commit(journal, transaction->t_tid); + __log_start_commit(journal, transaction->t_tid, true); } else if (journal->j_committing_transaction) transaction = journal->j_committing_transaction; @@ -1573,7 +1574,7 @@ static void __journal_abort_hard(journal_t *journal) journal->j_flags |= JFS_ABORT; transaction = journal->j_running_transaction; if (transaction) - __log_start_commit(journal, transaction->t_tid); + __log_start_commit(journal, transaction->t_tid, false); spin_unlock(&journal->j_state_lock); } diff --git a/fs/jbd/transaction.c b/fs/jbd/transaction.c index 5b2e4c3..a12c0a3 100644 --- a/fs/jbd/transaction.c +++ b/fs/jbd/transaction.c @@ -178,7 +178,7 @@ repeat_locked: spin_unlock(&transaction->t_handle_lock); prepare_to_wait(&journal->j_wait_transaction_locked, &wait, TASK_UNINTERRUPTIBLE); - __log_start_commit(journal, transaction->t_tid); + __log_start_commit(journal, transaction->t_tid, false); spin_unlock(&journal->j_state_lock); schedule(); finish_wait(&journal->j_wait_transaction_locked, &wait); @@ -411,7 +411,7 @@ int journal_restart(handle_t *handle, int nblocks) spin_unlock(&transaction->t_handle_lock); jbd_debug(2, "restarting handle %p\n", handle); - __log_start_commit(journal, transaction->t_tid); + __log_start_commit(journal, transaction->t_tid, false); spin_unlock(&journal->j_state_lock); lock_map_release(&handle->h_lockdep_map); @@ -1431,8 +1431,6 @@ int journal_stop(handle_t *handle) } } - if (handle->h_sync) - transaction->t_synchronous_commit = 1; current->journal_info = NULL; spin_lock(&journal->j_state_lock); spin_lock(&transaction->t_handle_lock); @@ -1463,7 +1461,8 @@ int journal_stop(handle_t *handle) jbd_debug(2, "transaction too old, requesting commit for " "handle %p\n", handle); /* This is non-blocking */ - __log_start_commit(journal, transaction->t_tid); + __log_start_commit(journal, transaction->t_tid, + handle->h_sync && !(current->flags & PF_MEMALLOC)); spin_unlock(&journal->j_state_lock); /* diff --git a/include/linux/jbd.h b/include/linux/jbd.h index e069650..c38f73e 100644 --- a/include/linux/jbd.h +++ b/include/linux/jbd.h @@ -541,12 +541,6 @@ struct transaction_s * How many handles used this transaction? [t_handle_lock] */ int t_handle_count; - - /* - * This transaction is being forced and some process is - * waiting for it to finish. - */ - unsigned int t_synchronous_commit:1; }; /** @@ -594,6 +588,8 @@ struct transaction_s * transaction * @j_commit_request: Sequence number of the most recent transaction wanting * commit + * @j_commit_waited: Sequence number of the most recent transaction someone + * is waiting for to commit. * @j_uuid: Uuid of client object. * @j_task: Pointer to the current commit thread for this journal * @j_max_transaction_buffers: Maximum number of metadata buffers to allow in a @@ -762,6 +758,13 @@ struct journal_s tid_t j_commit_request; /* + * Sequence number of the most recent transaction someone is waiting + * for to commit. + * [j_state_lock] + */ + tid_t j_commit_waited; + + /* * Journal uuid: identifies the object (filesystem, LVM volume etc) * backed by this journal. This will eventually be replaced by an array * of uuids, allowing us to index multiple devices within a single @@ -985,9 +988,9 @@ extern void journal_switch_revoke_table(journal_t *journal); */ int __log_space_left(journal_t *); /* Called with journal locked */ -int log_start_commit(journal_t *journal, tid_t tid); -int __log_start_commit(journal_t *journal, tid_t tid); -int journal_start_commit(journal_t *journal, tid_t *tid); +int log_start_commit(journal_t *journal, tid_t tid, bool will_wait); +int __log_start_commit(journal_t *journal, tid_t tid, bool will_wait); +int journal_start_commit(journal_t *journal, tid_t *tid, bool will_wait); int journal_force_commit_nested(journal_t *journal); int log_wait_commit(journal_t *journal, tid_t tid); int log_do_checkpoint(journal_t *journal); -- 1.7.1 ^ permalink raw reply related [flat|nested] 38+ messages in thread
* Re: [performance bug] kernel building regression on 64 LCPUs machine 2011-02-21 16:49 ` Jan Kara @ 2011-02-23 8:24 ` Alex,Shi 2011-02-24 12:13 ` Jan Kara 0 siblings, 1 reply; 38+ messages in thread From: Alex,Shi @ 2011-02-23 8:24 UTC (permalink / raw) To: Jan Kara Cc: Li, Shaohua, Corrado Zoccolo, Vivek Goyal, tytso@mit.edu, jaxboe@fusionio.com, linux-kernel@vger.kernel.org, Chen, Tim C Though these patches can not totally recovered the problem, but they are quite helpful with ccache enabled situation. It increase 10% performance on 38-rc1 kernel. I have tried to enabled they to latest rc6 kernel but failed. the vmstat output is here: with patches: r b swpd free buff cache si so bi bo in cs us sy id wa st 1.5 24.7 0.0 64316199.8 17240.8 153376.9 0.0 0.0 1777.1 1788.0 6479.0 2605.3 1.7 0.9 89.1 8.2 0.0 original 38-rc1 kernel: 2.4 32.3 0.0 63653302.9 17170.6 153125.3 0.0 0.0 1579.7 1834.1 6016.4 2407.0 1.5 0.7 86.6 10.1 0.0 It reduce write blocks clearly. > > can you send a patch with both ext3 and ext4 changes? Our test shows > > your patch has positive effect, but need confirm with the ext3 change. > Sure. Patches for both ext3 & ext4 are attached. Sorry, it took me a > while to get to this. > > Honza ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: [performance bug] kernel building regression on 64 LCPUs machine 2011-02-23 8:24 ` Alex,Shi @ 2011-02-24 12:13 ` Jan Kara 2011-02-25 0:44 ` Alex Shi 2011-02-26 14:45 ` Corrado Zoccolo 0 siblings, 2 replies; 38+ messages in thread From: Jan Kara @ 2011-02-24 12:13 UTC (permalink / raw) To: Alex,Shi Cc: Jan Kara, Li, Shaohua, Corrado Zoccolo, Vivek Goyal, tytso@mit.edu, jaxboe@fusionio.com, linux-kernel@vger.kernel.org, Chen, Tim C [-- Attachment #1: Type: text/plain, Size: 673 bytes --] On Wed 23-02-11 16:24:47, Alex,Shi wrote: > Though these patches can not totally recovered the problem, but they are > quite helpful with ccache enabled situation. It increase 10% performance > on 38-rc1 kernel. OK and what was the original performance drop with WRITE_SYNC change? > I have tried to enabled they to latest rc6 kernel but failed. the vmstat output is here: > with patches: I'm attaching patches rebased on top of latest Linus's tree. Corrado, could you possibly run your fsync-heavy tests so that we see whether there isn't negative impact of my patches on your fsync-heavy workload? Thanks. Honza -- Jan Kara <jack@suse.cz> SUSE Labs, CR [-- Attachment #2: 0001-jbd2-Refine-commit-writeout-logic.patch --] [-- Type: text/x-patch, Size: 10946 bytes --] >From 555d33d45bd38a8da6b187a15ad5eb9e9c4f1ab9 Mon Sep 17 00:00:00 2001 From: Jan Kara <jack@suse.cz> Date: Wed, 19 Jan 2011 13:45:04 +0100 Subject: [PATCH 1/2] jbd2: Refine commit writeout logic Currently we write out all journal buffers in WRITE_SYNC mode. This improves performance for fsync heavy workloads but hinders performance when writes are mostly asynchronous. So add possibility for callers starting a transaction commit to specify whether they are going to wait for the commit and submit journal writes in WRITE_SYNC mode only in that case. Signed-off-by: Jan Kara <jack@suse.cz> --- fs/ext4/fsync.c | 2 +- fs/ext4/super.c | 2 +- fs/jbd2/checkpoint.c | 2 +- fs/jbd2/commit.c | 4 ++-- fs/jbd2/journal.c | 19 ++++++++++--------- fs/jbd2/transaction.c | 15 +++++++++------ fs/ocfs2/aops.c | 2 +- fs/ocfs2/super.c | 2 +- include/linux/jbd2.h | 18 ++++++++++-------- 9 files changed, 36 insertions(+), 30 deletions(-) diff --git a/fs/ext4/fsync.c b/fs/ext4/fsync.c index 7829b28..19434da 100644 --- a/fs/ext4/fsync.c +++ b/fs/ext4/fsync.c @@ -198,7 +198,7 @@ int ext4_sync_file(struct file *file, int datasync) return ext4_force_commit(inode->i_sb); commit_tid = datasync ? ei->i_datasync_tid : ei->i_sync_tid; - if (jbd2_log_start_commit(journal, commit_tid)) { + if (jbd2_log_start_commit(journal, commit_tid, true)) { /* * When the journal is on a different device than the * fs data disk, we need to issue the barrier in diff --git a/fs/ext4/super.c b/fs/ext4/super.c index f6a318f..bfce0d6 100644 --- a/fs/ext4/super.c +++ b/fs/ext4/super.c @@ -4115,7 +4115,7 @@ static int ext4_sync_fs(struct super_block *sb, int wait) trace_ext4_sync_fs(sb, wait); flush_workqueue(sbi->dio_unwritten_wq); - if (jbd2_journal_start_commit(sbi->s_journal, &target)) { + if (jbd2_journal_start_commit(sbi->s_journal, &target, true)) { if (wait) jbd2_log_wait_commit(sbi->s_journal, target); } diff --git a/fs/jbd2/checkpoint.c b/fs/jbd2/checkpoint.c index 6a79fd0..3436d53 100644 --- a/fs/jbd2/checkpoint.c +++ b/fs/jbd2/checkpoint.c @@ -309,7 +309,7 @@ static int __process_buffer(journal_t *journal, struct journal_head *jh, "Waiting for Godot: block %llu\n", journal->j_devname, (unsigned long long) bh->b_blocknr); - jbd2_log_start_commit(journal, tid); + jbd2_log_start_commit(journal, tid, true); jbd2_log_wait_commit(journal, tid); ret = 1; } else if (!buffer_dirty(bh)) { diff --git a/fs/jbd2/commit.c b/fs/jbd2/commit.c index f3ad159..19973eb 100644 --- a/fs/jbd2/commit.c +++ b/fs/jbd2/commit.c @@ -329,7 +329,7 @@ void jbd2_journal_commit_transaction(journal_t *journal) int tag_bytes = journal_tag_bytes(journal); struct buffer_head *cbh = NULL; /* For transactional checksums */ __u32 crc32_sum = ~0; - int write_op = WRITE_SYNC; + int write_op = WRITE; /* * First job: lock down the current transaction and wait for @@ -368,7 +368,7 @@ void jbd2_journal_commit_transaction(journal_t *journal) * we unplug the device. We don't do explicit unplugging in here, * instead we rely on sync_buffer() doing the unplug for us. */ - if (commit_transaction->t_synchronous_commit) + if (tid_geq(journal->j_commit_waited, commit_transaction->t_tid)) write_op = WRITE_SYNC_PLUG; trace_jbd2_commit_locking(journal, commit_transaction); stats.run.rs_wait = commit_transaction->t_max_wait; diff --git a/fs/jbd2/journal.c b/fs/jbd2/journal.c index 97e7346..04835d6 100644 --- a/fs/jbd2/journal.c +++ b/fs/jbd2/journal.c @@ -476,7 +476,7 @@ int __jbd2_log_space_left(journal_t *journal) * Called with j_state_lock locked for writing. * Returns true if a transaction commit was started. */ -int __jbd2_log_start_commit(journal_t *journal, tid_t target) +int __jbd2_log_start_commit(journal_t *journal, tid_t target, bool will_wait) { /* * Are we already doing a recent enough commit? @@ -486,7 +486,8 @@ int __jbd2_log_start_commit(journal_t *journal, tid_t target) * We want a new commit: OK, mark the request and wakeup the * commit thread. We do _not_ do the commit ourselves. */ - + if (will_wait && !tid_geq(journal->j_commit_waited, target)) + journal->j_commit_waited = target; journal->j_commit_request = target; jbd_debug(1, "JBD: requesting commit %d/%d\n", journal->j_commit_request, @@ -497,12 +498,12 @@ int __jbd2_log_start_commit(journal_t *journal, tid_t target) return 0; } -int jbd2_log_start_commit(journal_t *journal, tid_t tid) +int jbd2_log_start_commit(journal_t *journal, tid_t tid, bool will_wait) { int ret; write_lock(&journal->j_state_lock); - ret = __jbd2_log_start_commit(journal, tid); + ret = __jbd2_log_start_commit(journal, tid, will_wait); write_unlock(&journal->j_state_lock); return ret; } @@ -539,7 +540,7 @@ int jbd2_journal_force_commit_nested(journal_t *journal) tid = transaction->t_tid; read_unlock(&journal->j_state_lock); if (need_to_start) - jbd2_log_start_commit(journal, tid); + jbd2_log_start_commit(journal, tid, true); jbd2_log_wait_commit(journal, tid); return 1; } @@ -549,7 +550,7 @@ int jbd2_journal_force_commit_nested(journal_t *journal) * if a transaction is going to be committed (or is currently already * committing), and fills its tid in at *ptid */ -int jbd2_journal_start_commit(journal_t *journal, tid_t *ptid) +int jbd2_journal_start_commit(journal_t *journal, tid_t *ptid, bool will_wait) { int ret = 0; @@ -557,7 +558,7 @@ int jbd2_journal_start_commit(journal_t *journal, tid_t *ptid) if (journal->j_running_transaction) { tid_t tid = journal->j_running_transaction->t_tid; - __jbd2_log_start_commit(journal, tid); + __jbd2_log_start_commit(journal, tid, will_wait); /* There's a running transaction and we've just made sure * it's commit has been scheduled. */ if (ptid) @@ -1564,7 +1565,7 @@ int jbd2_journal_flush(journal_t *journal) /* Force everything buffered to the log... */ if (journal->j_running_transaction) { transaction = journal->j_running_transaction; - __jbd2_log_start_commit(journal, transaction->t_tid); + __jbd2_log_start_commit(journal, transaction->t_tid, true); } else if (journal->j_committing_transaction) transaction = journal->j_committing_transaction; @@ -1680,7 +1681,7 @@ void __jbd2_journal_abort_hard(journal_t *journal) journal->j_flags |= JBD2_ABORT; transaction = journal->j_running_transaction; if (transaction) - __jbd2_log_start_commit(journal, transaction->t_tid); + __jbd2_log_start_commit(journal, transaction->t_tid, false); write_unlock(&journal->j_state_lock); } diff --git a/fs/jbd2/transaction.c b/fs/jbd2/transaction.c index 1d11910..340dd5e 100644 --- a/fs/jbd2/transaction.c +++ b/fs/jbd2/transaction.c @@ -226,7 +226,7 @@ repeat: need_to_start = !tid_geq(journal->j_commit_request, tid); read_unlock(&journal->j_state_lock); if (need_to_start) - jbd2_log_start_commit(journal, tid); + jbd2_log_start_commit(journal, tid, false); schedule(); finish_wait(&journal->j_wait_transaction_locked, &wait); goto repeat; @@ -469,8 +469,12 @@ int jbd2__journal_restart(handle_t *handle, int nblocks, int gfp_mask) spin_unlock(&transaction->t_handle_lock); jbd_debug(2, "restarting handle %p\n", handle); +<<<<<<< HEAD tid = transaction->t_tid; need_to_start = !tid_geq(journal->j_commit_request, tid); +======= + __jbd2_log_start_commit(journal, transaction->t_tid, false); +>>>>>>> jbd2: Refine commit writeout logic read_unlock(&journal->j_state_lock); if (need_to_start) jbd2_log_start_commit(journal, tid); @@ -1368,8 +1372,6 @@ int jbd2_journal_stop(handle_t *handle) } } - if (handle->h_sync) - transaction->t_synchronous_commit = 1; current->journal_info = NULL; atomic_sub(handle->h_buffer_credits, &transaction->t_outstanding_credits); @@ -1390,15 +1392,16 @@ int jbd2_journal_stop(handle_t *handle) jbd_debug(2, "transaction too old, requesting commit for " "handle %p\n", handle); - /* This is non-blocking */ - jbd2_log_start_commit(journal, transaction->t_tid); - /* * Special case: JBD2_SYNC synchronous updates require us * to wait for the commit to complete. */ if (handle->h_sync && !(current->flags & PF_MEMALLOC)) wait_for_commit = 1; + + /* This is non-blocking */ + jbd2_log_start_commit(journal, transaction->t_tid, + wait_for_commit); } /* diff --git a/fs/ocfs2/aops.c b/fs/ocfs2/aops.c index 1fbb0e2..d493f32 100644 --- a/fs/ocfs2/aops.c +++ b/fs/ocfs2/aops.c @@ -1659,7 +1659,7 @@ static int ocfs2_try_to_free_truncate_log(struct ocfs2_super *osb, goto out; } - if (jbd2_journal_start_commit(osb->journal->j_journal, &target)) { + if (jbd2_journal_start_commit(osb->journal->j_journal, &target, true)) { jbd2_log_wait_commit(osb->journal->j_journal, target); ret = 1; } diff --git a/fs/ocfs2/super.c b/fs/ocfs2/super.c index 38f986d..45d9f82 100644 --- a/fs/ocfs2/super.c +++ b/fs/ocfs2/super.c @@ -414,7 +414,7 @@ static int ocfs2_sync_fs(struct super_block *sb, int wait) } if (jbd2_journal_start_commit(OCFS2_SB(sb)->journal->j_journal, - &target)) { + &target, wait)) { if (wait) jbd2_log_wait_commit(OCFS2_SB(sb)->journal->j_journal, target); diff --git a/include/linux/jbd2.h b/include/linux/jbd2.h index 27e79c2..46aaf45 100644 --- a/include/linux/jbd2.h +++ b/include/linux/jbd2.h @@ -631,11 +631,6 @@ struct transaction_s */ atomic_t t_handle_count; - /* - * This transaction is being forced and some process is - * waiting for it to finish. - */ - unsigned int t_synchronous_commit:1; unsigned int t_flushed_data_blocks:1; /* @@ -900,6 +895,13 @@ struct journal_s tid_t j_commit_request; /* + * Sequence number of the most recent transaction someone is waiting + * for to commit. + * [j_state_lock] + */ + tid_t j_commit_waited; + + /* * Journal uuid: identifies the object (filesystem, LVM volume etc) * backed by this journal. This will eventually be replaced by an array * of uuids, allowing us to index multiple devices within a single @@ -1200,9 +1202,9 @@ extern void jbd2_journal_switch_revoke_table(journal_t *journal); */ int __jbd2_log_space_left(journal_t *); /* Called with journal locked */ -int jbd2_log_start_commit(journal_t *journal, tid_t tid); -int __jbd2_log_start_commit(journal_t *journal, tid_t tid); -int jbd2_journal_start_commit(journal_t *journal, tid_t *tid); +int jbd2_log_start_commit(journal_t *journal, tid_t tid, bool will_wait); +int __jbd2_log_start_commit(journal_t *journal, tid_t tid, bool will_wait); +int jbd2_journal_start_commit(journal_t *journal, tid_t *tid, bool will_wait); int jbd2_journal_force_commit_nested(journal_t *journal); int jbd2_log_wait_commit(journal_t *journal, tid_t tid); int jbd2_log_do_checkpoint(journal_t *journal); -- 1.7.1 [-- Attachment #3: 0002-jbd-Refine-commit-writeout-logic.patch --] [-- Type: text/x-patch, Size: 9843 bytes --] >From 3e4f42155513df7b6d36584e7b8383b86f425467 Mon Sep 17 00:00:00 2001 From: Jan Kara <jack@suse.cz> Date: Mon, 21 Feb 2011 17:25:37 +0100 Subject: [PATCH 2/2] jbd: Refine commit writeout logic Currently we write out all journal buffers in WRITE_SYNC mode. This improves performance for fsync heavy workloads but hinders performance when writes are mostly asynchronous. So add possibility for callers starting a transaction commit to specify whether they are going to wait for the commit and submit journal writes in WRITE_SYNC mode only in that case. Signed-off-by: Jan Kara <jack@suse.cz> --- fs/ext3/fsync.c | 2 +- fs/ext3/super.c | 2 +- fs/jbd/checkpoint.c | 2 +- fs/jbd/commit.c | 4 ++-- fs/jbd/journal.c | 19 ++++++++++--------- fs/jbd/transaction.c | 9 ++++----- include/linux/jbd.h | 21 ++++++++++++--------- 7 files changed, 31 insertions(+), 28 deletions(-) diff --git a/fs/ext3/fsync.c b/fs/ext3/fsync.c index 09b13bb..5396dd6 100644 --- a/fs/ext3/fsync.c +++ b/fs/ext3/fsync.c @@ -81,7 +81,7 @@ int ext3_sync_file(struct file *file, int datasync) if (test_opt(inode->i_sb, BARRIER) && !journal_trans_will_send_data_barrier(journal, commit_tid)) needs_barrier = 1; - log_start_commit(journal, commit_tid); + log_start_commit(journal, commit_tid, true); ret = log_wait_commit(journal, commit_tid); /* diff --git a/fs/ext3/super.c b/fs/ext3/super.c index 85c8cc8..58a4424 100644 --- a/fs/ext3/super.c +++ b/fs/ext3/super.c @@ -2497,7 +2497,7 @@ static int ext3_sync_fs(struct super_block *sb, int wait) { tid_t target; - if (journal_start_commit(EXT3_SB(sb)->s_journal, &target)) { + if (journal_start_commit(EXT3_SB(sb)->s_journal, &target, true)) { if (wait) log_wait_commit(EXT3_SB(sb)->s_journal, target); } diff --git a/fs/jbd/checkpoint.c b/fs/jbd/checkpoint.c index e4b87bc..26ca2c3 100644 --- a/fs/jbd/checkpoint.c +++ b/fs/jbd/checkpoint.c @@ -297,7 +297,7 @@ static int __process_buffer(journal_t *journal, struct journal_head *jh, spin_unlock(&journal->j_list_lock); jbd_unlock_bh_state(bh); - log_start_commit(journal, tid); + log_start_commit(journal, tid, true); log_wait_commit(journal, tid); ret = 1; } else if (!buffer_dirty(bh)) { diff --git a/fs/jbd/commit.c b/fs/jbd/commit.c index 34a4861..ac188cb 100644 --- a/fs/jbd/commit.c +++ b/fs/jbd/commit.c @@ -294,7 +294,7 @@ void journal_commit_transaction(journal_t *journal) int first_tag = 0; int tag_flag; int i; - int write_op = WRITE_SYNC; + int write_op = WRITE; /* * First job: lock down the current transaction and wait for @@ -332,7 +332,7 @@ void journal_commit_transaction(journal_t *journal) * we unplug the device. We don't do explicit unplugging in here, * instead we rely on sync_buffer() doing the unplug for us. */ - if (commit_transaction->t_synchronous_commit) + if (tid_geq(journal->j_commit_waited, commit_transaction->t_tid)) write_op = WRITE_SYNC_PLUG; spin_lock(&commit_transaction->t_handle_lock); while (commit_transaction->t_updates) { diff --git a/fs/jbd/journal.c b/fs/jbd/journal.c index da1b5e4..5743b4c 100644 --- a/fs/jbd/journal.c +++ b/fs/jbd/journal.c @@ -434,7 +434,7 @@ int __log_space_left(journal_t *journal) /* * Called under j_state_lock. Returns true if a transaction commit was started. */ -int __log_start_commit(journal_t *journal, tid_t target) +int __log_start_commit(journal_t *journal, tid_t target, bool will_wait) { /* * Are we already doing a recent enough commit? @@ -444,7 +444,8 @@ int __log_start_commit(journal_t *journal, tid_t target) * We want a new commit: OK, mark the request and wakeup the * commit thread. We do _not_ do the commit ourselves. */ - + if (will_wait && !tid_geq(journal->j_commit_waited, target)) + journal->j_commit_waited = target; journal->j_commit_request = target; jbd_debug(1, "JBD: requesting commit %d/%d\n", journal->j_commit_request, @@ -455,12 +456,12 @@ int __log_start_commit(journal_t *journal, tid_t target) return 0; } -int log_start_commit(journal_t *journal, tid_t tid) +int log_start_commit(journal_t *journal, tid_t tid, bool will_wait) { int ret; spin_lock(&journal->j_state_lock); - ret = __log_start_commit(journal, tid); + ret = __log_start_commit(journal, tid, will_wait); spin_unlock(&journal->j_state_lock); return ret; } @@ -483,7 +484,7 @@ int journal_force_commit_nested(journal_t *journal) spin_lock(&journal->j_state_lock); if (journal->j_running_transaction && !current->journal_info) { transaction = journal->j_running_transaction; - __log_start_commit(journal, transaction->t_tid); + __log_start_commit(journal, transaction->t_tid, true); } else if (journal->j_committing_transaction) transaction = journal->j_committing_transaction; @@ -503,7 +504,7 @@ int journal_force_commit_nested(journal_t *journal) * if a transaction is going to be committed (or is currently already * committing), and fills its tid in at *ptid */ -int journal_start_commit(journal_t *journal, tid_t *ptid) +int journal_start_commit(journal_t *journal, tid_t *ptid, bool will_wait) { int ret = 0; @@ -511,7 +512,7 @@ int journal_start_commit(journal_t *journal, tid_t *ptid) if (journal->j_running_transaction) { tid_t tid = journal->j_running_transaction->t_tid; - __log_start_commit(journal, tid); + __log_start_commit(journal, tid, will_wait); /* There's a running transaction and we've just made sure * it's commit has been scheduled. */ if (ptid) @@ -1439,7 +1440,7 @@ int journal_flush(journal_t *journal) /* Force everything buffered to the log... */ if (journal->j_running_transaction) { transaction = journal->j_running_transaction; - __log_start_commit(journal, transaction->t_tid); + __log_start_commit(journal, transaction->t_tid, true); } else if (journal->j_committing_transaction) transaction = journal->j_committing_transaction; @@ -1573,7 +1574,7 @@ static void __journal_abort_hard(journal_t *journal) journal->j_flags |= JFS_ABORT; transaction = journal->j_running_transaction; if (transaction) - __log_start_commit(journal, transaction->t_tid); + __log_start_commit(journal, transaction->t_tid, false); spin_unlock(&journal->j_state_lock); } diff --git a/fs/jbd/transaction.c b/fs/jbd/transaction.c index 5b2e4c3..a12c0a3 100644 --- a/fs/jbd/transaction.c +++ b/fs/jbd/transaction.c @@ -178,7 +178,7 @@ repeat_locked: spin_unlock(&transaction->t_handle_lock); prepare_to_wait(&journal->j_wait_transaction_locked, &wait, TASK_UNINTERRUPTIBLE); - __log_start_commit(journal, transaction->t_tid); + __log_start_commit(journal, transaction->t_tid, false); spin_unlock(&journal->j_state_lock); schedule(); finish_wait(&journal->j_wait_transaction_locked, &wait); @@ -411,7 +411,7 @@ int journal_restart(handle_t *handle, int nblocks) spin_unlock(&transaction->t_handle_lock); jbd_debug(2, "restarting handle %p\n", handle); - __log_start_commit(journal, transaction->t_tid); + __log_start_commit(journal, transaction->t_tid, false); spin_unlock(&journal->j_state_lock); lock_map_release(&handle->h_lockdep_map); @@ -1431,8 +1431,6 @@ int journal_stop(handle_t *handle) } } - if (handle->h_sync) - transaction->t_synchronous_commit = 1; current->journal_info = NULL; spin_lock(&journal->j_state_lock); spin_lock(&transaction->t_handle_lock); @@ -1463,7 +1461,8 @@ int journal_stop(handle_t *handle) jbd_debug(2, "transaction too old, requesting commit for " "handle %p\n", handle); /* This is non-blocking */ - __log_start_commit(journal, transaction->t_tid); + __log_start_commit(journal, transaction->t_tid, + handle->h_sync && !(current->flags & PF_MEMALLOC)); spin_unlock(&journal->j_state_lock); /* diff --git a/include/linux/jbd.h b/include/linux/jbd.h index e069650..c38f73e 100644 --- a/include/linux/jbd.h +++ b/include/linux/jbd.h @@ -541,12 +541,6 @@ struct transaction_s * How many handles used this transaction? [t_handle_lock] */ int t_handle_count; - - /* - * This transaction is being forced and some process is - * waiting for it to finish. - */ - unsigned int t_synchronous_commit:1; }; /** @@ -594,6 +588,8 @@ struct transaction_s * transaction * @j_commit_request: Sequence number of the most recent transaction wanting * commit + * @j_commit_waited: Sequence number of the most recent transaction someone + * is waiting for to commit. * @j_uuid: Uuid of client object. * @j_task: Pointer to the current commit thread for this journal * @j_max_transaction_buffers: Maximum number of metadata buffers to allow in a @@ -762,6 +758,13 @@ struct journal_s tid_t j_commit_request; /* + * Sequence number of the most recent transaction someone is waiting + * for to commit. + * [j_state_lock] + */ + tid_t j_commit_waited; + + /* * Journal uuid: identifies the object (filesystem, LVM volume etc) * backed by this journal. This will eventually be replaced by an array * of uuids, allowing us to index multiple devices within a single @@ -985,9 +988,9 @@ extern void journal_switch_revoke_table(journal_t *journal); */ int __log_space_left(journal_t *); /* Called with journal locked */ -int log_start_commit(journal_t *journal, tid_t tid); -int __log_start_commit(journal_t *journal, tid_t tid); -int journal_start_commit(journal_t *journal, tid_t *tid); +int log_start_commit(journal_t *journal, tid_t tid, bool will_wait); +int __log_start_commit(journal_t *journal, tid_t tid, bool will_wait); +int journal_start_commit(journal_t *journal, tid_t *tid, bool will_wait); int journal_force_commit_nested(journal_t *journal); int log_wait_commit(journal_t *journal, tid_t tid); int log_do_checkpoint(journal_t *journal); -- 1.7.1 ^ permalink raw reply related [flat|nested] 38+ messages in thread
* Re: [performance bug] kernel building regression on 64 LCPUs machine 2011-02-24 12:13 ` Jan Kara @ 2011-02-25 0:44 ` Alex Shi 2011-02-26 14:45 ` Corrado Zoccolo 1 sibling, 0 replies; 38+ messages in thread From: Alex Shi @ 2011-02-25 0:44 UTC (permalink / raw) To: Jan Kara Cc: Li, Shaohua, Corrado Zoccolo, Vivek Goyal, tytso@mit.edu, jaxboe@fusionio.com, linux-kernel@vger.kernel.org, Chen, Tim C Jan Kara wrote: > On Wed 23-02-11 16:24:47, Alex,Shi wrote: > >> Though these patches can not totally recovered the problem, but they are >> quite helpful with ccache enabled situation. It increase 10% performance >> on 38-rc1 kernel. >> > OK and what was the original performance drop with WRITE_SYNC change? > The original drop is 30%. > >> I have tried to enabled they to latest rc6 kernel but failed. the vmstat output is here: >> with patches: >> > I'm attaching patches rebased on top of latest Linus's tree. > Corrado, could you possibly run your fsync-heavy tests so that we see > whether there isn't negative impact of my patches on your fsync-heavy > workload? Thanks. > > Honza > ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: [performance bug] kernel building regression on 64 LCPUs machine 2011-02-24 12:13 ` Jan Kara 2011-02-25 0:44 ` Alex Shi @ 2011-02-26 14:45 ` Corrado Zoccolo 2011-03-01 19:56 ` Jeff Moyer 1 sibling, 1 reply; 38+ messages in thread From: Corrado Zoccolo @ 2011-02-26 14:45 UTC (permalink / raw) To: Jan Kara, Jeff Moyer Cc: Alex,Shi, Li, Shaohua, Vivek Goyal, tytso@mit.edu, jaxboe@fusionio.com, linux-kernel@vger.kernel.org, Chen, Tim C On Thu, Feb 24, 2011 at 1:13 PM, Jan Kara <jack@suse.cz> wrote: > On Wed 23-02-11 16:24:47, Alex,Shi wrote: >> Though these patches can not totally recovered the problem, but they are >> quite helpful with ccache enabled situation. It increase 10% performance >> on 38-rc1 kernel. > OK and what was the original performance drop with WRITE_SYNC change? > >> I have tried to enabled they to latest rc6 kernel but failed. the vmstat output is here: >> with patches: > I'm attaching patches rebased on top of latest Linus's tree. > Corrado, could you possibly run your fsync-heavy tests so that we see > whether there isn't negative impact of my patches on your fsync-heavy > workload? Thanks. The workload was actually Jeff's, and the stalls that my change tried to mitigate showed up on his enterprise class storage. Adding him so he can test it. Corrado > Honza > -- > Jan Kara <jack@suse.cz> > SUSE Labs, CR > -- __________________________________________________________________________ dott. Corrado Zoccolo mailto:czoccolo@gmail.com PhD - Department of Computer Science - University of Pisa, Italy -------------------------------------------------------------------------- The self-confidence of a warrior is not the self-confidence of the average man. The average man seeks certainty in the eyes of the onlooker and calls that self-confidence. The warrior seeks impeccability in his own eyes and calls that humbleness. Tales of Power - C. Castaneda ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: [performance bug] kernel building regression on 64 LCPUs machine 2011-02-26 14:45 ` Corrado Zoccolo @ 2011-03-01 19:56 ` Jeff Moyer 2011-03-02 9:42 ` Jan Kara 0 siblings, 1 reply; 38+ messages in thread From: Jeff Moyer @ 2011-03-01 19:56 UTC (permalink / raw) To: Corrado Zoccolo Cc: Jan Kara, Alex,Shi, Li, Shaohua, Vivek Goyal, tytso@mit.edu, jaxboe@fusionio.com, linux-kernel@vger.kernel.org, Chen, Tim C Corrado Zoccolo <czoccolo@gmail.com> writes: > On Thu, Feb 24, 2011 at 1:13 PM, Jan Kara <jack@suse.cz> wrote: >> On Wed 23-02-11 16:24:47, Alex,Shi wrote: >>> Though these patches can not totally recovered the problem, but they are >>> quite helpful with ccache enabled situation. It increase 10% performance >>> on 38-rc1 kernel. >> OK and what was the original performance drop with WRITE_SYNC change? >> >>> I have tried to enabled they to latest rc6 kernel but failed. the vmstat output is here: >>> with patches: >> I'm attaching patches rebased on top of latest Linus's tree. >> Corrado, could you possibly run your fsync-heavy tests so that we see >> whether there isn't negative impact of my patches on your fsync-heavy >> workload? Thanks. > The workload was actually Jeff's, and the stalls that my change tried > to mitigate showed up on his enterprise class storage. Adding him so > he can test it. Sorry for the late reply. You can use either fs_mark or iozone to generate an fsync-heavy workload. The test I did was to mix this with a sequential reader. If you can point me at patches, I should be able to test this. Cheers, Jeff ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: [performance bug] kernel building regression on 64 LCPUs machine 2011-03-01 19:56 ` Jeff Moyer @ 2011-03-02 9:42 ` Jan Kara 2011-03-02 16:13 ` Jeff Moyer 0 siblings, 1 reply; 38+ messages in thread From: Jan Kara @ 2011-03-02 9:42 UTC (permalink / raw) To: Jeff Moyer Cc: Corrado Zoccolo, Jan Kara, Alex,Shi, Li, Shaohua, Vivek Goyal, tytso@mit.edu, jaxboe@fusionio.com, linux-kernel@vger.kernel.org, Chen, Tim C On Tue 01-03-11 14:56:43, Jeff Moyer wrote: > Corrado Zoccolo <czoccolo@gmail.com> writes: > > > On Thu, Feb 24, 2011 at 1:13 PM, Jan Kara <jack@suse.cz> wrote: > >> On Wed 23-02-11 16:24:47, Alex,Shi wrote: > >>> Though these patches can not totally recovered the problem, but they are > >>> quite helpful with ccache enabled situation. It increase 10% performance > >>> on 38-rc1 kernel. > >> OK and what was the original performance drop with WRITE_SYNC change? > >> > >>> I have tried to enabled they to latest rc6 kernel but failed. the vmstat output is here: > >>> with patches: > >> I'm attaching patches rebased on top of latest Linus's tree. > >> Corrado, could you possibly run your fsync-heavy tests so that we see > >> whether there isn't negative impact of my patches on your fsync-heavy > >> workload? Thanks. > > The workload was actually Jeff's, and the stalls that my change tried > > to mitigate showed up on his enterprise class storage. Adding him so > > he can test it. > > Sorry for the late reply. You can use either fs_mark or iozone to > generate an fsync-heavy workload. The test I did was to mix this with a > sequential reader. If you can point me at patches, I should be able to > test this. The latest version of patches is attached to: https://lkml.org/lkml/2011/2/24/125 Honza -- Jan Kara <jack@suse.cz> SUSE Labs, CR ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: [performance bug] kernel building regression on 64 LCPUs machine 2011-03-02 9:42 ` Jan Kara @ 2011-03-02 16:13 ` Jeff Moyer 2011-03-02 21:17 ` Jan Kara 0 siblings, 1 reply; 38+ messages in thread From: Jeff Moyer @ 2011-03-02 16:13 UTC (permalink / raw) To: Jan Kara Cc: Corrado Zoccolo, Alex,Shi, Li, Shaohua, Vivek Goyal, tytso@mit.edu, jaxboe@fusionio.com, linux-kernel@vger.kernel.org, Chen, Tim C Jan Kara <jack@suse.cz> writes: > On Tue 01-03-11 14:56:43, Jeff Moyer wrote: >> Corrado Zoccolo <czoccolo@gmail.com> writes: >> >> > On Thu, Feb 24, 2011 at 1:13 PM, Jan Kara <jack@suse.cz> wrote: >> >> On Wed 23-02-11 16:24:47, Alex,Shi wrote: >> >>> Though these patches can not totally recovered the problem, but they are >> >>> quite helpful with ccache enabled situation. It increase 10% performance >> >>> on 38-rc1 kernel. >> >> OK and what was the original performance drop with WRITE_SYNC change? >> >> >> >>> I have tried to enabled they to latest rc6 kernel but failed. the vmstat output is here: >> >>> with patches: >> >> I'm attaching patches rebased on top of latest Linus's tree. >> >> Corrado, could you possibly run your fsync-heavy tests so that we see >> >> whether there isn't negative impact of my patches on your fsync-heavy >> >> workload? Thanks. >> > The workload was actually Jeff's, and the stalls that my change tried >> > to mitigate showed up on his enterprise class storage. Adding him so >> > he can test it. >> >> Sorry for the late reply. You can use either fs_mark or iozone to >> generate an fsync-heavy workload. The test I did was to mix this with a >> sequential reader. If you can point me at patches, I should be able to >> test this. > The latest version of patches is attached to: > https://lkml.org/lkml/2011/2/24/125 Perhaps you should fix up the merge conflicts, first? ;-) +<<<<<<< HEAD tid = transaction->t_tid; need_to_start = !tid_geq(journal->j_commit_request, tid); +======= + __jbd2_log_start_commit(journal, transaction->t_tid, false); +>>>>>>> jbd2: Refine commit writeout logic ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: [performance bug] kernel building regression on 64 LCPUs machine 2011-03-02 16:13 ` Jeff Moyer @ 2011-03-02 21:17 ` Jan Kara 2011-03-02 21:20 ` Jeff Moyer 2011-03-03 1:14 ` Jeff Moyer 0 siblings, 2 replies; 38+ messages in thread From: Jan Kara @ 2011-03-02 21:17 UTC (permalink / raw) To: Jeff Moyer Cc: Jan Kara, Corrado Zoccolo, Alex,Shi, Li, Shaohua, Vivek Goyal, tytso@mit.edu, jaxboe@fusionio.com, linux-kernel@vger.kernel.org, Chen, Tim C [-- Attachment #1: Type: text/plain, Size: 1933 bytes --] On Wed 02-03-11 11:13:53, Jeff Moyer wrote: > Jan Kara <jack@suse.cz> writes: > > On Tue 01-03-11 14:56:43, Jeff Moyer wrote: > >> Corrado Zoccolo <czoccolo@gmail.com> writes: > >> > >> > On Thu, Feb 24, 2011 at 1:13 PM, Jan Kara <jack@suse.cz> wrote: > >> >> On Wed 23-02-11 16:24:47, Alex,Shi wrote: > >> >>> Though these patches can not totally recovered the problem, but they are > >> >>> quite helpful with ccache enabled situation. It increase 10% performance > >> >>> on 38-rc1 kernel. > >> >> OK and what was the original performance drop with WRITE_SYNC change? > >> >> > >> >>> I have tried to enabled they to latest rc6 kernel but failed. the vmstat output is here: > >> >>> with patches: > >> >> I'm attaching patches rebased on top of latest Linus's tree. > >> >> Corrado, could you possibly run your fsync-heavy tests so that we see > >> >> whether there isn't negative impact of my patches on your fsync-heavy > >> >> workload? Thanks. > >> > The workload was actually Jeff's, and the stalls that my change tried > >> > to mitigate showed up on his enterprise class storage. Adding him so > >> > he can test it. > >> > >> Sorry for the late reply. You can use either fs_mark or iozone to > >> generate an fsync-heavy workload. The test I did was to mix this with a > >> sequential reader. If you can point me at patches, I should be able to > >> test this. > > The latest version of patches is attached to: > > https://lkml.org/lkml/2011/2/24/125 > > Perhaps you should fix up the merge conflicts, first? ;-) > > +<<<<<<< HEAD > tid = transaction->t_tid; > need_to_start = !tid_geq(journal->j_commit_request, tid); > +======= > + __jbd2_log_start_commit(journal, transaction->t_tid, false); > +>>>>>>> jbd2: Refine commit writeout logic Doh, how embarrassing ;). Attached is a new version which compiles and seems to run OK. Honza -- Jan Kara <jack@suse.cz> SUSE Labs, CR [-- Attachment #2: 0001-jbd2-Refine-commit-writeout-logic.patch --] [-- Type: text/x-patch, Size: 10820 bytes --] >From 2d4018f7fe647a440f488d71cf4ceffccaecb02e Mon Sep 17 00:00:00 2001 From: Jan Kara <jack@suse.cz> Date: Wed, 19 Jan 2011 13:45:04 +0100 Subject: [PATCH 1/2] jbd2: Refine commit writeout logic Currently we write out all journal buffers in WRITE_SYNC mode. This improves performance for fsync heavy workloads but hinders performance when writes are mostly asynchronous. So add possibility for callers starting a transaction commit to specify whether they are going to wait for the commit and submit journal writes in WRITE_SYNC mode only in that case. Signed-off-by: Jan Kara <jack@suse.cz> --- fs/ext4/fsync.c | 2 +- fs/ext4/super.c | 2 +- fs/jbd2/checkpoint.c | 2 +- fs/jbd2/commit.c | 4 ++-- fs/jbd2/journal.c | 19 ++++++++++--------- fs/jbd2/transaction.c | 13 ++++++------- fs/ocfs2/aops.c | 2 +- fs/ocfs2/super.c | 2 +- include/linux/jbd2.h | 18 ++++++++++-------- 9 files changed, 33 insertions(+), 31 deletions(-) diff --git a/fs/ext4/fsync.c b/fs/ext4/fsync.c index 7829b28..19434da 100644 --- a/fs/ext4/fsync.c +++ b/fs/ext4/fsync.c @@ -198,7 +198,7 @@ int ext4_sync_file(struct file *file, int datasync) return ext4_force_commit(inode->i_sb); commit_tid = datasync ? ei->i_datasync_tid : ei->i_sync_tid; - if (jbd2_log_start_commit(journal, commit_tid)) { + if (jbd2_log_start_commit(journal, commit_tid, true)) { /* * When the journal is on a different device than the * fs data disk, we need to issue the barrier in diff --git a/fs/ext4/super.c b/fs/ext4/super.c index f6a318f..bfce0d6 100644 --- a/fs/ext4/super.c +++ b/fs/ext4/super.c @@ -4115,7 +4115,7 @@ static int ext4_sync_fs(struct super_block *sb, int wait) trace_ext4_sync_fs(sb, wait); flush_workqueue(sbi->dio_unwritten_wq); - if (jbd2_journal_start_commit(sbi->s_journal, &target)) { + if (jbd2_journal_start_commit(sbi->s_journal, &target, true)) { if (wait) jbd2_log_wait_commit(sbi->s_journal, target); } diff --git a/fs/jbd2/checkpoint.c b/fs/jbd2/checkpoint.c index 6a79fd0..3436d53 100644 --- a/fs/jbd2/checkpoint.c +++ b/fs/jbd2/checkpoint.c @@ -309,7 +309,7 @@ static int __process_buffer(journal_t *journal, struct journal_head *jh, "Waiting for Godot: block %llu\n", journal->j_devname, (unsigned long long) bh->b_blocknr); - jbd2_log_start_commit(journal, tid); + jbd2_log_start_commit(journal, tid, true); jbd2_log_wait_commit(journal, tid); ret = 1; } else if (!buffer_dirty(bh)) { diff --git a/fs/jbd2/commit.c b/fs/jbd2/commit.c index f3ad159..19973eb 100644 --- a/fs/jbd2/commit.c +++ b/fs/jbd2/commit.c @@ -329,7 +329,7 @@ void jbd2_journal_commit_transaction(journal_t *journal) int tag_bytes = journal_tag_bytes(journal); struct buffer_head *cbh = NULL; /* For transactional checksums */ __u32 crc32_sum = ~0; - int write_op = WRITE_SYNC; + int write_op = WRITE; /* * First job: lock down the current transaction and wait for @@ -368,7 +368,7 @@ void jbd2_journal_commit_transaction(journal_t *journal) * we unplug the device. We don't do explicit unplugging in here, * instead we rely on sync_buffer() doing the unplug for us. */ - if (commit_transaction->t_synchronous_commit) + if (tid_geq(journal->j_commit_waited, commit_transaction->t_tid)) write_op = WRITE_SYNC_PLUG; trace_jbd2_commit_locking(journal, commit_transaction); stats.run.rs_wait = commit_transaction->t_max_wait; diff --git a/fs/jbd2/journal.c b/fs/jbd2/journal.c index 97e7346..04835d6 100644 --- a/fs/jbd2/journal.c +++ b/fs/jbd2/journal.c @@ -476,7 +476,7 @@ int __jbd2_log_space_left(journal_t *journal) * Called with j_state_lock locked for writing. * Returns true if a transaction commit was started. */ -int __jbd2_log_start_commit(journal_t *journal, tid_t target) +int __jbd2_log_start_commit(journal_t *journal, tid_t target, bool will_wait) { /* * Are we already doing a recent enough commit? @@ -486,7 +486,8 @@ int __jbd2_log_start_commit(journal_t *journal, tid_t target) * We want a new commit: OK, mark the request and wakeup the * commit thread. We do _not_ do the commit ourselves. */ - + if (will_wait && !tid_geq(journal->j_commit_waited, target)) + journal->j_commit_waited = target; journal->j_commit_request = target; jbd_debug(1, "JBD: requesting commit %d/%d\n", journal->j_commit_request, @@ -497,12 +498,12 @@ int __jbd2_log_start_commit(journal_t *journal, tid_t target) return 0; } -int jbd2_log_start_commit(journal_t *journal, tid_t tid) +int jbd2_log_start_commit(journal_t *journal, tid_t tid, bool will_wait) { int ret; write_lock(&journal->j_state_lock); - ret = __jbd2_log_start_commit(journal, tid); + ret = __jbd2_log_start_commit(journal, tid, will_wait); write_unlock(&journal->j_state_lock); return ret; } @@ -539,7 +540,7 @@ int jbd2_journal_force_commit_nested(journal_t *journal) tid = transaction->t_tid; read_unlock(&journal->j_state_lock); if (need_to_start) - jbd2_log_start_commit(journal, tid); + jbd2_log_start_commit(journal, tid, true); jbd2_log_wait_commit(journal, tid); return 1; } @@ -549,7 +550,7 @@ int jbd2_journal_force_commit_nested(journal_t *journal) * if a transaction is going to be committed (or is currently already * committing), and fills its tid in at *ptid */ -int jbd2_journal_start_commit(journal_t *journal, tid_t *ptid) +int jbd2_journal_start_commit(journal_t *journal, tid_t *ptid, bool will_wait) { int ret = 0; @@ -557,7 +558,7 @@ int jbd2_journal_start_commit(journal_t *journal, tid_t *ptid) if (journal->j_running_transaction) { tid_t tid = journal->j_running_transaction->t_tid; - __jbd2_log_start_commit(journal, tid); + __jbd2_log_start_commit(journal, tid, will_wait); /* There's a running transaction and we've just made sure * it's commit has been scheduled. */ if (ptid) @@ -1564,7 +1565,7 @@ int jbd2_journal_flush(journal_t *journal) /* Force everything buffered to the log... */ if (journal->j_running_transaction) { transaction = journal->j_running_transaction; - __jbd2_log_start_commit(journal, transaction->t_tid); + __jbd2_log_start_commit(journal, transaction->t_tid, true); } else if (journal->j_committing_transaction) transaction = journal->j_committing_transaction; @@ -1680,7 +1681,7 @@ void __jbd2_journal_abort_hard(journal_t *journal) journal->j_flags |= JBD2_ABORT; transaction = journal->j_running_transaction; if (transaction) - __jbd2_log_start_commit(journal, transaction->t_tid); + __jbd2_log_start_commit(journal, transaction->t_tid, false); write_unlock(&journal->j_state_lock); } diff --git a/fs/jbd2/transaction.c b/fs/jbd2/transaction.c index 1d11910..2211dae 100644 --- a/fs/jbd2/transaction.c +++ b/fs/jbd2/transaction.c @@ -226,7 +226,7 @@ repeat: need_to_start = !tid_geq(journal->j_commit_request, tid); read_unlock(&journal->j_state_lock); if (need_to_start) - jbd2_log_start_commit(journal, tid); + jbd2_log_start_commit(journal, tid, false); schedule(); finish_wait(&journal->j_wait_transaction_locked, &wait); goto repeat; @@ -473,7 +473,7 @@ int jbd2__journal_restart(handle_t *handle, int nblocks, int gfp_mask) need_to_start = !tid_geq(journal->j_commit_request, tid); read_unlock(&journal->j_state_lock); if (need_to_start) - jbd2_log_start_commit(journal, tid); + jbd2_log_start_commit(journal, tid, false); lock_map_release(&handle->h_lockdep_map); handle->h_buffer_credits = nblocks; @@ -1368,8 +1368,6 @@ int jbd2_journal_stop(handle_t *handle) } } - if (handle->h_sync) - transaction->t_synchronous_commit = 1; current->journal_info = NULL; atomic_sub(handle->h_buffer_credits, &transaction->t_outstanding_credits); @@ -1390,15 +1388,16 @@ int jbd2_journal_stop(handle_t *handle) jbd_debug(2, "transaction too old, requesting commit for " "handle %p\n", handle); - /* This is non-blocking */ - jbd2_log_start_commit(journal, transaction->t_tid); - /* * Special case: JBD2_SYNC synchronous updates require us * to wait for the commit to complete. */ if (handle->h_sync && !(current->flags & PF_MEMALLOC)) wait_for_commit = 1; + + /* This is non-blocking */ + jbd2_log_start_commit(journal, transaction->t_tid, + wait_for_commit); } /* diff --git a/fs/ocfs2/aops.c b/fs/ocfs2/aops.c index 1fbb0e2..d493f32 100644 --- a/fs/ocfs2/aops.c +++ b/fs/ocfs2/aops.c @@ -1659,7 +1659,7 @@ static int ocfs2_try_to_free_truncate_log(struct ocfs2_super *osb, goto out; } - if (jbd2_journal_start_commit(osb->journal->j_journal, &target)) { + if (jbd2_journal_start_commit(osb->journal->j_journal, &target, true)) { jbd2_log_wait_commit(osb->journal->j_journal, target); ret = 1; } diff --git a/fs/ocfs2/super.c b/fs/ocfs2/super.c index 38f986d..45d9f82 100644 --- a/fs/ocfs2/super.c +++ b/fs/ocfs2/super.c @@ -414,7 +414,7 @@ static int ocfs2_sync_fs(struct super_block *sb, int wait) } if (jbd2_journal_start_commit(OCFS2_SB(sb)->journal->j_journal, - &target)) { + &target, wait)) { if (wait) jbd2_log_wait_commit(OCFS2_SB(sb)->journal->j_journal, target); diff --git a/include/linux/jbd2.h b/include/linux/jbd2.h index 27e79c2..46aaf45 100644 --- a/include/linux/jbd2.h +++ b/include/linux/jbd2.h @@ -631,11 +631,6 @@ struct transaction_s */ atomic_t t_handle_count; - /* - * This transaction is being forced and some process is - * waiting for it to finish. - */ - unsigned int t_synchronous_commit:1; unsigned int t_flushed_data_blocks:1; /* @@ -900,6 +895,13 @@ struct journal_s tid_t j_commit_request; /* + * Sequence number of the most recent transaction someone is waiting + * for to commit. + * [j_state_lock] + */ + tid_t j_commit_waited; + + /* * Journal uuid: identifies the object (filesystem, LVM volume etc) * backed by this journal. This will eventually be replaced by an array * of uuids, allowing us to index multiple devices within a single @@ -1200,9 +1202,9 @@ extern void jbd2_journal_switch_revoke_table(journal_t *journal); */ int __jbd2_log_space_left(journal_t *); /* Called with journal locked */ -int jbd2_log_start_commit(journal_t *journal, tid_t tid); -int __jbd2_log_start_commit(journal_t *journal, tid_t tid); -int jbd2_journal_start_commit(journal_t *journal, tid_t *tid); +int jbd2_log_start_commit(journal_t *journal, tid_t tid, bool will_wait); +int __jbd2_log_start_commit(journal_t *journal, tid_t tid, bool will_wait); +int jbd2_journal_start_commit(journal_t *journal, tid_t *tid, bool will_wait); int jbd2_journal_force_commit_nested(journal_t *journal); int jbd2_log_wait_commit(journal_t *journal, tid_t tid); int jbd2_log_do_checkpoint(journal_t *journal); -- 1.7.1 [-- Attachment #3: 0002-jbd-Refine-commit-writeout-logic.patch --] [-- Type: text/x-patch, Size: 9843 bytes --] >From 4334f23305e74c6b050c701d97803790dc04cbe0 Mon Sep 17 00:00:00 2001 From: Jan Kara <jack@suse.cz> Date: Mon, 21 Feb 2011 17:25:37 +0100 Subject: [PATCH 2/2] jbd: Refine commit writeout logic Currently we write out all journal buffers in WRITE_SYNC mode. This improves performance for fsync heavy workloads but hinders performance when writes are mostly asynchronous. So add possibility for callers starting a transaction commit to specify whether they are going to wait for the commit and submit journal writes in WRITE_SYNC mode only in that case. Signed-off-by: Jan Kara <jack@suse.cz> --- fs/ext3/fsync.c | 2 +- fs/ext3/super.c | 2 +- fs/jbd/checkpoint.c | 2 +- fs/jbd/commit.c | 4 ++-- fs/jbd/journal.c | 19 ++++++++++--------- fs/jbd/transaction.c | 9 ++++----- include/linux/jbd.h | 21 ++++++++++++--------- 7 files changed, 31 insertions(+), 28 deletions(-) diff --git a/fs/ext3/fsync.c b/fs/ext3/fsync.c index 09b13bb..5396dd6 100644 --- a/fs/ext3/fsync.c +++ b/fs/ext3/fsync.c @@ -81,7 +81,7 @@ int ext3_sync_file(struct file *file, int datasync) if (test_opt(inode->i_sb, BARRIER) && !journal_trans_will_send_data_barrier(journal, commit_tid)) needs_barrier = 1; - log_start_commit(journal, commit_tid); + log_start_commit(journal, commit_tid, true); ret = log_wait_commit(journal, commit_tid); /* diff --git a/fs/ext3/super.c b/fs/ext3/super.c index 85c8cc8..58a4424 100644 --- a/fs/ext3/super.c +++ b/fs/ext3/super.c @@ -2497,7 +2497,7 @@ static int ext3_sync_fs(struct super_block *sb, int wait) { tid_t target; - if (journal_start_commit(EXT3_SB(sb)->s_journal, &target)) { + if (journal_start_commit(EXT3_SB(sb)->s_journal, &target, true)) { if (wait) log_wait_commit(EXT3_SB(sb)->s_journal, target); } diff --git a/fs/jbd/checkpoint.c b/fs/jbd/checkpoint.c index e4b87bc..26ca2c3 100644 --- a/fs/jbd/checkpoint.c +++ b/fs/jbd/checkpoint.c @@ -297,7 +297,7 @@ static int __process_buffer(journal_t *journal, struct journal_head *jh, spin_unlock(&journal->j_list_lock); jbd_unlock_bh_state(bh); - log_start_commit(journal, tid); + log_start_commit(journal, tid, true); log_wait_commit(journal, tid); ret = 1; } else if (!buffer_dirty(bh)) { diff --git a/fs/jbd/commit.c b/fs/jbd/commit.c index 34a4861..ac188cb 100644 --- a/fs/jbd/commit.c +++ b/fs/jbd/commit.c @@ -294,7 +294,7 @@ void journal_commit_transaction(journal_t *journal) int first_tag = 0; int tag_flag; int i; - int write_op = WRITE_SYNC; + int write_op = WRITE; /* * First job: lock down the current transaction and wait for @@ -332,7 +332,7 @@ void journal_commit_transaction(journal_t *journal) * we unplug the device. We don't do explicit unplugging in here, * instead we rely on sync_buffer() doing the unplug for us. */ - if (commit_transaction->t_synchronous_commit) + if (tid_geq(journal->j_commit_waited, commit_transaction->t_tid)) write_op = WRITE_SYNC_PLUG; spin_lock(&commit_transaction->t_handle_lock); while (commit_transaction->t_updates) { diff --git a/fs/jbd/journal.c b/fs/jbd/journal.c index da1b5e4..5743b4c 100644 --- a/fs/jbd/journal.c +++ b/fs/jbd/journal.c @@ -434,7 +434,7 @@ int __log_space_left(journal_t *journal) /* * Called under j_state_lock. Returns true if a transaction commit was started. */ -int __log_start_commit(journal_t *journal, tid_t target) +int __log_start_commit(journal_t *journal, tid_t target, bool will_wait) { /* * Are we already doing a recent enough commit? @@ -444,7 +444,8 @@ int __log_start_commit(journal_t *journal, tid_t target) * We want a new commit: OK, mark the request and wakeup the * commit thread. We do _not_ do the commit ourselves. */ - + if (will_wait && !tid_geq(journal->j_commit_waited, target)) + journal->j_commit_waited = target; journal->j_commit_request = target; jbd_debug(1, "JBD: requesting commit %d/%d\n", journal->j_commit_request, @@ -455,12 +456,12 @@ int __log_start_commit(journal_t *journal, tid_t target) return 0; } -int log_start_commit(journal_t *journal, tid_t tid) +int log_start_commit(journal_t *journal, tid_t tid, bool will_wait) { int ret; spin_lock(&journal->j_state_lock); - ret = __log_start_commit(journal, tid); + ret = __log_start_commit(journal, tid, will_wait); spin_unlock(&journal->j_state_lock); return ret; } @@ -483,7 +484,7 @@ int journal_force_commit_nested(journal_t *journal) spin_lock(&journal->j_state_lock); if (journal->j_running_transaction && !current->journal_info) { transaction = journal->j_running_transaction; - __log_start_commit(journal, transaction->t_tid); + __log_start_commit(journal, transaction->t_tid, true); } else if (journal->j_committing_transaction) transaction = journal->j_committing_transaction; @@ -503,7 +504,7 @@ int journal_force_commit_nested(journal_t *journal) * if a transaction is going to be committed (or is currently already * committing), and fills its tid in at *ptid */ -int journal_start_commit(journal_t *journal, tid_t *ptid) +int journal_start_commit(journal_t *journal, tid_t *ptid, bool will_wait) { int ret = 0; @@ -511,7 +512,7 @@ int journal_start_commit(journal_t *journal, tid_t *ptid) if (journal->j_running_transaction) { tid_t tid = journal->j_running_transaction->t_tid; - __log_start_commit(journal, tid); + __log_start_commit(journal, tid, will_wait); /* There's a running transaction and we've just made sure * it's commit has been scheduled. */ if (ptid) @@ -1439,7 +1440,7 @@ int journal_flush(journal_t *journal) /* Force everything buffered to the log... */ if (journal->j_running_transaction) { transaction = journal->j_running_transaction; - __log_start_commit(journal, transaction->t_tid); + __log_start_commit(journal, transaction->t_tid, true); } else if (journal->j_committing_transaction) transaction = journal->j_committing_transaction; @@ -1573,7 +1574,7 @@ static void __journal_abort_hard(journal_t *journal) journal->j_flags |= JFS_ABORT; transaction = journal->j_running_transaction; if (transaction) - __log_start_commit(journal, transaction->t_tid); + __log_start_commit(journal, transaction->t_tid, false); spin_unlock(&journal->j_state_lock); } diff --git a/fs/jbd/transaction.c b/fs/jbd/transaction.c index 5b2e4c3..a12c0a3 100644 --- a/fs/jbd/transaction.c +++ b/fs/jbd/transaction.c @@ -178,7 +178,7 @@ repeat_locked: spin_unlock(&transaction->t_handle_lock); prepare_to_wait(&journal->j_wait_transaction_locked, &wait, TASK_UNINTERRUPTIBLE); - __log_start_commit(journal, transaction->t_tid); + __log_start_commit(journal, transaction->t_tid, false); spin_unlock(&journal->j_state_lock); schedule(); finish_wait(&journal->j_wait_transaction_locked, &wait); @@ -411,7 +411,7 @@ int journal_restart(handle_t *handle, int nblocks) spin_unlock(&transaction->t_handle_lock); jbd_debug(2, "restarting handle %p\n", handle); - __log_start_commit(journal, transaction->t_tid); + __log_start_commit(journal, transaction->t_tid, false); spin_unlock(&journal->j_state_lock); lock_map_release(&handle->h_lockdep_map); @@ -1431,8 +1431,6 @@ int journal_stop(handle_t *handle) } } - if (handle->h_sync) - transaction->t_synchronous_commit = 1; current->journal_info = NULL; spin_lock(&journal->j_state_lock); spin_lock(&transaction->t_handle_lock); @@ -1463,7 +1461,8 @@ int journal_stop(handle_t *handle) jbd_debug(2, "transaction too old, requesting commit for " "handle %p\n", handle); /* This is non-blocking */ - __log_start_commit(journal, transaction->t_tid); + __log_start_commit(journal, transaction->t_tid, + handle->h_sync && !(current->flags & PF_MEMALLOC)); spin_unlock(&journal->j_state_lock); /* diff --git a/include/linux/jbd.h b/include/linux/jbd.h index e069650..c38f73e 100644 --- a/include/linux/jbd.h +++ b/include/linux/jbd.h @@ -541,12 +541,6 @@ struct transaction_s * How many handles used this transaction? [t_handle_lock] */ int t_handle_count; - - /* - * This transaction is being forced and some process is - * waiting for it to finish. - */ - unsigned int t_synchronous_commit:1; }; /** @@ -594,6 +588,8 @@ struct transaction_s * transaction * @j_commit_request: Sequence number of the most recent transaction wanting * commit + * @j_commit_waited: Sequence number of the most recent transaction someone + * is waiting for to commit. * @j_uuid: Uuid of client object. * @j_task: Pointer to the current commit thread for this journal * @j_max_transaction_buffers: Maximum number of metadata buffers to allow in a @@ -762,6 +758,13 @@ struct journal_s tid_t j_commit_request; /* + * Sequence number of the most recent transaction someone is waiting + * for to commit. + * [j_state_lock] + */ + tid_t j_commit_waited; + + /* * Journal uuid: identifies the object (filesystem, LVM volume etc) * backed by this journal. This will eventually be replaced by an array * of uuids, allowing us to index multiple devices within a single @@ -985,9 +988,9 @@ extern void journal_switch_revoke_table(journal_t *journal); */ int __log_space_left(journal_t *); /* Called with journal locked */ -int log_start_commit(journal_t *journal, tid_t tid); -int __log_start_commit(journal_t *journal, tid_t tid); -int journal_start_commit(journal_t *journal, tid_t *tid); +int log_start_commit(journal_t *journal, tid_t tid, bool will_wait); +int __log_start_commit(journal_t *journal, tid_t tid, bool will_wait); +int journal_start_commit(journal_t *journal, tid_t *tid, bool will_wait); int journal_force_commit_nested(journal_t *journal); int log_wait_commit(journal_t *journal, tid_t tid); int log_do_checkpoint(journal_t *journal); -- 1.7.1 ^ permalink raw reply related [flat|nested] 38+ messages in thread
* Re: [performance bug] kernel building regression on 64 LCPUs machine 2011-03-02 21:17 ` Jan Kara @ 2011-03-02 21:20 ` Jeff Moyer 2011-03-03 1:14 ` Jeff Moyer 1 sibling, 0 replies; 38+ messages in thread From: Jeff Moyer @ 2011-03-02 21:20 UTC (permalink / raw) To: Jan Kara Cc: Corrado Zoccolo, Alex,Shi, Li, Shaohua, Vivek Goyal, tytso@mit.edu, jaxboe@fusionio.com, linux-kernel@vger.kernel.org, Chen, Tim C Jan Kara <jack@suse.cz> writes: > On Wed 02-03-11 11:13:53, Jeff Moyer wrote: >> Jan Kara <jack@suse.cz> writes: >> > On Tue 01-03-11 14:56:43, Jeff Moyer wrote: >> >> Corrado Zoccolo <czoccolo@gmail.com> writes: >> >> >> >> > On Thu, Feb 24, 2011 at 1:13 PM, Jan Kara <jack@suse.cz> wrote: >> >> >> On Wed 23-02-11 16:24:47, Alex,Shi wrote: >> >> >>> Though these patches can not totally recovered the problem, but they are >> >> >>> quite helpful with ccache enabled situation. It increase 10% performance >> >> >>> on 38-rc1 kernel. >> >> >> OK and what was the original performance drop with WRITE_SYNC change? >> >> >> >> >> >>> I have tried to enabled they to latest rc6 kernel but failed. the vmstat output is here: >> >> >>> with patches: >> >> >> I'm attaching patches rebased on top of latest Linus's tree. >> >> >> Corrado, could you possibly run your fsync-heavy tests so that we see >> >> >> whether there isn't negative impact of my patches on your fsync-heavy >> >> >> workload? Thanks. >> >> > The workload was actually Jeff's, and the stalls that my change tried >> >> > to mitigate showed up on his enterprise class storage. Adding him so >> >> > he can test it. >> >> >> >> Sorry for the late reply. You can use either fs_mark or iozone to >> >> generate an fsync-heavy workload. The test I did was to mix this with a >> >> sequential reader. If you can point me at patches, I should be able to >> >> test this. >> > The latest version of patches is attached to: >> > https://lkml.org/lkml/2011/2/24/125 >> >> Perhaps you should fix up the merge conflicts, first? ;-) >> >> +<<<<<<< HEAD >> tid = transaction->t_tid; >> need_to_start = !tid_geq(journal->j_commit_request, tid); >> +======= >> + __jbd2_log_start_commit(journal, transaction->t_tid, false); >> +>>>>>>> jbd2: Refine commit writeout logic > Doh, how embarrassing ;). Attached is a new version which compiles and > seems to run OK. > > Honza Thanks, Jan. I should have results for you tomorrow. Cheers, Jeff ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: [performance bug] kernel building regression on 64 LCPUs machine 2011-03-02 21:17 ` Jan Kara 2011-03-02 21:20 ` Jeff Moyer @ 2011-03-03 1:14 ` Jeff Moyer 2011-03-04 15:32 ` Jan Kara 1 sibling, 1 reply; 38+ messages in thread From: Jeff Moyer @ 2011-03-03 1:14 UTC (permalink / raw) To: Jan Kara Cc: Corrado Zoccolo, Alex,Shi, Li, Shaohua, Vivek Goyal, tytso@mit.edu, jaxboe@fusionio.com, linux-kernel@vger.kernel.org, Chen, Tim C Hi, Jan, So, the results are in. The test workload is an fs_mark process writing out 64k files and fsyncing each file after it's written. Concurrently with this is a fio job running a buffered sequential reader (bsr). Each data point is the average of 10 runs, after throwing out the first run. File system mount options are left at their defaults, which means that barriers are on. The storage is an HP EVA, connected to the host via a single 4Gb FC path. ext3 looks marginally better with your patches. We get better files/sec AND better throughput from the buffered reader. For ext4, the results are less encouraging. We see a drop in files/sec, and an increase in throughput for the sequential reader. So, the fsync-ing is being starved a bit more than before. || ext3 || ext4 || || fs_mark | fio bsr || fs_mark | fio bsr || --------++---------+---------++---------+---------|| vanilla || 517.535 | 178187 || 408.547 | 277130 || patched || 540.34 | 182312 || 342.813 | 294655 || ==================================================== %diff || +4.4% | +2.3% || -16.1% | +6.3% || I'm tired right now, but I'll have a look at your ext4 patch in the morning and see if I can come up with a good reason for this drop. Let me know if you have any questions. Cheers, Jeff ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: [performance bug] kernel building regression on 64 LCPUs machine 2011-03-03 1:14 ` Jeff Moyer @ 2011-03-04 15:32 ` Jan Kara 2011-03-04 15:40 ` Jeff Moyer 2011-03-04 15:50 ` Jeff Moyer 0 siblings, 2 replies; 38+ messages in thread From: Jan Kara @ 2011-03-04 15:32 UTC (permalink / raw) To: Jeff Moyer Cc: Jan Kara, Corrado Zoccolo, Alex,Shi, Li, Shaohua, Vivek Goyal, tytso@mit.edu, jaxboe@fusionio.com, linux-kernel@vger.kernel.org, Chen, Tim C Hi Jeff, On Wed 02-03-11 20:14:13, Jeff Moyer wrote: > So, the results are in. The test workload is an fs_mark process writing > out 64k files and fsyncing each file after it's written. Concurrently > with this is a fio job running a buffered sequential reader (bsr). Each > data point is the average of 10 runs, after throwing out the first run. > File system mount options are left at their defaults, which means that > barriers are on. The storage is an HP EVA, connected to the host via a > single 4Gb FC path. Thanks a lot for testing! BTW: fs_mark runs in a single thread or do you use more threads? > ext3 looks marginally better with your patches. We get better files/sec > AND better throughput from the buffered reader. For ext4, the results > are less encouraging. We see a drop in files/sec, and an increase in > throughput for the sequential reader. So, the fsync-ing is being > starved a bit more than before. > > || ext3 || ext4 || > || fs_mark | fio bsr || fs_mark | fio bsr || > --------++---------+---------++---------+---------|| > vanilla || 517.535 | 178187 || 408.547 | 277130 || > patched || 540.34 | 182312 || 342.813 | 294655 || > ==================================================== > %diff || +4.4% | +2.3% || -16.1% | +6.3% || Interesting. I'm surprised ext3 and ext4 results differ this much. I'm more than happy with ext3 results since I just wanted to verify that fsync load doesn't degrade too much with the improved logic preferring non-fsync load more than we used to. I'm not so happy with ext4 results. The difference between ext3 and ext4 might be that amount of data written by kjournald in ext3 is considerably larger if it ends up pushing out data (because of data=ordered mode) as well. With ext4, all data are written by filemap_fdatawrite() from fsync because of delayed allocation. And thus maybe for ext4 WRITE_SYNC_PLUG is hurting us with your fast storage and small amount of written data? With WRITE_SYNC, data would be already on it's way to storage before we get to wait for them... Or it could be that we really send more data in WRITE mode rather than in WRITE_SYNC mode with the patch on ext4 (that should be verifiable with blktrace). But I wonder how that could happen... Bye Honza -- Jan Kara <jack@suse.cz> SUSE Labs, CR ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: [performance bug] kernel building regression on 64 LCPUs machine 2011-03-04 15:32 ` Jan Kara @ 2011-03-04 15:40 ` Jeff Moyer 2011-03-04 15:50 ` Jeff Moyer 1 sibling, 0 replies; 38+ messages in thread From: Jeff Moyer @ 2011-03-04 15:40 UTC (permalink / raw) To: Jan Kara Cc: Corrado Zoccolo, Alex,Shi, Li, Shaohua, Vivek Goyal, tytso@mit.edu, jaxboe@fusionio.com, linux-kernel@vger.kernel.org, Chen, Tim C Jan Kara <jack@suse.cz> writes: > Hi Jeff, > On Wed 02-03-11 20:14:13, Jeff Moyer wrote: >> So, the results are in. The test workload is an fs_mark process writing >> out 64k files and fsyncing each file after it's written. Concurrently >> with this is a fio job running a buffered sequential reader (bsr). Each >> data point is the average of 10 runs, after throwing out the first run. >> File system mount options are left at their defaults, which means that >> barriers are on. The storage is an HP EVA, connected to the host via a >> single 4Gb FC path. > Thanks a lot for testing! BTW: fs_mark runs in a single thread or do you > use more threads? I use a single fs_mark thread. FWIW, I also tested just fs_mark, and those numbers look good. >> ext3 looks marginally better with your patches. We get better files/sec >> AND better throughput from the buffered reader. For ext4, the results >> are less encouraging. We see a drop in files/sec, and an increase in >> throughput for the sequential reader. So, the fsync-ing is being >> starved a bit more than before. >> >> || ext3 || ext4 || >> || fs_mark | fio bsr || fs_mark | fio bsr || >> --------++---------+---------++---------+---------|| >> vanilla || 517.535 | 178187 || 408.547 | 277130 || >> patched || 540.34 | 182312 || 342.813 | 294655 || >> ==================================================== >> %diff || +4.4% | +2.3% || -16.1% | +6.3% || > Interesting. I'm surprised ext3 and ext4 results differ this much. I'm more > than happy with ext3 results since I just wanted to verify that fsync load > doesn't degrade too much with the improved logic preferring non-fsync load > more than we used to. > > I'm not so happy with ext4 results. The difference between ext3 and ext4 > might be that amount of data written by kjournald in ext3 is considerably > larger if it ends up pushing out data (because of data=ordered mode) as > well. With ext4, all data are written by filemap_fdatawrite() from fsync > because of delayed allocation. And thus maybe for ext4 WRITE_SYNC_PLUG > is hurting us with your fast storage and small amount of written data? With > WRITE_SYNC, data would be already on it's way to storage before we get to > wait for them... > > Or it could be that we really send more data in WRITE mode rather than in > WRITE_SYNC mode with the patch on ext4 (that should be verifiable with > blktrace). But I wonder how that could happen... Yeah, I've collected blktrace data and I'll get to evaluating that. Sorry, I ran out of time yesterday. Cheers, Jeff ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: [performance bug] kernel building regression on 64 LCPUs machine 2011-03-04 15:32 ` Jan Kara 2011-03-04 15:40 ` Jeff Moyer @ 2011-03-04 15:50 ` Jeff Moyer 2011-03-04 18:27 ` Jeff Moyer 1 sibling, 1 reply; 38+ messages in thread From: Jeff Moyer @ 2011-03-04 15:50 UTC (permalink / raw) To: Jan Kara Cc: Corrado Zoccolo, Alex,Shi, Li, Shaohua, Vivek Goyal, tytso@mit.edu, jaxboe@fusionio.com, linux-kernel@vger.kernel.org, Chen, Tim C Jan Kara <jack@suse.cz> writes: > I'm not so happy with ext4 results. The difference between ext3 and ext4 > might be that amount of data written by kjournald in ext3 is considerably > larger if it ends up pushing out data (because of data=ordered mode) as > well. With ext4, all data are written by filemap_fdatawrite() from fsync > because of delayed allocation. And thus maybe for ext4 WRITE_SYNC_PLUG > is hurting us with your fast storage and small amount of written data? With > WRITE_SYNC, data would be already on it's way to storage before we get to > wait for them... > Or it could be that we really send more data in WRITE mode rather than in > WRITE_SYNC mode with the patch on ext4 (that should be verifiable with > blktrace). But I wonder how that could happen... It looks like this is the case, the I/O isn't coming down as synchronous. I'm seeing a lot of writes, very few write sync's, which means that the write stream will be preempted by the incoming reads. Time to audit that fsync path and make sure it's marked properly, I guess. Cheers, Jeff ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: [performance bug] kernel building regression on 64 LCPUs machine 2011-03-04 15:50 ` Jeff Moyer @ 2011-03-04 18:27 ` Jeff Moyer 2011-03-22 7:38 ` Alex,Shi 0 siblings, 1 reply; 38+ messages in thread From: Jeff Moyer @ 2011-03-04 18:27 UTC (permalink / raw) To: Jan Kara Cc: Corrado Zoccolo, Alex,Shi, Li, Shaohua, Vivek Goyal, tytso@mit.edu, jaxboe@fusionio.com, linux-kernel@vger.kernel.org, Chen, Tim C Jeff Moyer <jmoyer@redhat.com> writes: > Jan Kara <jack@suse.cz> writes: > >> I'm not so happy with ext4 results. The difference between ext3 and ext4 >> might be that amount of data written by kjournald in ext3 is considerably >> larger if it ends up pushing out data (because of data=ordered mode) as >> well. With ext4, all data are written by filemap_fdatawrite() from fsync >> because of delayed allocation. And thus maybe for ext4 WRITE_SYNC_PLUG >> is hurting us with your fast storage and small amount of written data? With >> WRITE_SYNC, data would be already on it's way to storage before we get to >> wait for them... > >> Or it could be that we really send more data in WRITE mode rather than in >> WRITE_SYNC mode with the patch on ext4 (that should be verifiable with >> blktrace). But I wonder how that could happen... > > It looks like this is the case, the I/O isn't coming down as > synchronous. I'm seeing a lot of writes, very few write sync's, which > means that the write stream will be preempted by the incoming reads. > > Time to audit that fsync path and make sure it's marked properly, I > guess. OK, I spoke too soon. Here's the blktrace summary information (I re-ran the tests using 3 samples, the blktrace is from the last run of the three in each case): Vanilla ------- fs_mark: 307.288 files/sec fio: 286509 KB/s Total (sde): Reads Queued: 341,558, 84,994MiB Writes Queued: 1,561K, 6,244MiB Read Dispatches: 341,493, 84,994MiB Write Dispatches: 648,046, 6,244MiB Reads Requeued: 0 Writes Requeued: 27 Reads Completed: 341,491, 84,994MiB Writes Completed: 648,021, 6,244MiB Read Merges: 65, 2,780KiB Write Merges: 913,076, 3,652MiB IO unplugs: 578,102 Timer unplugs: 0 Throughput (R/W): 282,797KiB/s / 20,776KiB/s Events (sde): 16,724,303 entries Patched ------- fs_mark: 278.587 files/sec fio: 298007 KB/s Total (sde): Reads Queued: 345,407, 86,834MiB Writes Queued: 1,566K, 6,264MiB Read Dispatches: 345,391, 86,834MiB Write Dispatches: 327,404, 6,264MiB Reads Requeued: 0 Writes Requeued: 33 Reads Completed: 345,391, 86,834MiB Writes Completed: 327,371, 6,264MiB Read Merges: 16, 1,576KiB Write Merges: 1,238K, 4,954MiB IO unplugs: 580,308 Timer unplugs: 0 Throughput (R/W): 288,771KiB/s / 20,832KiB/s Events (sde): 14,030,610 entries So, it appears we flush out writes much more aggressively without the patch in place. I'm not sure why the write bandwidth looks to be higher in the patched case... odd. Cheers, Jeff ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: [performance bug] kernel building regression on 64 LCPUs machine 2011-03-04 18:27 ` Jeff Moyer @ 2011-03-22 7:38 ` Alex,Shi 2011-03-22 16:14 ` Jan Kara 0 siblings, 1 reply; 38+ messages in thread From: Alex,Shi @ 2011-03-22 7:38 UTC (permalink / raw) To: Jeff Moyer Cc: Jan Kara, Corrado Zoccolo, Li, Shaohua, Vivek Goyal, tytso@mit.edu, jaxboe@fusionio.com, linux-kernel@vger.kernel.org, Chen, Tim C On Sat, 2011-03-05 at 02:27 +0800, Jeff Moyer wrote: > Jeff Moyer <jmoyer@redhat.com> writes: > > > Jan Kara <jack@suse.cz> writes: > > > >> I'm not so happy with ext4 results. The difference between ext3 and ext4 > >> might be that amount of data written by kjournald in ext3 is considerably > >> larger if it ends up pushing out data (because of data=ordered mode) as > >> well. With ext4, all data are written by filemap_fdatawrite() from fsync > >> because of delayed allocation. And thus maybe for ext4 WRITE_SYNC_PLUG > >> is hurting us with your fast storage and small amount of written data? With > >> WRITE_SYNC, data would be already on it's way to storage before we get to > >> wait for them... > > > >> Or it could be that we really send more data in WRITE mode rather than in > >> WRITE_SYNC mode with the patch on ext4 (that should be verifiable with > >> blktrace). But I wonder how that could happen... > > > > It looks like this is the case, the I/O isn't coming down as > > synchronous. I'm seeing a lot of writes, very few write sync's, which > > means that the write stream will be preempted by the incoming reads. > > > > Time to audit that fsync path and make sure it's marked properly, I > > guess. > > OK, I spoke too soon. Here's the blktrace summary information (I re-ran > the tests using 3 samples, the blktrace is from the last run of the > three in each case): > > Vanilla > ------- > fs_mark: 307.288 files/sec > fio: 286509 KB/s > > Total (sde): > Reads Queued: 341,558, 84,994MiB Writes Queued: 1,561K, 6,244MiB > Read Dispatches: 341,493, 84,994MiB Write Dispatches: 648,046, 6,244MiB > Reads Requeued: 0 Writes Requeued: 27 > Reads Completed: 341,491, 84,994MiB Writes Completed: 648,021, 6,244MiB > Read Merges: 65, 2,780KiB Write Merges: 913,076, 3,652MiB > IO unplugs: 578,102 Timer unplugs: 0 > > Throughput (R/W): 282,797KiB/s / 20,776KiB/s > Events (sde): 16,724,303 entries > > Patched > ------- > fs_mark: 278.587 files/sec > fio: 298007 KB/s > > Total (sde): > Reads Queued: 345,407, 86,834MiB Writes Queued: 1,566K, 6,264MiB > Read Dispatches: 345,391, 86,834MiB Write Dispatches: 327,404, 6,264MiB > Reads Requeued: 0 Writes Requeued: 33 > Reads Completed: 345,391, 86,834MiB Writes Completed: 327,371, 6,264MiB > Read Merges: 16, 1,576KiB Write Merges: 1,238K, 4,954MiB > IO unplugs: 580,308 Timer unplugs: 0 > > Throughput (R/W): 288,771KiB/s / 20,832KiB/s > Events (sde): 14,030,610 entries > > So, it appears we flush out writes much more aggressively without the > patch in place. I'm not sure why the write bandwidth looks to be higher > in the patched case... odd. > Jan: Do you have new idea on this? ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: [performance bug] kernel building regression on 64 LCPUs machine 2011-03-22 7:38 ` Alex,Shi @ 2011-03-22 16:14 ` Jan Kara 2011-03-22 17:46 ` Jeff Moyer 0 siblings, 1 reply; 38+ messages in thread From: Jan Kara @ 2011-03-22 16:14 UTC (permalink / raw) To: Alex,Shi Cc: Jeff Moyer, Jan Kara, Corrado Zoccolo, Li, Shaohua, Vivek Goyal, tytso@mit.edu, jaxboe@fusionio.com, linux-kernel@vger.kernel.org, Chen, Tim C [-- Attachment #1: Type: text/plain, Size: 4560 bytes --] On Tue 22-03-11 15:38:19, Alex,Shi wrote: > On Sat, 2011-03-05 at 02:27 +0800, Jeff Moyer wrote: > > Jeff Moyer <jmoyer@redhat.com> writes: > > > > > Jan Kara <jack@suse.cz> writes: > > > > > >> I'm not so happy with ext4 results. The difference between ext3 and ext4 > > >> might be that amount of data written by kjournald in ext3 is considerably > > >> larger if it ends up pushing out data (because of data=ordered mode) as > > >> well. With ext4, all data are written by filemap_fdatawrite() from fsync > > >> because of delayed allocation. And thus maybe for ext4 WRITE_SYNC_PLUG > > >> is hurting us with your fast storage and small amount of written data? With > > >> WRITE_SYNC, data would be already on it's way to storage before we get to > > >> wait for them... > > > > > >> Or it could be that we really send more data in WRITE mode rather than in > > >> WRITE_SYNC mode with the patch on ext4 (that should be verifiable with > > >> blktrace). But I wonder how that could happen... > > > > > > It looks like this is the case, the I/O isn't coming down as > > > synchronous. I'm seeing a lot of writes, very few write sync's, which > > > means that the write stream will be preempted by the incoming reads. > > > > > > Time to audit that fsync path and make sure it's marked properly, I > > > guess. > > > > OK, I spoke too soon. Here's the blktrace summary information (I re-ran > > the tests using 3 samples, the blktrace is from the last run of the > > three in each case): > > > > Vanilla > > ------- > > fs_mark: 307.288 files/sec > > fio: 286509 KB/s > > > > Total (sde): > > Reads Queued: 341,558, 84,994MiB Writes Queued: 1,561K, 6,244MiB > > Read Dispatches: 341,493, 84,994MiB Write Dispatches: 648,046, 6,244MiB > > Reads Requeued: 0 Writes Requeued: 27 > > Reads Completed: 341,491, 84,994MiB Writes Completed: 648,021, 6,244MiB > > Read Merges: 65, 2,780KiB Write Merges: 913,076, 3,652MiB > > IO unplugs: 578,102 Timer unplugs: 0 > > > > Throughput (R/W): 282,797KiB/s / 20,776KiB/s > > Events (sde): 16,724,303 entries > > > > Patched > > ------- > > fs_mark: 278.587 files/sec > > fio: 298007 KB/s > > > > Total (sde): > > Reads Queued: 345,407, 86,834MiB Writes Queued: 1,566K, 6,264MiB > > Read Dispatches: 345,391, 86,834MiB Write Dispatches: 327,404, 6,264MiB > > Reads Requeued: 0 Writes Requeued: 33 > > Reads Completed: 345,391, 86,834MiB Writes Completed: 327,371, 6,264MiB > > Read Merges: 16, 1,576KiB Write Merges: 1,238K, 4,954MiB > > IO unplugs: 580,308 Timer unplugs: 0 > > > > Throughput (R/W): 288,771KiB/s / 20,832KiB/s > > Events (sde): 14,030,610 entries > > > > So, it appears we flush out writes much more aggressively without the > > patch in place. I'm not sure why the write bandwidth looks to be higher > > in the patched case... odd. > > Jan: > Do you have new idea on this? I was looking at the block traces for quite some time but I couldn't find the reason why fs_mark is slower with my patch. Actually, looking at the data now, I don't even understand how fs_mark can report lower files/sec values. Both block traces were taken for 300 seconds. From the above stats, we see that on both kernels, we wrote 6.2 GB over that time. Looking at more detailed stats I made, fs_mark processes wrote 4094 MB on vanilla kernel and 4107 MB on the patched kernel. Given that they just sequentially create and fsync 64 KB files, files/sec ratio should be about the same with both kernels. So I'm really missing how fs_mark arrives at different files/sec values or how with such different values it happens that the amount written is actually the same. Anyone has any idea? Looking at how fs_mark works and at differences in trace files - could it be that the difference is caused by a difference in how log files each fs_mark thread is writing are flushed? Or possibly by IO caused unlinking of created files somehow leaking to the time of the next measured fs_mark run in one case and not the other one? Jeff, I suppose the log files of fs_mark processes are on the same device as the test directory, aren't they - that might explain the flusher thread doing IO? The patch below should limit the interactions. If you have time to test fs_mark with this patch applied - does it make any difference? Honza -- Jan Kara <jack@suse.cz> SUSE Labs, CR [-- Attachment #2: fs_mark.c.diff --] [-- Type: text/x-patch, Size: 462 bytes --] --- a/fs_mark.c.orig 2011-03-22 17:04:52.194716633 +0100 +++ b/fs_mark.c 2011-03-22 17:06:11.910716645 +0100 @@ -1353,6 +1353,11 @@ int main(int argc, char **argv, char **e print_iteration_stats(log_file_fp, &iteration_stats, files_written); loops_done++; + /* + * Flush dirty data to avoid interaction between unlink / log + * file handling and the next iteration + */ + sync(); } while (do_fill_fs || (loop_count > loops_done)); ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: [performance bug] kernel building regression on 64 LCPUs machine 2011-03-22 16:14 ` Jan Kara @ 2011-03-22 17:46 ` Jeff Moyer 2011-03-24 6:45 ` Alex,Shi 2011-03-28 19:48 ` Jan Kara 0 siblings, 2 replies; 38+ messages in thread From: Jeff Moyer @ 2011-03-22 17:46 UTC (permalink / raw) To: Jan Kara Cc: Alex,Shi, Corrado Zoccolo, Li, Shaohua, Vivek Goyal, tytso@mit.edu, jaxboe@fusionio.com, linux-kernel@vger.kernel.org, Chen, Tim C Jan Kara <jack@suse.cz> writes: > Jeff, I suppose the log files of fs_mark processes are on the same > device as the test directory, aren't they - that might explain the > flusher thread doing IO? The patch below No. My test environments are sane. The file system is on a completely separate disk and it is unmounted and remounted in between each run. That should take care of any lost flushes. Did you receive my last couple of private emails commenting on the trace data? Cheers, Jeff ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: [performance bug] kernel building regression on 64 LCPUs machine 2011-03-22 17:46 ` Jeff Moyer @ 2011-03-24 6:45 ` Alex,Shi 2011-03-28 19:48 ` Jan Kara 1 sibling, 0 replies; 38+ messages in thread From: Alex,Shi @ 2011-03-24 6:45 UTC (permalink / raw) To: Jeff Moyer Cc: Jan Kara, Corrado Zoccolo, Li, Shaohua, Vivek Goyal, tytso@mit.edu, jaxboe@fusionio.com, linux-kernel@vger.kernel.org, Chen, Tim C On Wed, 2011-03-23 at 01:46 +0800, Jeff Moyer wrote: > Jan Kara <jack@suse.cz> writes: > > > Jeff, I suppose the log files of fs_mark processes are on the same > > device as the test directory, aren't they - that might explain the > > flusher thread doing IO? The patch below > > No. My test environments are sane. The file system is on a completely > separate disk and it is unmounted and remounted in between each run. > That should take care of any lost flushes. Jeff, why not give a try with Jan's fs_mark patch? Maybe the result show something change. :) > > Did you receive my last couple of private emails commenting on the trace > data? > > Cheers, > Jeff ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: [performance bug] kernel building regression on 64 LCPUs machine 2011-03-22 17:46 ` Jeff Moyer 2011-03-24 6:45 ` Alex,Shi @ 2011-03-28 19:48 ` Jan Kara 1 sibling, 0 replies; 38+ messages in thread From: Jan Kara @ 2011-03-28 19:48 UTC (permalink / raw) To: Jeff Moyer Cc: Jan Kara, Alex,Shi, Corrado Zoccolo, Li, Shaohua, Vivek Goyal, tytso@mit.edu, jaxboe@fusionio.com, linux-kernel@vger.kernel.org, Chen, Tim C On Tue 22-03-11 13:46:06, Jeff Moyer wrote: > Jan Kara <jack@suse.cz> writes: > > > Jeff, I suppose the log files of fs_mark processes are on the same > > device as the test directory, aren't they - that might explain the > > flusher thread doing IO? The patch below > > No. My test environments are sane. The file system is on a completely > separate disk and it is unmounted and remounted in between each run. > That should take care of any lost flushes. OK, I understand but let me clear out one thing. What fs_mark seems to be doing is: for given number of iterations { (-L option, default 1) fork each thread { start timer for i = 1 .. 10000 { (could be changed by option -n) write file i } stop timer for i = 1 .. 10000 { unlink file i } write statistics to log file exit } read all log files, write combined results to another log file } I see from blktrace data that indeed you are running more than one iteration of the fs_mark process so the problem I was wondering about is whether unlinking or writing of log files cannot interefere with the IO happening while the timer is running. Now log files are rather small so I don't really think they cause any problem but they could be the data flusher thread writes - they are stored in the current directory (unless you use -l option). So I my question was aiming at where these files are stored - do you specify -l option and if not, what is the current directory of fs_mark process in your setup? I think unlinks could possibly cause problems and that's why I suggested we add sync() to the parent process before it runs the next iteration... > Did you receive my last couple of private emails commenting on the trace > data? Yes, I did get them. I'll reply to them in a while. Honza -- Jan Kara <jack@suse.cz> SUSE Labs, CR ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: [performance bug] kernel building regression on 64 LCPUs machine 2011-01-19 1:55 [performance bug] kernel building regression on 64 LCPUs machine Alex,Shi 2011-01-19 2:03 ` Shaohua Li @ 2011-01-19 14:32 ` Ted Ts'o 2011-01-20 2:12 ` Shaohua Li 2011-01-21 7:23 ` Corrado Zoccolo 2 siblings, 1 reply; 38+ messages in thread From: Ted Ts'o @ 2011-01-19 14:32 UTC (permalink / raw) To: Alex,Shi Cc: czoccolo, vgoyal, jaxboe, linux-kernel@vger.kernel.org, Li, Shaohua, Chen, Tim C On Wed, Jan 19, 2011 at 09:55:48AM +0800, Alex,Shi wrote: > Shaohua and I tested kernel building performance on latest kernel. and > found it is drop about 15% on our 64 LCPUs NHM-EX machine on ext4 file > system. What kerenl version were you comparing against? And what were you using to benchmark this performance drop? I'm curious what your workload was; I'm guesisng it was some kind of mixed read/write workload? Were there any fsync()'s involved? Thanks, regards, - Ted ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: [performance bug] kernel building regression on 64 LCPUs machine 2011-01-19 14:32 ` Ted Ts'o @ 2011-01-20 2:12 ` Shaohua Li 0 siblings, 0 replies; 38+ messages in thread From: Shaohua Li @ 2011-01-20 2:12 UTC (permalink / raw) To: Ted Ts'o Cc: Shi, Alex, czoccolo@gmail.com, vgoyal@redhat.com, jaxboe@fusionio.com, linux-kernel@vger.kernel.org, Chen, Tim C On Wed, 2011-01-19 at 22:32 +0800, Ted Ts'o wrote: > On Wed, Jan 19, 2011 at 09:55:48AM +0800, Alex,Shi wrote: > > Shaohua and I tested kernel building performance on latest kernel. and > > found it is drop about 15% on our 64 LCPUs NHM-EX machine on ext4 file > > system. > > What kerenl version were you comparing against? And what were you > using to benchmark this performance drop? I'm curious what your > workload was; I'm guesisng it was some kind of mixed read/write > workload? Were there any fsync()'s involved? 2.6.36 vs 2.6.37. this exists in 2.6.37-rc1. it's a kbuild, no a mixed read/write workload. kbuild hasn't fsync involved Thanks, Shaohua ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: [performance bug] kernel building regression on 64 LCPUs machine 2011-01-19 1:55 [performance bug] kernel building regression on 64 LCPUs machine Alex,Shi 2011-01-19 2:03 ` Shaohua Li 2011-01-19 14:32 ` Ted Ts'o @ 2011-01-21 7:23 ` Corrado Zoccolo 2011-01-21 7:47 ` Alex,Shi 2011-01-21 8:20 ` Shaohua Li 2 siblings, 2 replies; 38+ messages in thread From: Corrado Zoccolo @ 2011-01-21 7:23 UTC (permalink / raw) To: Alex,Shi Cc: vgoyal, jaxboe, linux-kernel@vger.kernel.org, Li, Shaohua, Chen, Tim C On Wed, Jan 19, 2011 at 2:55 AM, Alex,Shi <alex.shi@intel.com> wrote: > Shaohua and I tested kernel building performance on latest kernel. and > found it is drop about 15% on our 64 LCPUs NHM-EX machine on ext4 file > system. We find this performance dropping is due to commit > 749ef9f8423054e326f. If we revert this patch or just change the > WRITE_SYNC back to WRITE in jbd2/commit.c file. the performance can be > recovered. > > iostat report show with the commit, read request merge number increased > and write request merge dropped. The total request size increased and > queue length dropped. So we tested another patch: only change WRITE_SYNC > to WRITE_SYNC_PLUG in jbd2/commit.c, but nothing effected. > > we didn't test deadline IO mode, just test cfq. seems insert write > request into sync queue effect much read performance, but we don't know > details. What's your comments of this? > > iostat of .37 kernel: > rrqm/s wrqm/s r/s w/s rMB/s wMB/s avgrq-sz avgqu-sz await svctm %util > 22.5772 96.46 92.3742 14.747 1.0048 0.439474 34.8557 0.18078 3.8076 0.30447 2.94302 > iostat of commit reverted .37: > 26.6223 80.21 107.875 6.03538 1.51415 0 41.3275 0.153385 3.80569 0.377231 3.22323 >From these numbers, it seems to me that with the patch reverted, the write bandwidth is really low, and probably you are keeping most written files in the buffer cache during the whole compile, while the non-reverted kernel is making progress in writing out the files. So the 'improved' read bandwidth is due to unfairness w.r.t. writes. Does the result change if you add a final sync and time that as well, in order to see the full time to make it on disk? I think that in a more extreme test where you end up filling all the buffer cache with written data, you will see much longer stalls with the revert than without. Thanks, Corrado > > vmstat report show, read bandwidth dropping: > vmstat of .37: > r b swpd free buff cache si so bi bo in cs us sy id wa st > 3.4 52.6 0.0 64303937.0 16466.7 121544.5 0.0 0.0 2102.7 1914.6 7414.1 3185.7 2.0 1.0 80.3 16.7 0.0 > vmstat of revert all from .37 > 2.2 35.8 0.0 64306767.4 17265.6 126101.2 0.0 0.0 2415.8 1619.1 8532.2 3556.2 2.5 1.1 83.0 13.3 0.0 > > Regards > Alex > > === > diff --git a/fs/jbd/commit.c b/fs/jbd/commit.c > index 34a4861..27ac2f3 100644 > --- a/fs/jbd/commit.c > +++ b/fs/jbd/commit.c > @@ -294,7 +294,7 @@ void journal_commit_transaction(journal_t *journal) > int first_tag = 0; > int tag_flag; > int i; > - int write_op = WRITE_SYNC; > + int write_op = WRITE; > > /* > * First job: lock down the current transaction and wait for > diff --git a/fs/jbd2/commit.c b/fs/jbd2/commit.c > index f3ad159..69ff08e 100644 > --- a/fs/jbd2/commit.c > +++ b/fs/jbd2/commit.c > @@ -329,7 +329,7 @@ void jbd2_journal_commit_transaction(journal_t *journal) > int tag_bytes = journal_tag_bytes(journal); > struct buffer_head *cbh = NULL; /* For transactional checksums */ > __u32 crc32_sum = ~0; > - int write_op = WRITE_SYNC; > + int write_op = WRITE; > > /* > * First job: lock down the current transaction and wait for > > > -- __________________________________________________________________________ dott. Corrado Zoccolo mailto:czoccolo@gmail.com PhD - Department of Computer Science - University of Pisa, Italy -------------------------------------------------------------------------- The self-confidence of a warrior is not the self-confidence of the average man. The average man seeks certainty in the eyes of the onlooker and calls that self-confidence. The warrior seeks impeccability in his own eyes and calls that humbleness. Tales of Power - C. Castaneda ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: [performance bug] kernel building regression on 64 LCPUs machine 2011-01-21 7:23 ` Corrado Zoccolo @ 2011-01-21 7:47 ` Alex,Shi 2011-01-21 7:52 ` Alex,Shi 2011-01-21 8:20 ` Shaohua Li 1 sibling, 1 reply; 38+ messages in thread From: Alex,Shi @ 2011-01-21 7:47 UTC (permalink / raw) To: Corrado Zoccolo Cc: vgoyal@redhat.com, jaxboe@fusionio.com, linux-kernel@vger.kernel.org, Li, Shaohua, Chen, Tim C On Fri, 2011-01-21 at 15:23 +0800, Corrado Zoccolo wrote: > On Wed, Jan 19, 2011 at 2:55 AM, Alex,Shi <alex.shi@intel.com> wrote: > > Shaohua and I tested kernel building performance on latest kernel. and > > found it is drop about 15% on our 64 LCPUs NHM-EX machine on ext4 file > > system. We find this performance dropping is due to commit > > 749ef9f8423054e326f. If we revert this patch or just change the > > WRITE_SYNC back to WRITE in jbd2/commit.c file. the performance can be > > recovered. > > > > iostat report show with the commit, read request merge number increased > > and write request merge dropped. The total request size increased and > > queue length dropped. So we tested another patch: only change WRITE_SYNC > > to WRITE_SYNC_PLUG in jbd2/commit.c, but nothing effected. > > > > we didn't test deadline IO mode, just test cfq. seems insert write > > request into sync queue effect much read performance, but we don't know > > details. What's your comments of this? > > > > iostat of .37 kernel: > > rrqm/s wrqm/s r/s w/s rMB/s wMB/s avgrq-sz avgqu-sz await svctm %util > > 22.5772 96.46 92.3742 14.747 1.0048 0.439474 34.8557 0.18078 3.8076 0.30447 2.94302 > > iostat of commit reverted .37: > > 26.6223 80.21 107.875 6.03538 1.51415 0 41.3275 0.153385 3.80569 0.377231 3.22323 > > From these numbers, it seems to me that with the patch reverted, the > write bandwidth is really low, and probably you are keeping most > written files in the buffer cache during the whole compile, while the > non-reverted kernel is making progress in writing out the files. So > the 'improved' read bandwidth is due to unfairness w.r.t. writes. > Does the result change if you add a final sync and time that as well, > in order to see the full time to make it on disk? Agree with your guess, but kbuild is such kind of benchmark, we can not change its behavior. :( > > I think that in a more extreme test where you end up filling all the > buffer cache with written data, you will see much longer stalls with > the revert than without. Have to do this? and if so, it is not kbuild. :) BTW, the Jan's patch has a little improvement on kbuild. In many time testing, it seems about 3% improving. The average iostat output of Jan's patch: rrqm/s wrqm/s r/s w/s rMB/s wMB/s avgrq-sz avgqu-sz await svctm %util 23.419726 87.450685 96.164521 6.748493 1.046438 0.370137 45.182192 0.200685 6.848767 0.394110 3.072192 > > Thanks, > Corrado > > > > > vmstat report show, read bandwidth dropping: > > vmstat of .37: > > r b swpd free buff cache si so bi bo in cs us sy id wa st > > 3.4 52.6 0.0 64303937.0 16466.7 121544.5 0.0 0.0 2102.7 1914.6 7414.1 3185.7 2.0 1.0 80.3 16.7 0.0 > > vmstat of revert all from .37 > > 2.2 35.8 0.0 64306767.4 17265.6 126101.2 0.0 0.0 2415.8 1619.1 8532.2 3556.2 2.5 1.1 83.0 13.3 0.0 > > > > Regards > > Alex > > > > === > > diff --git a/fs/jbd/commit.c b/fs/jbd/commit.c > > index 34a4861..27ac2f3 100644 > > --- a/fs/jbd/commit.c > > +++ b/fs/jbd/commit.c > > @@ -294,7 +294,7 @@ void journal_commit_transaction(journal_t *journal) > > int first_tag = 0; > > int tag_flag; > > int i; > > - int write_op = WRITE_SYNC; > > + int write_op = WRITE; > > > > /* > > * First job: lock down the current transaction and wait for > > diff --git a/fs/jbd2/commit.c b/fs/jbd2/commit.c > > index f3ad159..69ff08e 100644 > > --- a/fs/jbd2/commit.c > > +++ b/fs/jbd2/commit.c > > @@ -329,7 +329,7 @@ void jbd2_journal_commit_transaction(journal_t *journal) > > int tag_bytes = journal_tag_bytes(journal); > > struct buffer_head *cbh = NULL; /* For transactional checksums */ > > __u32 crc32_sum = ~0; > > - int write_op = WRITE_SYNC; > > + int write_op = WRITE; > > > > /* > > * First job: lock down the current transaction and wait for > > > > > > > > > ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: [performance bug] kernel building regression on 64 LCPUs machine 2011-01-21 7:47 ` Alex,Shi @ 2011-01-21 7:52 ` Alex,Shi 2011-01-21 8:13 ` Corrado Zoccolo 0 siblings, 1 reply; 38+ messages in thread From: Alex,Shi @ 2011-01-21 7:52 UTC (permalink / raw) To: Corrado Zoccolo, Jan Kara, tytso@mit.edu Cc: vgoyal@redhat.com, jaxboe@fusionio.com, linux-kernel@vger.kernel.org, Li, Shaohua, Chen, Tim C Sorry for forgetting Jan and tytso. added again. On Fri, 2011-01-21 at 15:47 +0800, Alex,Shi wrote: > On Fri, 2011-01-21 at 15:23 +0800, Corrado Zoccolo wrote: > > On Wed, Jan 19, 2011 at 2:55 AM, Alex,Shi <alex.shi@intel.com> wrote: > > > Shaohua and I tested kernel building performance on latest kernel. and > > > found it is drop about 15% on our 64 LCPUs NHM-EX machine on ext4 file > > > system. We find this performance dropping is due to commit > > > 749ef9f8423054e326f. If we revert this patch or just change the > > > WRITE_SYNC back to WRITE in jbd2/commit.c file. the performance can be > > > recovered. > > > > > > iostat report show with the commit, read request merge number increased > > > and write request merge dropped. The total request size increased and > > > queue length dropped. So we tested another patch: only change WRITE_SYNC > > > to WRITE_SYNC_PLUG in jbd2/commit.c, but nothing effected. > > > > > > we didn't test deadline IO mode, just test cfq. seems insert write > > > request into sync queue effect much read performance, but we don't know > > > details. What's your comments of this? > > > > > > iostat of .37 kernel: > > > rrqm/s wrqm/s r/s w/s rMB/s wMB/s avgrq-sz avgqu-sz await svctm %util > > > 22.5772 96.46 92.3742 14.747 1.0048 0.439474 34.8557 0.18078 3.8076 0.30447 2.94302 > > > iostat of commit reverted .37: > > > 26.6223 80.21 107.875 6.03538 1.51415 0 41.3275 0.153385 3.80569 0.377231 3.22323 > > > > From these numbers, it seems to me that with the patch reverted, the > > write bandwidth is really low, and probably you are keeping most > > written files in the buffer cache during the whole compile, while the > > non-reverted kernel is making progress in writing out the files. So > > the 'improved' read bandwidth is due to unfairness w.r.t. writes. > > Does the result change if you add a final sync and time that as well, > > in order to see the full time to make it on disk? > > Agree with your guess, but kbuild is such kind of benchmark, we can not > change its behavior. :( > > > > > > I think that in a more extreme test where you end up filling all the > > buffer cache with written data, you will see much longer stalls with > > the revert than without. > > Have to do this? and if so, it is not kbuild. :) > > BTW, the Jan's patch has a little improvement on kbuild. In many time > testing, it seems about 3% improving. > > The average iostat output of Jan's patch: > rrqm/s wrqm/s r/s w/s rMB/s wMB/s avgrq-sz avgqu-sz await svctm %util > 23.419726 87.450685 96.164521 6.748493 1.046438 0.370137 45.182192 0.200685 6.848767 0.394110 3.072192 > > > > > Thanks, > > Corrado > > > > > > > > vmstat report show, read bandwidth dropping: > > > vmstat of .37: > > > r b swpd free buff cache si so bi bo in cs us sy id wa st > > > 3.4 52.6 0.0 64303937.0 16466.7 121544.5 0.0 0.0 2102.7 1914.6 7414.1 3185.7 2.0 1.0 80.3 16.7 0.0 > > > vmstat of revert all from .37 > > > 2.2 35.8 0.0 64306767.4 17265.6 126101.2 0.0 0.0 2415.8 1619.1 8532.2 3556.2 2.5 1.1 83.0 13.3 0.0 > > > > > > Regards > > > Alex > > > > > > === > > > diff --git a/fs/jbd/commit.c b/fs/jbd/commit.c > > > index 34a4861..27ac2f3 100644 > > > --- a/fs/jbd/commit.c > > > +++ b/fs/jbd/commit.c > > > @@ -294,7 +294,7 @@ void journal_commit_transaction(journal_t *journal) > > > int first_tag = 0; > > > int tag_flag; > > > int i; > > > - int write_op = WRITE_SYNC; > > > + int write_op = WRITE; > > > > > > /* > > > * First job: lock down the current transaction and wait for > > > diff --git a/fs/jbd2/commit.c b/fs/jbd2/commit.c > > > index f3ad159..69ff08e 100644 > > > --- a/fs/jbd2/commit.c > > > +++ b/fs/jbd2/commit.c > > > @@ -329,7 +329,7 @@ void jbd2_journal_commit_transaction(journal_t *journal) > > > int tag_bytes = journal_tag_bytes(journal); > > > struct buffer_head *cbh = NULL; /* For transactional checksums */ > > > __u32 crc32_sum = ~0; > > > - int write_op = WRITE_SYNC; > > > + int write_op = WRITE; > > > > > > /* > > > * First job: lock down the current transaction and wait for > > > > > > > > > > > > > > > > ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: [performance bug] kernel building regression on 64 LCPUs machine 2011-01-21 7:52 ` Alex,Shi @ 2011-01-21 8:13 ` Corrado Zoccolo 0 siblings, 0 replies; 38+ messages in thread From: Corrado Zoccolo @ 2011-01-21 8:13 UTC (permalink / raw) To: Alex,Shi Cc: Jan Kara, tytso@mit.edu, vgoyal@redhat.com, jaxboe@fusionio.com, linux-kernel@vger.kernel.org, Li, Shaohua, Chen, Tim C On Fri, Jan 21, 2011 at 8:52 AM, Alex,Shi <alex.shi@intel.com> wrote: > Sorry for forgetting Jan and tytso. added again. > > On Fri, 2011-01-21 at 15:47 +0800, Alex,Shi wrote: >> On Fri, 2011-01-21 at 15:23 +0800, Corrado Zoccolo wrote: >> > On Wed, Jan 19, 2011 at 2:55 AM, Alex,Shi <alex.shi@intel.com> wrote: >> > > Shaohua and I tested kernel building performance on latest kernel. and >> > > found it is drop about 15% on our 64 LCPUs NHM-EX machine on ext4 file >> > > system. We find this performance dropping is due to commit >> > > 749ef9f8423054e326f. If we revert this patch or just change the >> > > WRITE_SYNC back to WRITE in jbd2/commit.c file. the performance can be >> > > recovered. >> > > >> > > iostat report show with the commit, read request merge number increased >> > > and write request merge dropped. The total request size increased and >> > > queue length dropped. So we tested another patch: only change WRITE_SYNC >> > > to WRITE_SYNC_PLUG in jbd2/commit.c, but nothing effected. >> > > >> > > we didn't test deadline IO mode, just test cfq. seems insert write >> > > request into sync queue effect much read performance, but we don't know >> > > details. What's your comments of this? >> > > >> > > iostat of .37 kernel: >> > > rrqm/s wrqm/s r/s w/s rMB/s wMB/s avgrq-sz avgqu-sz await svctm %util >> > > 22.5772 96.46 92.3742 14.747 1.0048 0.439474 34.8557 0.18078 3.8076 0.30447 2.94302 >> > > iostat of commit reverted .37: >> > > 26.6223 80.21 107.875 6.03538 1.51415 0 41.3275 0.153385 3.80569 0.377231 3.22323 >> > >> > From these numbers, it seems to me that with the patch reverted, the >> > write bandwidth is really low, and probably you are keeping most >> > written files in the buffer cache during the whole compile, while the >> > non-reverted kernel is making progress in writing out the files. So >> > the 'improved' read bandwidth is due to unfairness w.r.t. writes. >> > Does the result change if you add a final sync and time that as well, >> > in order to see the full time to make it on disk? >> >> Agree with your guess, but kbuild is such kind of benchmark, we can not >> change its behavior. :( Why not? Benchmarks should try to model real workloads, and after I build a kernel, I run lilo, that performs a sync (and typically takes several seconds). If I used grub, I would pay the sync while unmounting the filesystem, but I would experience the multi-second delay anyway. >> >> >> > >> > I think that in a more extreme test where you end up filling all the >> > buffer cache with written data, you will see much longer stalls with >> > the revert than without. >> >> Have to do this? and if so, it is not kbuild. :) It might be, on a machine with fewer ram. Can you try running with just 1G of ram? >> >> BTW, the Jan's patch has a little improvement on kbuild. In many time >> testing, it seems about 3% improving. The fact is, does it improve (or at least is a noop) also on constrained memory systems? >> The average iostat output of Jan's patch: >> rrqm/s wrqm/s r/s w/s rMB/s wMB/s avgrq-sz avgqu-sz await svctm %util >> 23.419726 87.450685 96.164521 6.748493 1.046438 0.370137 45.182192 0.200685 6.848767 0.394110 3.072192 >> >> > >> > Thanks, >> > Corrado >> > >> > > >> > > vmstat report show, read bandwidth dropping: >> > > vmstat of .37: >> > > r b swpd free buff cache si so bi bo in cs us sy id wa st >> > > 3.4 52.6 0.0 64303937.0 16466.7 121544.5 0.0 0.0 2102.7 1914.6 7414.1 3185.7 2.0 1.0 80.3 16.7 0.0 >> > > vmstat of revert all from .37 >> > > 2.2 35.8 0.0 64306767.4 17265.6 126101.2 0.0 0.0 2415.8 1619.1 8532.2 3556.2 2.5 1.1 83.0 13.3 0.0 >> > > >> > > Regards >> > > Alex >> > > >> > > === >> > > diff --git a/fs/jbd/commit.c b/fs/jbd/commit.c >> > > index 34a4861..27ac2f3 100644 >> > > --- a/fs/jbd/commit.c >> > > +++ b/fs/jbd/commit.c >> > > @@ -294,7 +294,7 @@ void journal_commit_transaction(journal_t *journal) >> > > int first_tag = 0; >> > > int tag_flag; >> > > int i; >> > > - int write_op = WRITE_SYNC; >> > > + int write_op = WRITE; >> > > >> > > /* >> > > * First job: lock down the current transaction and wait for >> > > diff --git a/fs/jbd2/commit.c b/fs/jbd2/commit.c >> > > index f3ad159..69ff08e 100644 >> > > --- a/fs/jbd2/commit.c >> > > +++ b/fs/jbd2/commit.c >> > > @@ -329,7 +329,7 @@ void jbd2_journal_commit_transaction(journal_t *journal) >> > > int tag_bytes = journal_tag_bytes(journal); >> > > struct buffer_head *cbh = NULL; /* For transactional checksums */ >> > > __u32 crc32_sum = ~0; >> > > - int write_op = WRITE_SYNC; >> > > + int write_op = WRITE; >> > > >> > > /* >> > > * First job: lock down the current transaction and wait for >> > > >> > > >> > > >> > >> > >> > >> > > > -- __________________________________________________________________________ dott. Corrado Zoccolo mailto:czoccolo@gmail.com PhD - Department of Computer Science - University of Pisa, Italy -------------------------------------------------------------------------- The self-confidence of a warrior is not the self-confidence of the average man. The average man seeks certainty in the eyes of the onlooker and calls that self-confidence. The warrior seeks impeccability in his own eyes and calls that humbleness. Tales of Power - C. Castaneda ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: [performance bug] kernel building regression on 64 LCPUs machine 2011-01-21 7:23 ` Corrado Zoccolo 2011-01-21 7:47 ` Alex,Shi @ 2011-01-21 8:20 ` Shaohua Li 1 sibling, 0 replies; 38+ messages in thread From: Shaohua Li @ 2011-01-21 8:20 UTC (permalink / raw) To: Corrado Zoccolo, Jan Kara, Ted Ts'o Cc: Shi, Alex, vgoyal@redhat.com, jaxboe@fusionio.com, linux-kernel@vger.kernel.org, Chen, Tim C On Fri, 2011-01-21 at 15:23 +0800, Corrado Zoccolo wrote: > On Wed, Jan 19, 2011 at 2:55 AM, Alex,Shi <alex.shi@intel.com> wrote: > > Shaohua and I tested kernel building performance on latest kernel. and > > found it is drop about 15% on our 64 LCPUs NHM-EX machine on ext4 file > > system. We find this performance dropping is due to commit > > 749ef9f8423054e326f. If we revert this patch or just change the > > WRITE_SYNC back to WRITE in jbd2/commit.c file. the performance can be > > recovered. > > > > iostat report show with the commit, read request merge number increased > > and write request merge dropped. The total request size increased and > > queue length dropped. So we tested another patch: only change WRITE_SYNC > > to WRITE_SYNC_PLUG in jbd2/commit.c, but nothing effected. > > > > we didn't test deadline IO mode, just test cfq. seems insert write > > request into sync queue effect much read performance, but we don't know > > details. What's your comments of this? > > > > iostat of .37 kernel: > > rrqm/s wrqm/s r/s w/s rMB/s wMB/s avgrq-sz avgqu-sz await svctm %util > > 22.5772 96.46 92.3742 14.747 1.0048 0.439474 34.8557 0.18078 3.8076 0.30447 2.94302 > > iostat of commit reverted .37: > > 26.6223 80.21 107.875 6.03538 1.51415 0 41.3275 0.153385 3.80569 0.377231 3.22323 > > From these numbers, it seems to me that with the patch reverted, the > write bandwidth is really low, and probably you are keeping most > written files in the buffer cache during the whole compile, while the > non-reverted kernel is making progress in writing out the files. So > the 'improved' read bandwidth is due to unfairness w.r.t. writes. > Does the result change if you add a final sync and time that as well, > in order to see the full time to make it on disk? > > I think that in a more extreme test where you end up filling all the > buffer cache with written data, you will see much longer stalls with > the revert than without. Not just bandwidth issue. cpu might have more idle. so I bet even the 'kbuild + sync' workload still has regression And the no-write-merge of WRITE_SYNC isn't good, at least we should use WRITE_SYNC_PLUG. Thanks, Shaohua > > > > vmstat report show, read bandwidth dropping: > > vmstat of .37: > > r b swpd free buff cache si so bi bo in cs us sy id wa st > > 3.4 52.6 0.0 64303937.0 16466.7 121544.5 0.0 0.0 2102.7 1914.6 7414.1 3185.7 2.0 1.0 80.3 16.7 0.0 > > vmstat of revert all from .37 > > 2.2 35.8 0.0 64306767.4 17265.6 126101.2 0.0 0.0 2415.8 1619.1 8532.2 3556.2 2.5 1.1 83.0 13.3 0.0 > > > > Regards > > Alex > > > > === > > diff --git a/fs/jbd/commit.c b/fs/jbd/commit.c > > index 34a4861..27ac2f3 100644 > > --- a/fs/jbd/commit.c > > +++ b/fs/jbd/commit.c > > @@ -294,7 +294,7 @@ void journal_commit_transaction(journal_t *journal) > > int first_tag = 0; > > int tag_flag; > > int i; > > - int write_op = WRITE_SYNC; > > + int write_op = WRITE; > > > > /* > > * First job: lock down the current transaction and wait for > > diff --git a/fs/jbd2/commit.c b/fs/jbd2/commit.c > > index f3ad159..69ff08e 100644 > > --- a/fs/jbd2/commit.c > > +++ b/fs/jbd2/commit.c > > @@ -329,7 +329,7 @@ void jbd2_journal_commit_transaction(journal_t *journal) > > int tag_bytes = journal_tag_bytes(journal); > > struct buffer_head *cbh = NULL; /* For transactional checksums */ > > __u32 crc32_sum = ~0; > > - int write_op = WRITE_SYNC; > > + int write_op = WRITE; > > > > /* > > * First job: lock down the current transaction and wait for > > > > > > > > > ^ permalink raw reply [flat|nested] 38+ messages in thread
end of thread, other threads:[~2011-03-28 19:49 UTC | newest] Thread overview: 38+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2011-01-19 1:55 [performance bug] kernel building regression on 64 LCPUs machine Alex,Shi 2011-01-19 2:03 ` Shaohua Li 2011-01-19 12:56 ` Jan Kara 2011-01-20 7:52 ` Alex,Shi 2011-01-20 15:16 ` Vivek Goyal 2011-01-21 7:17 ` Shaohua Li 2011-01-26 8:15 ` Shaohua Li 2011-02-12 9:21 ` Alex,Shi 2011-02-12 18:25 ` Corrado Zoccolo 2011-02-14 2:25 ` Alex,Shi 2011-02-15 1:10 ` Shaohua Li 2011-02-21 16:49 ` Jan Kara 2011-02-23 8:24 ` Alex,Shi 2011-02-24 12:13 ` Jan Kara 2011-02-25 0:44 ` Alex Shi 2011-02-26 14:45 ` Corrado Zoccolo 2011-03-01 19:56 ` Jeff Moyer 2011-03-02 9:42 ` Jan Kara 2011-03-02 16:13 ` Jeff Moyer 2011-03-02 21:17 ` Jan Kara 2011-03-02 21:20 ` Jeff Moyer 2011-03-03 1:14 ` Jeff Moyer 2011-03-04 15:32 ` Jan Kara 2011-03-04 15:40 ` Jeff Moyer 2011-03-04 15:50 ` Jeff Moyer 2011-03-04 18:27 ` Jeff Moyer 2011-03-22 7:38 ` Alex,Shi 2011-03-22 16:14 ` Jan Kara 2011-03-22 17:46 ` Jeff Moyer 2011-03-24 6:45 ` Alex,Shi 2011-03-28 19:48 ` Jan Kara 2011-01-19 14:32 ` Ted Ts'o 2011-01-20 2:12 ` Shaohua Li 2011-01-21 7:23 ` Corrado Zoccolo 2011-01-21 7:47 ` Alex,Shi 2011-01-21 7:52 ` Alex,Shi 2011-01-21 8:13 ` Corrado Zoccolo 2011-01-21 8:20 ` Shaohua Li
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).