* [PATCH, RFC 0/3] *** SUBJECT HERE ***
@ 2010-08-03 16:01 ` Theodore Ts'o
0 siblings, 0 replies; 22+ messages in thread
From: Theodore Ts'o @ 2010-08-03 16:01 UTC (permalink / raw)
To: Ext4 Developers List, ocfs2-devel
Cc: John Stultz, Keith Maanthey, Eric Whitney, Theodore Ts'o
The first patch in this patch series hasn't changed since when I had
last posted it, but I'm including it again for the benefit of the folks
on ocfs2-dev.
Thanks to some work done by Eric Whitney, when he accidentally ran the
command "mkfs.ext4 -t xfs", and created a ext4 file system without a
journal, it appears that main scalability bottleneck for ext4 is in the
jbd2 layer. In fact, his testing on a 48-core system shows that on some
workloads, ext4 is roughly comparable with XFS!
The lockstat results indicate that the main bottlenecks are in the
j_state_lock and t_handle_lock, especially in start_this_handle() in
fs/jbd2/transaction.c. A previous patch, which removed an unneeded
grabbing of j_state_lock jbd2_journal_stop() relieved pressure on that
lock and was noted to make a significant difference for dbench on a
kernel with CONFIG_PREEMPT_RT enabled, as well as on a 48-core AMD
system from HP. This patch is already in 2.6.35, and the benchmark
results can be found here: http://free.linux.hp.com/~enw/ext4/2.6.34/
This patch series removes all exclusive spinlocks when starting and
stopping jbd2 handles, which should improve things even more. Since
OCFS2 uses the jbd2 layer, and the second patch in this patch series
touches ocfs2 a wee bit, I'd appreciate it if you could take a look and
let me know what you think. Hopefully, this should also improve OCFS2's
scalability.
Best regards,
- Ted
Theodore Ts'o (3):
jbd2: Use atomic variables to avoid taking t_handle_lock in
jbd2_journal_stop
jbd2: Change j_state_lock to be a rwlock_t
jbd2: Remove t_handle_lock from start_this_handle()
fs/ext4/inode.c | 4 +-
fs/ext4/super.c | 4 +-
fs/jbd2/checkpoint.c | 18 +++---
fs/jbd2/commit.c | 42 +++++++-------
fs/jbd2/journal.c | 94 +++++++++++++++----------------
fs/jbd2/transaction.c | 149 ++++++++++++++++++++++++++++---------------------
fs/ocfs2/journal.c | 4 +-
include/linux/jbd2.h | 12 ++--
8 files changed, 174 insertions(+), 153 deletions(-)
^ permalink raw reply [flat|nested] 22+ messages in thread* [Ocfs2-devel] [PATCH, RFC 0/3] *** SUBJECT HERE *** @ 2010-08-03 16:01 ` Theodore Ts'o 0 siblings, 0 replies; 22+ messages in thread From: Theodore Ts'o @ 2010-08-03 16:01 UTC (permalink / raw) To: Ext4 Developers List, ocfs2-devel Cc: John Stultz, Keith Maanthey, Eric Whitney, Theodore Ts'o The first patch in this patch series hasn't changed since when I had last posted it, but I'm including it again for the benefit of the folks on ocfs2-dev. Thanks to some work done by Eric Whitney, when he accidentally ran the command "mkfs.ext4 -t xfs", and created a ext4 file system without a journal, it appears that main scalability bottleneck for ext4 is in the jbd2 layer. In fact, his testing on a 48-core system shows that on some workloads, ext4 is roughly comparable with XFS! The lockstat results indicate that the main bottlenecks are in the j_state_lock and t_handle_lock, especially in start_this_handle() in fs/jbd2/transaction.c. A previous patch, which removed an unneeded grabbing of j_state_lock jbd2_journal_stop() relieved pressure on that lock and was noted to make a significant difference for dbench on a kernel with CONFIG_PREEMPT_RT enabled, as well as on a 48-core AMD system from HP. This patch is already in 2.6.35, and the benchmark results can be found here: http://free.linux.hp.com/~enw/ext4/2.6.34/ This patch series removes all exclusive spinlocks when starting and stopping jbd2 handles, which should improve things even more. Since OCFS2 uses the jbd2 layer, and the second patch in this patch series touches ocfs2 a wee bit, I'd appreciate it if you could take a look and let me know what you think. Hopefully, this should also improve OCFS2's scalability. Best regards, - Ted Theodore Ts'o (3): jbd2: Use atomic variables to avoid taking t_handle_lock in jbd2_journal_stop jbd2: Change j_state_lock to be a rwlock_t jbd2: Remove t_handle_lock from start_this_handle() fs/ext4/inode.c | 4 +- fs/ext4/super.c | 4 +- fs/jbd2/checkpoint.c | 18 +++--- fs/jbd2/commit.c | 42 +++++++------- fs/jbd2/journal.c | 94 +++++++++++++++---------------- fs/jbd2/transaction.c | 149 ++++++++++++++++++++++++++++--------------------- fs/ocfs2/journal.c | 4 +- include/linux/jbd2.h | 12 ++-- 8 files changed, 174 insertions(+), 153 deletions(-) ^ permalink raw reply [flat|nested] 22+ messages in thread
* [PATCH, RFC 1/3] jbd2: Use atomic variables to avoid taking t_handle_lock in jbd2_journal_stop 2010-08-03 16:01 ` [Ocfs2-devel] " Theodore Ts'o @ 2010-08-03 16:01 ` Theodore Ts'o -1 siblings, 0 replies; 22+ messages in thread From: Theodore Ts'o @ 2010-08-03 16:01 UTC (permalink / raw) To: Ext4 Developers List, ocfs2-devel Cc: John Stultz, Keith Maanthey, Eric Whitney, Theodore Ts'o By using an atomic_t for t_updates and t_outstanding credits, this should allow us to not need to take transaction t_handle_lock in jbd2_journal_stop(). Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> --- fs/jbd2/checkpoint.c | 2 +- fs/jbd2/commit.c | 13 +++++---- fs/jbd2/transaction.c | 69 +++++++++++++++++++++++++++--------------------- include/linux/jbd2.h | 8 +++--- 4 files changed, 51 insertions(+), 41 deletions(-) diff --git a/fs/jbd2/checkpoint.c b/fs/jbd2/checkpoint.c index 076d1cc..f8cdc02 100644 --- a/fs/jbd2/checkpoint.c +++ b/fs/jbd2/checkpoint.c @@ -775,7 +775,7 @@ void __jbd2_journal_drop_transaction(journal_t *journal, transaction_t *transact J_ASSERT(transaction->t_log_list == NULL); J_ASSERT(transaction->t_checkpoint_list == NULL); J_ASSERT(transaction->t_checkpoint_io_list == NULL); - J_ASSERT(transaction->t_updates == 0); + J_ASSERT(atomic_read(&transaction->t_updates) == 0); J_ASSERT(journal->j_committing_transaction != transaction); J_ASSERT(journal->j_running_transaction != transaction); diff --git a/fs/jbd2/commit.c b/fs/jbd2/commit.c index af05681..fbd2c56 100644 --- a/fs/jbd2/commit.c +++ b/fs/jbd2/commit.c @@ -417,12 +417,12 @@ void jbd2_journal_commit_transaction(journal_t *journal) stats.run.rs_locked); spin_lock(&commit_transaction->t_handle_lock); - while (commit_transaction->t_updates) { + while (atomic_read(&commit_transaction->t_updates)) { DEFINE_WAIT(wait); prepare_to_wait(&journal->j_wait_updates, &wait, TASK_UNINTERRUPTIBLE); - if (commit_transaction->t_updates) { + if (atomic_read(&commit_transaction->t_updates)) { spin_unlock(&commit_transaction->t_handle_lock); spin_unlock(&journal->j_state_lock); schedule(); @@ -433,7 +433,7 @@ void jbd2_journal_commit_transaction(journal_t *journal) } spin_unlock(&commit_transaction->t_handle_lock); - J_ASSERT (commit_transaction->t_outstanding_credits <= + J_ASSERT (atomic_read(&commit_transaction->t_outstanding_credits) <= journal->j_max_transaction_buffers); /* @@ -527,11 +527,12 @@ void jbd2_journal_commit_transaction(journal_t *journal) stats.run.rs_logging = jiffies; stats.run.rs_flushing = jbd2_time_diff(stats.run.rs_flushing, stats.run.rs_logging); - stats.run.rs_blocks = commit_transaction->t_outstanding_credits; + stats.run.rs_blocks = + atomic_read(&commit_transaction->t_outstanding_credits); stats.run.rs_blocks_logged = 0; J_ASSERT(commit_transaction->t_nr_buffers <= - commit_transaction->t_outstanding_credits); + atomic_read(&commit_transaction->t_outstanding_credits)); err = 0; descriptor = NULL; @@ -616,7 +617,7 @@ void jbd2_journal_commit_transaction(journal_t *journal) * the free space in the log, but this counter is changed * by jbd2_journal_next_log_block() also. */ - commit_transaction->t_outstanding_credits--; + atomic_dec(&commit_transaction->t_outstanding_credits); /* Bump b_count to prevent truncate from stumbling over the shadowed buffer! @@@ This can go if we ever get diff --git a/fs/jbd2/transaction.c b/fs/jbd2/transaction.c index 001e95f..9c64c7e 100644 --- a/fs/jbd2/transaction.c +++ b/fs/jbd2/transaction.c @@ -55,6 +55,8 @@ jbd2_get_transaction(journal_t *journal, transaction_t *transaction) transaction->t_tid = journal->j_transaction_sequence++; transaction->t_expires = jiffies + journal->j_commit_interval; spin_lock_init(&transaction->t_handle_lock); + atomic_set(&transaction->t_updates, 0); + atomic_set(&transaction->t_outstanding_credits, 0); INIT_LIST_HEAD(&transaction->t_inode_list); INIT_LIST_HEAD(&transaction->t_private_list); @@ -177,7 +179,7 @@ repeat_locked: * checkpoint to free some more log space. */ spin_lock(&transaction->t_handle_lock); - needed = transaction->t_outstanding_credits + nblocks; + needed = atomic_read(&transaction->t_outstanding_credits) + nblocks; if (needed > journal->j_max_transaction_buffers) { /* @@ -240,11 +242,12 @@ repeat_locked: } handle->h_transaction = transaction; - transaction->t_outstanding_credits += nblocks; - transaction->t_updates++; + atomic_add(nblocks, &transaction->t_outstanding_credits); + atomic_inc(&transaction->t_updates); transaction->t_handle_count++; jbd_debug(4, "Handle %p given %d credits (total %d, free %d)\n", - handle, nblocks, transaction->t_outstanding_credits, + handle, nblocks, + atomic_read(&transaction->t_outstanding_credits), __jbd2_log_space_left(journal)); spin_unlock(&transaction->t_handle_lock); spin_unlock(&journal->j_state_lock); @@ -369,7 +372,7 @@ int jbd2_journal_extend(handle_t *handle, int nblocks) } spin_lock(&transaction->t_handle_lock); - wanted = transaction->t_outstanding_credits + nblocks; + wanted = atomic_read(&transaction->t_outstanding_credits) + nblocks; if (wanted > journal->j_max_transaction_buffers) { jbd_debug(3, "denied handle %p %d blocks: " @@ -384,7 +387,7 @@ int jbd2_journal_extend(handle_t *handle, int nblocks) } handle->h_buffer_credits += nblocks; - transaction->t_outstanding_credits += nblocks; + atomic_add(nblocks, &transaction->t_outstanding_credits); result = 0; jbd_debug(3, "extended handle %p by %d\n", handle, nblocks); @@ -426,15 +429,14 @@ int jbd2__journal_restart(handle_t *handle, int nblocks, int gfp_mask) * First unlink the handle from its current transaction, and start the * commit on that. */ - J_ASSERT(transaction->t_updates > 0); + J_ASSERT(atomic_read(&transaction->t_updates) > 0); J_ASSERT(journal_current_handle() == handle); spin_lock(&journal->j_state_lock); spin_lock(&transaction->t_handle_lock); - transaction->t_outstanding_credits -= handle->h_buffer_credits; - transaction->t_updates--; - - if (!transaction->t_updates) + atomic_sub(handle->h_buffer_credits, + &transaction->t_outstanding_credits); + if (atomic_dec_and_test(&transaction->t_updates)) wake_up(&journal->j_wait_updates); spin_unlock(&transaction->t_handle_lock); @@ -481,7 +483,7 @@ void jbd2_journal_lock_updates(journal_t *journal) break; spin_lock(&transaction->t_handle_lock); - if (!transaction->t_updates) { + if (!atomic_read(&transaction->t_updates)) { spin_unlock(&transaction->t_handle_lock); break; } @@ -1258,7 +1260,8 @@ int jbd2_journal_stop(handle_t *handle) { transaction_t *transaction = handle->h_transaction; journal_t *journal = transaction->t_journal; - int err; + int err, wait_for_commit = 0; + tid_t tid; pid_t pid; J_ASSERT(journal_current_handle() == handle); @@ -1266,7 +1269,7 @@ int jbd2_journal_stop(handle_t *handle) if (is_handle_aborted(handle)) err = -EIO; else { - J_ASSERT(transaction->t_updates > 0); + J_ASSERT(atomic_read(&transaction->t_updates) > 0); err = 0; } @@ -1334,14 +1337,8 @@ int jbd2_journal_stop(handle_t *handle) if (handle->h_sync) transaction->t_synchronous_commit = 1; current->journal_info = NULL; - spin_lock(&transaction->t_handle_lock); - transaction->t_outstanding_credits -= handle->h_buffer_credits; - transaction->t_updates--; - if (!transaction->t_updates) { - wake_up(&journal->j_wait_updates); - if (journal->j_barrier_count) - wake_up(&journal->j_wait_transaction_locked); - } + atomic_sub(handle->h_buffer_credits, + &transaction->t_outstanding_credits); /* * If the handle is marked SYNC, we need to set another commit @@ -1350,15 +1347,13 @@ int jbd2_journal_stop(handle_t *handle) * transaction is too old now. */ if (handle->h_sync || - transaction->t_outstanding_credits > - journal->j_max_transaction_buffers || - time_after_eq(jiffies, transaction->t_expires)) { + (atomic_read(&transaction->t_outstanding_credits) > + journal->j_max_transaction_buffers) || + time_after_eq(jiffies, transaction->t_expires)) { /* Do this even for aborted journals: an abort still * completes the commit thread, it just doesn't write * anything to disk. */ - tid_t tid = transaction->t_tid; - spin_unlock(&transaction->t_handle_lock); jbd_debug(2, "transaction too old, requesting commit for " "handle %p\n", handle); /* This is non-blocking */ @@ -1369,11 +1364,25 @@ int jbd2_journal_stop(handle_t *handle) * to wait for the commit to complete. */ if (handle->h_sync && !(current->flags & PF_MEMALLOC)) - err = jbd2_log_wait_commit(journal, tid); - } else { - spin_unlock(&transaction->t_handle_lock); + wait_for_commit = 1; } + /* + * Once we drop t_updates, if it goes to zero the transaction + * could start commiting on us and eventually disappear. So + * once we do this, we must not dereference transaction + * pointer again. + */ + tid = transaction->t_tid; + if (atomic_dec_and_test(&transaction->t_updates)) { + wake_up(&journal->j_wait_updates); + if (journal->j_barrier_count) + wake_up(&journal->j_wait_transaction_locked); + } + + if (wait_for_commit) + err = jbd2_log_wait_commit(journal, tid); + lock_map_release(&handle->h_lockdep_map); jbd2_free_handle(handle); diff --git a/include/linux/jbd2.h b/include/linux/jbd2.h index 5a72bc7..a72ce21 100644 --- a/include/linux/jbd2.h +++ b/include/linux/jbd2.h @@ -601,13 +601,13 @@ struct transaction_s * Number of outstanding updates running on this transaction * [t_handle_lock] */ - int t_updates; + atomic_t t_updates; /* * Number of buffers reserved for use by all handles in this transaction * handle but not yet modified. [t_handle_lock] */ - int t_outstanding_credits; + atomic_t t_outstanding_credits; /* * Forward and backward links for the circular list of all transactions @@ -1258,8 +1258,8 @@ static inline int jbd_space_needed(journal_t *journal) { int nblocks = journal->j_max_transaction_buffers; if (journal->j_committing_transaction) - nblocks += journal->j_committing_transaction-> - t_outstanding_credits; + nblocks += atomic_read(&journal->j_committing_transaction-> + t_outstanding_credits); return nblocks; } -- 1.7.0.4 ^ permalink raw reply related [flat|nested] 22+ messages in thread
* [Ocfs2-devel] [PATCH, RFC 1/3] jbd2: Use atomic variables to avoid taking t_handle_lock in jbd2_journal_stop @ 2010-08-03 16:01 ` Theodore Ts'o 0 siblings, 0 replies; 22+ messages in thread From: Theodore Ts'o @ 2010-08-03 16:01 UTC (permalink / raw) To: Ext4 Developers List, ocfs2-devel Cc: John Stultz, Keith Maanthey, Eric Whitney, Theodore Ts'o By using an atomic_t for t_updates and t_outstanding credits, this should allow us to not need to take transaction t_handle_lock in jbd2_journal_stop(). Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> --- fs/jbd2/checkpoint.c | 2 +- fs/jbd2/commit.c | 13 +++++---- fs/jbd2/transaction.c | 69 +++++++++++++++++++++++++++--------------------- include/linux/jbd2.h | 8 +++--- 4 files changed, 51 insertions(+), 41 deletions(-) diff --git a/fs/jbd2/checkpoint.c b/fs/jbd2/checkpoint.c index 076d1cc..f8cdc02 100644 --- a/fs/jbd2/checkpoint.c +++ b/fs/jbd2/checkpoint.c @@ -775,7 +775,7 @@ void __jbd2_journal_drop_transaction(journal_t *journal, transaction_t *transact J_ASSERT(transaction->t_log_list == NULL); J_ASSERT(transaction->t_checkpoint_list == NULL); J_ASSERT(transaction->t_checkpoint_io_list == NULL); - J_ASSERT(transaction->t_updates == 0); + J_ASSERT(atomic_read(&transaction->t_updates) == 0); J_ASSERT(journal->j_committing_transaction != transaction); J_ASSERT(journal->j_running_transaction != transaction); diff --git a/fs/jbd2/commit.c b/fs/jbd2/commit.c index af05681..fbd2c56 100644 --- a/fs/jbd2/commit.c +++ b/fs/jbd2/commit.c @@ -417,12 +417,12 @@ void jbd2_journal_commit_transaction(journal_t *journal) stats.run.rs_locked); spin_lock(&commit_transaction->t_handle_lock); - while (commit_transaction->t_updates) { + while (atomic_read(&commit_transaction->t_updates)) { DEFINE_WAIT(wait); prepare_to_wait(&journal->j_wait_updates, &wait, TASK_UNINTERRUPTIBLE); - if (commit_transaction->t_updates) { + if (atomic_read(&commit_transaction->t_updates)) { spin_unlock(&commit_transaction->t_handle_lock); spin_unlock(&journal->j_state_lock); schedule(); @@ -433,7 +433,7 @@ void jbd2_journal_commit_transaction(journal_t *journal) } spin_unlock(&commit_transaction->t_handle_lock); - J_ASSERT (commit_transaction->t_outstanding_credits <= + J_ASSERT (atomic_read(&commit_transaction->t_outstanding_credits) <= journal->j_max_transaction_buffers); /* @@ -527,11 +527,12 @@ void jbd2_journal_commit_transaction(journal_t *journal) stats.run.rs_logging = jiffies; stats.run.rs_flushing = jbd2_time_diff(stats.run.rs_flushing, stats.run.rs_logging); - stats.run.rs_blocks = commit_transaction->t_outstanding_credits; + stats.run.rs_blocks = + atomic_read(&commit_transaction->t_outstanding_credits); stats.run.rs_blocks_logged = 0; J_ASSERT(commit_transaction->t_nr_buffers <= - commit_transaction->t_outstanding_credits); + atomic_read(&commit_transaction->t_outstanding_credits)); err = 0; descriptor = NULL; @@ -616,7 +617,7 @@ void jbd2_journal_commit_transaction(journal_t *journal) * the free space in the log, but this counter is changed * by jbd2_journal_next_log_block() also. */ - commit_transaction->t_outstanding_credits--; + atomic_dec(&commit_transaction->t_outstanding_credits); /* Bump b_count to prevent truncate from stumbling over the shadowed buffer! @@@ This can go if we ever get diff --git a/fs/jbd2/transaction.c b/fs/jbd2/transaction.c index 001e95f..9c64c7e 100644 --- a/fs/jbd2/transaction.c +++ b/fs/jbd2/transaction.c @@ -55,6 +55,8 @@ jbd2_get_transaction(journal_t *journal, transaction_t *transaction) transaction->t_tid = journal->j_transaction_sequence++; transaction->t_expires = jiffies + journal->j_commit_interval; spin_lock_init(&transaction->t_handle_lock); + atomic_set(&transaction->t_updates, 0); + atomic_set(&transaction->t_outstanding_credits, 0); INIT_LIST_HEAD(&transaction->t_inode_list); INIT_LIST_HEAD(&transaction->t_private_list); @@ -177,7 +179,7 @@ repeat_locked: * checkpoint to free some more log space. */ spin_lock(&transaction->t_handle_lock); - needed = transaction->t_outstanding_credits + nblocks; + needed = atomic_read(&transaction->t_outstanding_credits) + nblocks; if (needed > journal->j_max_transaction_buffers) { /* @@ -240,11 +242,12 @@ repeat_locked: } handle->h_transaction = transaction; - transaction->t_outstanding_credits += nblocks; - transaction->t_updates++; + atomic_add(nblocks, &transaction->t_outstanding_credits); + atomic_inc(&transaction->t_updates); transaction->t_handle_count++; jbd_debug(4, "Handle %p given %d credits (total %d, free %d)\n", - handle, nblocks, transaction->t_outstanding_credits, + handle, nblocks, + atomic_read(&transaction->t_outstanding_credits), __jbd2_log_space_left(journal)); spin_unlock(&transaction->t_handle_lock); spin_unlock(&journal->j_state_lock); @@ -369,7 +372,7 @@ int jbd2_journal_extend(handle_t *handle, int nblocks) } spin_lock(&transaction->t_handle_lock); - wanted = transaction->t_outstanding_credits + nblocks; + wanted = atomic_read(&transaction->t_outstanding_credits) + nblocks; if (wanted > journal->j_max_transaction_buffers) { jbd_debug(3, "denied handle %p %d blocks: " @@ -384,7 +387,7 @@ int jbd2_journal_extend(handle_t *handle, int nblocks) } handle->h_buffer_credits += nblocks; - transaction->t_outstanding_credits += nblocks; + atomic_add(nblocks, &transaction->t_outstanding_credits); result = 0; jbd_debug(3, "extended handle %p by %d\n", handle, nblocks); @@ -426,15 +429,14 @@ int jbd2__journal_restart(handle_t *handle, int nblocks, int gfp_mask) * First unlink the handle from its current transaction, and start the * commit on that. */ - J_ASSERT(transaction->t_updates > 0); + J_ASSERT(atomic_read(&transaction->t_updates) > 0); J_ASSERT(journal_current_handle() == handle); spin_lock(&journal->j_state_lock); spin_lock(&transaction->t_handle_lock); - transaction->t_outstanding_credits -= handle->h_buffer_credits; - transaction->t_updates--; - - if (!transaction->t_updates) + atomic_sub(handle->h_buffer_credits, + &transaction->t_outstanding_credits); + if (atomic_dec_and_test(&transaction->t_updates)) wake_up(&journal->j_wait_updates); spin_unlock(&transaction->t_handle_lock); @@ -481,7 +483,7 @@ void jbd2_journal_lock_updates(journal_t *journal) break; spin_lock(&transaction->t_handle_lock); - if (!transaction->t_updates) { + if (!atomic_read(&transaction->t_updates)) { spin_unlock(&transaction->t_handle_lock); break; } @@ -1258,7 +1260,8 @@ int jbd2_journal_stop(handle_t *handle) { transaction_t *transaction = handle->h_transaction; journal_t *journal = transaction->t_journal; - int err; + int err, wait_for_commit = 0; + tid_t tid; pid_t pid; J_ASSERT(journal_current_handle() == handle); @@ -1266,7 +1269,7 @@ int jbd2_journal_stop(handle_t *handle) if (is_handle_aborted(handle)) err = -EIO; else { - J_ASSERT(transaction->t_updates > 0); + J_ASSERT(atomic_read(&transaction->t_updates) > 0); err = 0; } @@ -1334,14 +1337,8 @@ int jbd2_journal_stop(handle_t *handle) if (handle->h_sync) transaction->t_synchronous_commit = 1; current->journal_info = NULL; - spin_lock(&transaction->t_handle_lock); - transaction->t_outstanding_credits -= handle->h_buffer_credits; - transaction->t_updates--; - if (!transaction->t_updates) { - wake_up(&journal->j_wait_updates); - if (journal->j_barrier_count) - wake_up(&journal->j_wait_transaction_locked); - } + atomic_sub(handle->h_buffer_credits, + &transaction->t_outstanding_credits); /* * If the handle is marked SYNC, we need to set another commit @@ -1350,15 +1347,13 @@ int jbd2_journal_stop(handle_t *handle) * transaction is too old now. */ if (handle->h_sync || - transaction->t_outstanding_credits > - journal->j_max_transaction_buffers || - time_after_eq(jiffies, transaction->t_expires)) { + (atomic_read(&transaction->t_outstanding_credits) > + journal->j_max_transaction_buffers) || + time_after_eq(jiffies, transaction->t_expires)) { /* Do this even for aborted journals: an abort still * completes the commit thread, it just doesn't write * anything to disk. */ - tid_t tid = transaction->t_tid; - spin_unlock(&transaction->t_handle_lock); jbd_debug(2, "transaction too old, requesting commit for " "handle %p\n", handle); /* This is non-blocking */ @@ -1369,11 +1364,25 @@ int jbd2_journal_stop(handle_t *handle) * to wait for the commit to complete. */ if (handle->h_sync && !(current->flags & PF_MEMALLOC)) - err = jbd2_log_wait_commit(journal, tid); - } else { - spin_unlock(&transaction->t_handle_lock); + wait_for_commit = 1; } + /* + * Once we drop t_updates, if it goes to zero the transaction + * could start commiting on us and eventually disappear. So + * once we do this, we must not dereference transaction + * pointer again. + */ + tid = transaction->t_tid; + if (atomic_dec_and_test(&transaction->t_updates)) { + wake_up(&journal->j_wait_updates); + if (journal->j_barrier_count) + wake_up(&journal->j_wait_transaction_locked); + } + + if (wait_for_commit) + err = jbd2_log_wait_commit(journal, tid); + lock_map_release(&handle->h_lockdep_map); jbd2_free_handle(handle); diff --git a/include/linux/jbd2.h b/include/linux/jbd2.h index 5a72bc7..a72ce21 100644 --- a/include/linux/jbd2.h +++ b/include/linux/jbd2.h @@ -601,13 +601,13 @@ struct transaction_s * Number of outstanding updates running on this transaction * [t_handle_lock] */ - int t_updates; + atomic_t t_updates; /* * Number of buffers reserved for use by all handles in this transaction * handle but not yet modified. [t_handle_lock] */ - int t_outstanding_credits; + atomic_t t_outstanding_credits; /* * Forward and backward links for the circular list of all transactions @@ -1258,8 +1258,8 @@ static inline int jbd_space_needed(journal_t *journal) { int nblocks = journal->j_max_transaction_buffers; if (journal->j_committing_transaction) - nblocks += journal->j_committing_transaction-> - t_outstanding_credits; + nblocks += atomic_read(&journal->j_committing_transaction-> + t_outstanding_credits); return nblocks; } -- 1.7.0.4 ^ permalink raw reply related [flat|nested] 22+ messages in thread
* [PATCH, RFC 2/3] jbd2: Change j_state_lock to be a rwlock_t 2010-08-03 16:01 ` [Ocfs2-devel] " Theodore Ts'o @ 2010-08-03 16:01 ` Theodore Ts'o -1 siblings, 0 replies; 22+ messages in thread From: Theodore Ts'o @ 2010-08-03 16:01 UTC (permalink / raw) To: Ext4 Developers List, ocfs2-devel Cc: John Stultz, Keith Maanthey, Eric Whitney, Theodore Ts'o Lockstat reports have shown that j_state_lock is a major source of lock contention, especially on systems with more than 4 CPU cores. So change it to be a read/write spinlock. Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> --- fs/ext4/inode.c | 4 +- fs/ext4/super.c | 4 +- fs/jbd2/checkpoint.c | 16 ++++---- fs/jbd2/commit.c | 26 +++++++------- fs/jbd2/journal.c | 94 ++++++++++++++++++++++++------------------------- fs/jbd2/transaction.c | 52 +++++++++++++------------- fs/ocfs2/journal.c | 4 +- include/linux/jbd2.h | 2 +- 8 files changed, 100 insertions(+), 102 deletions(-) diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c index 533b607..ab2247d 100644 --- a/fs/ext4/inode.c +++ b/fs/ext4/inode.c @@ -5066,7 +5066,7 @@ struct inode *ext4_iget(struct super_block *sb, unsigned long ino) transaction_t *transaction; tid_t tid; - spin_lock(&journal->j_state_lock); + read_lock(&journal->j_state_lock); if (journal->j_running_transaction) transaction = journal->j_running_transaction; else @@ -5075,7 +5075,7 @@ struct inode *ext4_iget(struct super_block *sb, unsigned long ino) tid = transaction->t_tid; else tid = journal->j_commit_sequence; - spin_unlock(&journal->j_state_lock); + read_unlock(&journal->j_state_lock); ei->i_sync_tid = tid; ei->i_datasync_tid = tid; } diff --git a/fs/ext4/super.c b/fs/ext4/super.c index 3fd65eb..81cb3fc 100644 --- a/fs/ext4/super.c +++ b/fs/ext4/super.c @@ -3232,7 +3232,7 @@ static void ext4_init_journal_params(struct super_block *sb, journal_t *journal) journal->j_min_batch_time = sbi->s_min_batch_time; journal->j_max_batch_time = sbi->s_max_batch_time; - spin_lock(&journal->j_state_lock); + write_lock(&journal->j_state_lock); if (test_opt(sb, BARRIER)) journal->j_flags |= JBD2_BARRIER; else @@ -3241,7 +3241,7 @@ static void ext4_init_journal_params(struct super_block *sb, journal_t *journal) journal->j_flags |= JBD2_ABORT_ON_SYNCDATA_ERR; else journal->j_flags &= ~JBD2_ABORT_ON_SYNCDATA_ERR; - spin_unlock(&journal->j_state_lock); + write_unlock(&journal->j_state_lock); } static journal_t *ext4_get_journal(struct super_block *sb, diff --git a/fs/jbd2/checkpoint.c b/fs/jbd2/checkpoint.c index f8cdc02..1c23a0f 100644 --- a/fs/jbd2/checkpoint.c +++ b/fs/jbd2/checkpoint.c @@ -118,13 +118,13 @@ static int __try_to_free_cp_buf(struct journal_head *jh) void __jbd2_log_wait_for_space(journal_t *journal) { int nblocks, space_left; - assert_spin_locked(&journal->j_state_lock); + /* assert_spin_locked(&journal->j_state_lock); */ nblocks = jbd_space_needed(journal); while (__jbd2_log_space_left(journal) < nblocks) { if (journal->j_flags & JBD2_ABORT) return; - spin_unlock(&journal->j_state_lock); + write_unlock(&journal->j_state_lock); mutex_lock(&journal->j_checkpoint_mutex); /* @@ -138,7 +138,7 @@ void __jbd2_log_wait_for_space(journal_t *journal) * filesystem, so abort the journal and leave a stack * trace for forensic evidence. */ - spin_lock(&journal->j_state_lock); + write_lock(&journal->j_state_lock); spin_lock(&journal->j_list_lock); nblocks = jbd_space_needed(journal); space_left = __jbd2_log_space_left(journal); @@ -149,7 +149,7 @@ void __jbd2_log_wait_for_space(journal_t *journal) if (journal->j_committing_transaction) tid = journal->j_committing_transaction->t_tid; spin_unlock(&journal->j_list_lock); - spin_unlock(&journal->j_state_lock); + write_unlock(&journal->j_state_lock); if (chkpt) { jbd2_log_do_checkpoint(journal); } else if (jbd2_cleanup_journal_tail(journal) == 0) { @@ -167,7 +167,7 @@ void __jbd2_log_wait_for_space(journal_t *journal) WARN_ON(1); jbd2_journal_abort(journal, 0); } - spin_lock(&journal->j_state_lock); + write_lock(&journal->j_state_lock); } else { spin_unlock(&journal->j_list_lock); } @@ -474,7 +474,7 @@ int jbd2_cleanup_journal_tail(journal_t *journal) * next transaction ID we will write, and where it will * start. */ - spin_lock(&journal->j_state_lock); + write_lock(&journal->j_state_lock); spin_lock(&journal->j_list_lock); transaction = journal->j_checkpoint_transactions; if (transaction) { @@ -496,7 +496,7 @@ int jbd2_cleanup_journal_tail(journal_t *journal) /* If the oldest pinned transaction is at the tail of the log already then there's not much we can do right now. */ if (journal->j_tail_sequence == first_tid) { - spin_unlock(&journal->j_state_lock); + write_unlock(&journal->j_state_lock); return 1; } @@ -516,7 +516,7 @@ int jbd2_cleanup_journal_tail(journal_t *journal) journal->j_free += freed; journal->j_tail_sequence = first_tid; journal->j_tail = blocknr; - spin_unlock(&journal->j_state_lock); + write_unlock(&journal->j_state_lock); /* * If there is an external journal, we need to make sure that diff --git a/fs/jbd2/commit.c b/fs/jbd2/commit.c index fbd2c56..67bb0a2 100644 --- a/fs/jbd2/commit.c +++ b/fs/jbd2/commit.c @@ -152,9 +152,9 @@ static int journal_submit_commit_record(journal_t *journal, printk(KERN_WARNING "JBD2: Disabling barriers on %s, " "not supported by device\n", journal->j_devname); - spin_lock(&journal->j_state_lock); + write_lock(&journal->j_state_lock); journal->j_flags &= ~JBD2_BARRIER; - spin_unlock(&journal->j_state_lock); + write_unlock(&journal->j_state_lock); /* And try again, without the barrier */ lock_buffer(bh); @@ -182,9 +182,9 @@ retry: printk(KERN_WARNING "JBD2: %s: disabling barries on %s - not supported " "by device\n", __func__, journal->j_devname); - spin_lock(&journal->j_state_lock); + write_lock(&journal->j_state_lock); journal->j_flags &= ~JBD2_BARRIER; - spin_unlock(&journal->j_state_lock); + write_unlock(&journal->j_state_lock); lock_buffer(bh); clear_buffer_dirty(bh); @@ -400,7 +400,7 @@ void jbd2_journal_commit_transaction(journal_t *journal) jbd_debug(1, "JBD: starting commit of transaction %d\n", commit_transaction->t_tid); - spin_lock(&journal->j_state_lock); + write_lock(&journal->j_state_lock); commit_transaction->t_state = T_LOCKED; /* @@ -424,9 +424,9 @@ void jbd2_journal_commit_transaction(journal_t *journal) TASK_UNINTERRUPTIBLE); if (atomic_read(&commit_transaction->t_updates)) { spin_unlock(&commit_transaction->t_handle_lock); - spin_unlock(&journal->j_state_lock); + write_unlock(&journal->j_state_lock); schedule(); - spin_lock(&journal->j_state_lock); + write_lock(&journal->j_state_lock); spin_lock(&commit_transaction->t_handle_lock); } finish_wait(&journal->j_wait_updates, &wait); @@ -497,7 +497,7 @@ void jbd2_journal_commit_transaction(journal_t *journal) start_time = ktime_get(); commit_transaction->t_log_start = journal->j_head; wake_up(&journal->j_wait_transaction_locked); - spin_unlock(&journal->j_state_lock); + write_unlock(&journal->j_state_lock); jbd_debug (3, "JBD: commit phase 2\n"); @@ -519,9 +519,9 @@ void jbd2_journal_commit_transaction(journal_t *journal) * transaction! Now comes the tricky part: we need to write out * metadata. Loop over the transaction's entire buffer list: */ - spin_lock(&journal->j_state_lock); + write_lock(&journal->j_state_lock); commit_transaction->t_state = T_COMMIT; - spin_unlock(&journal->j_state_lock); + write_unlock(&journal->j_state_lock); trace_jbd2_commit_logging(journal, commit_transaction); stats.run.rs_logging = jiffies; @@ -978,7 +978,7 @@ restart_loop: * __jbd2_journal_drop_transaction(). Otherwise we could race with * other checkpointing code processing the transaction... */ - spin_lock(&journal->j_state_lock); + write_lock(&journal->j_state_lock); spin_lock(&journal->j_list_lock); /* * Now recheck if some buffers did not get attached to the transaction @@ -986,7 +986,7 @@ restart_loop: */ if (commit_transaction->t_forget) { spin_unlock(&journal->j_list_lock); - spin_unlock(&journal->j_state_lock); + write_unlock(&journal->j_state_lock); goto restart_loop; } @@ -1038,7 +1038,7 @@ restart_loop: journal->j_average_commit_time*3) / 4; else journal->j_average_commit_time = commit_time; - spin_unlock(&journal->j_state_lock); + write_unlock(&journal->j_state_lock); if (commit_transaction->t_checkpoint_list == NULL && commit_transaction->t_checkpoint_io_list == NULL) { diff --git a/fs/jbd2/journal.c b/fs/jbd2/journal.c index a79d334..e7bf0fd 100644 --- a/fs/jbd2/journal.c +++ b/fs/jbd2/journal.c @@ -142,7 +142,7 @@ static int kjournald2(void *arg) /* * And now, wait forever for commit wakeup events. */ - spin_lock(&journal->j_state_lock); + write_lock(&journal->j_state_lock); loop: if (journal->j_flags & JBD2_UNMOUNT) @@ -153,10 +153,10 @@ loop: if (journal->j_commit_sequence != journal->j_commit_request) { jbd_debug(1, "OK, requests differ\n"); - spin_unlock(&journal->j_state_lock); + write_unlock(&journal->j_state_lock); del_timer_sync(&journal->j_commit_timer); jbd2_journal_commit_transaction(journal); - spin_lock(&journal->j_state_lock); + write_lock(&journal->j_state_lock); goto loop; } @@ -168,9 +168,9 @@ loop: * be already stopped. */ jbd_debug(1, "Now suspending kjournald2\n"); - spin_unlock(&journal->j_state_lock); + write_unlock(&journal->j_state_lock); refrigerator(); - spin_lock(&journal->j_state_lock); + write_lock(&journal->j_state_lock); } else { /* * We assume on resume that commits are already there, @@ -190,9 +190,9 @@ loop: if (journal->j_flags & JBD2_UNMOUNT) should_sleep = 0; if (should_sleep) { - spin_unlock(&journal->j_state_lock); + write_unlock(&journal->j_state_lock); schedule(); - spin_lock(&journal->j_state_lock); + write_lock(&journal->j_state_lock); } finish_wait(&journal->j_wait_commit, &wait); } @@ -210,7 +210,7 @@ loop: goto loop; end_loop: - spin_unlock(&journal->j_state_lock); + write_unlock(&journal->j_state_lock); del_timer_sync(&journal->j_commit_timer); journal->j_task = NULL; wake_up(&journal->j_wait_done_commit); @@ -233,16 +233,16 @@ static int jbd2_journal_start_thread(journal_t *journal) static void journal_kill_thread(journal_t *journal) { - spin_lock(&journal->j_state_lock); + write_lock(&journal->j_state_lock); journal->j_flags |= JBD2_UNMOUNT; while (journal->j_task) { wake_up(&journal->j_wait_commit); - spin_unlock(&journal->j_state_lock); + write_unlock(&journal->j_state_lock); wait_event(journal->j_wait_done_commit, journal->j_task == NULL); - spin_lock(&journal->j_state_lock); + write_lock(&journal->j_state_lock); } - spin_unlock(&journal->j_state_lock); + write_unlock(&journal->j_state_lock); } /* @@ -452,7 +452,7 @@ int __jbd2_log_space_left(journal_t *journal) { int left = journal->j_free; - assert_spin_locked(&journal->j_state_lock); + /* assert_spin_locked(&journal->j_state_lock); */ /* * Be pessimistic here about the number of those free blocks which @@ -497,9 +497,9 @@ int jbd2_log_start_commit(journal_t *journal, tid_t tid) { int ret; - spin_lock(&journal->j_state_lock); + write_lock(&journal->j_state_lock); ret = __jbd2_log_start_commit(journal, tid); - spin_unlock(&journal->j_state_lock); + write_unlock(&journal->j_state_lock); return ret; } @@ -518,7 +518,7 @@ int jbd2_journal_force_commit_nested(journal_t *journal) transaction_t *transaction = NULL; tid_t tid; - spin_lock(&journal->j_state_lock); + read_lock(&journal->j_state_lock); if (journal->j_running_transaction && !current->journal_info) { transaction = journal->j_running_transaction; __jbd2_log_start_commit(journal, transaction->t_tid); @@ -526,12 +526,12 @@ int jbd2_journal_force_commit_nested(journal_t *journal) transaction = journal->j_committing_transaction; if (!transaction) { - spin_unlock(&journal->j_state_lock); + read_unlock(&journal->j_state_lock); return 0; /* Nothing to retry */ } tid = transaction->t_tid; - spin_unlock(&journal->j_state_lock); + read_unlock(&journal->j_state_lock); jbd2_log_wait_commit(journal, tid); return 1; } @@ -545,7 +545,7 @@ int jbd2_journal_start_commit(journal_t *journal, tid_t *ptid) { int ret = 0; - spin_lock(&journal->j_state_lock); + write_lock(&journal->j_state_lock); if (journal->j_running_transaction) { tid_t tid = journal->j_running_transaction->t_tid; @@ -564,7 +564,7 @@ int jbd2_journal_start_commit(journal_t *journal, tid_t *ptid) *ptid = journal->j_committing_transaction->t_tid; ret = 1; } - spin_unlock(&journal->j_state_lock); + write_unlock(&journal->j_state_lock); return ret; } @@ -576,26 +576,24 @@ int jbd2_log_wait_commit(journal_t *journal, tid_t tid) { int err = 0; + read_lock(&journal->j_state_lock); #ifdef CONFIG_JBD2_DEBUG - spin_lock(&journal->j_state_lock); if (!tid_geq(journal->j_commit_request, tid)) { printk(KERN_EMERG "%s: error: j_commit_request=%d, tid=%d\n", __func__, journal->j_commit_request, tid); } - spin_unlock(&journal->j_state_lock); #endif - spin_lock(&journal->j_state_lock); while (tid_gt(tid, journal->j_commit_sequence)) { jbd_debug(1, "JBD: want %d, j_commit_sequence=%d\n", tid, journal->j_commit_sequence); wake_up(&journal->j_wait_commit); - spin_unlock(&journal->j_state_lock); + read_unlock(&journal->j_state_lock); wait_event(journal->j_wait_done_commit, !tid_gt(tid, journal->j_commit_sequence)); - spin_lock(&journal->j_state_lock); + read_lock(&journal->j_state_lock); } - spin_unlock(&journal->j_state_lock); + read_unlock(&journal->j_state_lock); if (unlikely(is_journal_aborted(journal))) { printk(KERN_EMERG "journal commit I/O error\n"); @@ -612,7 +610,7 @@ int jbd2_journal_next_log_block(journal_t *journal, unsigned long long *retp) { unsigned long blocknr; - spin_lock(&journal->j_state_lock); + write_lock(&journal->j_state_lock); J_ASSERT(journal->j_free > 1); blocknr = journal->j_head; @@ -620,7 +618,7 @@ int jbd2_journal_next_log_block(journal_t *journal, unsigned long long *retp) journal->j_free--; if (journal->j_head == journal->j_last) journal->j_head = journal->j_first; - spin_unlock(&journal->j_state_lock); + write_unlock(&journal->j_state_lock); return jbd2_journal_bmap(journal, blocknr, retp); } @@ -840,7 +838,7 @@ static journal_t * journal_init_common (void) mutex_init(&journal->j_checkpoint_mutex); spin_lock_init(&journal->j_revoke_lock); spin_lock_init(&journal->j_list_lock); - spin_lock_init(&journal->j_state_lock); + rwlock_init(&journal->j_state_lock); journal->j_commit_interval = (HZ * JBD2_DEFAULT_MAX_COMMIT_AGE); journal->j_min_batch_time = 0; @@ -1106,14 +1104,14 @@ void jbd2_journal_update_superblock(journal_t *journal, int wait) set_buffer_uptodate(bh); } - spin_lock(&journal->j_state_lock); + read_lock(&journal->j_state_lock); jbd_debug(1,"JBD: updating superblock (start %ld, seq %d, errno %d)\n", journal->j_tail, journal->j_tail_sequence, journal->j_errno); sb->s_sequence = cpu_to_be32(journal->j_tail_sequence); sb->s_start = cpu_to_be32(journal->j_tail); sb->s_errno = cpu_to_be32(journal->j_errno); - spin_unlock(&journal->j_state_lock); + read_unlock(&journal->j_state_lock); BUFFER_TRACE(bh, "marking dirty"); mark_buffer_dirty(bh); @@ -1134,12 +1132,12 @@ out: * any future commit will have to be careful to update the * superblock again to re-record the true start of the log. */ - spin_lock(&journal->j_state_lock); + write_lock(&journal->j_state_lock); if (sb->s_start) journal->j_flags &= ~JBD2_FLUSHED; else journal->j_flags |= JBD2_FLUSHED; - spin_unlock(&journal->j_state_lock); + write_unlock(&journal->j_state_lock); } /* @@ -1551,7 +1549,7 @@ int jbd2_journal_flush(journal_t *journal) transaction_t *transaction = NULL; unsigned long old_tail; - spin_lock(&journal->j_state_lock); + write_lock(&journal->j_state_lock); /* Force everything buffered to the log... */ if (journal->j_running_transaction) { @@ -1564,10 +1562,10 @@ int jbd2_journal_flush(journal_t *journal) if (transaction) { tid_t tid = transaction->t_tid; - spin_unlock(&journal->j_state_lock); + write_unlock(&journal->j_state_lock); jbd2_log_wait_commit(journal, tid); } else { - spin_unlock(&journal->j_state_lock); + write_unlock(&journal->j_state_lock); } /* ...and flush everything in the log out to disk. */ @@ -1591,12 +1589,12 @@ int jbd2_journal_flush(journal_t *journal) * the magic code for a fully-recovered superblock. Any future * commits of data to the journal will restore the current * s_start value. */ - spin_lock(&journal->j_state_lock); + write_lock(&journal->j_state_lock); old_tail = journal->j_tail; journal->j_tail = 0; - spin_unlock(&journal->j_state_lock); + write_unlock(&journal->j_state_lock); jbd2_journal_update_superblock(journal, 1); - spin_lock(&journal->j_state_lock); + write_lock(&journal->j_state_lock); journal->j_tail = old_tail; J_ASSERT(!journal->j_running_transaction); @@ -1604,7 +1602,7 @@ int jbd2_journal_flush(journal_t *journal) J_ASSERT(!journal->j_checkpoint_transactions); J_ASSERT(journal->j_head == journal->j_tail); J_ASSERT(journal->j_tail_sequence == journal->j_transaction_sequence); - spin_unlock(&journal->j_state_lock); + write_unlock(&journal->j_state_lock); return 0; } @@ -1668,12 +1666,12 @@ void __jbd2_journal_abort_hard(journal_t *journal) printk(KERN_ERR "Aborting journal on device %s.\n", journal->j_devname); - spin_lock(&journal->j_state_lock); + write_lock(&journal->j_state_lock); journal->j_flags |= JBD2_ABORT; transaction = journal->j_running_transaction; if (transaction) __jbd2_log_start_commit(journal, transaction->t_tid); - spin_unlock(&journal->j_state_lock); + write_unlock(&journal->j_state_lock); } /* Soft abort: record the abort error status in the journal superblock, @@ -1758,12 +1756,12 @@ int jbd2_journal_errno(journal_t *journal) { int err; - spin_lock(&journal->j_state_lock); + read_lock(&journal->j_state_lock); if (journal->j_flags & JBD2_ABORT) err = -EROFS; else err = journal->j_errno; - spin_unlock(&journal->j_state_lock); + read_unlock(&journal->j_state_lock); return err; } @@ -1778,12 +1776,12 @@ int jbd2_journal_clear_err(journal_t *journal) { int err = 0; - spin_lock(&journal->j_state_lock); + write_lock(&journal->j_state_lock); if (journal->j_flags & JBD2_ABORT) err = -EROFS; else journal->j_errno = 0; - spin_unlock(&journal->j_state_lock); + write_unlock(&journal->j_state_lock); return err; } @@ -1796,10 +1794,10 @@ int jbd2_journal_clear_err(journal_t *journal) */ void jbd2_journal_ack_err(journal_t *journal) { - spin_lock(&journal->j_state_lock); + write_lock(&journal->j_state_lock); if (journal->j_errno) journal->j_flags |= JBD2_ACK_ERR; - spin_unlock(&journal->j_state_lock); + write_unlock(&journal->j_state_lock); } int jbd2_journal_blocks_per_page(struct inode *inode) diff --git a/fs/jbd2/transaction.c b/fs/jbd2/transaction.c index 9c64c7e..d66d00b 100644 --- a/fs/jbd2/transaction.c +++ b/fs/jbd2/transaction.c @@ -130,18 +130,18 @@ repeat: * We need to hold j_state_lock until t_updates has been incremented, * for proper journal barrier handling */ - spin_lock(&journal->j_state_lock); + read_lock(&journal->j_state_lock); repeat_locked: if (is_journal_aborted(journal) || (journal->j_errno != 0 && !(journal->j_flags & JBD2_ACK_ERR))) { - spin_unlock(&journal->j_state_lock); + read_unlock(&journal->j_state_lock); kfree(new_transaction); return -EROFS; } /* Wait on the journal's transaction barrier if necessary */ if (journal->j_barrier_count) { - spin_unlock(&journal->j_state_lock); + read_unlock(&journal->j_state_lock); wait_event(journal->j_wait_transaction_locked, journal->j_barrier_count == 0); goto repeat; @@ -149,7 +149,7 @@ repeat_locked: if (!journal->j_running_transaction) { if (!new_transaction) { - spin_unlock(&journal->j_state_lock); + read_unlock(&journal->j_state_lock); goto alloc_transaction; } jbd2_get_transaction(journal, new_transaction); @@ -167,7 +167,7 @@ repeat_locked: prepare_to_wait(&journal->j_wait_transaction_locked, &wait, TASK_UNINTERRUPTIBLE); - spin_unlock(&journal->j_state_lock); + read_unlock(&journal->j_state_lock); schedule(); finish_wait(&journal->j_wait_transaction_locked, &wait); goto repeat; @@ -194,7 +194,7 @@ repeat_locked: prepare_to_wait(&journal->j_wait_transaction_locked, &wait, TASK_UNINTERRUPTIBLE); __jbd2_log_start_commit(journal, transaction->t_tid); - spin_unlock(&journal->j_state_lock); + read_unlock(&journal->j_state_lock); schedule(); finish_wait(&journal->j_wait_transaction_locked, &wait); goto repeat; @@ -250,7 +250,7 @@ repeat_locked: atomic_read(&transaction->t_outstanding_credits), __jbd2_log_space_left(journal)); spin_unlock(&transaction->t_handle_lock); - spin_unlock(&journal->j_state_lock); + read_unlock(&journal->j_state_lock); lock_map_acquire(&handle->h_lockdep_map); kfree(new_transaction); @@ -362,7 +362,7 @@ int jbd2_journal_extend(handle_t *handle, int nblocks) result = 1; - spin_lock(&journal->j_state_lock); + read_lock(&journal->j_state_lock); /* Don't extend a locked-down transaction! */ if (handle->h_transaction->t_state != T_RUNNING) { @@ -394,7 +394,7 @@ int jbd2_journal_extend(handle_t *handle, int nblocks) unlock: spin_unlock(&transaction->t_handle_lock); error_out: - spin_unlock(&journal->j_state_lock); + read_unlock(&journal->j_state_lock); out: return result; } @@ -432,7 +432,7 @@ int jbd2__journal_restart(handle_t *handle, int nblocks, int gfp_mask) J_ASSERT(atomic_read(&transaction->t_updates) > 0); J_ASSERT(journal_current_handle() == handle); - spin_lock(&journal->j_state_lock); + read_lock(&journal->j_state_lock); spin_lock(&transaction->t_handle_lock); atomic_sub(handle->h_buffer_credits, &transaction->t_outstanding_credits); @@ -442,7 +442,7 @@ int jbd2__journal_restart(handle_t *handle, int nblocks, int gfp_mask) jbd_debug(2, "restarting handle %p\n", handle); __jbd2_log_start_commit(journal, transaction->t_tid); - spin_unlock(&journal->j_state_lock); + read_unlock(&journal->j_state_lock); lock_map_release(&handle->h_lockdep_map); handle->h_buffer_credits = nblocks; @@ -472,7 +472,7 @@ void jbd2_journal_lock_updates(journal_t *journal) { DEFINE_WAIT(wait); - spin_lock(&journal->j_state_lock); + write_lock(&journal->j_state_lock); ++journal->j_barrier_count; /* Wait until there are no running updates */ @@ -490,12 +490,12 @@ void jbd2_journal_lock_updates(journal_t *journal) prepare_to_wait(&journal->j_wait_updates, &wait, TASK_UNINTERRUPTIBLE); spin_unlock(&transaction->t_handle_lock); - spin_unlock(&journal->j_state_lock); + write_unlock(&journal->j_state_lock); schedule(); finish_wait(&journal->j_wait_updates, &wait); - spin_lock(&journal->j_state_lock); + write_lock(&journal->j_state_lock); } - spin_unlock(&journal->j_state_lock); + write_unlock(&journal->j_state_lock); /* * We have now established a barrier against other normal updates, but @@ -519,9 +519,9 @@ void jbd2_journal_unlock_updates (journal_t *journal) J_ASSERT(journal->j_barrier_count != 0); mutex_unlock(&journal->j_barrier); - spin_lock(&journal->j_state_lock); + write_lock(&journal->j_state_lock); --journal->j_barrier_count; - spin_unlock(&journal->j_state_lock); + write_unlock(&journal->j_state_lock); wake_up(&journal->j_wait_transaction_locked); } @@ -1314,9 +1314,9 @@ int jbd2_journal_stop(handle_t *handle) journal->j_last_sync_writer = pid; - spin_lock(&journal->j_state_lock); + read_lock(&journal->j_state_lock); commit_time = journal->j_average_commit_time; - spin_unlock(&journal->j_state_lock); + read_unlock(&journal->j_state_lock); trans_time = ktime_to_ns(ktime_sub(ktime_get(), transaction->t_start_time)); @@ -1748,7 +1748,7 @@ static int journal_unmap_buffer(journal_t *journal, struct buffer_head *bh) goto zap_buffer_unlocked; /* OK, we have data buffer in journaled mode */ - spin_lock(&journal->j_state_lock); + write_lock(&journal->j_state_lock); jbd_lock_bh_state(bh); spin_lock(&journal->j_list_lock); @@ -1801,7 +1801,7 @@ static int journal_unmap_buffer(journal_t *journal, struct buffer_head *bh) jbd2_journal_put_journal_head(jh); spin_unlock(&journal->j_list_lock); jbd_unlock_bh_state(bh); - spin_unlock(&journal->j_state_lock); + write_unlock(&journal->j_state_lock); return ret; } else { /* There is no currently-running transaction. So the @@ -1815,7 +1815,7 @@ static int journal_unmap_buffer(journal_t *journal, struct buffer_head *bh) jbd2_journal_put_journal_head(jh); spin_unlock(&journal->j_list_lock); jbd_unlock_bh_state(bh); - spin_unlock(&journal->j_state_lock); + write_unlock(&journal->j_state_lock); return ret; } else { /* The orphan record's transaction has @@ -1839,7 +1839,7 @@ static int journal_unmap_buffer(journal_t *journal, struct buffer_head *bh) jbd2_journal_put_journal_head(jh); spin_unlock(&journal->j_list_lock); jbd_unlock_bh_state(bh); - spin_unlock(&journal->j_state_lock); + write_unlock(&journal->j_state_lock); return 0; } else { /* Good, the buffer belongs to the running transaction. @@ -1858,7 +1858,7 @@ zap_buffer: zap_buffer_no_jh: spin_unlock(&journal->j_list_lock); jbd_unlock_bh_state(bh); - spin_unlock(&journal->j_state_lock); + write_unlock(&journal->j_state_lock); zap_buffer_unlocked: clear_buffer_dirty(bh); J_ASSERT_BH(bh, !buffer_jbddirty(bh)); @@ -2165,9 +2165,9 @@ int jbd2_journal_begin_ordered_truncate(journal_t *journal, /* Locks are here just to force reading of recent values, it is * enough that the transaction was not committing before we started * a transaction adding the inode to orphan list */ - spin_lock(&journal->j_state_lock); + read_lock(&journal->j_state_lock); commit_trans = journal->j_committing_transaction; - spin_unlock(&journal->j_state_lock); + read_unlock(&journal->j_state_lock); spin_lock(&journal->j_list_lock); inode_trans = jinode->i_transaction; spin_unlock(&journal->j_list_lock); diff --git a/fs/ocfs2/journal.c b/fs/ocfs2/journal.c index 47878cf..9c1b92e 100644 --- a/fs/ocfs2/journal.c +++ b/fs/ocfs2/journal.c @@ -760,13 +760,13 @@ void ocfs2_set_journal_params(struct ocfs2_super *osb) if (osb->osb_commit_interval) commit_interval = osb->osb_commit_interval; - spin_lock(&journal->j_state_lock); + write_lock(&journal->j_state_lock); journal->j_commit_interval = commit_interval; if (osb->s_mount_opt & OCFS2_MOUNT_BARRIER) journal->j_flags |= JBD2_BARRIER; else journal->j_flags &= ~JBD2_BARRIER; - spin_unlock(&journal->j_state_lock); + write_unlock(&journal->j_state_lock); } int ocfs2_journal_init(struct ocfs2_journal *journal, int *dirty) diff --git a/include/linux/jbd2.h b/include/linux/jbd2.h index a72ce21..15d5743 100644 --- a/include/linux/jbd2.h +++ b/include/linux/jbd2.h @@ -764,7 +764,7 @@ struct journal_s /* * Protect the various scalars in the journal */ - spinlock_t j_state_lock; + rwlock_t j_state_lock; /* * Number of processes waiting to create a barrier lock [j_state_lock] -- 1.7.0.4 ^ permalink raw reply related [flat|nested] 22+ messages in thread
* [Ocfs2-devel] [PATCH, RFC 2/3] jbd2: Change j_state_lock to be a rwlock_t @ 2010-08-03 16:01 ` Theodore Ts'o 0 siblings, 0 replies; 22+ messages in thread From: Theodore Ts'o @ 2010-08-03 16:01 UTC (permalink / raw) To: Ext4 Developers List, ocfs2-devel Cc: John Stultz, Keith Maanthey, Eric Whitney, Theodore Ts'o Lockstat reports have shown that j_state_lock is a major source of lock contention, especially on systems with more than 4 CPU cores. So change it to be a read/write spinlock. Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> --- fs/ext4/inode.c | 4 +- fs/ext4/super.c | 4 +- fs/jbd2/checkpoint.c | 16 ++++---- fs/jbd2/commit.c | 26 +++++++------- fs/jbd2/journal.c | 94 ++++++++++++++++++++++++------------------------- fs/jbd2/transaction.c | 52 +++++++++++++------------- fs/ocfs2/journal.c | 4 +- include/linux/jbd2.h | 2 +- 8 files changed, 100 insertions(+), 102 deletions(-) diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c index 533b607..ab2247d 100644 --- a/fs/ext4/inode.c +++ b/fs/ext4/inode.c @@ -5066,7 +5066,7 @@ struct inode *ext4_iget(struct super_block *sb, unsigned long ino) transaction_t *transaction; tid_t tid; - spin_lock(&journal->j_state_lock); + read_lock(&journal->j_state_lock); if (journal->j_running_transaction) transaction = journal->j_running_transaction; else @@ -5075,7 +5075,7 @@ struct inode *ext4_iget(struct super_block *sb, unsigned long ino) tid = transaction->t_tid; else tid = journal->j_commit_sequence; - spin_unlock(&journal->j_state_lock); + read_unlock(&journal->j_state_lock); ei->i_sync_tid = tid; ei->i_datasync_tid = tid; } diff --git a/fs/ext4/super.c b/fs/ext4/super.c index 3fd65eb..81cb3fc 100644 --- a/fs/ext4/super.c +++ b/fs/ext4/super.c @@ -3232,7 +3232,7 @@ static void ext4_init_journal_params(struct super_block *sb, journal_t *journal) journal->j_min_batch_time = sbi->s_min_batch_time; journal->j_max_batch_time = sbi->s_max_batch_time; - spin_lock(&journal->j_state_lock); + write_lock(&journal->j_state_lock); if (test_opt(sb, BARRIER)) journal->j_flags |= JBD2_BARRIER; else @@ -3241,7 +3241,7 @@ static void ext4_init_journal_params(struct super_block *sb, journal_t *journal) journal->j_flags |= JBD2_ABORT_ON_SYNCDATA_ERR; else journal->j_flags &= ~JBD2_ABORT_ON_SYNCDATA_ERR; - spin_unlock(&journal->j_state_lock); + write_unlock(&journal->j_state_lock); } static journal_t *ext4_get_journal(struct super_block *sb, diff --git a/fs/jbd2/checkpoint.c b/fs/jbd2/checkpoint.c index f8cdc02..1c23a0f 100644 --- a/fs/jbd2/checkpoint.c +++ b/fs/jbd2/checkpoint.c @@ -118,13 +118,13 @@ static int __try_to_free_cp_buf(struct journal_head *jh) void __jbd2_log_wait_for_space(journal_t *journal) { int nblocks, space_left; - assert_spin_locked(&journal->j_state_lock); + /* assert_spin_locked(&journal->j_state_lock); */ nblocks = jbd_space_needed(journal); while (__jbd2_log_space_left(journal) < nblocks) { if (journal->j_flags & JBD2_ABORT) return; - spin_unlock(&journal->j_state_lock); + write_unlock(&journal->j_state_lock); mutex_lock(&journal->j_checkpoint_mutex); /* @@ -138,7 +138,7 @@ void __jbd2_log_wait_for_space(journal_t *journal) * filesystem, so abort the journal and leave a stack * trace for forensic evidence. */ - spin_lock(&journal->j_state_lock); + write_lock(&journal->j_state_lock); spin_lock(&journal->j_list_lock); nblocks = jbd_space_needed(journal); space_left = __jbd2_log_space_left(journal); @@ -149,7 +149,7 @@ void __jbd2_log_wait_for_space(journal_t *journal) if (journal->j_committing_transaction) tid = journal->j_committing_transaction->t_tid; spin_unlock(&journal->j_list_lock); - spin_unlock(&journal->j_state_lock); + write_unlock(&journal->j_state_lock); if (chkpt) { jbd2_log_do_checkpoint(journal); } else if (jbd2_cleanup_journal_tail(journal) == 0) { @@ -167,7 +167,7 @@ void __jbd2_log_wait_for_space(journal_t *journal) WARN_ON(1); jbd2_journal_abort(journal, 0); } - spin_lock(&journal->j_state_lock); + write_lock(&journal->j_state_lock); } else { spin_unlock(&journal->j_list_lock); } @@ -474,7 +474,7 @@ int jbd2_cleanup_journal_tail(journal_t *journal) * next transaction ID we will write, and where it will * start. */ - spin_lock(&journal->j_state_lock); + write_lock(&journal->j_state_lock); spin_lock(&journal->j_list_lock); transaction = journal->j_checkpoint_transactions; if (transaction) { @@ -496,7 +496,7 @@ int jbd2_cleanup_journal_tail(journal_t *journal) /* If the oldest pinned transaction is at the tail of the log already then there's not much we can do right now. */ if (journal->j_tail_sequence == first_tid) { - spin_unlock(&journal->j_state_lock); + write_unlock(&journal->j_state_lock); return 1; } @@ -516,7 +516,7 @@ int jbd2_cleanup_journal_tail(journal_t *journal) journal->j_free += freed; journal->j_tail_sequence = first_tid; journal->j_tail = blocknr; - spin_unlock(&journal->j_state_lock); + write_unlock(&journal->j_state_lock); /* * If there is an external journal, we need to make sure that diff --git a/fs/jbd2/commit.c b/fs/jbd2/commit.c index fbd2c56..67bb0a2 100644 --- a/fs/jbd2/commit.c +++ b/fs/jbd2/commit.c @@ -152,9 +152,9 @@ static int journal_submit_commit_record(journal_t *journal, printk(KERN_WARNING "JBD2: Disabling barriers on %s, " "not supported by device\n", journal->j_devname); - spin_lock(&journal->j_state_lock); + write_lock(&journal->j_state_lock); journal->j_flags &= ~JBD2_BARRIER; - spin_unlock(&journal->j_state_lock); + write_unlock(&journal->j_state_lock); /* And try again, without the barrier */ lock_buffer(bh); @@ -182,9 +182,9 @@ retry: printk(KERN_WARNING "JBD2: %s: disabling barries on %s - not supported " "by device\n", __func__, journal->j_devname); - spin_lock(&journal->j_state_lock); + write_lock(&journal->j_state_lock); journal->j_flags &= ~JBD2_BARRIER; - spin_unlock(&journal->j_state_lock); + write_unlock(&journal->j_state_lock); lock_buffer(bh); clear_buffer_dirty(bh); @@ -400,7 +400,7 @@ void jbd2_journal_commit_transaction(journal_t *journal) jbd_debug(1, "JBD: starting commit of transaction %d\n", commit_transaction->t_tid); - spin_lock(&journal->j_state_lock); + write_lock(&journal->j_state_lock); commit_transaction->t_state = T_LOCKED; /* @@ -424,9 +424,9 @@ void jbd2_journal_commit_transaction(journal_t *journal) TASK_UNINTERRUPTIBLE); if (atomic_read(&commit_transaction->t_updates)) { spin_unlock(&commit_transaction->t_handle_lock); - spin_unlock(&journal->j_state_lock); + write_unlock(&journal->j_state_lock); schedule(); - spin_lock(&journal->j_state_lock); + write_lock(&journal->j_state_lock); spin_lock(&commit_transaction->t_handle_lock); } finish_wait(&journal->j_wait_updates, &wait); @@ -497,7 +497,7 @@ void jbd2_journal_commit_transaction(journal_t *journal) start_time = ktime_get(); commit_transaction->t_log_start = journal->j_head; wake_up(&journal->j_wait_transaction_locked); - spin_unlock(&journal->j_state_lock); + write_unlock(&journal->j_state_lock); jbd_debug (3, "JBD: commit phase 2\n"); @@ -519,9 +519,9 @@ void jbd2_journal_commit_transaction(journal_t *journal) * transaction! Now comes the tricky part: we need to write out * metadata. Loop over the transaction's entire buffer list: */ - spin_lock(&journal->j_state_lock); + write_lock(&journal->j_state_lock); commit_transaction->t_state = T_COMMIT; - spin_unlock(&journal->j_state_lock); + write_unlock(&journal->j_state_lock); trace_jbd2_commit_logging(journal, commit_transaction); stats.run.rs_logging = jiffies; @@ -978,7 +978,7 @@ restart_loop: * __jbd2_journal_drop_transaction(). Otherwise we could race with * other checkpointing code processing the transaction... */ - spin_lock(&journal->j_state_lock); + write_lock(&journal->j_state_lock); spin_lock(&journal->j_list_lock); /* * Now recheck if some buffers did not get attached to the transaction @@ -986,7 +986,7 @@ restart_loop: */ if (commit_transaction->t_forget) { spin_unlock(&journal->j_list_lock); - spin_unlock(&journal->j_state_lock); + write_unlock(&journal->j_state_lock); goto restart_loop; } @@ -1038,7 +1038,7 @@ restart_loop: journal->j_average_commit_time*3) / 4; else journal->j_average_commit_time = commit_time; - spin_unlock(&journal->j_state_lock); + write_unlock(&journal->j_state_lock); if (commit_transaction->t_checkpoint_list == NULL && commit_transaction->t_checkpoint_io_list == NULL) { diff --git a/fs/jbd2/journal.c b/fs/jbd2/journal.c index a79d334..e7bf0fd 100644 --- a/fs/jbd2/journal.c +++ b/fs/jbd2/journal.c @@ -142,7 +142,7 @@ static int kjournald2(void *arg) /* * And now, wait forever for commit wakeup events. */ - spin_lock(&journal->j_state_lock); + write_lock(&journal->j_state_lock); loop: if (journal->j_flags & JBD2_UNMOUNT) @@ -153,10 +153,10 @@ loop: if (journal->j_commit_sequence != journal->j_commit_request) { jbd_debug(1, "OK, requests differ\n"); - spin_unlock(&journal->j_state_lock); + write_unlock(&journal->j_state_lock); del_timer_sync(&journal->j_commit_timer); jbd2_journal_commit_transaction(journal); - spin_lock(&journal->j_state_lock); + write_lock(&journal->j_state_lock); goto loop; } @@ -168,9 +168,9 @@ loop: * be already stopped. */ jbd_debug(1, "Now suspending kjournald2\n"); - spin_unlock(&journal->j_state_lock); + write_unlock(&journal->j_state_lock); refrigerator(); - spin_lock(&journal->j_state_lock); + write_lock(&journal->j_state_lock); } else { /* * We assume on resume that commits are already there, @@ -190,9 +190,9 @@ loop: if (journal->j_flags & JBD2_UNMOUNT) should_sleep = 0; if (should_sleep) { - spin_unlock(&journal->j_state_lock); + write_unlock(&journal->j_state_lock); schedule(); - spin_lock(&journal->j_state_lock); + write_lock(&journal->j_state_lock); } finish_wait(&journal->j_wait_commit, &wait); } @@ -210,7 +210,7 @@ loop: goto loop; end_loop: - spin_unlock(&journal->j_state_lock); + write_unlock(&journal->j_state_lock); del_timer_sync(&journal->j_commit_timer); journal->j_task = NULL; wake_up(&journal->j_wait_done_commit); @@ -233,16 +233,16 @@ static int jbd2_journal_start_thread(journal_t *journal) static void journal_kill_thread(journal_t *journal) { - spin_lock(&journal->j_state_lock); + write_lock(&journal->j_state_lock); journal->j_flags |= JBD2_UNMOUNT; while (journal->j_task) { wake_up(&journal->j_wait_commit); - spin_unlock(&journal->j_state_lock); + write_unlock(&journal->j_state_lock); wait_event(journal->j_wait_done_commit, journal->j_task == NULL); - spin_lock(&journal->j_state_lock); + write_lock(&journal->j_state_lock); } - spin_unlock(&journal->j_state_lock); + write_unlock(&journal->j_state_lock); } /* @@ -452,7 +452,7 @@ int __jbd2_log_space_left(journal_t *journal) { int left = journal->j_free; - assert_spin_locked(&journal->j_state_lock); + /* assert_spin_locked(&journal->j_state_lock); */ /* * Be pessimistic here about the number of those free blocks which @@ -497,9 +497,9 @@ int jbd2_log_start_commit(journal_t *journal, tid_t tid) { int ret; - spin_lock(&journal->j_state_lock); + write_lock(&journal->j_state_lock); ret = __jbd2_log_start_commit(journal, tid); - spin_unlock(&journal->j_state_lock); + write_unlock(&journal->j_state_lock); return ret; } @@ -518,7 +518,7 @@ int jbd2_journal_force_commit_nested(journal_t *journal) transaction_t *transaction = NULL; tid_t tid; - spin_lock(&journal->j_state_lock); + read_lock(&journal->j_state_lock); if (journal->j_running_transaction && !current->journal_info) { transaction = journal->j_running_transaction; __jbd2_log_start_commit(journal, transaction->t_tid); @@ -526,12 +526,12 @@ int jbd2_journal_force_commit_nested(journal_t *journal) transaction = journal->j_committing_transaction; if (!transaction) { - spin_unlock(&journal->j_state_lock); + read_unlock(&journal->j_state_lock); return 0; /* Nothing to retry */ } tid = transaction->t_tid; - spin_unlock(&journal->j_state_lock); + read_unlock(&journal->j_state_lock); jbd2_log_wait_commit(journal, tid); return 1; } @@ -545,7 +545,7 @@ int jbd2_journal_start_commit(journal_t *journal, tid_t *ptid) { int ret = 0; - spin_lock(&journal->j_state_lock); + write_lock(&journal->j_state_lock); if (journal->j_running_transaction) { tid_t tid = journal->j_running_transaction->t_tid; @@ -564,7 +564,7 @@ int jbd2_journal_start_commit(journal_t *journal, tid_t *ptid) *ptid = journal->j_committing_transaction->t_tid; ret = 1; } - spin_unlock(&journal->j_state_lock); + write_unlock(&journal->j_state_lock); return ret; } @@ -576,26 +576,24 @@ int jbd2_log_wait_commit(journal_t *journal, tid_t tid) { int err = 0; + read_lock(&journal->j_state_lock); #ifdef CONFIG_JBD2_DEBUG - spin_lock(&journal->j_state_lock); if (!tid_geq(journal->j_commit_request, tid)) { printk(KERN_EMERG "%s: error: j_commit_request=%d, tid=%d\n", __func__, journal->j_commit_request, tid); } - spin_unlock(&journal->j_state_lock); #endif - spin_lock(&journal->j_state_lock); while (tid_gt(tid, journal->j_commit_sequence)) { jbd_debug(1, "JBD: want %d, j_commit_sequence=%d\n", tid, journal->j_commit_sequence); wake_up(&journal->j_wait_commit); - spin_unlock(&journal->j_state_lock); + read_unlock(&journal->j_state_lock); wait_event(journal->j_wait_done_commit, !tid_gt(tid, journal->j_commit_sequence)); - spin_lock(&journal->j_state_lock); + read_lock(&journal->j_state_lock); } - spin_unlock(&journal->j_state_lock); + read_unlock(&journal->j_state_lock); if (unlikely(is_journal_aborted(journal))) { printk(KERN_EMERG "journal commit I/O error\n"); @@ -612,7 +610,7 @@ int jbd2_journal_next_log_block(journal_t *journal, unsigned long long *retp) { unsigned long blocknr; - spin_lock(&journal->j_state_lock); + write_lock(&journal->j_state_lock); J_ASSERT(journal->j_free > 1); blocknr = journal->j_head; @@ -620,7 +618,7 @@ int jbd2_journal_next_log_block(journal_t *journal, unsigned long long *retp) journal->j_free--; if (journal->j_head == journal->j_last) journal->j_head = journal->j_first; - spin_unlock(&journal->j_state_lock); + write_unlock(&journal->j_state_lock); return jbd2_journal_bmap(journal, blocknr, retp); } @@ -840,7 +838,7 @@ static journal_t * journal_init_common (void) mutex_init(&journal->j_checkpoint_mutex); spin_lock_init(&journal->j_revoke_lock); spin_lock_init(&journal->j_list_lock); - spin_lock_init(&journal->j_state_lock); + rwlock_init(&journal->j_state_lock); journal->j_commit_interval = (HZ * JBD2_DEFAULT_MAX_COMMIT_AGE); journal->j_min_batch_time = 0; @@ -1106,14 +1104,14 @@ void jbd2_journal_update_superblock(journal_t *journal, int wait) set_buffer_uptodate(bh); } - spin_lock(&journal->j_state_lock); + read_lock(&journal->j_state_lock); jbd_debug(1,"JBD: updating superblock (start %ld, seq %d, errno %d)\n", journal->j_tail, journal->j_tail_sequence, journal->j_errno); sb->s_sequence = cpu_to_be32(journal->j_tail_sequence); sb->s_start = cpu_to_be32(journal->j_tail); sb->s_errno = cpu_to_be32(journal->j_errno); - spin_unlock(&journal->j_state_lock); + read_unlock(&journal->j_state_lock); BUFFER_TRACE(bh, "marking dirty"); mark_buffer_dirty(bh); @@ -1134,12 +1132,12 @@ out: * any future commit will have to be careful to update the * superblock again to re-record the true start of the log. */ - spin_lock(&journal->j_state_lock); + write_lock(&journal->j_state_lock); if (sb->s_start) journal->j_flags &= ~JBD2_FLUSHED; else journal->j_flags |= JBD2_FLUSHED; - spin_unlock(&journal->j_state_lock); + write_unlock(&journal->j_state_lock); } /* @@ -1551,7 +1549,7 @@ int jbd2_journal_flush(journal_t *journal) transaction_t *transaction = NULL; unsigned long old_tail; - spin_lock(&journal->j_state_lock); + write_lock(&journal->j_state_lock); /* Force everything buffered to the log... */ if (journal->j_running_transaction) { @@ -1564,10 +1562,10 @@ int jbd2_journal_flush(journal_t *journal) if (transaction) { tid_t tid = transaction->t_tid; - spin_unlock(&journal->j_state_lock); + write_unlock(&journal->j_state_lock); jbd2_log_wait_commit(journal, tid); } else { - spin_unlock(&journal->j_state_lock); + write_unlock(&journal->j_state_lock); } /* ...and flush everything in the log out to disk. */ @@ -1591,12 +1589,12 @@ int jbd2_journal_flush(journal_t *journal) * the magic code for a fully-recovered superblock. Any future * commits of data to the journal will restore the current * s_start value. */ - spin_lock(&journal->j_state_lock); + write_lock(&journal->j_state_lock); old_tail = journal->j_tail; journal->j_tail = 0; - spin_unlock(&journal->j_state_lock); + write_unlock(&journal->j_state_lock); jbd2_journal_update_superblock(journal, 1); - spin_lock(&journal->j_state_lock); + write_lock(&journal->j_state_lock); journal->j_tail = old_tail; J_ASSERT(!journal->j_running_transaction); @@ -1604,7 +1602,7 @@ int jbd2_journal_flush(journal_t *journal) J_ASSERT(!journal->j_checkpoint_transactions); J_ASSERT(journal->j_head == journal->j_tail); J_ASSERT(journal->j_tail_sequence == journal->j_transaction_sequence); - spin_unlock(&journal->j_state_lock); + write_unlock(&journal->j_state_lock); return 0; } @@ -1668,12 +1666,12 @@ void __jbd2_journal_abort_hard(journal_t *journal) printk(KERN_ERR "Aborting journal on device %s.\n", journal->j_devname); - spin_lock(&journal->j_state_lock); + write_lock(&journal->j_state_lock); journal->j_flags |= JBD2_ABORT; transaction = journal->j_running_transaction; if (transaction) __jbd2_log_start_commit(journal, transaction->t_tid); - spin_unlock(&journal->j_state_lock); + write_unlock(&journal->j_state_lock); } /* Soft abort: record the abort error status in the journal superblock, @@ -1758,12 +1756,12 @@ int jbd2_journal_errno(journal_t *journal) { int err; - spin_lock(&journal->j_state_lock); + read_lock(&journal->j_state_lock); if (journal->j_flags & JBD2_ABORT) err = -EROFS; else err = journal->j_errno; - spin_unlock(&journal->j_state_lock); + read_unlock(&journal->j_state_lock); return err; } @@ -1778,12 +1776,12 @@ int jbd2_journal_clear_err(journal_t *journal) { int err = 0; - spin_lock(&journal->j_state_lock); + write_lock(&journal->j_state_lock); if (journal->j_flags & JBD2_ABORT) err = -EROFS; else journal->j_errno = 0; - spin_unlock(&journal->j_state_lock); + write_unlock(&journal->j_state_lock); return err; } @@ -1796,10 +1794,10 @@ int jbd2_journal_clear_err(journal_t *journal) */ void jbd2_journal_ack_err(journal_t *journal) { - spin_lock(&journal->j_state_lock); + write_lock(&journal->j_state_lock); if (journal->j_errno) journal->j_flags |= JBD2_ACK_ERR; - spin_unlock(&journal->j_state_lock); + write_unlock(&journal->j_state_lock); } int jbd2_journal_blocks_per_page(struct inode *inode) diff --git a/fs/jbd2/transaction.c b/fs/jbd2/transaction.c index 9c64c7e..d66d00b 100644 --- a/fs/jbd2/transaction.c +++ b/fs/jbd2/transaction.c @@ -130,18 +130,18 @@ repeat: * We need to hold j_state_lock until t_updates has been incremented, * for proper journal barrier handling */ - spin_lock(&journal->j_state_lock); + read_lock(&journal->j_state_lock); repeat_locked: if (is_journal_aborted(journal) || (journal->j_errno != 0 && !(journal->j_flags & JBD2_ACK_ERR))) { - spin_unlock(&journal->j_state_lock); + read_unlock(&journal->j_state_lock); kfree(new_transaction); return -EROFS; } /* Wait on the journal's transaction barrier if necessary */ if (journal->j_barrier_count) { - spin_unlock(&journal->j_state_lock); + read_unlock(&journal->j_state_lock); wait_event(journal->j_wait_transaction_locked, journal->j_barrier_count == 0); goto repeat; @@ -149,7 +149,7 @@ repeat_locked: if (!journal->j_running_transaction) { if (!new_transaction) { - spin_unlock(&journal->j_state_lock); + read_unlock(&journal->j_state_lock); goto alloc_transaction; } jbd2_get_transaction(journal, new_transaction); @@ -167,7 +167,7 @@ repeat_locked: prepare_to_wait(&journal->j_wait_transaction_locked, &wait, TASK_UNINTERRUPTIBLE); - spin_unlock(&journal->j_state_lock); + read_unlock(&journal->j_state_lock); schedule(); finish_wait(&journal->j_wait_transaction_locked, &wait); goto repeat; @@ -194,7 +194,7 @@ repeat_locked: prepare_to_wait(&journal->j_wait_transaction_locked, &wait, TASK_UNINTERRUPTIBLE); __jbd2_log_start_commit(journal, transaction->t_tid); - spin_unlock(&journal->j_state_lock); + read_unlock(&journal->j_state_lock); schedule(); finish_wait(&journal->j_wait_transaction_locked, &wait); goto repeat; @@ -250,7 +250,7 @@ repeat_locked: atomic_read(&transaction->t_outstanding_credits), __jbd2_log_space_left(journal)); spin_unlock(&transaction->t_handle_lock); - spin_unlock(&journal->j_state_lock); + read_unlock(&journal->j_state_lock); lock_map_acquire(&handle->h_lockdep_map); kfree(new_transaction); @@ -362,7 +362,7 @@ int jbd2_journal_extend(handle_t *handle, int nblocks) result = 1; - spin_lock(&journal->j_state_lock); + read_lock(&journal->j_state_lock); /* Don't extend a locked-down transaction! */ if (handle->h_transaction->t_state != T_RUNNING) { @@ -394,7 +394,7 @@ int jbd2_journal_extend(handle_t *handle, int nblocks) unlock: spin_unlock(&transaction->t_handle_lock); error_out: - spin_unlock(&journal->j_state_lock); + read_unlock(&journal->j_state_lock); out: return result; } @@ -432,7 +432,7 @@ int jbd2__journal_restart(handle_t *handle, int nblocks, int gfp_mask) J_ASSERT(atomic_read(&transaction->t_updates) > 0); J_ASSERT(journal_current_handle() == handle); - spin_lock(&journal->j_state_lock); + read_lock(&journal->j_state_lock); spin_lock(&transaction->t_handle_lock); atomic_sub(handle->h_buffer_credits, &transaction->t_outstanding_credits); @@ -442,7 +442,7 @@ int jbd2__journal_restart(handle_t *handle, int nblocks, int gfp_mask) jbd_debug(2, "restarting handle %p\n", handle); __jbd2_log_start_commit(journal, transaction->t_tid); - spin_unlock(&journal->j_state_lock); + read_unlock(&journal->j_state_lock); lock_map_release(&handle->h_lockdep_map); handle->h_buffer_credits = nblocks; @@ -472,7 +472,7 @@ void jbd2_journal_lock_updates(journal_t *journal) { DEFINE_WAIT(wait); - spin_lock(&journal->j_state_lock); + write_lock(&journal->j_state_lock); ++journal->j_barrier_count; /* Wait until there are no running updates */ @@ -490,12 +490,12 @@ void jbd2_journal_lock_updates(journal_t *journal) prepare_to_wait(&journal->j_wait_updates, &wait, TASK_UNINTERRUPTIBLE); spin_unlock(&transaction->t_handle_lock); - spin_unlock(&journal->j_state_lock); + write_unlock(&journal->j_state_lock); schedule(); finish_wait(&journal->j_wait_updates, &wait); - spin_lock(&journal->j_state_lock); + write_lock(&journal->j_state_lock); } - spin_unlock(&journal->j_state_lock); + write_unlock(&journal->j_state_lock); /* * We have now established a barrier against other normal updates, but @@ -519,9 +519,9 @@ void jbd2_journal_unlock_updates (journal_t *journal) J_ASSERT(journal->j_barrier_count != 0); mutex_unlock(&journal->j_barrier); - spin_lock(&journal->j_state_lock); + write_lock(&journal->j_state_lock); --journal->j_barrier_count; - spin_unlock(&journal->j_state_lock); + write_unlock(&journal->j_state_lock); wake_up(&journal->j_wait_transaction_locked); } @@ -1314,9 +1314,9 @@ int jbd2_journal_stop(handle_t *handle) journal->j_last_sync_writer = pid; - spin_lock(&journal->j_state_lock); + read_lock(&journal->j_state_lock); commit_time = journal->j_average_commit_time; - spin_unlock(&journal->j_state_lock); + read_unlock(&journal->j_state_lock); trans_time = ktime_to_ns(ktime_sub(ktime_get(), transaction->t_start_time)); @@ -1748,7 +1748,7 @@ static int journal_unmap_buffer(journal_t *journal, struct buffer_head *bh) goto zap_buffer_unlocked; /* OK, we have data buffer in journaled mode */ - spin_lock(&journal->j_state_lock); + write_lock(&journal->j_state_lock); jbd_lock_bh_state(bh); spin_lock(&journal->j_list_lock); @@ -1801,7 +1801,7 @@ static int journal_unmap_buffer(journal_t *journal, struct buffer_head *bh) jbd2_journal_put_journal_head(jh); spin_unlock(&journal->j_list_lock); jbd_unlock_bh_state(bh); - spin_unlock(&journal->j_state_lock); + write_unlock(&journal->j_state_lock); return ret; } else { /* There is no currently-running transaction. So the @@ -1815,7 +1815,7 @@ static int journal_unmap_buffer(journal_t *journal, struct buffer_head *bh) jbd2_journal_put_journal_head(jh); spin_unlock(&journal->j_list_lock); jbd_unlock_bh_state(bh); - spin_unlock(&journal->j_state_lock); + write_unlock(&journal->j_state_lock); return ret; } else { /* The orphan record's transaction has @@ -1839,7 +1839,7 @@ static int journal_unmap_buffer(journal_t *journal, struct buffer_head *bh) jbd2_journal_put_journal_head(jh); spin_unlock(&journal->j_list_lock); jbd_unlock_bh_state(bh); - spin_unlock(&journal->j_state_lock); + write_unlock(&journal->j_state_lock); return 0; } else { /* Good, the buffer belongs to the running transaction. @@ -1858,7 +1858,7 @@ zap_buffer: zap_buffer_no_jh: spin_unlock(&journal->j_list_lock); jbd_unlock_bh_state(bh); - spin_unlock(&journal->j_state_lock); + write_unlock(&journal->j_state_lock); zap_buffer_unlocked: clear_buffer_dirty(bh); J_ASSERT_BH(bh, !buffer_jbddirty(bh)); @@ -2165,9 +2165,9 @@ int jbd2_journal_begin_ordered_truncate(journal_t *journal, /* Locks are here just to force reading of recent values, it is * enough that the transaction was not committing before we started * a transaction adding the inode to orphan list */ - spin_lock(&journal->j_state_lock); + read_lock(&journal->j_state_lock); commit_trans = journal->j_committing_transaction; - spin_unlock(&journal->j_state_lock); + read_unlock(&journal->j_state_lock); spin_lock(&journal->j_list_lock); inode_trans = jinode->i_transaction; spin_unlock(&journal->j_list_lock); diff --git a/fs/ocfs2/journal.c b/fs/ocfs2/journal.c index 47878cf..9c1b92e 100644 --- a/fs/ocfs2/journal.c +++ b/fs/ocfs2/journal.c @@ -760,13 +760,13 @@ void ocfs2_set_journal_params(struct ocfs2_super *osb) if (osb->osb_commit_interval) commit_interval = osb->osb_commit_interval; - spin_lock(&journal->j_state_lock); + write_lock(&journal->j_state_lock); journal->j_commit_interval = commit_interval; if (osb->s_mount_opt & OCFS2_MOUNT_BARRIER) journal->j_flags |= JBD2_BARRIER; else journal->j_flags &= ~JBD2_BARRIER; - spin_unlock(&journal->j_state_lock); + write_unlock(&journal->j_state_lock); } int ocfs2_journal_init(struct ocfs2_journal *journal, int *dirty) diff --git a/include/linux/jbd2.h b/include/linux/jbd2.h index a72ce21..15d5743 100644 --- a/include/linux/jbd2.h +++ b/include/linux/jbd2.h @@ -764,7 +764,7 @@ struct journal_s /* * Protect the various scalars in the journal */ - spinlock_t j_state_lock; + rwlock_t j_state_lock; /* * Number of processes waiting to create a barrier lock [j_state_lock] -- 1.7.0.4 ^ permalink raw reply related [flat|nested] 22+ messages in thread
* Re: [PATCH, RFC 2/3] jbd2: Change j_state_lock to be a rwlock_t 2010-08-03 16:01 ` [Ocfs2-devel] " Theodore Ts'o @ 2010-08-04 0:08 ` Ted Ts'o -1 siblings, 0 replies; 22+ messages in thread From: Ted Ts'o @ 2010-08-04 0:08 UTC (permalink / raw) To: Ext4 Developers List, ocfs2-devel Cc: John Stultz, Keith Maanthey, Eric Whitney On Tue, Aug 03, 2010 at 12:01:54PM -0400, Theodore Ts'o wrote: > Lockstat reports have shown that j_state_lock is a major source of > lock contention, especially on systems with more than 4 CPU cores. So > change it to be a read/write spinlock. > > Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> Oops, this patch can result in a BUG_ON in fs/jbd2/transaction.c. I'm currently testing a fix/replacement patch, but I figured I send a quick warning to folks not to waste time trying this out just yet. I should hopefully have a new, better-tested patch in by tomorrow morning. - Ted ^ permalink raw reply [flat|nested] 22+ messages in thread
* [Ocfs2-devel] [PATCH, RFC 2/3] jbd2: Change j_state_lock to be a rwlock_t @ 2010-08-04 0:08 ` Ted Ts'o 0 siblings, 0 replies; 22+ messages in thread From: Ted Ts'o @ 2010-08-04 0:08 UTC (permalink / raw) To: Ext4 Developers List, ocfs2-devel Cc: John Stultz, Keith Maanthey, Eric Whitney On Tue, Aug 03, 2010 at 12:01:54PM -0400, Theodore Ts'o wrote: > Lockstat reports have shown that j_state_lock is a major source of > lock contention, especially on systems with more than 4 CPU cores. So > change it to be a read/write spinlock. > > Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> Oops, this patch can result in a BUG_ON in fs/jbd2/transaction.c. I'm currently testing a fix/replacement patch, but I figured I send a quick warning to folks not to waste time trying this out just yet. I should hopefully have a new, better-tested patch in by tomorrow morning. - Ted ^ permalink raw reply [flat|nested] 22+ messages in thread
* [PATCH, RFC 3/3] jbd2: Remove t_handle_lock from start_this_handle() 2010-08-03 16:01 ` [Ocfs2-devel] " Theodore Ts'o @ 2010-08-03 16:01 ` Theodore Ts'o -1 siblings, 0 replies; 22+ messages in thread From: Theodore Ts'o @ 2010-08-03 16:01 UTC (permalink / raw) To: Ext4 Developers List, ocfs2-devel Cc: John Stultz, Keith Maanthey, Eric Whitney, Theodore Ts'o This should remove the last exclusive lock from start_this_handle(), so that we should now be able to start multiple transactions at the same time on large SMP systems. Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> --- fs/jbd2/commit.c | 3 ++- fs/jbd2/transaction.c | 32 ++++++++++++++++++++++---------- include/linux/jbd2.h | 2 +- 3 files changed, 25 insertions(+), 12 deletions(-) diff --git a/fs/jbd2/commit.c b/fs/jbd2/commit.c index 67bb0a2..f52e5e8 100644 --- a/fs/jbd2/commit.c +++ b/fs/jbd2/commit.c @@ -1004,7 +1004,8 @@ restart_loop: * File the transaction statistics */ stats.ts_tid = commit_transaction->t_tid; - stats.run.rs_handle_count = commit_transaction->t_handle_count; + stats.run.rs_handle_count = + atomic_read(&commit_transaction->t_handle_count); trace_jbd2_run_stats(journal->j_fs_dev->bd_dev, commit_transaction->t_tid, &stats.run); diff --git a/fs/jbd2/transaction.c b/fs/jbd2/transaction.c index d66d00b..5c83a91 100644 --- a/fs/jbd2/transaction.c +++ b/fs/jbd2/transaction.c @@ -57,6 +57,7 @@ jbd2_get_transaction(journal_t *journal, transaction_t *transaction) spin_lock_init(&transaction->t_handle_lock); atomic_set(&transaction->t_updates, 0); atomic_set(&transaction->t_outstanding_credits, 0); + atomic_set(&transaction->t_handle_count, 0); INIT_LIST_HEAD(&transaction->t_inode_list); INIT_LIST_HEAD(&transaction->t_private_list); @@ -178,8 +179,8 @@ repeat_locked: * buffers requested by this operation, we need to stall pending a log * checkpoint to free some more log space. */ - spin_lock(&transaction->t_handle_lock); - needed = atomic_read(&transaction->t_outstanding_credits) + nblocks; + needed = atomic_add_return(nblocks, + &transaction->t_outstanding_credits); if (needed > journal->j_max_transaction_buffers) { /* @@ -190,7 +191,7 @@ repeat_locked: DEFINE_WAIT(wait); jbd_debug(2, "Handle %p starting new commit...\n", handle); - spin_unlock(&transaction->t_handle_lock); + atomic_sub(nblocks, &transaction->t_outstanding_credits); prepare_to_wait(&journal->j_wait_transaction_locked, &wait, TASK_UNINTERRUPTIBLE); __jbd2_log_start_commit(journal, transaction->t_tid); @@ -227,29 +228,40 @@ repeat_locked: */ if (__jbd2_log_space_left(journal) < jbd_space_needed(journal)) { jbd_debug(2, "Handle %p waiting for checkpoint...\n", handle); - spin_unlock(&transaction->t_handle_lock); + atomic_sub(nblocks, &transaction->t_outstanding_credits); __jbd2_log_wait_for_space(journal); goto repeat_locked; } /* OK, account for the buffers that this operation expects to - * use and add the handle to the running transaction. */ - - if (time_after(transaction->t_start, ts)) { + * use and add the handle to the running transaction. + * + * In order for t_max_wait to be reliable, it must be + * protected by a lock. But doing so will mean that + * start_this_handle() can not be run in parallel on SMP + * systems, which limits our scalability. So we only enable + * it when debugging is enabled. We may want to use a + * separate flag, eventually, so we can enable this + * independently of debugging. + */ +#ifdef CONFIG_JBD2_DEBUG + if (jbd2_journal_enable_debug && + time_after(transaction->t_start, ts)) { ts = jbd2_time_diff(ts, transaction->t_start); + spin_lock(&transaction->t_handle_lock); if (ts > transaction->t_max_wait) transaction->t_max_wait = ts; + spin_unlock(&transaction->t_handle_lock); } +#endif handle->h_transaction = transaction; - atomic_add(nblocks, &transaction->t_outstanding_credits); atomic_inc(&transaction->t_updates); - transaction->t_handle_count++; + atomic_inc(&transaction->t_handle_count); jbd_debug(4, "Handle %p given %d credits (total %d, free %d)\n", handle, nblocks, atomic_read(&transaction->t_outstanding_credits), __jbd2_log_space_left(journal)); - spin_unlock(&transaction->t_handle_lock); read_unlock(&journal->j_state_lock); lock_map_acquire(&handle->h_lockdep_map); diff --git a/include/linux/jbd2.h b/include/linux/jbd2.h index 15d5743..01743b5 100644 --- a/include/linux/jbd2.h +++ b/include/linux/jbd2.h @@ -629,7 +629,7 @@ struct transaction_s /* * How many handles used this transaction? [t_handle_lock] */ - int t_handle_count; + atomic_t t_handle_count; /* * This transaction is being forced and some process is -- 1.7.0.4 ^ permalink raw reply related [flat|nested] 22+ messages in thread
* [Ocfs2-devel] [PATCH, RFC 3/3] jbd2: Remove t_handle_lock from start_this_handle() @ 2010-08-03 16:01 ` Theodore Ts'o 0 siblings, 0 replies; 22+ messages in thread From: Theodore Ts'o @ 2010-08-03 16:01 UTC (permalink / raw) To: Ext4 Developers List, ocfs2-devel Cc: John Stultz, Keith Maanthey, Eric Whitney, Theodore Ts'o This should remove the last exclusive lock from start_this_handle(), so that we should now be able to start multiple transactions at the same time on large SMP systems. Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> --- fs/jbd2/commit.c | 3 ++- fs/jbd2/transaction.c | 32 ++++++++++++++++++++++---------- include/linux/jbd2.h | 2 +- 3 files changed, 25 insertions(+), 12 deletions(-) diff --git a/fs/jbd2/commit.c b/fs/jbd2/commit.c index 67bb0a2..f52e5e8 100644 --- a/fs/jbd2/commit.c +++ b/fs/jbd2/commit.c @@ -1004,7 +1004,8 @@ restart_loop: * File the transaction statistics */ stats.ts_tid = commit_transaction->t_tid; - stats.run.rs_handle_count = commit_transaction->t_handle_count; + stats.run.rs_handle_count = + atomic_read(&commit_transaction->t_handle_count); trace_jbd2_run_stats(journal->j_fs_dev->bd_dev, commit_transaction->t_tid, &stats.run); diff --git a/fs/jbd2/transaction.c b/fs/jbd2/transaction.c index d66d00b..5c83a91 100644 --- a/fs/jbd2/transaction.c +++ b/fs/jbd2/transaction.c @@ -57,6 +57,7 @@ jbd2_get_transaction(journal_t *journal, transaction_t *transaction) spin_lock_init(&transaction->t_handle_lock); atomic_set(&transaction->t_updates, 0); atomic_set(&transaction->t_outstanding_credits, 0); + atomic_set(&transaction->t_handle_count, 0); INIT_LIST_HEAD(&transaction->t_inode_list); INIT_LIST_HEAD(&transaction->t_private_list); @@ -178,8 +179,8 @@ repeat_locked: * buffers requested by this operation, we need to stall pending a log * checkpoint to free some more log space. */ - spin_lock(&transaction->t_handle_lock); - needed = atomic_read(&transaction->t_outstanding_credits) + nblocks; + needed = atomic_add_return(nblocks, + &transaction->t_outstanding_credits); if (needed > journal->j_max_transaction_buffers) { /* @@ -190,7 +191,7 @@ repeat_locked: DEFINE_WAIT(wait); jbd_debug(2, "Handle %p starting new commit...\n", handle); - spin_unlock(&transaction->t_handle_lock); + atomic_sub(nblocks, &transaction->t_outstanding_credits); prepare_to_wait(&journal->j_wait_transaction_locked, &wait, TASK_UNINTERRUPTIBLE); __jbd2_log_start_commit(journal, transaction->t_tid); @@ -227,29 +228,40 @@ repeat_locked: */ if (__jbd2_log_space_left(journal) < jbd_space_needed(journal)) { jbd_debug(2, "Handle %p waiting for checkpoint...\n", handle); - spin_unlock(&transaction->t_handle_lock); + atomic_sub(nblocks, &transaction->t_outstanding_credits); __jbd2_log_wait_for_space(journal); goto repeat_locked; } /* OK, account for the buffers that this operation expects to - * use and add the handle to the running transaction. */ - - if (time_after(transaction->t_start, ts)) { + * use and add the handle to the running transaction. + * + * In order for t_max_wait to be reliable, it must be + * protected by a lock. But doing so will mean that + * start_this_handle() can not be run in parallel on SMP + * systems, which limits our scalability. So we only enable + * it when debugging is enabled. We may want to use a + * separate flag, eventually, so we can enable this + * independently of debugging. + */ +#ifdef CONFIG_JBD2_DEBUG + if (jbd2_journal_enable_debug && + time_after(transaction->t_start, ts)) { ts = jbd2_time_diff(ts, transaction->t_start); + spin_lock(&transaction->t_handle_lock); if (ts > transaction->t_max_wait) transaction->t_max_wait = ts; + spin_unlock(&transaction->t_handle_lock); } +#endif handle->h_transaction = transaction; - atomic_add(nblocks, &transaction->t_outstanding_credits); atomic_inc(&transaction->t_updates); - transaction->t_handle_count++; + atomic_inc(&transaction->t_handle_count); jbd_debug(4, "Handle %p given %d credits (total %d, free %d)\n", handle, nblocks, atomic_read(&transaction->t_outstanding_credits), __jbd2_log_space_left(journal)); - spin_unlock(&transaction->t_handle_lock); read_unlock(&journal->j_state_lock); lock_map_acquire(&handle->h_lockdep_map); diff --git a/include/linux/jbd2.h b/include/linux/jbd2.h index 15d5743..01743b5 100644 --- a/include/linux/jbd2.h +++ b/include/linux/jbd2.h @@ -629,7 +629,7 @@ struct transaction_s /* * How many handles used this transaction? [t_handle_lock] */ - int t_handle_count; + atomic_t t_handle_count; /* * This transaction is being forced and some process is -- 1.7.0.4 ^ permalink raw reply related [flat|nested] 22+ messages in thread
* Re: [Ocfs2-devel] [PATCH, RFC 0/3] *** SUBJECT HERE *** 2010-08-03 16:01 ` [Ocfs2-devel] " Theodore Ts'o @ 2010-08-03 19:07 ` Joel Becker -1 siblings, 0 replies; 22+ messages in thread From: Joel Becker @ 2010-08-03 19:07 UTC (permalink / raw) To: Theodore Ts'o Cc: Ext4 Developers List, ocfs2-devel, Keith Maanthey, John Stultz, Eric Whitney On Tue, Aug 03, 2010 at 12:01:52PM -0400, Theodore Ts'o wrote: > The first patch in this patch series hasn't changed since when I had > last posted it, but I'm including it again for the benefit of the folks > on ocfs2-dev. Thanks! > Thanks to some work done by Eric Whitney, when he accidentally ran the > command "mkfs.ext4 -t xfs", and created a ext4 file system without a > journal, it appears that main scalability bottleneck for ext4 is in the > jbd2 layer. I'm certain, as you've surmised, that a lot of this affects ocfs2 as well. jbd2 improvements make both filesystems better. > This patch series removes all exclusive spinlocks when starting and > stopping jbd2 handles, which should improve things even more. Since > OCFS2 uses the jbd2 layer, and the second patch in this patch series > touches ocfs2 a wee bit, I'd appreciate it if you could take a look and > let me know what you think. Hopefully, this should also improve OCFS2's > scalability. The atomic changes make absolute sense. Ack on them. I had two reactions to the rwlock: first, a lot of your rwlock changes are on the write_lock() side. You get journal start/stop parallelized, but what about all the underlying access/dirty/commit paths? Second, rwlocks are known to behave worse than spinlocks when they ping the cache line across CPUs. That said, I have a hunch that you've tested both of the above concerns. You mention 48 core systems, and clearly if cachelines were going to be a problem, you would have noticed. So if the rwlock changes are faster on 48 core than the spinlocks, I say ack ack ack. From the ocfs2 perspective, the code is perfectly safe, and any speedup you see on ext4 ought to be mirrored on ocfs2. Joel -- A good programming language should have features that make the kind of people who use the phrase "software engineering" shake their heads disapprovingly. - Paul Graham Joel Becker Consulting Software Developer Oracle E-mail: joel.becker@oracle.com Phone: (650) 506-8127 ^ permalink raw reply [flat|nested] 22+ messages in thread
* [Ocfs2-devel] [PATCH, RFC 0/3] *** SUBJECT HERE *** @ 2010-08-03 19:07 ` Joel Becker 0 siblings, 0 replies; 22+ messages in thread From: Joel Becker @ 2010-08-03 19:07 UTC (permalink / raw) To: Theodore Ts'o Cc: Ext4 Developers List, ocfs2-devel, Keith Maanthey, John Stultz, Eric Whitney On Tue, Aug 03, 2010 at 12:01:52PM -0400, Theodore Ts'o wrote: > The first patch in this patch series hasn't changed since when I had > last posted it, but I'm including it again for the benefit of the folks > on ocfs2-dev. Thanks! > Thanks to some work done by Eric Whitney, when he accidentally ran the > command "mkfs.ext4 -t xfs", and created a ext4 file system without a > journal, it appears that main scalability bottleneck for ext4 is in the > jbd2 layer. I'm certain, as you've surmised, that a lot of this affects ocfs2 as well. jbd2 improvements make both filesystems better. > This patch series removes all exclusive spinlocks when starting and > stopping jbd2 handles, which should improve things even more. Since > OCFS2 uses the jbd2 layer, and the second patch in this patch series > touches ocfs2 a wee bit, I'd appreciate it if you could take a look and > let me know what you think. Hopefully, this should also improve OCFS2's > scalability. The atomic changes make absolute sense. Ack on them. I had two reactions to the rwlock: first, a lot of your rwlock changes are on the write_lock() side. You get journal start/stop parallelized, but what about all the underlying access/dirty/commit paths? Second, rwlocks are known to behave worse than spinlocks when they ping the cache line across CPUs. That said, I have a hunch that you've tested both of the above concerns. You mention 48 core systems, and clearly if cachelines were going to be a problem, you would have noticed. So if the rwlock changes are faster on 48 core than the spinlocks, I say ack ack ack. From the ocfs2 perspective, the code is perfectly safe, and any speedup you see on ext4 ought to be mirrored on ocfs2. Joel -- A good programming language should have features that make the kind of people who use the phrase "software engineering" shake their heads disapprovingly. - Paul Graham Joel Becker Consulting Software Developer Oracle E-mail: joel.becker at oracle.com Phone: (650) 506-8127 ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [Ocfs2-devel] [PATCH, RFC 0/3] *** SUBJECT HERE *** 2010-08-03 19:07 ` Joel Becker @ 2010-08-03 20:07 ` Ted Ts'o -1 siblings, 0 replies; 22+ messages in thread From: Ted Ts'o @ 2010-08-03 20:07 UTC (permalink / raw) To: Ext4 Developers List, ocfs2-devel, Keith Maanthey, John Stultz, Eric Whitney On Tue, Aug 03, 2010 at 12:07:03PM -0700, Joel Becker wrote: > > The atomic changes make absolute sense. Ack on them. I had two > reactions to the rwlock: first, a lot of your rwlock changes are on > the write_lock() side. You get journal start/stop parallelized, but > what about all the underlying access/dirty/commit paths? Second, > rwlocks are known to behave worse than spinlocks when they ping the > cache line across CPUs. > That said, I have a hunch that you've tested both of the above > concerns. You mention 48 core systems, and clearly if cachelines were > going to be a problem, you would have noticed. So if the rwlock changes > are faster on 48 core than the spinlocks, I say ack ack ack. We don't have the results from the 48-core machine yet. I was going to try to get measurements from the 48-core machine I have access to at $WORK, but it doesn't have enough hard drive spindles on it. :-( But yes, I am worried about the cache-line bounce issue, and I'm hoping that we'll get some input from people who can run some measurements on an 8-core and 48-core machine. I haven't worried about the commit paths yet because they haven't shown up as being significant on any of the lockstat reports. Remember that with jbd2, the commit code only runs on in kjournald, and in general only once every 5 seconds or for every fsync. In contrast, essentially every single file system syscall that modifies the filesystem is going to end up calling start_this_handle(). So if you have multiple threads all creating files, or writing to files, or even just changing the mtime or permissions, it's going to call start_this_handle(), so we're seeing nearly all of the contention on start_this_handle() and to a lesser extent, jbd2_journal_stop(), the function which retires a handle. Things would probably be different on a workload that tries to simulate a mail transfer agent or a database which is _not_ using O_DIRECT on a preallocated table space file, since there will be many more fsync() calls and thus much more pressure on the commit code. But I didn't want to do any premature optimization until we see how bad it actually gets in those cases. If you are set up to do some performance measurements on OCFS2, I'd appreciate if you could give it a try and let me know how the patches fare on OCFS2. Thanks, - Ted ^ permalink raw reply [flat|nested] 22+ messages in thread
* [Ocfs2-devel] [PATCH, RFC 0/3] *** SUBJECT HERE *** @ 2010-08-03 20:07 ` Ted Ts'o 0 siblings, 0 replies; 22+ messages in thread From: Ted Ts'o @ 2010-08-03 20:07 UTC (permalink / raw) To: Ext4 Developers List, ocfs2-devel, Keith Maanthey, John Stultz, Eric Whitney On Tue, Aug 03, 2010 at 12:07:03PM -0700, Joel Becker wrote: > > The atomic changes make absolute sense. Ack on them. I had two > reactions to the rwlock: first, a lot of your rwlock changes are on > the write_lock() side. You get journal start/stop parallelized, but > what about all the underlying access/dirty/commit paths? Second, > rwlocks are known to behave worse than spinlocks when they ping the > cache line across CPUs. > That said, I have a hunch that you've tested both of the above > concerns. You mention 48 core systems, and clearly if cachelines were > going to be a problem, you would have noticed. So if the rwlock changes > are faster on 48 core than the spinlocks, I say ack ack ack. We don't have the results from the 48-core machine yet. I was going to try to get measurements from the 48-core machine I have access to at $WORK, but it doesn't have enough hard drive spindles on it. :-( But yes, I am worried about the cache-line bounce issue, and I'm hoping that we'll get some input from people who can run some measurements on an 8-core and 48-core machine. I haven't worried about the commit paths yet because they haven't shown up as being significant on any of the lockstat reports. Remember that with jbd2, the commit code only runs on in kjournald, and in general only once every 5 seconds or for every fsync. In contrast, essentially every single file system syscall that modifies the filesystem is going to end up calling start_this_handle(). So if you have multiple threads all creating files, or writing to files, or even just changing the mtime or permissions, it's going to call start_this_handle(), so we're seeing nearly all of the contention on start_this_handle() and to a lesser extent, jbd2_journal_stop(), the function which retires a handle. Things would probably be different on a workload that tries to simulate a mail transfer agent or a database which is _not_ using O_DIRECT on a preallocated table space file, since there will be many more fsync() calls and thus much more pressure on the commit code. But I didn't want to do any premature optimization until we see how bad it actually gets in those cases. If you are set up to do some performance measurements on OCFS2, I'd appreciate if you could give it a try and let me know how the patches fare on OCFS2. Thanks, - Ted ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [Ocfs2-devel] [PATCH, RFC 0/3] *** SUBJECT HERE *** 2010-08-03 20:07 ` Ted Ts'o @ 2010-08-03 21:19 ` Joel Becker -1 siblings, 0 replies; 22+ messages in thread From: Joel Becker @ 2010-08-03 21:19 UTC (permalink / raw) To: Ted Ts'o Cc: Ext4 Developers List, ocfs2-devel, Keith Maanthey, John Stultz, Eric Whitney On Tue, Aug 03, 2010 at 04:07:55PM -0400, Ted Ts'o wrote: > If you are set up to do some performance measurements on OCFS2, I'd > appreciate if you could give it a try and let me know how the patches > fare on OCFS2. We're lining up a 16-way box here. Do you have a happy dbench command line or similar that you would like to see run? Joel -- You can use a screwdriver to screw in screws or to clean your ears, however, the latter needs real skill, determination and a lack of fear of injuring yourself. It is much the same with JavaScript. - Chris Heilmann Joel Becker Consulting Software Developer Oracle E-mail: joel.becker@oracle.com Phone: (650) 506-8127 ^ permalink raw reply [flat|nested] 22+ messages in thread
* [Ocfs2-devel] [PATCH, RFC 0/3] *** SUBJECT HERE *** @ 2010-08-03 21:19 ` Joel Becker 0 siblings, 0 replies; 22+ messages in thread From: Joel Becker @ 2010-08-03 21:19 UTC (permalink / raw) To: Ted Ts'o Cc: Ext4 Developers List, ocfs2-devel, Keith Maanthey, John Stultz, Eric Whitney On Tue, Aug 03, 2010 at 04:07:55PM -0400, Ted Ts'o wrote: > If you are set up to do some performance measurements on OCFS2, I'd > appreciate if you could give it a try and let me know how the patches > fare on OCFS2. We're lining up a 16-way box here. Do you have a happy dbench command line or similar that you would like to see run? Joel -- You can use a screwdriver to screw in screws or to clean your ears, however, the latter needs real skill, determination and a lack of fear of injuring yourself. It is much the same with JavaScript. - Chris Heilmann Joel Becker Consulting Software Developer Oracle E-mail: joel.becker at oracle.com Phone: (650) 506-8127 ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [Ocfs2-devel] [PATCH, RFC 0/3] *** SUBJECT HERE *** 2010-08-03 21:19 ` Joel Becker @ 2010-08-03 22:57 ` Ted Ts'o -1 siblings, 0 replies; 22+ messages in thread From: Ted Ts'o @ 2010-08-03 22:57 UTC (permalink / raw) To: Ext4 Developers List, ocfs2-devel, Keith Maanthey, John Stultz, Eric Whitney On Tue, Aug 03, 2010 at 02:19:16PM -0700, Joel Becker wrote: > On Tue, Aug 03, 2010 at 04:07:55PM -0400, Ted Ts'o wrote: > > If you are set up to do some performance measurements on OCFS2, I'd > > appreciate if you could give it a try and let me know how the patches > > fare on OCFS2. > > We're lining up a 16-way box here. Do you have a happy dbench > command line or similar that you would like to see run? The problem with dbench is that the core VFS can end up being the bottleneck. OTOH, it's very easy to run. What would be great if you could try using the Flexible File System Benchmark (FFSB), which is what Eric and Steve used: http://free.linux.hp.com/~enw/ext4/2.6.34/ http://btrfs.boxacle.net/ It's a bit more work to set it up, though. - Ted ^ permalink raw reply [flat|nested] 22+ messages in thread
* [Ocfs2-devel] [PATCH, RFC 0/3] *** SUBJECT HERE *** @ 2010-08-03 22:57 ` Ted Ts'o 0 siblings, 0 replies; 22+ messages in thread From: Ted Ts'o @ 2010-08-03 22:57 UTC (permalink / raw) To: Ext4 Developers List, ocfs2-devel, Keith Maanthey, John Stultz, Eric Whitney On Tue, Aug 03, 2010 at 02:19:16PM -0700, Joel Becker wrote: > On Tue, Aug 03, 2010 at 04:07:55PM -0400, Ted Ts'o wrote: > > If you are set up to do some performance measurements on OCFS2, I'd > > appreciate if you could give it a try and let me know how the patches > > fare on OCFS2. > > We're lining up a 16-way box here. Do you have a happy dbench > command line or similar that you would like to see run? The problem with dbench is that the core VFS can end up being the bottleneck. OTOH, it's very easy to run. What would be great if you could try using the Flexible File System Benchmark (FFSB), which is what Eric and Steve used: http://free.linux.hp.com/~enw/ext4/2.6.34/ http://btrfs.boxacle.net/ It's a bit more work to set it up, though. - Ted ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [PATCH, RFC 0/3] *** SUBJECT HERE *** (ext4 scalability patches) 2010-08-03 16:01 ` [Ocfs2-devel] " Theodore Ts'o @ 2010-08-10 3:40 ` Eric Whitney -1 siblings, 0 replies; 22+ messages in thread From: Eric Whitney @ 2010-08-10 3:40 UTC (permalink / raw) To: Theodore Ts'o Cc: Ext4 Developers List, ocfs2-devel, John Stultz, Keith Maanthey On 08/03/2010 12:01 PM, Theodore Ts'o wrote: > The first patch in this patch series hasn't changed since when I had > last posted it, but I'm including it again for the benefit of the folks > on ocfs2-dev. > > Thanks to some work done by Eric Whitney, when he accidentally ran the > command "mkfs.ext4 -t xfs", and created a ext4 file system without a > journal, it appears that main scalability bottleneck for ext4 is in the > jbd2 layer. In fact, his testing on a 48-core system shows that on some > workloads, ext4 is roughly comparable with XFS! > > The lockstat results indicate that the main bottlenecks are in the > j_state_lock and t_handle_lock, especially in start_this_handle() in > fs/jbd2/transaction.c. A previous patch, which removed an unneeded > grabbing of j_state_lock jbd2_journal_stop() relieved pressure on that > lock and was noted to make a significant difference for dbench on a > kernel with CONFIG_PREEMPT_RT enabled, as well as on a 48-core AMD > system from HP. This patch is already in 2.6.35, and the benchmark > results can be found here: http://free.linux.hp.com/~enw/ext4/2.6.34/ > > This patch series removes all exclusive spinlocks when starting and > stopping jbd2 handles, which should improve things even more. Since > OCFS2 uses the jbd2 layer, and the second patch in this patch series > touches ocfs2 a wee bit, I'd appreciate it if you could take a look and > let me know what you think. Hopefully, this should also improve OCFS2's > scalability. > > Best regards, > > - Ted > > Theodore Ts'o (3): > jbd2: Use atomic variables to avoid taking t_handle_lock in > jbd2_journal_stop > jbd2: Change j_state_lock to be a rwlock_t > jbd2: Remove t_handle_lock from start_this_handle() > > fs/ext4/inode.c | 4 +- > fs/ext4/super.c | 4 +- > fs/jbd2/checkpoint.c | 18 +++--- > fs/jbd2/commit.c | 42 +++++++------- > fs/jbd2/journal.c | 94 +++++++++++++++---------------- > fs/jbd2/transaction.c | 149 ++++++++++++++++++++++++++++--------------------- > fs/ocfs2/journal.c | 4 +- > include/linux/jbd2.h | 12 ++-- > 8 files changed, 174 insertions(+), 153 deletions(-) > My 48 core test results for these patches as applied to 2.6.35 can be found at: http://free.linux.hp.com/~enw/ext4/2.6.35 Both the Boxacle large_file_creates and random_writes workloads improved significantly and consistently with these patches, and apparently in the single threaded case as well as at increased scale. The graphs at the URL show one instance of several runs I made to establish repeatability. I've also taken unmodified 2.6.35 ext4, ext4 without a journal, and xfs data for comparison. In addition, I've collected lock stats and more detailed performance data for reference. Thanks, Ted! Eric ^ permalink raw reply [flat|nested] 22+ messages in thread
* [Ocfs2-devel] [PATCH, RFC 0/3] *** SUBJECT HERE *** (ext4 scalability patches) @ 2010-08-10 3:40 ` Eric Whitney 0 siblings, 0 replies; 22+ messages in thread From: Eric Whitney @ 2010-08-10 3:40 UTC (permalink / raw) To: Theodore Ts'o Cc: Ext4 Developers List, ocfs2-devel, John Stultz, Keith Maanthey On 08/03/2010 12:01 PM, Theodore Ts'o wrote: > The first patch in this patch series hasn't changed since when I had > last posted it, but I'm including it again for the benefit of the folks > on ocfs2-dev. > > Thanks to some work done by Eric Whitney, when he accidentally ran the > command "mkfs.ext4 -t xfs", and created a ext4 file system without a > journal, it appears that main scalability bottleneck for ext4 is in the > jbd2 layer. In fact, his testing on a 48-core system shows that on some > workloads, ext4 is roughly comparable with XFS! > > The lockstat results indicate that the main bottlenecks are in the > j_state_lock and t_handle_lock, especially in start_this_handle() in > fs/jbd2/transaction.c. A previous patch, which removed an unneeded > grabbing of j_state_lock jbd2_journal_stop() relieved pressure on that > lock and was noted to make a significant difference for dbench on a > kernel with CONFIG_PREEMPT_RT enabled, as well as on a 48-core AMD > system from HP. This patch is already in 2.6.35, and the benchmark > results can be found here: http://free.linux.hp.com/~enw/ext4/2.6.34/ > > This patch series removes all exclusive spinlocks when starting and > stopping jbd2 handles, which should improve things even more. Since > OCFS2 uses the jbd2 layer, and the second patch in this patch series > touches ocfs2 a wee bit, I'd appreciate it if you could take a look and > let me know what you think. Hopefully, this should also improve OCFS2's > scalability. > > Best regards, > > - Ted > > Theodore Ts'o (3): > jbd2: Use atomic variables to avoid taking t_handle_lock in > jbd2_journal_stop > jbd2: Change j_state_lock to be a rwlock_t > jbd2: Remove t_handle_lock from start_this_handle() > > fs/ext4/inode.c | 4 +- > fs/ext4/super.c | 4 +- > fs/jbd2/checkpoint.c | 18 +++--- > fs/jbd2/commit.c | 42 +++++++------- > fs/jbd2/journal.c | 94 +++++++++++++++---------------- > fs/jbd2/transaction.c | 149 ++++++++++++++++++++++++++++--------------------- > fs/ocfs2/journal.c | 4 +- > include/linux/jbd2.h | 12 ++-- > 8 files changed, 174 insertions(+), 153 deletions(-) > My 48 core test results for these patches as applied to 2.6.35 can be found at: http://free.linux.hp.com/~enw/ext4/2.6.35 Both the Boxacle large_file_creates and random_writes workloads improved significantly and consistently with these patches, and apparently in the single threaded case as well as at increased scale. The graphs at the URL show one instance of several runs I made to establish repeatability. I've also taken unmodified 2.6.35 ext4, ext4 without a journal, and xfs data for comparison. In addition, I've collected lock stats and more detailed performance data for reference. Thanks, Ted! Eric ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [PATCH, RFC 0/3] *** SUBJECT HERE *** (ext4 scalability patches) 2010-08-10 3:40 ` [Ocfs2-devel] " Eric Whitney @ 2010-08-11 21:08 ` Ted Ts'o -1 siblings, 0 replies; 22+ messages in thread From: Ted Ts'o @ 2010-08-11 21:08 UTC (permalink / raw) To: Eric Whitney Cc: Ext4 Developers List, ocfs2-devel, John Stultz, Keith Maanthey On Mon, Aug 09, 2010 at 11:40:42PM -0400, Eric Whitney wrote: > > My 48 core test results for these patches as applied to 2.6.35 can > be found at: > > http://free.linux.hp.com/~enw/ext4/2.6.35 > > Both the Boxacle large_file_creates and random_writes workloads > improved significantly and consistently with these patches, and > apparently in the single threaded case as well as at increased > scale. Thanks for doing these runs! I very much appreciate it --- it's really helped to validate these patches. Looking at your results, the two things which stand out to me is that we've now reached parity with XFS on the random write workload on the 48 and 192 core runs. On the large file creates workload, looking at the lockstats report, it looks like the next big thing we need to work on is to rework the ext4_da_writepages() function. The problem is one that's known to me for a while; we carefully spend a bunch of CPU time walking the page structures so we have a contiguous extent of dirty pages that need to written out --- and then we turn around and submit each page 4k at a time. This is causing a huge amount of pressure on the block device queue's rlock. That's almost certainly responsible for the increased CPU utilization that we see in both the large file create workload and random writes workload as compared to XFS. So that's clearly the next thing we need to tackle, and which should further increase ext4's scalability. - Ted ^ permalink raw reply [flat|nested] 22+ messages in thread
* [Ocfs2-devel] [PATCH, RFC 0/3] *** SUBJECT HERE *** (ext4 scalability patches) @ 2010-08-11 21:08 ` Ted Ts'o 0 siblings, 0 replies; 22+ messages in thread From: Ted Ts'o @ 2010-08-11 21:08 UTC (permalink / raw) To: Eric Whitney Cc: Ext4 Developers List, ocfs2-devel, John Stultz, Keith Maanthey On Mon, Aug 09, 2010 at 11:40:42PM -0400, Eric Whitney wrote: > > My 48 core test results for these patches as applied to 2.6.35 can > be found at: > > http://free.linux.hp.com/~enw/ext4/2.6.35 > > Both the Boxacle large_file_creates and random_writes workloads > improved significantly and consistently with these patches, and > apparently in the single threaded case as well as at increased > scale. Thanks for doing these runs! I very much appreciate it --- it's really helped to validate these patches. Looking at your results, the two things which stand out to me is that we've now reached parity with XFS on the random write workload on the 48 and 192 core runs. On the large file creates workload, looking at the lockstats report, it looks like the next big thing we need to work on is to rework the ext4_da_writepages() function. The problem is one that's known to me for a while; we carefully spend a bunch of CPU time walking the page structures so we have a contiguous extent of dirty pages that need to written out --- and then we turn around and submit each page 4k at a time. This is causing a huge amount of pressure on the block device queue's rlock. That's almost certainly responsible for the increased CPU utilization that we see in both the large file create workload and random writes workload as compared to XFS. So that's clearly the next thing we need to tackle, and which should further increase ext4's scalability. - Ted ^ permalink raw reply [flat|nested] 22+ messages in thread
end of thread, other threads:[~2010-08-11 21:08 UTC | newest] Thread overview: 22+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2010-08-03 16:01 [PATCH, RFC 0/3] *** SUBJECT HERE *** Theodore Ts'o 2010-08-03 16:01 ` [Ocfs2-devel] " Theodore Ts'o 2010-08-03 16:01 ` [PATCH, RFC 1/3] jbd2: Use atomic variables to avoid taking t_handle_lock in jbd2_journal_stop Theodore Ts'o 2010-08-03 16:01 ` [Ocfs2-devel] " Theodore Ts'o 2010-08-03 16:01 ` [PATCH, RFC 2/3] jbd2: Change j_state_lock to be a rwlock_t Theodore Ts'o 2010-08-03 16:01 ` [Ocfs2-devel] " Theodore Ts'o 2010-08-04 0:08 ` Ted Ts'o 2010-08-04 0:08 ` [Ocfs2-devel] " Ted Ts'o 2010-08-03 16:01 ` [PATCH, RFC 3/3] jbd2: Remove t_handle_lock from start_this_handle() Theodore Ts'o 2010-08-03 16:01 ` [Ocfs2-devel] " Theodore Ts'o 2010-08-03 19:07 ` [Ocfs2-devel] [PATCH, RFC 0/3] *** SUBJECT HERE *** Joel Becker 2010-08-03 19:07 ` Joel Becker 2010-08-03 20:07 ` Ted Ts'o 2010-08-03 20:07 ` Ted Ts'o 2010-08-03 21:19 ` Joel Becker 2010-08-03 21:19 ` Joel Becker 2010-08-03 22:57 ` Ted Ts'o 2010-08-03 22:57 ` Ted Ts'o 2010-08-10 3:40 ` [PATCH, RFC 0/3] *** SUBJECT HERE *** (ext4 scalability patches) Eric Whitney 2010-08-10 3:40 ` [Ocfs2-devel] " Eric Whitney 2010-08-11 21:08 ` Ted Ts'o 2010-08-11 21:08 ` [Ocfs2-devel] " Ted Ts'o
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.