From: Jan Kara <jack@suse.cz>
To: Ted Tso <tytso@mit.edu>
Cc: linux-ext4@vger.kernel.org, Jan Kara <jack@suse.cz>
Subject: [PATCH 04/26] jbd2: Refine waiting for shadow buffers
Date: Fri, 31 May 2013 11:42:37 +0200 [thread overview]
Message-ID: <1369993379-13017-5-git-send-email-jack@suse.cz> (raw)
In-Reply-To: <1369993379-13017-1-git-send-email-jack@suse.cz>
Currently when we add a buffer to a transaction, we wait until the
buffer is removed from BJ_Shadow list (so that we prevent any changes to
the buffer that is just written to the journal). This can take
unnecessarily long as a lot happens between the time the buffer is
submitted to the journal and the time when we remove the buffer from
BJ_Shadow list (e.g. we wait for all data buffers in the transaction,
we issue a cache flush etc.). Also this creates a dependency of
do_get_write_access() on transaction commit (namely waiting for data IO
to complete) which we want to avoid when implementing transaction
reservation.
So we modify commit code to set new BH_Shadow flag when temporary
shadowing buffer is created and we clear that flag once IO on that
buffer is complete. This allows do_get_write_access() to wait only for
BH_Shadow bit and thus removes the dependency on data IO completion.
Reviewed-by: Zheng Liu <wenqing.lz@taobao.com>
Signed-off-by: Jan Kara <jack@suse.cz>
---
fs/jbd2/commit.c | 18 +++++++++---------
fs/jbd2/journal.c | 2 ++
fs/jbd2/transaction.c | 44 +++++++++++++++++++-------------------------
include/linux/jbd.h | 25 +++++++++++++++++++++++++
include/linux/jbd2.h | 28 ++++++++++++++++++++++++++++
include/linux/jbd_common.h | 26 --------------------------
6 files changed, 83 insertions(+), 60 deletions(-)
diff --git a/fs/jbd2/commit.c b/fs/jbd2/commit.c
index dd92fc7..b992e16 100644
--- a/fs/jbd2/commit.c
+++ b/fs/jbd2/commit.c
@@ -30,15 +30,22 @@
#include <trace/events/jbd2.h>
/*
- * Default IO end handler for temporary BJ_IO buffer_heads.
+ * IO end handler for temporary buffer_heads handling writes to the journal.
*/
static void journal_end_buffer_io_sync(struct buffer_head *bh, int uptodate)
{
+ struct buffer_head *orig_bh = bh->b_private;
+
BUFFER_TRACE(bh, "");
if (uptodate)
set_buffer_uptodate(bh);
else
clear_buffer_uptodate(bh);
+ if (orig_bh) {
+ clear_bit_unlock(BH_Shadow, &orig_bh->b_state);
+ smp_mb__after_clear_bit();
+ wake_up_bit(&orig_bh->b_state, BH_Shadow);
+ }
unlock_buffer(bh);
}
@@ -831,6 +838,7 @@ start_journal_io:
bh = jh2bh(jh);
clear_buffer_jwrite(bh);
J_ASSERT_BH(bh, buffer_jbddirty(bh));
+ J_ASSERT_BH(bh, !buffer_shadow(bh));
/* The metadata is now released for reuse, but we need
to remember it against this transaction so that when
@@ -838,14 +846,6 @@ start_journal_io:
required. */
JBUFFER_TRACE(jh, "file as BJ_Forget");
jbd2_journal_file_buffer(jh, commit_transaction, BJ_Forget);
- /*
- * Wake up any transactions which were waiting for this IO to
- * complete. The barrier must be here so that changes by
- * jbd2_journal_file_buffer() take effect before wake_up_bit()
- * does the waitqueue check.
- */
- smp_mb();
- wake_up_bit(&bh->b_state, BH_Unshadow);
JBUFFER_TRACE(jh, "brelse shadowed buffer");
__brelse(bh);
}
diff --git a/fs/jbd2/journal.c b/fs/jbd2/journal.c
index 96e0594..e812030 100644
--- a/fs/jbd2/journal.c
+++ b/fs/jbd2/journal.c
@@ -451,6 +451,7 @@ repeat:
new_bh->b_size = bh_in->b_size;
new_bh->b_bdev = journal->j_dev;
new_bh->b_blocknr = blocknr;
+ new_bh->b_private = bh_in;
set_buffer_mapped(new_bh);
set_buffer_dirty(new_bh);
@@ -465,6 +466,7 @@ repeat:
spin_lock(&journal->j_list_lock);
__jbd2_journal_file_buffer(jh_in, transaction, BJ_Shadow);
spin_unlock(&journal->j_list_lock);
+ set_buffer_shadow(bh_in);
jbd_unlock_bh_state(bh_in);
return do_escape | (done_copy_out << 1);
diff --git a/fs/jbd2/transaction.c b/fs/jbd2/transaction.c
index 213a43b..4d5ef4b 100644
--- a/fs/jbd2/transaction.c
+++ b/fs/jbd2/transaction.c
@@ -619,6 +619,12 @@ static void warn_dirty_buffer(struct buffer_head *bh)
bdevname(bh->b_bdev, b), (unsigned long long)bh->b_blocknr);
}
+static int sleep_on_shadow_bh(void *word)
+{
+ io_schedule();
+ return 0;
+}
+
/*
* If the buffer is already part of the current transaction, then there
* is nothing we need to do. If it is already part of a prior
@@ -754,41 +760,29 @@ repeat:
* journaled. If the primary copy is already going to
* disk then we cannot do copy-out here. */
- if (jh->b_jlist == BJ_Shadow) {
- DEFINE_WAIT_BIT(wait, &bh->b_state, BH_Unshadow);
- wait_queue_head_t *wqh;
-
- wqh = bit_waitqueue(&bh->b_state, BH_Unshadow);
-
+ if (buffer_shadow(bh)) {
JBUFFER_TRACE(jh, "on shadow: sleep");
jbd_unlock_bh_state(bh);
- /* commit wakes up all shadow buffers after IO */
- for ( ; ; ) {
- prepare_to_wait(wqh, &wait.wait,
- TASK_UNINTERRUPTIBLE);
- if (jh->b_jlist != BJ_Shadow)
- break;
- schedule();
- }
- finish_wait(wqh, &wait.wait);
+ wait_on_bit(&bh->b_state, BH_Shadow,
+ sleep_on_shadow_bh, TASK_UNINTERRUPTIBLE);
goto repeat;
}
- /* Only do the copy if the currently-owning transaction
- * still needs it. If it is on the Forget list, the
- * committing transaction is past that stage. The
- * buffer had better remain locked during the kmalloc,
- * but that should be true --- we hold the journal lock
- * still and the buffer is already on the BUF_JOURNAL
- * list so won't be flushed.
+ /*
+ * Only do the copy if the currently-owning transaction still
+ * needs it. If buffer isn't on BJ_Metadata list, the
+ * committing transaction is past that stage (here we use the
+ * fact that BH_Shadow is set under bh_state lock together with
+ * refiling to BJ_Shadow list and at this point we know the
+ * buffer doesn't have BH_Shadow set).
*
* Subtle point, though: if this is a get_undo_access,
* then we will be relying on the frozen_data to contain
* the new value of the committed_data record after the
* transaction, so we HAVE to force the frozen_data copy
- * in that case. */
-
- if (jh->b_jlist != BJ_Forget || force_copy) {
+ * in that case.
+ */
+ if (jh->b_jlist == BJ_Metadata || force_copy) {
JBUFFER_TRACE(jh, "generate frozen data");
if (!frozen_buffer) {
JBUFFER_TRACE(jh, "allocate memory for buffer");
diff --git a/include/linux/jbd.h b/include/linux/jbd.h
index 7e0b622..92062ee 100644
--- a/include/linux/jbd.h
+++ b/include/linux/jbd.h
@@ -244,6 +244,31 @@ typedef struct journal_superblock_s
#include <linux/fs.h>
#include <linux/sched.h>
+
+enum jbd_state_bits {
+ BH_JBD /* Has an attached ext3 journal_head */
+ = BH_PrivateStart,
+ BH_JWrite, /* Being written to log (@@@ DEBUGGING) */
+ BH_Freed, /* Has been freed (truncated) */
+ BH_Revoked, /* Has been revoked from the log */
+ BH_RevokeValid, /* Revoked flag is valid */
+ BH_JBDDirty, /* Is dirty but journaled */
+ BH_State, /* Pins most journal_head state */
+ BH_JournalHead, /* Pins bh->b_private and jh->b_bh */
+ BH_Unshadow, /* Dummy bit, for BJ_Shadow wakeup filtering */
+ BH_JBDPrivateStart, /* First bit available for private use by FS */
+};
+
+BUFFER_FNS(JBD, jbd)
+BUFFER_FNS(JWrite, jwrite)
+BUFFER_FNS(JBDDirty, jbddirty)
+TAS_BUFFER_FNS(JBDDirty, jbddirty)
+BUFFER_FNS(Revoked, revoked)
+TAS_BUFFER_FNS(Revoked, revoked)
+BUFFER_FNS(RevokeValid, revokevalid)
+TAS_BUFFER_FNS(RevokeValid, revokevalid)
+BUFFER_FNS(Freed, freed)
+
#include <linux/jbd_common.h>
#define J_ASSERT(assert) BUG_ON(!(assert))
diff --git a/include/linux/jbd2.h b/include/linux/jbd2.h
index bdb9ae4..a687c8d 100644
--- a/include/linux/jbd2.h
+++ b/include/linux/jbd2.h
@@ -302,6 +302,34 @@ typedef struct journal_superblock_s
#include <linux/fs.h>
#include <linux/sched.h>
+
+enum jbd_state_bits {
+ BH_JBD /* Has an attached ext3 journal_head */
+ = BH_PrivateStart,
+ BH_JWrite, /* Being written to log (@@@ DEBUGGING) */
+ BH_Freed, /* Has been freed (truncated) */
+ BH_Revoked, /* Has been revoked from the log */
+ BH_RevokeValid, /* Revoked flag is valid */
+ BH_JBDDirty, /* Is dirty but journaled */
+ BH_State, /* Pins most journal_head state */
+ BH_JournalHead, /* Pins bh->b_private and jh->b_bh */
+ BH_Shadow, /* IO on shadow buffer is running */
+ BH_Verified, /* Metadata block has been verified ok */
+ BH_JBDPrivateStart, /* First bit available for private use by FS */
+};
+
+BUFFER_FNS(JBD, jbd)
+BUFFER_FNS(JWrite, jwrite)
+BUFFER_FNS(JBDDirty, jbddirty)
+TAS_BUFFER_FNS(JBDDirty, jbddirty)
+BUFFER_FNS(Revoked, revoked)
+TAS_BUFFER_FNS(Revoked, revoked)
+BUFFER_FNS(RevokeValid, revokevalid)
+TAS_BUFFER_FNS(RevokeValid, revokevalid)
+BUFFER_FNS(Freed, freed)
+BUFFER_FNS(Shadow, shadow)
+BUFFER_FNS(Verified, verified)
+
#include <linux/jbd_common.h>
#define J_ASSERT(assert) BUG_ON(!(assert))
diff --git a/include/linux/jbd_common.h b/include/linux/jbd_common.h
index 6133679..b1f7089 100644
--- a/include/linux/jbd_common.h
+++ b/include/linux/jbd_common.h
@@ -1,32 +1,6 @@
#ifndef _LINUX_JBD_STATE_H
#define _LINUX_JBD_STATE_H
-enum jbd_state_bits {
- BH_JBD /* Has an attached ext3 journal_head */
- = BH_PrivateStart,
- BH_JWrite, /* Being written to log (@@@ DEBUGGING) */
- BH_Freed, /* Has been freed (truncated) */
- BH_Revoked, /* Has been revoked from the log */
- BH_RevokeValid, /* Revoked flag is valid */
- BH_JBDDirty, /* Is dirty but journaled */
- BH_State, /* Pins most journal_head state */
- BH_JournalHead, /* Pins bh->b_private and jh->b_bh */
- BH_Unshadow, /* Dummy bit, for BJ_Shadow wakeup filtering */
- BH_Verified, /* Metadata block has been verified ok */
- BH_JBDPrivateStart, /* First bit available for private use by FS */
-};
-
-BUFFER_FNS(JBD, jbd)
-BUFFER_FNS(JWrite, jwrite)
-BUFFER_FNS(JBDDirty, jbddirty)
-TAS_BUFFER_FNS(JBDDirty, jbddirty)
-BUFFER_FNS(Revoked, revoked)
-TAS_BUFFER_FNS(Revoked, revoked)
-BUFFER_FNS(RevokeValid, revokevalid)
-TAS_BUFFER_FNS(RevokeValid, revokevalid)
-BUFFER_FNS(Freed, freed)
-BUFFER_FNS(Verified, verified)
next prev parent reply other threads:[~2013-05-31 9:43 UTC|newest]
Thread overview: 54+ messages / expand[flat|nested] mbox.gz Atom feed top
2013-05-31 9:42 [PATCH 00/22 v2] Fixes and improvements in ext4 writeback path Jan Kara
2013-05-31 9:42 ` [PATCH 01/26] ext4: use io_end for multiple bios Jan Kara
2013-06-04 16:03 ` Theodore Ts'o
2013-05-31 9:42 ` [PATCH 02/26] jbd2: Don't create journal_head for temporary journal buffers Jan Kara
2013-06-04 16:04 ` Theodore Ts'o
2013-05-31 9:42 ` [PATCH 03/26] jbd2: Remove journal_head from descriptor buffers Jan Kara
2013-06-04 16:06 ` Theodore Ts'o
2013-05-31 9:42 ` Jan Kara [this message]
2013-06-04 16:09 ` [PATCH 04/26] jbd2: Refine waiting for shadow buffers Theodore Ts'o
2013-05-31 9:42 ` [PATCH 05/26] jbd2: Remove outdated comment Jan Kara
2013-06-04 16:11 ` Theodore Ts'o
2013-05-31 9:42 ` [PATCH 06/26] jbd2: Cleanup needed free block estimates when starting a transaction Jan Kara
2013-06-04 16:17 ` Theodore Ts'o
2013-05-31 9:42 ` [PATCH 07/26] jbd2: Fix race in t_outstanding_credits update in jbd2_journal_extend() Jan Kara
2013-06-04 16:23 ` Theodore Ts'o
2013-05-31 9:42 ` [PATCH 08/26] jbd2: Remove unused waitqueues Jan Kara
2013-06-04 16:24 ` Theodore Ts'o
2013-05-31 9:42 ` [PATCH 09/26] jbd2: Transaction reservation support Jan Kara
2013-06-04 16:36 ` Theodore Ts'o
2013-05-31 9:42 ` [PATCH 10/26] ext4: Provide wrappers for transaction reservation calls Jan Kara
2013-06-04 16:41 ` Theodore Ts'o
2013-05-31 9:42 ` [PATCH 11/26] ext4: Stop messing with nr_to_write in ext4_da_writepages() Jan Kara
2013-06-04 16:49 ` Theodore Ts'o
2013-05-31 9:42 ` [PATCH 12/26] ext4: Deprecate max_writeback_mb_bump sysfs attribute Jan Kara
2013-06-04 16:53 ` Theodore Ts'o
2013-05-31 9:42 ` [PATCH 13/26] ext4: Improve writepage credit estimate for files with indirect blocks Jan Kara
2013-06-03 21:45 ` Darrick J. Wong
2013-06-04 16:57 ` Theodore Ts'o
2013-05-31 9:42 ` [PATCH 14/26] ext4: Better estimate credits needed for ext4_da_writepages() Jan Kara
2013-06-04 17:01 ` Theodore Ts'o
2013-05-31 9:42 ` [PATCH 15/26] ext4: Restructure writeback path Jan Kara
2013-06-04 17:18 ` Theodore Ts'o
2013-05-31 9:42 ` [PATCH 16/26] ext4: Remove buffer_uninit handling Jan Kara
2013-06-04 17:20 ` Theodore Ts'o
2013-05-31 9:42 ` [PATCH 17/26] ext4: Use transaction reservation for extent conversion in ext4_end_io Jan Kara
2013-06-04 17:29 ` Theodore Ts'o
2013-05-31 9:42 ` [PATCH 18/26] ext4: Split extent conversion lists to reserved & unreserved parts Jan Kara
2013-06-04 18:22 ` Theodore Ts'o
2013-05-31 9:42 ` [PATCH 19/26] ext4: Defer clearing of PageWriteback after extent conversion Jan Kara
2013-06-04 18:24 ` Theodore Ts'o
2013-05-31 9:42 ` [PATCH 20/26] ext4: Protect extent conversion after DIO with i_dio_count Jan Kara
2013-06-04 18:28 ` Theodore Ts'o
2013-05-31 9:42 ` [PATCH 21/26] ext4: Remove wait for unwritten extent conversion from ext4_truncate() Jan Kara
2013-06-04 18:30 ` Theodore Ts'o
2013-05-31 9:42 ` [PATCH 22/26] ext4: Use generic_file_fsync() in ext4_file_fsync() in nojournal mode Jan Kara
2013-06-04 18:38 ` Theodore Ts'o
2013-05-31 9:42 ` [PATCH 23/26] ext4: Remove i_mutex from ext4_file_sync() Jan Kara
2013-06-04 18:40 ` Theodore Ts'o
2013-05-31 9:42 ` [PATCH 24/26] ext4: Remove wait for unwritten extents in ext4_ind_direct_IO() Jan Kara
2013-06-04 18:42 ` Theodore Ts'o
2013-05-31 9:42 ` [PATCH 25/26] ext4: Don't wait for extent conversion in ext4_punch_hole() Jan Kara
2013-06-04 18:46 ` Theodore Ts'o
2013-05-31 9:42 ` [PATCH 26/26] ext4: Remove ext4_ioend_wait() Jan Kara
2013-06-04 18:47 ` Theodore Ts'o
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=1369993379-13017-5-git-send-email-jack@suse.cz \
--to=jack@suse.cz \
--cc=linux-ext4@vger.kernel.org \
--cc=tytso@mit.edu \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).