linux-ext4.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 0/8 v3] Checkpointing fixes and cleanups
@ 2012-02-15 18:34 Jan Kara
  2012-02-15 18:34 ` [PATCH 1/8] jbd2: Split updating of journal superblock and marking journal empty Jan Kara
                   ` (8 more replies)
  0 siblings, 9 replies; 15+ messages in thread
From: Jan Kara @ 2012-02-15 18:34 UTC (permalink / raw)
  To: Ted Tso; +Cc: linux-ext4


  Hello,

  this is a third round of my checkpointing fixes. The patch set is primarily
aimed at fixing filesystem corruption happening occasionally (rather rarely)
after power failure. First three patches should fix the issue. Patches 4-7 are
assorted checkpointing cleanups I've gathered when inspecting checkpointing
code. Finally patch 8 is a possible speedup of checkpoining - we can use cache
flushes happening during transaction commits for pushing the journal tail
safely. The observable speedup is disputable since jbd2_cleanup_journal_tail()
is called rather rarely (for metadata heavy load I saw about one
jbd2_cleanup_journal_tail() for about 200 commits) so the cost of additional
cache flush will be likely in the noise. But the patch is simple enough so I
send it for others to judge whether it makes sense or not.

Changes since v2:
Rewrote the code pushing log tail to fix race described by Ted.
Improved some comments

									Honza

^ permalink raw reply	[flat|nested] 15+ messages in thread

* [PATCH 1/8] jbd2: Split updating of journal superblock and marking journal empty
  2012-02-15 18:34 [PATCH 0/8 v3] Checkpointing fixes and cleanups Jan Kara
@ 2012-02-15 18:34 ` Jan Kara
  2012-02-15 18:34 ` [PATCH 2/8] jbd2: Protect all log tail updates with j_checkpoint_mutex Jan Kara
                   ` (7 subsequent siblings)
  8 siblings, 0 replies; 15+ messages in thread
From: Jan Kara @ 2012-02-15 18:34 UTC (permalink / raw)
  To: Ted Tso; +Cc: linux-ext4, Jan Kara

There are three case of updating journal superblock. In the first case, we want
to mark journal as empty (setting s_sequence to 0), in the second case we want
to update log tail, in the third case we want to update s_errno. Split these
cases into separate functions. It makes the code slightly more straightforward
and later patches will make the distinction even more important.

Signed-off-by: Jan Kara <jack@suse.cz>
---
 fs/jbd2/checkpoint.c |    2 +-
 fs/jbd2/commit.c     |    2 +-
 fs/jbd2/journal.c    |  162 +++++++++++++++++++++++++++++---------------------
 include/linux/jbd2.h |    2 +-
 4 files changed, 98 insertions(+), 70 deletions(-)

diff --git a/fs/jbd2/checkpoint.c b/fs/jbd2/checkpoint.c
index 16a698b..aa09868 100644
--- a/fs/jbd2/checkpoint.c
+++ b/fs/jbd2/checkpoint.c
@@ -550,7 +550,7 @@ int jbd2_cleanup_journal_tail(journal_t *journal)
 	    (journal->j_flags & JBD2_BARRIER))
 		blkdev_issue_flush(journal->j_fs_dev, GFP_KERNEL, NULL);
 	if (!(journal->j_flags & JBD2_ABORT))
-		jbd2_journal_update_superblock(journal, 1);
+		jbd2_journal_update_sb_log_tail(journal);
 	return 0;
 }
 
diff --git a/fs/jbd2/commit.c b/fs/jbd2/commit.c
index 68d704d..ffdf512 100644
--- a/fs/jbd2/commit.c
+++ b/fs/jbd2/commit.c
@@ -340,7 +340,7 @@ void jbd2_journal_commit_transaction(journal_t *journal)
 	/* Do we need to erase the effects of a prior jbd2_journal_flush? */
 	if (journal->j_flags & JBD2_FLUSHED) {
 		jbd_debug(3, "super block updated\n");
-		jbd2_journal_update_superblock(journal, 1);
+		jbd2_journal_update_sb_log_tail(journal);
 	} else {
 		jbd_debug(3, "superblock not updated\n");
 	}
diff --git a/fs/jbd2/journal.c b/fs/jbd2/journal.c
index 0fa0123..ec05900 100644
--- a/fs/jbd2/journal.c
+++ b/fs/jbd2/journal.c
@@ -1112,39 +1112,28 @@ static int journal_reset(journal_t *journal)
 
 	journal->j_max_transaction_buffers = journal->j_maxlen / 4;
 
-	/* Add the dynamic fields and write it to disk. */
-	jbd2_journal_update_superblock(journal, 1);
-	return jbd2_journal_start_thread(journal);
-}
-
-/**
- * void jbd2_journal_update_superblock() - Update journal sb on disk.
- * @journal: The journal to update.
- * @wait: Set to '0' if you don't want to wait for IO completion.
- *
- * Update a journal's dynamic superblock fields and write it to disk,
- * optionally waiting for the IO to complete.
- */
-void jbd2_journal_update_superblock(journal_t *journal, int wait)
-{
-	journal_superblock_t *sb = journal->j_superblock;
-	struct buffer_head *bh = journal->j_sb_buffer;
-
 	/*
 	 * As a special case, if the on-disk copy is already marked as needing
-	 * no recovery (s_start == 0) and there are no outstanding transactions
-	 * in the filesystem, then we can safely defer the superblock update
-	 * until the next commit by setting JBD2_FLUSHED.  This avoids
+	 * no recovery (s_start == 0), then we can safely defer the superblock
+	 * update until the next commit by setting JBD2_FLUSHED.  This avoids
 	 * attempting a write to a potential-readonly device.
 	 */
-	if (sb->s_start == 0 && journal->j_tail_sequence ==
-				journal->j_transaction_sequence) {
+	if (sb->s_start == 0) {
 		jbd_debug(1, "JBD2: Skipping superblock update on recovered sb "
 			"(start %ld, seq %d, errno %d)\n",
 			journal->j_tail, journal->j_tail_sequence,
 			journal->j_errno);
-		goto out;
+		journal->j_flags |= JBD2_FLUSHED;
+	} else {
+		/* Add the dynamic fields and write it to disk. */
+		jbd2_journal_update_sb_log_tail(journal);
 	}
+	return jbd2_journal_start_thread(journal);
+}
+
+static void jbd2_write_superblock(journal_t *journal)
+{
+	struct buffer_head *bh = journal->j_sb_buffer;
 
 	if (buffer_write_io_error(bh)) {
 		/*
@@ -1162,47 +1151,98 @@ void jbd2_journal_update_superblock(journal_t *journal, int wait)
 		set_buffer_uptodate(bh);
 	}
 
+	BUFFER_TRACE(bh, "marking dirty");
+	mark_buffer_dirty(bh);
+	sync_dirty_buffer(bh);
+	if (buffer_write_io_error(bh)) {
+		printk(KERN_ERR "JBD2: I/O error detected "
+		       "when updating journal superblock for %s.\n",
+		       journal->j_devname);
+		clear_buffer_write_io_error(bh);
+		set_buffer_uptodate(bh);
+	}
+}
+
+/**
+ * jbd2_journal_update_sb_log_tail() - Update log tail in journal sb on disk.
+ * @journal: The journal to update.
+ *
+ * Update a journal's superblock information about log tail and write it to
+ * disk, waiting for the IO to complete.
+ */
+void jbd2_journal_update_sb_log_tail(journal_t *journal)
+{
+	journal_superblock_t *sb = journal->j_superblock;
+
 	read_lock(&journal->j_state_lock);
-	jbd_debug(1, "JBD2: updating superblock (start %ld, seq %d, errno %d)\n",
-		  journal->j_tail, journal->j_tail_sequence, journal->j_errno);
+	jbd_debug(1, "JBD2: updating superblock (start %ld, seq %d)\n",
+		  journal->j_tail, journal->j_tail_sequence);
 
 	sb->s_sequence = cpu_to_be32(journal->j_tail_sequence);
 	sb->s_start    = cpu_to_be32(journal->j_tail);
-	sb->s_errno    = cpu_to_be32(journal->j_errno);
 	read_unlock(&journal->j_state_lock);
 
-	BUFFER_TRACE(bh, "marking dirty");
-	mark_buffer_dirty(bh);
-	if (wait) {
-		sync_dirty_buffer(bh);
-		if (buffer_write_io_error(bh)) {
-			printk(KERN_ERR "JBD2: I/O error detected "
-			       "when updating journal superblock for %s.\n",
-			       journal->j_devname);
-			clear_buffer_write_io_error(bh);
-			set_buffer_uptodate(bh);
-		}
-	} else
-		write_dirty_buffer(bh, WRITE);
+	jbd2_write_superblock(journal);
 
-out:
-	/* If we have just flushed the log (by marking s_start==0), then
-	 * any future commit will have to be careful to update the
-	 * superblock again to re-record the true start of the log. */
+	/* Log is no longer empty */
+	write_lock(&journal->j_state_lock);
+	WARN_ON(!sb->s_sequence);
+	journal->j_flags &= ~JBD2_FLUSHED;
+	write_unlock(&journal->j_state_lock);
+}
+
+/**
+ * jbd2_mark_journal_empty() - Mark on disk journal as empty.
+ * @journal: The journal to update.
+ *
+ * Update a journal's dynamic superblock fields to show that journal is empty.
+ * Write updated superblock to disk waiting for IO to complete.
+ */
+static void jbd2_mark_journal_empty(journal_t *journal)
+{
+	journal_superblock_t *sb = journal->j_superblock;
+
+	read_lock(&journal->j_state_lock);
+	jbd_debug(1, "JBD2: Marking journal as empty (seq %d)\n",
+		  journal->j_tail_sequence);
+
+	sb->s_sequence = cpu_to_be32(journal->j_tail_sequence);
+	sb->s_start    = cpu_to_be32(0);
+	read_unlock(&journal->j_state_lock);
+
+	jbd2_write_superblock(journal);
 
+	/* Log is no longer empty */
 	write_lock(&journal->j_state_lock);
-	if (sb->s_start)
-		journal->j_flags &= ~JBD2_FLUSHED;
-	else
-		journal->j_flags |= JBD2_FLUSHED;
+	journal->j_flags |= JBD2_FLUSHED;
 	write_unlock(&journal->j_state_lock);
 }
 
+
+/**
+ * jbd2_journal_update_sb_errno() - Update error in the journal.
+ * @journal: The journal to update.
+ *
+ * Update a journal's errno.  Write updated superblock to disk waiting for IO
+ * to complete.
+ */
+static void jbd2_journal_update_sb_errno(journal_t *journal)
+{
+	journal_superblock_t *sb = journal->j_superblock;
+
+	read_lock(&journal->j_state_lock);
+	jbd_debug(1, "JBD2: updating superblock error (errno %d)\n",
+		  journal->j_errno);
+	sb->s_errno    = cpu_to_be32(journal->j_errno);
+	read_unlock(&journal->j_state_lock);
+
+	jbd2_write_superblock(journal);
+}
+
 /*
  * Read the superblock for a given journal, performing initial
  * validation of the format.
  */
-
 static int journal_get_superblock(journal_t *journal)
 {
 	struct buffer_head *bh;
@@ -1395,15 +1435,10 @@ int jbd2_journal_destroy(journal_t *journal)
 	spin_unlock(&journal->j_list_lock);
 
 	if (journal->j_sb_buffer) {
-		if (!is_journal_aborted(journal)) {
-			/* We can now mark the journal as empty. */
-			journal->j_tail = 0;
-			journal->j_tail_sequence =
-				++journal->j_transaction_sequence;
-			jbd2_journal_update_superblock(journal, 1);
-		} else {
+		if (!is_journal_aborted(journal))
+			jbd2_mark_journal_empty(journal);
+		else
 			err = -EIO;
-		}
 		brelse(journal->j_sb_buffer);
 	}
 
@@ -1617,7 +1652,6 @@ int jbd2_journal_flush(journal_t *journal)
 {
 	int err = 0;
 	transaction_t *transaction = NULL;
-	unsigned long old_tail;
 
 	write_lock(&journal->j_state_lock);
 
@@ -1659,14 +1693,8 @@ int jbd2_journal_flush(journal_t *journal)
 	 * the magic code for a fully-recovered superblock.  Any future
 	 * commits of data to the journal will restore the current
 	 * s_start value. */
+	jbd2_mark_journal_empty(journal);
 	write_lock(&journal->j_state_lock);
-	old_tail = journal->j_tail;
-	journal->j_tail = 0;
-	write_unlock(&journal->j_state_lock);
-	jbd2_journal_update_superblock(journal, 1);
-	write_lock(&journal->j_state_lock);
-	journal->j_tail = old_tail;
-
 	J_ASSERT(!journal->j_running_transaction);
 	J_ASSERT(!journal->j_committing_transaction);
 	J_ASSERT(!journal->j_checkpoint_transactions);
@@ -1707,7 +1735,7 @@ int jbd2_journal_wipe(journal_t *journal, int write)
 
 	err = jbd2_journal_skip_recovery(journal);
 	if (write)
-		jbd2_journal_update_superblock(journal, 1);
+		jbd2_mark_journal_empty(journal);
 
  no_recovery:
 	return err;
@@ -1757,7 +1785,7 @@ static void __journal_abort_soft (journal_t *journal, int errno)
 	__jbd2_journal_abort_hard(journal);
 
 	if (errno)
-		jbd2_journal_update_superblock(journal, 1);
+		jbd2_journal_update_sb_errno(journal);
 }
 
 /**
diff --git a/include/linux/jbd2.h b/include/linux/jbd2.h
index 2092ea2..78da912 100644
--- a/include/linux/jbd2.h
+++ b/include/linux/jbd2.h
@@ -1082,7 +1082,7 @@ extern int	   jbd2_journal_destroy    (journal_t *);
 extern int	   jbd2_journal_recover    (journal_t *journal);
 extern int	   jbd2_journal_wipe       (journal_t *, int);
 extern int	   jbd2_journal_skip_recovery	(journal_t *);
-extern void	   jbd2_journal_update_superblock	(journal_t *, int);
+extern void	   jbd2_journal_update_sb_log_tail	(journal_t *);
 extern void	   __jbd2_journal_abort_hard	(journal_t *);
 extern void	   jbd2_journal_abort      (journal_t *, int);
 extern int	   jbd2_journal_errno      (journal_t *);
-- 
1.7.1


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [PATCH 2/8] jbd2: Protect all log tail updates with j_checkpoint_mutex
  2012-02-15 18:34 [PATCH 0/8 v3] Checkpointing fixes and cleanups Jan Kara
  2012-02-15 18:34 ` [PATCH 1/8] jbd2: Split updating of journal superblock and marking journal empty Jan Kara
@ 2012-02-15 18:34 ` Jan Kara
  2012-02-15 18:34 ` [PATCH 3/8] jbd2: Issue cache flush after checkpointing even with internal journal Jan Kara
                   ` (6 subsequent siblings)
  8 siblings, 0 replies; 15+ messages in thread
From: Jan Kara @ 2012-02-15 18:34 UTC (permalink / raw)
  To: Ted Tso; +Cc: linux-ext4, Jan Kara

There are some log tail updates that are not protected by j_checkpoint_mutex.
Some of these are harmless because they happen during startup or shutdown but
updates in jbd2_journal_commit_transaction() and jbd2_journal_flush() can
really race with other log tail updates (e.g. someone doing
jbd2_journal_flush() with someone running jbd2_cleanup_journal_tail()). So
protect all log tail updates with j_checkpoint_mutex.

Signed-off-by: Jan Kara <jack@suse.cz>
---
 fs/jbd2/commit.c  |    2 ++
 fs/jbd2/journal.c |   19 ++++++++++++++++---
 2 files changed, 18 insertions(+), 3 deletions(-)

diff --git a/fs/jbd2/commit.c b/fs/jbd2/commit.c
index ffdf512..39978c1 100644
--- a/fs/jbd2/commit.c
+++ b/fs/jbd2/commit.c
@@ -340,7 +340,9 @@ void jbd2_journal_commit_transaction(journal_t *journal)
 	/* Do we need to erase the effects of a prior jbd2_journal_flush? */
 	if (journal->j_flags & JBD2_FLUSHED) {
 		jbd_debug(3, "super block updated\n");
+		mutex_lock(&journal->j_checkpoint_mutex);
 		jbd2_journal_update_sb_log_tail(journal);
+		mutex_unlock(&journal->j_checkpoint_mutex);
 	} else {
 		jbd_debug(3, "superblock not updated\n");
 	}
diff --git a/fs/jbd2/journal.c b/fs/jbd2/journal.c
index ec05900..183a099 100644
--- a/fs/jbd2/journal.c
+++ b/fs/jbd2/journal.c
@@ -1125,8 +1125,11 @@ static int journal_reset(journal_t *journal)
 			journal->j_errno);
 		journal->j_flags |= JBD2_FLUSHED;
 	} else {
+		/* Lock here to make assertions happy... */
+		mutex_lock(&journal->j_checkpoint_mutex);
 		/* Add the dynamic fields and write it to disk. */
 		jbd2_journal_update_sb_log_tail(journal);
+		mutex_unlock(&journal->j_checkpoint_mutex);
 	}
 	return jbd2_journal_start_thread(journal);
 }
@@ -1174,6 +1177,7 @@ void jbd2_journal_update_sb_log_tail(journal_t *journal)
 {
 	journal_superblock_t *sb = journal->j_superblock;
 
+	BUG_ON(!mutex_is_locked(&journal->j_checkpoint_mutex));
 	read_lock(&journal->j_state_lock);
 	jbd_debug(1, "JBD2: updating superblock (start %ld, seq %d)\n",
 		  journal->j_tail, journal->j_tail_sequence);
@@ -1202,6 +1206,7 @@ static void jbd2_mark_journal_empty(journal_t *journal)
 {
 	journal_superblock_t *sb = journal->j_superblock;
 
+	BUG_ON(!mutex_is_locked(&journal->j_checkpoint_mutex));
 	read_lock(&journal->j_state_lock);
 	jbd_debug(1, "JBD2: Marking journal as empty (seq %d)\n",
 		  journal->j_tail_sequence);
@@ -1435,9 +1440,11 @@ int jbd2_journal_destroy(journal_t *journal)
 	spin_unlock(&journal->j_list_lock);
 
 	if (journal->j_sb_buffer) {
-		if (!is_journal_aborted(journal))
+		if (!is_journal_aborted(journal)) {
+			mutex_lock(&journal->j_checkpoint_mutex);
 			jbd2_mark_journal_empty(journal);
-		else
+			mutex_unlock(&journal->j_checkpoint_mutex);
+		} else
 			err = -EIO;
 		brelse(journal->j_sb_buffer);
 	}
@@ -1686,6 +1693,7 @@ int jbd2_journal_flush(journal_t *journal)
 	if (is_journal_aborted(journal))
 		return -EIO;
 
+	mutex_lock(&journal->j_checkpoint_mutex);
 	jbd2_cleanup_journal_tail(journal);
 
 	/* Finally, mark the journal as really needing no recovery.
@@ -1694,6 +1702,7 @@ int jbd2_journal_flush(journal_t *journal)
 	 * commits of data to the journal will restore the current
 	 * s_start value. */
 	jbd2_mark_journal_empty(journal);
+	mutex_unlock(&journal->j_checkpoint_mutex);
 	write_lock(&journal->j_state_lock);
 	J_ASSERT(!journal->j_running_transaction);
 	J_ASSERT(!journal->j_committing_transaction);
@@ -1734,8 +1743,12 @@ int jbd2_journal_wipe(journal_t *journal, int write)
 		write ? "Clearing" : "Ignoring");
 
 	err = jbd2_journal_skip_recovery(journal);
-	if (write)
+	if (write) {
+		/* Lock to make assertions happy... */
+		mutex_lock(&journal->j_checkpoint_mutex);
 		jbd2_mark_journal_empty(journal);
+		mutex_unlock(&journal->j_checkpoint_mutex);
+	}
 
  no_recovery:
 	return err;
-- 
1.7.1


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [PATCH 3/8] jbd2: Issue cache flush after checkpointing even with internal journal
  2012-02-15 18:34 [PATCH 0/8 v3] Checkpointing fixes and cleanups Jan Kara
  2012-02-15 18:34 ` [PATCH 1/8] jbd2: Split updating of journal superblock and marking journal empty Jan Kara
  2012-02-15 18:34 ` [PATCH 2/8] jbd2: Protect all log tail updates with j_checkpoint_mutex Jan Kara
@ 2012-02-15 18:34 ` Jan Kara
  2012-03-14  2:17   ` Ted Ts'o
  2012-02-15 18:34 ` [PATCH 4/8] jbd2: Fix BH_JWrite setting in checkpointing code Jan Kara
                   ` (5 subsequent siblings)
  8 siblings, 1 reply; 15+ messages in thread
From: Jan Kara @ 2012-02-15 18:34 UTC (permalink / raw)
  To: Ted Tso; +Cc: linux-ext4, Jan Kara

When we reach jbd2_cleanup_journal_tail(), there is no guarantee that
checkpointed buffers are on a stable storage - especially if buffers were
written out by jbd2_log_do_checkpoint(), they are likely to be only in disk's
caches. Thus when we update journal superblock effectively removing old
transaction from journal, this write of superblock can get to stable storage
before those checkpointed buffers which can result in filesystem corruption
after a crash. Thus we must unconditionally issue a cache flush before we
update journal superblock in these cases.

A similar problem can also occur if journal superblock is written only in
disk's caches, other transaction starts reusing space of the transaction
cleaned from the log and power failure happens. Subsequent journal replay would
still try to replay the old transaction but some of it's blocks may be already
overwritten by the new transaction. For this reason we must use WRITE_FUA when
updating log tail and we must first write new log tail to disk and update
in-memory information only after that.

Signed-off-by: Jan Kara <jack@suse.cz>
---
 fs/jbd2/checkpoint.c        |   75 ++++--------------------
 fs/jbd2/commit.c            |   11 +++-
 fs/jbd2/journal.c           |  136 ++++++++++++++++++++++++++++++++++++------
 fs/jbd2/recovery.c          |    5 +-
 include/linux/jbd2.h        |    6 ++-
 include/trace/events/jbd2.h |    2 +-
 6 files changed, 148 insertions(+), 87 deletions(-)

diff --git a/fs/jbd2/checkpoint.c b/fs/jbd2/checkpoint.c
index aa09868..de76f1f 100644
--- a/fs/jbd2/checkpoint.c
+++ b/fs/jbd2/checkpoint.c
@@ -478,79 +478,28 @@ out:
 
 int jbd2_cleanup_journal_tail(journal_t *journal)
 {
-	transaction_t * transaction;
 	tid_t		first_tid;
-	unsigned long	blocknr, freed;
+	unsigned long	blocknr;
 
 	if (is_journal_aborted(journal))
 		return 1;
 
-	/* OK, work out the oldest transaction remaining in the log, and
-	 * the log block it starts at.
-	 *
-	 * If the log is now empty, we need to work out which is the
-	 * next transaction ID we will write, and where it will
-	 * start. */
-
-	write_lock(&journal->j_state_lock);
-	spin_lock(&journal->j_list_lock);
-	transaction = journal->j_checkpoint_transactions;
-	if (transaction) {
-		first_tid = transaction->t_tid;
-		blocknr = transaction->t_log_start;
-	} else if ((transaction = journal->j_committing_transaction) != NULL) {
-		first_tid = transaction->t_tid;
-		blocknr = transaction->t_log_start;
-	} else if ((transaction = journal->j_running_transaction) != NULL) {
-		first_tid = transaction->t_tid;
-		blocknr = journal->j_head;
-	} else {
-		first_tid = journal->j_transaction_sequence;
-		blocknr = journal->j_head;
-	}
-	spin_unlock(&journal->j_list_lock);
-	J_ASSERT(blocknr != 0);
-
-	/* If the oldest pinned transaction is at the tail of the log
-           already then there's not much we can do right now. */
-	if (journal->j_tail_sequence == first_tid) {
-		write_unlock(&journal->j_state_lock);
+	if (!jbd2_journal_get_log_tail(journal, &first_tid, &blocknr))
 		return 1;
-	}
-
-	/* OK, update the superblock to recover the freed space.
-	 * Physical blocks come first: have we wrapped beyond the end of
-	 * the log?  */
-	freed = blocknr - journal->j_tail;
-	if (blocknr < journal->j_tail)
-		freed = freed + journal->j_last - journal->j_first;
-
-	trace_jbd2_cleanup_journal_tail(journal, first_tid, blocknr, freed);
-	jbd_debug(1,
-		  "Cleaning journal tail from %d to %d (offset %lu), "
-		  "freeing %lu\n",
-		  journal->j_tail_sequence, first_tid, blocknr, freed);
-
-	journal->j_free += freed;
-	journal->j_tail_sequence = first_tid;
-	journal->j_tail = blocknr;
-	write_unlock(&journal->j_state_lock);
+	J_ASSERT(blocknr != 0);
 
 	/*
-	 * If there is an external journal, we need to make sure that
-	 * any data blocks that were recently written out --- perhaps
-	 * by jbd2_log_do_checkpoint() --- are flushed out before we
-	 * drop the transactions from the external journal.  It's
-	 * unlikely this will be necessary, especially with a
-	 * appropriately sized journal, but we need this to guarantee
-	 * correctness.  Fortunately jbd2_cleanup_journal_tail()
-	 * doesn't get called all that often.
+	 * We need to make sure that any blocks that were recently written out
+	 * --- perhaps by jbd2_log_do_checkpoint() --- are flushed out before
+	 * we drop the transactions from the journal. It's unlikely this will
+	 * be necessary, especially with an appropriately sized journal, but we
+	 * need this to guarantee correctness.  Fortunately
+	 * jbd2_cleanup_journal_tail() doesn't get called all that often.
 	 */
-	if ((journal->j_fs_dev != journal->j_dev) &&
-	    (journal->j_flags & JBD2_BARRIER))
+	if (journal->j_flags & JBD2_BARRIER)
 		blkdev_issue_flush(journal->j_fs_dev, GFP_KERNEL, NULL);
-	if (!(journal->j_flags & JBD2_ABORT))
-		jbd2_journal_update_sb_log_tail(journal);
+
+	__jbd2_update_log_tail(journal, first_tid, blocknr);
 	return 0;
 }
 
diff --git a/fs/jbd2/commit.c b/fs/jbd2/commit.c
index 39978c1..f37b783 100644
--- a/fs/jbd2/commit.c
+++ b/fs/jbd2/commit.c
@@ -341,7 +341,16 @@ void jbd2_journal_commit_transaction(journal_t *journal)
 	if (journal->j_flags & JBD2_FLUSHED) {
 		jbd_debug(3, "super block updated\n");
 		mutex_lock(&journal->j_checkpoint_mutex);
-		jbd2_journal_update_sb_log_tail(journal);
+		/*
+		 * We hold j_checkpoint_mutex so tail cannot change under us.
+		 * We don't need any special data guarantees for writing sb
+		 * since journal is empty and it is ok for write to be
+		 * flushed only with transaction commit.
+		 */
+		jbd2_journal_update_sb_log_tail(journal,
+						journal->j_tail_sequence,
+						journal->j_tail,
+						WRITE_SYNC);
 		mutex_unlock(&journal->j_checkpoint_mutex);
 	} else {
 		jbd_debug(3, "superblock not updated\n");
diff --git a/fs/jbd2/journal.c b/fs/jbd2/journal.c
index 183a099..192558f 100644
--- a/fs/jbd2/journal.c
+++ b/fs/jbd2/journal.c
@@ -744,6 +744,85 @@ struct journal_head *jbd2_journal_get_descriptor_buffer(journal_t *journal)
 	return jbd2_journal_add_journal_head(bh);
 }
 
+/*
+ * Return tid of the oldest transaction in the journal and block in the journal
+ * where the transaction starts.
+ *
+ * If the journal is now empty, return which will be the next transaction ID
+ * we will write and where will that transaction start.
+ *
+ * The return value is 0 if journal tail cannot be pushed any further, 1 if
+ * it can.
+ */
+int jbd2_journal_get_log_tail(journal_t *journal, tid_t *tid,
+			      unsigned long *block)
+{
+	transaction_t *transaction;
+	int ret;
+
+	read_lock(&journal->j_state_lock);
+	spin_lock(&journal->j_list_lock);
+	transaction = journal->j_checkpoint_transactions;
+	if (transaction) {
+		*tid = transaction->t_tid;
+		*block = transaction->t_log_start;
+	} else if ((transaction = journal->j_committing_transaction) != NULL) {
+		*tid = transaction->t_tid;
+		*block = transaction->t_log_start;
+	} else if ((transaction = journal->j_running_transaction) != NULL) {
+		*tid = transaction->t_tid;
+		*block = journal->j_head;
+	} else {
+		*tid = journal->j_transaction_sequence;
+		*block = journal->j_head;
+	}
+	ret = tid_gt(*tid, journal->j_tail_sequence);
+	spin_unlock(&journal->j_list_lock);
+	read_unlock(&journal->j_state_lock);
+
+	return ret;
+}
+
+/*
+ * Update information in journal structure and in on disk journal superblock
+ * about log tail. This function does not check whether information passed in
+ * really pushes log tail further. It's responsibility of the caller to make
+ * sure provided log tail information is valid (e.g. by holding
+ * j_checkpoint_mutex all the time between computing log tail and calling this
+ * function as is the case with jbd2_cleanup_journal_tail()).
+ *
+ * Requires j_checkpoint_mutex
+ */
+void __jbd2_update_log_tail(journal_t *journal, tid_t tid, unsigned long block)
+{
+	unsigned long freed;
+
+	BUG_ON(!mutex_is_locked(&journal->j_checkpoint_mutex));
+
+	/*
+	 * We cannot afford for write to remain in drive's caches since as
+	 * soon as we update j_tail, next transaction can start reusing journal
+	 * space and if we lose sb update during power failure we'd replay
+	 * old transaction with possibly newly overwritten data.
+	 */
+	jbd2_journal_update_sb_log_tail(journal, tid, block, WRITE_FUA);
+	write_lock(&journal->j_state_lock);
+	freed = block - journal->j_tail;
+	if (block < journal->j_tail)
+		freed += journal->j_last - journal->j_first;
+
+	trace_jbd2_update_log_tail(journal, tid, block, freed);
+	jbd_debug(1,
+		  "Cleaning journal tail from %d to %d (offset %lu), "
+		  "freeing %lu\n",
+		  journal->j_tail_sequence, tid, block, freed);
+
+	journal->j_free += freed;
+	journal->j_tail_sequence = tid;
+	journal->j_tail = block;
+	write_unlock(&journal->j_state_lock);
+}
+
 struct jbd2_stats_proc_session {
 	journal_t *journal;
 	struct transaction_stats_s *stats;
@@ -1127,17 +1206,29 @@ static int journal_reset(journal_t *journal)
 	} else {
 		/* Lock here to make assertions happy... */
 		mutex_lock(&journal->j_checkpoint_mutex);
-		/* Add the dynamic fields and write it to disk. */
-		jbd2_journal_update_sb_log_tail(journal);
+		/*
+		 * Update log tail information. We use WRITE_FUA since new
+		 * transaction will start reusing journal space and so we
+		 * must make sure information about current log tail is on
+		 * disk before that.
+		 */
+		jbd2_journal_update_sb_log_tail(journal,
+						journal->j_tail_sequence,
+						journal->j_tail,
+						WRITE_FUA);
 		mutex_unlock(&journal->j_checkpoint_mutex);
 	}
 	return jbd2_journal_start_thread(journal);
 }
 
-static void jbd2_write_superblock(journal_t *journal)
+static void jbd2_write_superblock(journal_t *journal, int write_op)
 {
 	struct buffer_head *bh = journal->j_sb_buffer;
+	int ret;
 
+	if (!(journal->j_flags & JBD2_BARRIER))
+		write_op &= ~(REQ_FUA | REQ_FLUSH);
+	lock_buffer(bh);
 	if (buffer_write_io_error(bh)) {
 		/*
 		 * Oh, dear.  A previous attempt to write the journal
@@ -1153,40 +1244,45 @@ static void jbd2_write_superblock(journal_t *journal)
 		clear_buffer_write_io_error(bh);
 		set_buffer_uptodate(bh);
 	}
-
-	BUFFER_TRACE(bh, "marking dirty");
-	mark_buffer_dirty(bh);
-	sync_dirty_buffer(bh);
+	get_bh(bh);
+	bh->b_end_io = end_buffer_write_sync;
+	ret = submit_bh(write_op, bh);
+	wait_on_buffer(bh);
 	if (buffer_write_io_error(bh)) {
-		printk(KERN_ERR "JBD2: I/O error detected "
-		       "when updating journal superblock for %s.\n",
-		       journal->j_devname);
 		clear_buffer_write_io_error(bh);
 		set_buffer_uptodate(bh);
+		ret = -EIO;
+	}
+	if (ret) {
+		printk(KERN_ERR "JBD2: Error %d detected when updating "
+		       "journal superblock for %s.\n", ret,
+		       journal->j_devname);
 	}
 }
 
 /**
  * jbd2_journal_update_sb_log_tail() - Update log tail in journal sb on disk.
  * @journal: The journal to update.
+ * @tail_tid: TID of the new transaction at the tail of the log
+ * @tail_block: The first block of the transaction at the tail of the log
+ * @write_op: With which operation should we write the journal sb
  *
  * Update a journal's superblock information about log tail and write it to
  * disk, waiting for the IO to complete.
  */
-void jbd2_journal_update_sb_log_tail(journal_t *journal)
+void jbd2_journal_update_sb_log_tail(journal_t *journal, tid_t tail_tid,
+				     unsigned long tail_block, int write_op)
 {
 	journal_superblock_t *sb = journal->j_superblock;
 
 	BUG_ON(!mutex_is_locked(&journal->j_checkpoint_mutex));
-	read_lock(&journal->j_state_lock);
-	jbd_debug(1, "JBD2: updating superblock (start %ld, seq %d)\n",
-		  journal->j_tail, journal->j_tail_sequence);
+	jbd_debug(1, "JBD2: updating superblock (start %lu, seq %u)\n",
+		  tail_block, tail_tid);
 
-	sb->s_sequence = cpu_to_be32(journal->j_tail_sequence);
-	sb->s_start    = cpu_to_be32(journal->j_tail);
-	read_unlock(&journal->j_state_lock);
+	sb->s_sequence = cpu_to_be32(tail_tid);
+	sb->s_start    = cpu_to_be32(tail_block);
 
-	jbd2_write_superblock(journal);
+	jbd2_write_superblock(journal, write_op);
 
 	/* Log is no longer empty */
 	write_lock(&journal->j_state_lock);
@@ -1215,7 +1311,7 @@ static void jbd2_mark_journal_empty(journal_t *journal)
 	sb->s_start    = cpu_to_be32(0);
 	read_unlock(&journal->j_state_lock);
 
-	jbd2_write_superblock(journal);
+	jbd2_write_superblock(journal, WRITE_FUA);
 
 	/* Log is no longer empty */
 	write_lock(&journal->j_state_lock);
@@ -1241,7 +1337,7 @@ static void jbd2_journal_update_sb_errno(journal_t *journal)
 	sb->s_errno    = cpu_to_be32(journal->j_errno);
 	read_unlock(&journal->j_state_lock);
 
-	jbd2_write_superblock(journal);
+	jbd2_write_superblock(journal, WRITE_SYNC);
 }
 
 /*
diff --git a/fs/jbd2/recovery.c b/fs/jbd2/recovery.c
index da6d7ba..c1a0335 100644
--- a/fs/jbd2/recovery.c
+++ b/fs/jbd2/recovery.c
@@ -21,6 +21,7 @@
 #include <linux/jbd2.h>
 #include <linux/errno.h>
 #include <linux/crc32.h>
+#include <linux/blkdev.h>
 #endif
 
 /*
@@ -265,7 +266,9 @@ int jbd2_journal_recover(journal_t *journal)
 	err2 = sync_blockdev(journal->j_fs_dev);
 	if (!err)
 		err = err2;
-
+	/* Make sure all replayed data is on permanent storage */
+	if (journal->j_flags & JBD2_BARRIER)
+		blkdev_issue_flush(journal->j_fs_dev, GFP_KERNEL, NULL);
 	return err;
 }
 
diff --git a/include/linux/jbd2.h b/include/linux/jbd2.h
index 78da912..1b10c2a 100644
--- a/include/linux/jbd2.h
+++ b/include/linux/jbd2.h
@@ -971,6 +971,9 @@ extern void __journal_clean_data_list(transaction_t *transaction);
 /* Log buffer allocation */
 extern struct journal_head * jbd2_journal_get_descriptor_buffer(journal_t *);
 int jbd2_journal_next_log_block(journal_t *, unsigned long long *);
+int jbd2_journal_get_log_tail(journal_t *journal, tid_t *tid,
+			      unsigned long *block);
+void __jbd2_update_log_tail(journal_t *journal, tid_t tid, unsigned long block);
 
 /* Commit management */
 extern void jbd2_journal_commit_transaction(journal_t *);
@@ -1082,7 +1085,8 @@ extern int	   jbd2_journal_destroy    (journal_t *);
 extern int	   jbd2_journal_recover    (journal_t *journal);
 extern int	   jbd2_journal_wipe       (journal_t *, int);
 extern int	   jbd2_journal_skip_recovery	(journal_t *);
-extern void	   jbd2_journal_update_sb_log_tail	(journal_t *);
+extern void	   jbd2_journal_update_sb_log_tail	(journal_t *, tid_t,
+				unsigned long, int);
 extern void	   __jbd2_journal_abort_hard	(journal_t *);
 extern void	   jbd2_journal_abort      (journal_t *, int);
 extern int	   jbd2_journal_errno      (journal_t *);
diff --git a/include/trace/events/jbd2.h b/include/trace/events/jbd2.h
index 7596441..5c74007 100644
--- a/include/trace/events/jbd2.h
+++ b/include/trace/events/jbd2.h
@@ -200,7 +200,7 @@ TRACE_EVENT(jbd2_checkpoint_stats,
 		  __entry->forced_to_close, __entry->written, __entry->dropped)
 );
 
-TRACE_EVENT(jbd2_cleanup_journal_tail,
+TRACE_EVENT(jbd2_update_log_tail,
 
 	TP_PROTO(journal_t *journal, tid_t first_tid,
 		 unsigned long block_nr, unsigned long freed),
-- 
1.7.1


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [PATCH 4/8] jbd2: Fix BH_JWrite setting in checkpointing code
  2012-02-15 18:34 [PATCH 0/8 v3] Checkpointing fixes and cleanups Jan Kara
                   ` (2 preceding siblings ...)
  2012-02-15 18:34 ` [PATCH 3/8] jbd2: Issue cache flush after checkpointing even with internal journal Jan Kara
@ 2012-02-15 18:34 ` Jan Kara
  2012-02-15 18:34 ` [PATCH 5/8] jbd2: __jbd2_journal_temp_unlink_buffer() is static Jan Kara
                   ` (4 subsequent siblings)
  8 siblings, 0 replies; 15+ messages in thread
From: Jan Kara @ 2012-02-15 18:34 UTC (permalink / raw)
  To: Ted Tso; +Cc: linux-ext4, Jan Kara

BH_JWrite bit should be set when buffer is written to the journal. So
checkpointing shouldn't set this bit when writing out buffer. This didn't
cause any observable bug since BH_JWrite bit is used only for debugging
purposes but it's good to have this consistent.

Signed-off-by: Jan Kara <jack@suse.cz>
---
 fs/jbd2/checkpoint.c |    2 --
 1 files changed, 0 insertions(+), 2 deletions(-)

diff --git a/fs/jbd2/checkpoint.c b/fs/jbd2/checkpoint.c
index de76f1f..e68a820 100644
--- a/fs/jbd2/checkpoint.c
+++ b/fs/jbd2/checkpoint.c
@@ -266,7 +266,6 @@ __flush_batch(journal_t *journal, int *batch_count)
 
 	for (i = 0; i < *batch_count; i++) {
 		struct buffer_head *bh = journal->j_chkpt_bhs[i];
-		clear_buffer_jwrite(bh);
 		BUFFER_TRACE(bh, "brelse");
 		__brelse(bh);
 	}
@@ -340,7 +339,6 @@ static int __process_buffer(journal_t *journal, struct journal_head *jh,
 		BUFFER_TRACE(bh, "queue");
 		get_bh(bh);
 		J_ASSERT_BH(bh, !buffer_jwrite(bh));
-		set_buffer_jwrite(bh);
 		journal->j_chkpt_bhs[*batch_count] = bh;
 		__buffer_relink_io(jh);
 		jbd_unlock_bh_state(bh);
-- 
1.7.1


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [PATCH 5/8] jbd2: __jbd2_journal_temp_unlink_buffer() is static
  2012-02-15 18:34 [PATCH 0/8 v3] Checkpointing fixes and cleanups Jan Kara
                   ` (3 preceding siblings ...)
  2012-02-15 18:34 ` [PATCH 4/8] jbd2: Fix BH_JWrite setting in checkpointing code Jan Kara
@ 2012-02-15 18:34 ` Jan Kara
  2012-02-15 18:34 ` [PATCH 6/8] jbd2: Remove always true condition in __journal_try_to_free_buffer() Jan Kara
                   ` (3 subsequent siblings)
  8 siblings, 0 replies; 15+ messages in thread
From: Jan Kara @ 2012-02-15 18:34 UTC (permalink / raw)
  To: Ted Tso; +Cc: linux-ext4, Jan Kara

Signed-off-by: Jan Kara <jack@suse.cz>
---
 fs/jbd2/transaction.c |    4 ++--
 1 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/fs/jbd2/transaction.c b/fs/jbd2/transaction.c
index a0e41a4..acd08be 100644
--- a/fs/jbd2/transaction.c
+++ b/fs/jbd2/transaction.c
@@ -1548,9 +1548,9 @@ __blist_del_buffer(struct journal_head **list, struct journal_head *jh)
  * of these pointers, it could go bad.  Generally the caller needs to re-read
  * the pointer from the transaction_t.
  *
- * Called under j_list_lock.  The journal may not be locked.
+ * Called under j_list_lock.
  */
-void __jbd2_journal_temp_unlink_buffer(struct journal_head *jh)
+static void __jbd2_journal_temp_unlink_buffer(struct journal_head *jh)
 {
 	struct journal_head **list = NULL;
 	transaction_t *transaction;
-- 
1.7.1


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [PATCH 6/8] jbd2: Remove always true condition in __journal_try_to_free_buffer()
  2012-02-15 18:34 [PATCH 0/8 v3] Checkpointing fixes and cleanups Jan Kara
                   ` (4 preceding siblings ...)
  2012-02-15 18:34 ` [PATCH 5/8] jbd2: __jbd2_journal_temp_unlink_buffer() is static Jan Kara
@ 2012-02-15 18:34 ` Jan Kara
  2012-02-15 18:34 ` [PATCH 7/8] jbd2: Remove bh_state lock from checkpointing code Jan Kara
                   ` (2 subsequent siblings)
  8 siblings, 0 replies; 15+ messages in thread
From: Jan Kara @ 2012-02-15 18:34 UTC (permalink / raw)
  To: Ted Tso; +Cc: linux-ext4, Jan Kara

The check b_jlist == BJ_None in __journal_try_to_free_buffer() is
always true (__jbd2_journal_temp_unlink_buffer() also checks this in
an assertion) so just remove it.

Signed-off-by: Jan Kara <jack@suse.cz>
---
 fs/jbd2/transaction.c |    6 ++----
 1 files changed, 2 insertions(+), 4 deletions(-)

diff --git a/fs/jbd2/transaction.c b/fs/jbd2/transaction.c
index acd08be..52154c9 100644
--- a/fs/jbd2/transaction.c
+++ b/fs/jbd2/transaction.c
@@ -1645,10 +1645,8 @@ __journal_try_to_free_buffer(journal_t *journal, struct buffer_head *bh)
 	spin_lock(&journal->j_list_lock);
 	if (jh->b_cp_transaction != NULL && jh->b_transaction == NULL) {
 		/* written-back checkpointed metadata buffer */
-		if (jh->b_jlist == BJ_None) {
-			JBUFFER_TRACE(jh, "remove from checkpoint list");
-			__jbd2_journal_remove_checkpoint(jh);
-		}
+		JBUFFER_TRACE(jh, "remove from checkpoint list");
+		__jbd2_journal_remove_checkpoint(jh);
 	}
 	spin_unlock(&journal->j_list_lock);
 out:
-- 
1.7.1


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [PATCH 7/8] jbd2: Remove bh_state lock from checkpointing code
  2012-02-15 18:34 [PATCH 0/8 v3] Checkpointing fixes and cleanups Jan Kara
                   ` (5 preceding siblings ...)
  2012-02-15 18:34 ` [PATCH 6/8] jbd2: Remove always true condition in __journal_try_to_free_buffer() Jan Kara
@ 2012-02-15 18:34 ` Jan Kara
  2012-02-15 18:34 ` [PATCH 8/8] jbd2: Cleanup journal tail after transaction commit Jan Kara
  2012-02-29 11:03 ` [PATCH 0/8 v3] Checkpointing fixes and cleanups Jan Kara
  8 siblings, 0 replies; 15+ messages in thread
From: Jan Kara @ 2012-02-15 18:34 UTC (permalink / raw)
  To: Ted Tso; +Cc: linux-ext4, Jan Kara

All accesses to checkpointing entries in journal_head are protected
by j_list_lock. Thus __jbd2_journal_remove_checkpoint() doesn't really
need bh_state lock.

Also the only part of journal head that the rest of checkpointing code
needs to check is jh->b_transaction which is safe to read under
j_list_lock.

So we can safely remove bh_state lock from all of checkpointing code which
makes it considerably prettier.

Signed-off-by: Jan Kara <jack@suse.cz>
---
 fs/jbd2/checkpoint.c         |   59 +++++-------------------------------------
 include/linux/journal-head.h |    2 +
 2 files changed, 9 insertions(+), 52 deletions(-)

diff --git a/fs/jbd2/checkpoint.c b/fs/jbd2/checkpoint.c
index e68a820..b884adf 100644
--- a/fs/jbd2/checkpoint.c
+++ b/fs/jbd2/checkpoint.c
@@ -88,14 +88,13 @@ static inline void __buffer_relink_io(struct journal_head *jh)
  * whole transaction.
  *
  * Requires j_list_lock
- * Called under jbd_lock_bh_state(jh2bh(jh)), and drops it
  */
 static int __try_to_free_cp_buf(struct journal_head *jh)
 {
 	int ret = 0;
 	struct buffer_head *bh = jh2bh(jh);
 
-	if (jh->b_jlist == BJ_None && !buffer_locked(bh) &&
+	if (jh->b_transaction == NULL && !buffer_locked(bh) &&
 	    !buffer_dirty(bh) && !buffer_write_io_error(bh)) {
 		/*
 		 * Get our reference so that bh cannot be freed before
@@ -104,11 +103,8 @@ static int __try_to_free_cp_buf(struct journal_head *jh)
 		get_bh(bh);
 		JBUFFER_TRACE(jh, "remove from checkpoint list");
 		ret = __jbd2_journal_remove_checkpoint(jh) + 1;
-		jbd_unlock_bh_state(bh);
 		BUFFER_TRACE(bh, "release");
 		__brelse(bh);
-	} else {
-		jbd_unlock_bh_state(bh);
 	}
 	return ret;
 }
@@ -180,21 +176,6 @@ void __jbd2_log_wait_for_space(journal_t *journal)
 }
 
 /*
- * We were unable to perform jbd_trylock_bh_state() inside j_list_lock.
- * The caller must restart a list walk.  Wait for someone else to run
- * jbd_unlock_bh_state().
- */
-static void jbd_sync_bh(journal_t *journal, struct buffer_head *bh)
-	__releases(journal->j_list_lock)
-{
-	get_bh(bh);
-	spin_unlock(&journal->j_list_lock);
-	jbd_lock_bh_state(bh);
-	jbd_unlock_bh_state(bh);
-	put_bh(bh);
-}
-
-/*
  * Clean up transaction's list of buffers submitted for io.
  * We wait for any pending IO to complete and remove any clean
  * buffers. Note that we take the buffers in the opposite ordering
@@ -222,15 +203,9 @@ restart:
 	while (!released && transaction->t_checkpoint_io_list) {
 		jh = transaction->t_checkpoint_io_list;
 		bh = jh2bh(jh);
-		if (!jbd_trylock_bh_state(bh)) {
-			jbd_sync_bh(journal, bh);
-			spin_lock(&journal->j_list_lock);
-			goto restart;
-		}
 		get_bh(bh);
 		if (buffer_locked(bh)) {
 			spin_unlock(&journal->j_list_lock);
-			jbd_unlock_bh_state(bh);
 			wait_on_buffer(bh);
 			/* the journal_head may have gone by now */
 			BUFFER_TRACE(bh, "brelse");
@@ -246,7 +221,6 @@ restart:
 		 * it has been written out and so we can drop it from the list
 		 */
 		released = __jbd2_journal_remove_checkpoint(jh);
-		jbd_unlock_bh_state(bh);
 		__brelse(bh);
 	}
 
@@ -280,7 +254,6 @@ __flush_batch(journal_t *journal, int *batch_count)
  * be written out.
  *
  * Called with j_list_lock held and drops it if 1 is returned
- * Called under jbd_lock_bh_state(jh2bh(jh)), and drops it
  */
 static int __process_buffer(journal_t *journal, struct journal_head *jh,
 			    int *batch_count, transaction_t *transaction)
@@ -291,7 +264,6 @@ static int __process_buffer(journal_t *journal, struct journal_head *jh,
 	if (buffer_locked(bh)) {
 		get_bh(bh);
 		spin_unlock(&journal->j_list_lock);
-		jbd_unlock_bh_state(bh);
 		wait_on_buffer(bh);
 		/* the journal_head may have gone by now */
 		BUFFER_TRACE(bh, "brelse");
@@ -303,7 +275,6 @@ static int __process_buffer(journal_t *journal, struct journal_head *jh,
 
 		transaction->t_chp_stats.cs_forced_to_close++;
 		spin_unlock(&journal->j_list_lock);
-		jbd_unlock_bh_state(bh);
 		if (unlikely(journal->j_flags & JBD2_UNMOUNT))
 			/*
 			 * The journal thread is dead; so starting and
@@ -322,11 +293,9 @@ static int __process_buffer(journal_t *journal, struct journal_head *jh,
 		if (unlikely(buffer_write_io_error(bh)))
 			ret = -EIO;
 		get_bh(bh);
-		J_ASSERT_JH(jh, !buffer_jbddirty(bh));
 		BUFFER_TRACE(bh, "remove from checkpoint");
 		__jbd2_journal_remove_checkpoint(jh);
 		spin_unlock(&journal->j_list_lock);
-		jbd_unlock_bh_state(bh);
 		__brelse(bh);
 	} else {
 		/*
@@ -341,7 +310,6 @@ static int __process_buffer(journal_t *journal, struct journal_head *jh,
 		J_ASSERT_BH(bh, !buffer_jwrite(bh));
 		journal->j_chkpt_bhs[*batch_count] = bh;
 		__buffer_relink_io(jh);
-		jbd_unlock_bh_state(bh);
 		transaction->t_chp_stats.cs_written++;
 		(*batch_count)++;
 		if (*batch_count == JBD2_NR_BATCH) {
@@ -405,15 +373,7 @@ restart:
 		int retry = 0, err;
 
 		while (!retry && transaction->t_checkpoint_list) {
-			struct buffer_head *bh;
-
 			jh = transaction->t_checkpoint_list;
-			bh = jh2bh(jh);
-			if (!jbd_trylock_bh_state(bh)) {
-				jbd_sync_bh(journal, bh);
-				retry = 1;
-				break;
-			}
 			retry = __process_buffer(journal, jh, &batch_count,
 						 transaction);
 			if (retry < 0 && !result)
@@ -529,15 +489,12 @@ static int journal_clean_one_cp_list(struct journal_head *jh, int *released)
 	do {
 		jh = next_jh;
 		next_jh = jh->b_cpnext;
-		/* Use trylock because of the ranking */
-		if (jbd_trylock_bh_state(jh2bh(jh))) {
-			ret = __try_to_free_cp_buf(jh);
-			if (ret) {
-				freed++;
-				if (ret == 2) {
-					*released = 1;
-					return freed;
-				}
+		ret = __try_to_free_cp_buf(jh);
+		if (ret) {
+			freed++;
+			if (ret == 2) {
+				*released = 1;
+				return freed;
 			}
 		}
 		/*
@@ -620,9 +577,7 @@ out:
  * The function can free jh and bh.
  *
  * This function is called with j_list_lock held.
- * This function is called with jbd_lock_bh_state(jh2bh(jh))
  */

^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [PATCH 8/8] jbd2: Cleanup journal tail after transaction commit
  2012-02-15 18:34 [PATCH 0/8 v3] Checkpointing fixes and cleanups Jan Kara
                   ` (6 preceding siblings ...)
  2012-02-15 18:34 ` [PATCH 7/8] jbd2: Remove bh_state lock from checkpointing code Jan Kara
@ 2012-02-15 18:34 ` Jan Kara
  2012-02-15 22:03   ` Andreas Dilger
  2012-02-29 11:03 ` [PATCH 0/8 v3] Checkpointing fixes and cleanups Jan Kara
  8 siblings, 1 reply; 15+ messages in thread
From: Jan Kara @ 2012-02-15 18:34 UTC (permalink / raw)
  To: Ted Tso; +Cc: linux-ext4, Jan Kara

Normally, we have to issue a cache flush before we can update journal tail in
journal superblock, effectively wiping out old transactions from the journal.
So use the fact that during transaction commit we issue cache flush anyway and
opportunistically push journal tail as far as we can. Since update of journal
superblock is still costly (we have to use WRITE_FUA), we update log tail only
if we can free significant amount of space.

Signed-off-by: Jan Kara <jack@suse.cz>
---
 fs/jbd2/commit.c     |   32 ++++++++++++++++++++++++++++++++
 fs/jbd2/journal.c    |   13 +++++++++++++
 include/linux/jbd2.h |    1 +
 3 files changed, 46 insertions(+), 0 deletions(-)

diff --git a/fs/jbd2/commit.c b/fs/jbd2/commit.c
index f37b783..245201c 100644
--- a/fs/jbd2/commit.c
+++ b/fs/jbd2/commit.c
@@ -331,6 +331,10 @@ void jbd2_journal_commit_transaction(journal_t *journal)
 	struct buffer_head *cbh = NULL; /* For transactional checksums */
 	__u32 crc32_sum = ~0;
 	struct blk_plug plug;
+	/* Tail of the journal */
+	unsigned long first_block;
+	tid_t first_tid;
+	int update_tail;
 
 	/*
 	 * First job: lock down the current transaction and wait for
@@ -682,10 +686,30 @@ start_journal_io:
 		err = 0;
 	}
 
+	/*
+	 * Get current oldest transaction in the log before we issue flush
+	 * to the filesystem device. After the flush we can be sure that
+	 * blocks of all older transactions are checkpointed to persistent
+	 * storage and we will be safe to update journal start in the
+	 * superblock with the numbers we get here.
+	 */
+	update_tail =
+		jbd2_journal_get_log_tail(journal, &first_tid, &first_block);
+
 	write_lock(&journal->j_state_lock);
+	if (update_tail) {
+		long freed = first_block - journal->j_tail;
+
+		if (first_block < journal->j_tail)
+			freed += journal->j_last - journal->j_first;
+		/* Update tail only if we free significant amount of space */
+		if (freed < journal->j_maxlen / 4)
+			update_tail = 0;
+	}
 	J_ASSERT(commit_transaction->t_state == T_COMMIT);
 	commit_transaction->t_state = T_COMMIT_DFLUSH;
 	write_unlock(&journal->j_state_lock);
+
 	/* 
 	 * If the journal is not located on the file system device,
 	 * then we must flush the file system device before we issue
@@ -836,6 +860,14 @@ wait_for_iobuf:
 	if (err)
 		jbd2_journal_abort(journal, err);
 
+	/*
+	 * Now disk caches for filesystem device are flushed so we are safe to
+	 * erase checkpointed transactions from the log by updating journal
+	 * superblock.
+	 */
+	if (update_tail)
+		jbd2_update_log_tail(journal, first_tid, first_block);
+
 	/* End of a transaction!  Finally, we can do checkpoint
            processing: any buffers committed as a result of this
            transaction can be removed from any checkpoint list it was on
diff --git a/fs/jbd2/journal.c b/fs/jbd2/journal.c
index 192558f..be304b9 100644
--- a/fs/jbd2/journal.c
+++ b/fs/jbd2/journal.c
@@ -823,6 +823,19 @@ void __jbd2_update_log_tail(journal_t *journal, tid_t tid, unsigned long block)
 	write_unlock(&journal->j_state_lock);
 }
 
+/*
+ * This is a variaon of __jbd2_update_log_tail which checks for validity of
+ * provided log tail and locks j_checkpoint_mutex. So it is safe against races
+ * with other threads updating log tail.
+ */
+void jbd2_update_log_tail(journal_t *journal, tid_t tid, unsigned long block)
+{
+	mutex_lock(&journal->j_checkpoint_mutex);
+	if (tid_gt(tid, journal->j_tail_sequence))
+		__jbd2_update_log_tail(journal, tid, block);
+	mutex_unlock(&journal->j_checkpoint_mutex);
+}
+
 struct jbd2_stats_proc_session {
 	journal_t *journal;
 	struct transaction_stats_s *stats;
diff --git a/include/linux/jbd2.h b/include/linux/jbd2.h
index 1b10c2a..8dccaaa 100644
--- a/include/linux/jbd2.h
+++ b/include/linux/jbd2.h
@@ -974,6 +974,7 @@ int jbd2_journal_next_log_block(journal_t *, unsigned long long *);
 int jbd2_journal_get_log_tail(journal_t *journal, tid_t *tid,
 			      unsigned long *block);
 void __jbd2_update_log_tail(journal_t *journal, tid_t tid, unsigned long block);
+void jbd2_update_log_tail(journal_t *journal, tid_t tid, unsigned long block);
 
 /* Commit management */
 extern void jbd2_journal_commit_transaction(journal_t *);
-- 
1.7.1


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* Re: [PATCH 8/8] jbd2: Cleanup journal tail after transaction commit
  2012-02-15 18:34 ` [PATCH 8/8] jbd2: Cleanup journal tail after transaction commit Jan Kara
@ 2012-02-15 22:03   ` Andreas Dilger
  2012-02-16 12:59     ` Jan Kara
  0 siblings, 1 reply; 15+ messages in thread
From: Andreas Dilger @ 2012-02-15 22:03 UTC (permalink / raw)
  To: Jan Kara; +Cc: Ted Tso, linux-ext4

On 2012-02-15, at 10:34 AM, Jan Kara wrote:
> Normally, we have to issue a cache flush before we can update journal tail in
> journal superblock, effectively wiping out old transactions from the journal.
> So use the fact that during transaction commit we issue cache flush anyway and
> opportunistically push journal tail as far as we can. Since update of journal
> superblock is still costly (we have to use WRITE_FUA), we update log tail only
> if we can free significant amount of space.
> 
> Signed-off-by: Jan Kara <jack@suse.cz>
> ---
> fs/jbd2/commit.c     |   32 ++++++++++++++++++++++++++++++++
> fs/jbd2/journal.c    |   13 +++++++++++++
> include/linux/jbd2.h |    1 +
> 3 files changed, 46 insertions(+), 0 deletions(-)
> 
> diff --git a/fs/jbd2/commit.c b/fs/jbd2/commit.c
> index f37b783..245201c 100644
> --- a/fs/jbd2/commit.c
> +++ b/fs/jbd2/commit.c
> @@ -331,6 +331,10 @@ void jbd2_journal_commit_transaction(journal_t *journal)
> 	struct buffer_head *cbh = NULL; /* For transactional checksums */
> 	__u32 crc32_sum = ~0;
> 	struct blk_plug plug;
> +	/* Tail of the journal */
> +	unsigned long first_block;
> +	tid_t first_tid;
> +	int update_tail;
> 
> 	/*
> 	 * First job: lock down the current transaction and wait for
> @@ -682,10 +686,30 @@ start_journal_io:
> 		err = 0;
> 	}
> 
> +	/*
> +	 * Get current oldest transaction in the log before we issue flush
> +	 * to the filesystem device. After the flush we can be sure that
> +	 * blocks of all older transactions are checkpointed to persistent
> +	 * storage and we will be safe to update journal start in the
> +	 * superblock with the numbers we get here.
> +	 */
> +	update_tail =
> +		jbd2_journal_get_log_tail(journal, &first_tid, &first_block);
> +
> 	write_lock(&journal->j_state_lock);
> +	if (update_tail) {
> +		long freed = first_block - journal->j_tail;
> +
> +		if (first_block < journal->j_tail)
> +			freed += journal->j_last - journal->j_first;
> +		/* Update tail only if we free significant amount of space */
> +		if (freed < journal->j_maxlen / 4)
> +			update_tail = 0;
> +	}

Have you done any performance testing on this?  I expect that it may give
a nice boost in performance when there are lots of small transactions in
the journal.  However, it might also increase latency if the journal is
nearly full and no new transactions can be started until 1/4 of the journal
is checkpointed.

This should probably be conditional on a decent amount of free blocks left
in the journal, for example:

		if (j_free >= j_maxlen / 8 && freed < journal->j_maxlen / 4)
			update_tail = 0;

or
		if (freed >= j_free)
			update_tail = 0;

Cheers, Andreas






^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 8/8] jbd2: Cleanup journal tail after transaction commit
  2012-02-15 22:03   ` Andreas Dilger
@ 2012-02-16 12:59     ` Jan Kara
  0 siblings, 0 replies; 15+ messages in thread
From: Jan Kara @ 2012-02-16 12:59 UTC (permalink / raw)
  To: Andreas Dilger; +Cc: Jan Kara, Ted Tso, linux-ext4

On Wed 15-02-12 14:03:30, Andreas Dilger wrote:
> On 2012-02-15, at 10:34 AM, Jan Kara wrote:
> > Normally, we have to issue a cache flush before we can update journal tail in
> > journal superblock, effectively wiping out old transactions from the journal.
> > So use the fact that during transaction commit we issue cache flush anyway and
> > opportunistically push journal tail as far as we can. Since update of journal
> > superblock is still costly (we have to use WRITE_FUA), we update log tail only
> > if we can free significant amount of space.
> > 
> > Signed-off-by: Jan Kara <jack@suse.cz>
> > ---
> > fs/jbd2/commit.c     |   32 ++++++++++++++++++++++++++++++++
> > fs/jbd2/journal.c    |   13 +++++++++++++
> > include/linux/jbd2.h |    1 +
> > 3 files changed, 46 insertions(+), 0 deletions(-)
> > 
> > diff --git a/fs/jbd2/commit.c b/fs/jbd2/commit.c
> > index f37b783..245201c 100644
> > --- a/fs/jbd2/commit.c
> > +++ b/fs/jbd2/commit.c
> > @@ -331,6 +331,10 @@ void jbd2_journal_commit_transaction(journal_t *journal)
> > 	struct buffer_head *cbh = NULL; /* For transactional checksums */
> > 	__u32 crc32_sum = ~0;
> > 	struct blk_plug plug;
> > +	/* Tail of the journal */
> > +	unsigned long first_block;
> > +	tid_t first_tid;
> > +	int update_tail;
> > 
> > 	/*
> > 	 * First job: lock down the current transaction and wait for
> > @@ -682,10 +686,30 @@ start_journal_io:
> > 		err = 0;
> > 	}
> > 
> > +	/*
> > +	 * Get current oldest transaction in the log before we issue flush
> > +	 * to the filesystem device. After the flush we can be sure that
> > +	 * blocks of all older transactions are checkpointed to persistent
> > +	 * storage and we will be safe to update journal start in the
> > +	 * superblock with the numbers we get here.
> > +	 */
> > +	update_tail =
> > +		jbd2_journal_get_log_tail(journal, &first_tid, &first_block);
> > +
> > 	write_lock(&journal->j_state_lock);
> > +	if (update_tail) {
> > +		long freed = first_block - journal->j_tail;
> > +
> > +		if (first_block < journal->j_tail)
> > +			freed += journal->j_last - journal->j_first;
> > +		/* Update tail only if we free significant amount of space */
> > +		if (freed < journal->j_maxlen / 4)
> > +			update_tail = 0;
> > +	}
> 
> Have you done any performance testing on this?  I expect that it may give
> a nice boost in performance when there are lots of small transactions in
> the journal.  However, it might also increase latency if the journal is
> nearly full and no new transactions can be started until 1/4 of the journal
> is checkpointed.
  Well, I didn't do serious performance testing because I failed to find a
workload where I would think it could make a difference. For example for
workload creating and deleting directories and 0-length files, I saw about
one jbd2_cleanup_journal_tail() invocation per 200 transaction commits. So
the possible gain would be very well in the noise.

> This should probably be conditional on a decent amount of free blocks left
> in the journal, for example:
> 
> 		if (j_free >= j_maxlen / 8 && freed < journal->j_maxlen / 4)
> 			update_tail = 0;
> 
> or
> 		if (freed >= j_free)
> 			update_tail = 0;
  Yeah. Currently, I'm doubtful we should apply this patch at all. But if
we do, tweaking the condition is certainly possible. I mostly wanted to
gather people's thoughts regarding this particular patch in the first
round.

  Thanks for your comments.
								Honza

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 0/8 v3] Checkpointing fixes and cleanups
  2012-02-15 18:34 [PATCH 0/8 v3] Checkpointing fixes and cleanups Jan Kara
                   ` (7 preceding siblings ...)
  2012-02-15 18:34 ` [PATCH 8/8] jbd2: Cleanup journal tail after transaction commit Jan Kara
@ 2012-02-29 11:03 ` Jan Kara
  8 siblings, 0 replies; 15+ messages in thread
From: Jan Kara @ 2012-02-29 11:03 UTC (permalink / raw)
  To: Ted Tso; +Cc: linux-ext4

On Wed 15-02-12 19:34:06, Jan Kara wrote:
>   Hello,
> 
>   this is a third round of my checkpointing fixes. The patch set is primarily
> aimed at fixing filesystem corruption happening occasionally (rather rarely)
> after power failure. First three patches should fix the issue. Patches 4-7 are
> assorted checkpointing cleanups I've gathered when inspecting checkpointing
> code. Finally patch 8 is a possible speedup of checkpoining - we can use cache
> flushes happening during transaction commits for pushing the journal tail
> safely. The observable speedup is disputable since jbd2_cleanup_journal_tail()
> is called rather rarely (for metadata heavy load I saw about one
> jbd2_cleanup_journal_tail() for about 200 commits) so the cost of additional
> cache flush will be likely in the noise. But the patch is simple enough so I
> send it for others to judge whether it makes sense or not.
> 
> Changes since v2:
> Rewrote the code pushing log tail to fix race described by Ted.
> Improved some comments
  Ted, do you plan to have a look at this? It would be nice to get these
fixes in for the next merge window...

									Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 3/8] jbd2: Issue cache flush after checkpointing even with internal journal
  2012-02-15 18:34 ` [PATCH 3/8] jbd2: Issue cache flush after checkpointing even with internal journal Jan Kara
@ 2012-03-14  2:17   ` Ted Ts'o
  2012-03-15  8:59     ` Jan Kara
  0 siblings, 1 reply; 15+ messages in thread
From: Ted Ts'o @ 2012-03-14  2:17 UTC (permalink / raw)
  To: Jan Kara; +Cc: linux-ext4

I'm not sure why you pulled out the code for
jbd2_journal_get_log_tail(), since there's no other usage of the
function, but it did make this commit harder to review.  It's good for
code movement patches to be segregated to their own commit, just to
make it easier to review....

					- Ted

On Wed, Feb 15, 2012 at 07:34:09PM +0100, Jan Kara wrote:
> When we reach jbd2_cleanup_journal_tail(), there is no guarantee that
> checkpointed buffers are on a stable storage - especially if buffers were
> written out by jbd2_log_do_checkpoint(), they are likely to be only in disk's
> caches. Thus when we update journal superblock effectively removing old
> transaction from journal, this write of superblock can get to stable storage
> before those checkpointed buffers which can result in filesystem corruption
> after a crash. Thus we must unconditionally issue a cache flush before we
> update journal superblock in these cases.
> 
> A similar problem can also occur if journal superblock is written only in
> disk's caches, other transaction starts reusing space of the transaction
> cleaned from the log and power failure happens. Subsequent journal replay would
> still try to replay the old transaction but some of it's blocks may be already
> overwritten by the new transaction. For this reason we must use WRITE_FUA when
> updating log tail and we must first write new log tail to disk and update
> in-memory information only after that.
> 
> Signed-off-by: Jan Kara <jack@suse.cz>
> ---
>  fs/jbd2/checkpoint.c        |   75 ++++--------------------
>  fs/jbd2/commit.c            |   11 +++-
>  fs/jbd2/journal.c           |  136 ++++++++++++++++++++++++++++++++++++------
>  fs/jbd2/recovery.c          |    5 +-
>  include/linux/jbd2.h        |    6 ++-
>  include/trace/events/jbd2.h |    2 +-
>  6 files changed, 148 insertions(+), 87 deletions(-)
> 
> diff --git a/fs/jbd2/checkpoint.c b/fs/jbd2/checkpoint.c
> index aa09868..de76f1f 100644
> --- a/fs/jbd2/checkpoint.c
> +++ b/fs/jbd2/checkpoint.c
> @@ -478,79 +478,28 @@ out:
>  
>  int jbd2_cleanup_journal_tail(journal_t *journal)
>  {
> -	transaction_t * transaction;
>  	tid_t		first_tid;
> -	unsigned long	blocknr, freed;
> +	unsigned long	blocknr;
>  
>  	if (is_journal_aborted(journal))
>  		return 1;
>  
> -	/* OK, work out the oldest transaction remaining in the log, and
> -	 * the log block it starts at.
> -	 *
> -	 * If the log is now empty, we need to work out which is the
> -	 * next transaction ID we will write, and where it will
> -	 * start. */
> -
> -	write_lock(&journal->j_state_lock);
> -	spin_lock(&journal->j_list_lock);
> -	transaction = journal->j_checkpoint_transactions;
> -	if (transaction) {
> -		first_tid = transaction->t_tid;
> -		blocknr = transaction->t_log_start;
> -	} else if ((transaction = journal->j_committing_transaction) != NULL) {
> -		first_tid = transaction->t_tid;
> -		blocknr = transaction->t_log_start;
> -	} else if ((transaction = journal->j_running_transaction) != NULL) {
> -		first_tid = transaction->t_tid;
> -		blocknr = journal->j_head;
> -	} else {
> -		first_tid = journal->j_transaction_sequence;
> -		blocknr = journal->j_head;
> -	}
> -	spin_unlock(&journal->j_list_lock);
> -	J_ASSERT(blocknr != 0);
> -
> -	/* If the oldest pinned transaction is at the tail of the log
> -           already then there's not much we can do right now. */
> -	if (journal->j_tail_sequence == first_tid) {
> -		write_unlock(&journal->j_state_lock);
> +	if (!jbd2_journal_get_log_tail(journal, &first_tid, &blocknr))
>  		return 1;
> -	}
> -
> -	/* OK, update the superblock to recover the freed space.
> -	 * Physical blocks come first: have we wrapped beyond the end of
> -	 * the log?  */
> -	freed = blocknr - journal->j_tail;
> -	if (blocknr < journal->j_tail)
> -		freed = freed + journal->j_last - journal->j_first;
> -
> -	trace_jbd2_cleanup_journal_tail(journal, first_tid, blocknr, freed);
> -	jbd_debug(1,
> -		  "Cleaning journal tail from %d to %d (offset %lu), "
> -		  "freeing %lu\n",
> -		  journal->j_tail_sequence, first_tid, blocknr, freed);
> -
> -	journal->j_free += freed;
> -	journal->j_tail_sequence = first_tid;
> -	journal->j_tail = blocknr;
> -	write_unlock(&journal->j_state_lock);
> +	J_ASSERT(blocknr != 0);
>  
>  	/*
> -	 * If there is an external journal, we need to make sure that
> -	 * any data blocks that were recently written out --- perhaps
> -	 * by jbd2_log_do_checkpoint() --- are flushed out before we
> -	 * drop the transactions from the external journal.  It's
> -	 * unlikely this will be necessary, especially with a
> -	 * appropriately sized journal, but we need this to guarantee
> -	 * correctness.  Fortunately jbd2_cleanup_journal_tail()
> -	 * doesn't get called all that often.
> +	 * We need to make sure that any blocks that were recently written out
> +	 * --- perhaps by jbd2_log_do_checkpoint() --- are flushed out before
> +	 * we drop the transactions from the journal. It's unlikely this will
> +	 * be necessary, especially with an appropriately sized journal, but we
> +	 * need this to guarantee correctness.  Fortunately
> +	 * jbd2_cleanup_journal_tail() doesn't get called all that often.
>  	 */
> -	if ((journal->j_fs_dev != journal->j_dev) &&
> -	    (journal->j_flags & JBD2_BARRIER))
> +	if (journal->j_flags & JBD2_BARRIER)
>  		blkdev_issue_flush(journal->j_fs_dev, GFP_KERNEL, NULL);
> -	if (!(journal->j_flags & JBD2_ABORT))
> -		jbd2_journal_update_sb_log_tail(journal);
> +
> +	__jbd2_update_log_tail(journal, first_tid, blocknr);
>  	return 0;
>  }
>  
> diff --git a/fs/jbd2/commit.c b/fs/jbd2/commit.c
> index 39978c1..f37b783 100644
> --- a/fs/jbd2/commit.c
> +++ b/fs/jbd2/commit.c
> @@ -341,7 +341,16 @@ void jbd2_journal_commit_transaction(journal_t *journal)
>  	if (journal->j_flags & JBD2_FLUSHED) {
>  		jbd_debug(3, "super block updated\n");
>  		mutex_lock(&journal->j_checkpoint_mutex);
> -		jbd2_journal_update_sb_log_tail(journal);
> +		/*
> +		 * We hold j_checkpoint_mutex so tail cannot change under us.
> +		 * We don't need any special data guarantees for writing sb
> +		 * since journal is empty and it is ok for write to be
> +		 * flushed only with transaction commit.
> +		 */
> +		jbd2_journal_update_sb_log_tail(journal,
> +						journal->j_tail_sequence,
> +						journal->j_tail,
> +						WRITE_SYNC);
>  		mutex_unlock(&journal->j_checkpoint_mutex);
>  	} else {
>  		jbd_debug(3, "superblock not updated\n");
> diff --git a/fs/jbd2/journal.c b/fs/jbd2/journal.c
> index 183a099..192558f 100644
> --- a/fs/jbd2/journal.c
> +++ b/fs/jbd2/journal.c
> @@ -744,6 +744,85 @@ struct journal_head *jbd2_journal_get_descriptor_buffer(journal_t *journal)
>  	return jbd2_journal_add_journal_head(bh);
>  }
>  
> +/*
> + * Return tid of the oldest transaction in the journal and block in the journal
> + * where the transaction starts.
> + *
> + * If the journal is now empty, return which will be the next transaction ID
> + * we will write and where will that transaction start.
> + *
> + * The return value is 0 if journal tail cannot be pushed any further, 1 if
> + * it can.
> + */
> +int jbd2_journal_get_log_tail(journal_t *journal, tid_t *tid,
> +			      unsigned long *block)
> +{
> +	transaction_t *transaction;
> +	int ret;
> +
> +	read_lock(&journal->j_state_lock);
> +	spin_lock(&journal->j_list_lock);
> +	transaction = journal->j_checkpoint_transactions;
> +	if (transaction) {
> +		*tid = transaction->t_tid;
> +		*block = transaction->t_log_start;
> +	} else if ((transaction = journal->j_committing_transaction) != NULL) {
> +		*tid = transaction->t_tid;
> +		*block = transaction->t_log_start;
> +	} else if ((transaction = journal->j_running_transaction) != NULL) {
> +		*tid = transaction->t_tid;
> +		*block = journal->j_head;
> +	} else {
> +		*tid = journal->j_transaction_sequence;
> +		*block = journal->j_head;
> +	}
> +	ret = tid_gt(*tid, journal->j_tail_sequence);
> +	spin_unlock(&journal->j_list_lock);
> +	read_unlock(&journal->j_state_lock);
> +
> +	return ret;
> +}
> +
> +/*
> + * Update information in journal structure and in on disk journal superblock
> + * about log tail. This function does not check whether information passed in
> + * really pushes log tail further. It's responsibility of the caller to make
> + * sure provided log tail information is valid (e.g. by holding
> + * j_checkpoint_mutex all the time between computing log tail and calling this
> + * function as is the case with jbd2_cleanup_journal_tail()).
> + *
> + * Requires j_checkpoint_mutex
> + */
> +void __jbd2_update_log_tail(journal_t *journal, tid_t tid, unsigned long block)
> +{
> +	unsigned long freed;
> +
> +	BUG_ON(!mutex_is_locked(&journal->j_checkpoint_mutex));
> +
> +	/*
> +	 * We cannot afford for write to remain in drive's caches since as
> +	 * soon as we update j_tail, next transaction can start reusing journal
> +	 * space and if we lose sb update during power failure we'd replay
> +	 * old transaction with possibly newly overwritten data.
> +	 */
> +	jbd2_journal_update_sb_log_tail(journal, tid, block, WRITE_FUA);
> +	write_lock(&journal->j_state_lock);
> +	freed = block - journal->j_tail;
> +	if (block < journal->j_tail)
> +		freed += journal->j_last - journal->j_first;
> +
> +	trace_jbd2_update_log_tail(journal, tid, block, freed);
> +	jbd_debug(1,
> +		  "Cleaning journal tail from %d to %d (offset %lu), "
> +		  "freeing %lu\n",
> +		  journal->j_tail_sequence, tid, block, freed);
> +
> +	journal->j_free += freed;
> +	journal->j_tail_sequence = tid;
> +	journal->j_tail = block;
> +	write_unlock(&journal->j_state_lock);
> +}
> +
>  struct jbd2_stats_proc_session {
>  	journal_t *journal;
>  	struct transaction_stats_s *stats;
> @@ -1127,17 +1206,29 @@ static int journal_reset(journal_t *journal)
>  	} else {
>  		/* Lock here to make assertions happy... */
>  		mutex_lock(&journal->j_checkpoint_mutex);
> -		/* Add the dynamic fields and write it to disk. */
> -		jbd2_journal_update_sb_log_tail(journal);
> +		/*
> +		 * Update log tail information. We use WRITE_FUA since new
> +		 * transaction will start reusing journal space and so we
> +		 * must make sure information about current log tail is on
> +		 * disk before that.
> +		 */
> +		jbd2_journal_update_sb_log_tail(journal,
> +						journal->j_tail_sequence,
> +						journal->j_tail,
> +						WRITE_FUA);
>  		mutex_unlock(&journal->j_checkpoint_mutex);
>  	}
>  	return jbd2_journal_start_thread(journal);
>  }
>  
> -static void jbd2_write_superblock(journal_t *journal)
> +static void jbd2_write_superblock(journal_t *journal, int write_op)
>  {
>  	struct buffer_head *bh = journal->j_sb_buffer;
> +	int ret;
>  
> +	if (!(journal->j_flags & JBD2_BARRIER))
> +		write_op &= ~(REQ_FUA | REQ_FLUSH);
> +	lock_buffer(bh);
>  	if (buffer_write_io_error(bh)) {
>  		/*
>  		 * Oh, dear.  A previous attempt to write the journal
> @@ -1153,40 +1244,45 @@ static void jbd2_write_superblock(journal_t *journal)
>  		clear_buffer_write_io_error(bh);
>  		set_buffer_uptodate(bh);
>  	}
> -
> -	BUFFER_TRACE(bh, "marking dirty");
> -	mark_buffer_dirty(bh);
> -	sync_dirty_buffer(bh);
> +	get_bh(bh);
> +	bh->b_end_io = end_buffer_write_sync;
> +	ret = submit_bh(write_op, bh);
> +	wait_on_buffer(bh);
>  	if (buffer_write_io_error(bh)) {
> -		printk(KERN_ERR "JBD2: I/O error detected "
> -		       "when updating journal superblock for %s.\n",
> -		       journal->j_devname);
>  		clear_buffer_write_io_error(bh);
>  		set_buffer_uptodate(bh);
> +		ret = -EIO;
> +	}
> +	if (ret) {
> +		printk(KERN_ERR "JBD2: Error %d detected when updating "
> +		       "journal superblock for %s.\n", ret,
> +		       journal->j_devname);
>  	}
>  }
>  
>  /**
>   * jbd2_journal_update_sb_log_tail() - Update log tail in journal sb on disk.
>   * @journal: The journal to update.
> + * @tail_tid: TID of the new transaction at the tail of the log
> + * @tail_block: The first block of the transaction at the tail of the log
> + * @write_op: With which operation should we write the journal sb
>   *
>   * Update a journal's superblock information about log tail and write it to
>   * disk, waiting for the IO to complete.
>   */
> -void jbd2_journal_update_sb_log_tail(journal_t *journal)
> +void jbd2_journal_update_sb_log_tail(journal_t *journal, tid_t tail_tid,
> +				     unsigned long tail_block, int write_op)
>  {
>  	journal_superblock_t *sb = journal->j_superblock;
>  
>  	BUG_ON(!mutex_is_locked(&journal->j_checkpoint_mutex));
> -	read_lock(&journal->j_state_lock);
> -	jbd_debug(1, "JBD2: updating superblock (start %ld, seq %d)\n",
> -		  journal->j_tail, journal->j_tail_sequence);
> +	jbd_debug(1, "JBD2: updating superblock (start %lu, seq %u)\n",
> +		  tail_block, tail_tid);
>  
> -	sb->s_sequence = cpu_to_be32(journal->j_tail_sequence);
> -	sb->s_start    = cpu_to_be32(journal->j_tail);
> -	read_unlock(&journal->j_state_lock);
> +	sb->s_sequence = cpu_to_be32(tail_tid);
> +	sb->s_start    = cpu_to_be32(tail_block);
>  
> -	jbd2_write_superblock(journal);
> +	jbd2_write_superblock(journal, write_op);
>  
>  	/* Log is no longer empty */
>  	write_lock(&journal->j_state_lock);
> @@ -1215,7 +1311,7 @@ static void jbd2_mark_journal_empty(journal_t *journal)
>  	sb->s_start    = cpu_to_be32(0);
>  	read_unlock(&journal->j_state_lock);
>  
> -	jbd2_write_superblock(journal);
> +	jbd2_write_superblock(journal, WRITE_FUA);
>  
>  	/* Log is no longer empty */
>  	write_lock(&journal->j_state_lock);
> @@ -1241,7 +1337,7 @@ static void jbd2_journal_update_sb_errno(journal_t *journal)
>  	sb->s_errno    = cpu_to_be32(journal->j_errno);
>  	read_unlock(&journal->j_state_lock);
>  
> -	jbd2_write_superblock(journal);
> +	jbd2_write_superblock(journal, WRITE_SYNC);
>  }
>  
>  /*
> diff --git a/fs/jbd2/recovery.c b/fs/jbd2/recovery.c
> index da6d7ba..c1a0335 100644
> --- a/fs/jbd2/recovery.c
> +++ b/fs/jbd2/recovery.c
> @@ -21,6 +21,7 @@
>  #include <linux/jbd2.h>
>  #include <linux/errno.h>
>  #include <linux/crc32.h>
> +#include <linux/blkdev.h>
>  #endif
>  
>  /*
> @@ -265,7 +266,9 @@ int jbd2_journal_recover(journal_t *journal)
>  	err2 = sync_blockdev(journal->j_fs_dev);
>  	if (!err)
>  		err = err2;
> -
> +	/* Make sure all replayed data is on permanent storage */
> +	if (journal->j_flags & JBD2_BARRIER)
> +		blkdev_issue_flush(journal->j_fs_dev, GFP_KERNEL, NULL);
>  	return err;
>  }
>  
> diff --git a/include/linux/jbd2.h b/include/linux/jbd2.h
> index 78da912..1b10c2a 100644
> --- a/include/linux/jbd2.h
> +++ b/include/linux/jbd2.h
> @@ -971,6 +971,9 @@ extern void __journal_clean_data_list(transaction_t *transaction);
>  /* Log buffer allocation */
>  extern struct journal_head * jbd2_journal_get_descriptor_buffer(journal_t *);
>  int jbd2_journal_next_log_block(journal_t *, unsigned long long *);
> +int jbd2_journal_get_log_tail(journal_t *journal, tid_t *tid,
> +			      unsigned long *block);
> +void __jbd2_update_log_tail(journal_t *journal, tid_t tid, unsigned long block);
>  
>  /* Commit management */
>  extern void jbd2_journal_commit_transaction(journal_t *);
> @@ -1082,7 +1085,8 @@ extern int	   jbd2_journal_destroy    (journal_t *);
>  extern int	   jbd2_journal_recover    (journal_t *journal);
>  extern int	   jbd2_journal_wipe       (journal_t *, int);
>  extern int	   jbd2_journal_skip_recovery	(journal_t *);
> -extern void	   jbd2_journal_update_sb_log_tail	(journal_t *);
> +extern void	   jbd2_journal_update_sb_log_tail	(journal_t *, tid_t,
> +				unsigned long, int);
>  extern void	   __jbd2_journal_abort_hard	(journal_t *);
>  extern void	   jbd2_journal_abort      (journal_t *, int);
>  extern int	   jbd2_journal_errno      (journal_t *);
> diff --git a/include/trace/events/jbd2.h b/include/trace/events/jbd2.h
> index 7596441..5c74007 100644
> --- a/include/trace/events/jbd2.h
> +++ b/include/trace/events/jbd2.h
> @@ -200,7 +200,7 @@ TRACE_EVENT(jbd2_checkpoint_stats,
>  		  __entry->forced_to_close, __entry->written, __entry->dropped)
>  );
>  
> -TRACE_EVENT(jbd2_cleanup_journal_tail,
> +TRACE_EVENT(jbd2_update_log_tail,
>  
>  	TP_PROTO(journal_t *journal, tid_t first_tid,
>  		 unsigned long block_nr, unsigned long freed),
> -- 
> 1.7.1
> 

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 3/8] jbd2: Issue cache flush after checkpointing even with internal journal
  2012-03-14  2:17   ` Ted Ts'o
@ 2012-03-15  8:59     ` Jan Kara
  2012-03-19  3:45       ` Ted Ts'o
  0 siblings, 1 reply; 15+ messages in thread
From: Jan Kara @ 2012-03-15  8:59 UTC (permalink / raw)
  To: Ted Ts'o; +Cc: Jan Kara, linux-ext4

On Tue 13-03-12 22:17:45, Ted Tso wrote:
> I'm not sure why you pulled out the code for
> jbd2_journal_get_log_tail(), since there's no other usage of the
> function, but it did make this commit harder to review.  It's good for
> code movement patches to be segregated to their own commit, just to
> make it easier to review....
  I pulled the code out because the last patch in the series uses it from
commit code. So it was not a deliberate code movement. But I agree it might
have been easier to review if the code movement was a separate patch.
Should I do that or have you already coped with the patch as is?

								Honza

> On Wed, Feb 15, 2012 at 07:34:09PM +0100, Jan Kara wrote:
> > When we reach jbd2_cleanup_journal_tail(), there is no guarantee that
> > checkpointed buffers are on a stable storage - especially if buffers were
> > written out by jbd2_log_do_checkpoint(), they are likely to be only in disk's
> > caches. Thus when we update journal superblock effectively removing old
> > transaction from journal, this write of superblock can get to stable storage
> > before those checkpointed buffers which can result in filesystem corruption
> > after a crash. Thus we must unconditionally issue a cache flush before we
> > update journal superblock in these cases.
> > 
> > A similar problem can also occur if journal superblock is written only in
> > disk's caches, other transaction starts reusing space of the transaction
> > cleaned from the log and power failure happens. Subsequent journal replay would
> > still try to replay the old transaction but some of it's blocks may be already
> > overwritten by the new transaction. For this reason we must use WRITE_FUA when
> > updating log tail and we must first write new log tail to disk and update
> > in-memory information only after that.
> > 
> > Signed-off-by: Jan Kara <jack@suse.cz>
> > ---
> >  fs/jbd2/checkpoint.c        |   75 ++++--------------------
> >  fs/jbd2/commit.c            |   11 +++-
> >  fs/jbd2/journal.c           |  136 ++++++++++++++++++++++++++++++++++++------
> >  fs/jbd2/recovery.c          |    5 +-
> >  include/linux/jbd2.h        |    6 ++-
> >  include/trace/events/jbd2.h |    2 +-
> >  6 files changed, 148 insertions(+), 87 deletions(-)
> > 
> > diff --git a/fs/jbd2/checkpoint.c b/fs/jbd2/checkpoint.c
> > index aa09868..de76f1f 100644
> > --- a/fs/jbd2/checkpoint.c
> > +++ b/fs/jbd2/checkpoint.c
> > @@ -478,79 +478,28 @@ out:
> >  
> >  int jbd2_cleanup_journal_tail(journal_t *journal)
> >  {
> > -	transaction_t * transaction;
> >  	tid_t		first_tid;
> > -	unsigned long	blocknr, freed;
> > +	unsigned long	blocknr;
> >  
> >  	if (is_journal_aborted(journal))
> >  		return 1;
> >  
> > -	/* OK, work out the oldest transaction remaining in the log, and
> > -	 * the log block it starts at.
> > -	 *
> > -	 * If the log is now empty, we need to work out which is the
> > -	 * next transaction ID we will write, and where it will
> > -	 * start. */
> > -
> > -	write_lock(&journal->j_state_lock);
> > -	spin_lock(&journal->j_list_lock);
> > -	transaction = journal->j_checkpoint_transactions;
> > -	if (transaction) {
> > -		first_tid = transaction->t_tid;
> > -		blocknr = transaction->t_log_start;
> > -	} else if ((transaction = journal->j_committing_transaction) != NULL) {
> > -		first_tid = transaction->t_tid;
> > -		blocknr = transaction->t_log_start;
> > -	} else if ((transaction = journal->j_running_transaction) != NULL) {
> > -		first_tid = transaction->t_tid;
> > -		blocknr = journal->j_head;
> > -	} else {
> > -		first_tid = journal->j_transaction_sequence;
> > -		blocknr = journal->j_head;
> > -	}
> > -	spin_unlock(&journal->j_list_lock);
> > -	J_ASSERT(blocknr != 0);
> > -
> > -	/* If the oldest pinned transaction is at the tail of the log
> > -           already then there's not much we can do right now. */
> > -	if (journal->j_tail_sequence == first_tid) {
> > -		write_unlock(&journal->j_state_lock);
> > +	if (!jbd2_journal_get_log_tail(journal, &first_tid, &blocknr))
> >  		return 1;
> > -	}
> > -
> > -	/* OK, update the superblock to recover the freed space.
> > -	 * Physical blocks come first: have we wrapped beyond the end of
> > -	 * the log?  */
> > -	freed = blocknr - journal->j_tail;
> > -	if (blocknr < journal->j_tail)
> > -		freed = freed + journal->j_last - journal->j_first;
> > -
> > -	trace_jbd2_cleanup_journal_tail(journal, first_tid, blocknr, freed);
> > -	jbd_debug(1,
> > -		  "Cleaning journal tail from %d to %d (offset %lu), "
> > -		  "freeing %lu\n",
> > -		  journal->j_tail_sequence, first_tid, blocknr, freed);
> > -
> > -	journal->j_free += freed;
> > -	journal->j_tail_sequence = first_tid;
> > -	journal->j_tail = blocknr;
> > -	write_unlock(&journal->j_state_lock);
> > +	J_ASSERT(blocknr != 0);
> >  
> >  	/*
> > -	 * If there is an external journal, we need to make sure that
> > -	 * any data blocks that were recently written out --- perhaps
> > -	 * by jbd2_log_do_checkpoint() --- are flushed out before we
> > -	 * drop the transactions from the external journal.  It's
> > -	 * unlikely this will be necessary, especially with a
> > -	 * appropriately sized journal, but we need this to guarantee
> > -	 * correctness.  Fortunately jbd2_cleanup_journal_tail()
> > -	 * doesn't get called all that often.
> > +	 * We need to make sure that any blocks that were recently written out
> > +	 * --- perhaps by jbd2_log_do_checkpoint() --- are flushed out before
> > +	 * we drop the transactions from the journal. It's unlikely this will
> > +	 * be necessary, especially with an appropriately sized journal, but we
> > +	 * need this to guarantee correctness.  Fortunately
> > +	 * jbd2_cleanup_journal_tail() doesn't get called all that often.
> >  	 */
> > -	if ((journal->j_fs_dev != journal->j_dev) &&
> > -	    (journal->j_flags & JBD2_BARRIER))
> > +	if (journal->j_flags & JBD2_BARRIER)
> >  		blkdev_issue_flush(journal->j_fs_dev, GFP_KERNEL, NULL);
> > -	if (!(journal->j_flags & JBD2_ABORT))
> > -		jbd2_journal_update_sb_log_tail(journal);
> > +
> > +	__jbd2_update_log_tail(journal, first_tid, blocknr);
> >  	return 0;
> >  }
> >  
> > diff --git a/fs/jbd2/commit.c b/fs/jbd2/commit.c
> > index 39978c1..f37b783 100644
> > --- a/fs/jbd2/commit.c
> > +++ b/fs/jbd2/commit.c
> > @@ -341,7 +341,16 @@ void jbd2_journal_commit_transaction(journal_t *journal)
> >  	if (journal->j_flags & JBD2_FLUSHED) {
> >  		jbd_debug(3, "super block updated\n");
> >  		mutex_lock(&journal->j_checkpoint_mutex);
> > -		jbd2_journal_update_sb_log_tail(journal);
> > +		/*
> > +		 * We hold j_checkpoint_mutex so tail cannot change under us.
> > +		 * We don't need any special data guarantees for writing sb
> > +		 * since journal is empty and it is ok for write to be
> > +		 * flushed only with transaction commit.
> > +		 */
> > +		jbd2_journal_update_sb_log_tail(journal,
> > +						journal->j_tail_sequence,
> > +						journal->j_tail,
> > +						WRITE_SYNC);
> >  		mutex_unlock(&journal->j_checkpoint_mutex);
> >  	} else {
> >  		jbd_debug(3, "superblock not updated\n");
> > diff --git a/fs/jbd2/journal.c b/fs/jbd2/journal.c
> > index 183a099..192558f 100644
> > --- a/fs/jbd2/journal.c
> > +++ b/fs/jbd2/journal.c
> > @@ -744,6 +744,85 @@ struct journal_head *jbd2_journal_get_descriptor_buffer(journal_t *journal)
> >  	return jbd2_journal_add_journal_head(bh);
> >  }
> >  
> > +/*
> > + * Return tid of the oldest transaction in the journal and block in the journal
> > + * where the transaction starts.
> > + *
> > + * If the journal is now empty, return which will be the next transaction ID
> > + * we will write and where will that transaction start.
> > + *
> > + * The return value is 0 if journal tail cannot be pushed any further, 1 if
> > + * it can.
> > + */
> > +int jbd2_journal_get_log_tail(journal_t *journal, tid_t *tid,
> > +			      unsigned long *block)
> > +{
> > +	transaction_t *transaction;
> > +	int ret;
> > +
> > +	read_lock(&journal->j_state_lock);
> > +	spin_lock(&journal->j_list_lock);
> > +	transaction = journal->j_checkpoint_transactions;
> > +	if (transaction) {
> > +		*tid = transaction->t_tid;
> > +		*block = transaction->t_log_start;
> > +	} else if ((transaction = journal->j_committing_transaction) != NULL) {
> > +		*tid = transaction->t_tid;
> > +		*block = transaction->t_log_start;
> > +	} else if ((transaction = journal->j_running_transaction) != NULL) {
> > +		*tid = transaction->t_tid;
> > +		*block = journal->j_head;
> > +	} else {
> > +		*tid = journal->j_transaction_sequence;
> > +		*block = journal->j_head;
> > +	}
> > +	ret = tid_gt(*tid, journal->j_tail_sequence);
> > +	spin_unlock(&journal->j_list_lock);
> > +	read_unlock(&journal->j_state_lock);
> > +
> > +	return ret;
> > +}
> > +
> > +/*
> > + * Update information in journal structure and in on disk journal superblock
> > + * about log tail. This function does not check whether information passed in
> > + * really pushes log tail further. It's responsibility of the caller to make
> > + * sure provided log tail information is valid (e.g. by holding
> > + * j_checkpoint_mutex all the time between computing log tail and calling this
> > + * function as is the case with jbd2_cleanup_journal_tail()).
> > + *
> > + * Requires j_checkpoint_mutex
> > + */
> > +void __jbd2_update_log_tail(journal_t *journal, tid_t tid, unsigned long block)
> > +{
> > +	unsigned long freed;
> > +
> > +	BUG_ON(!mutex_is_locked(&journal->j_checkpoint_mutex));
> > +
> > +	/*
> > +	 * We cannot afford for write to remain in drive's caches since as
> > +	 * soon as we update j_tail, next transaction can start reusing journal
> > +	 * space and if we lose sb update during power failure we'd replay
> > +	 * old transaction with possibly newly overwritten data.
> > +	 */
> > +	jbd2_journal_update_sb_log_tail(journal, tid, block, WRITE_FUA);
> > +	write_lock(&journal->j_state_lock);
> > +	freed = block - journal->j_tail;
> > +	if (block < journal->j_tail)
> > +		freed += journal->j_last - journal->j_first;
> > +
> > +	trace_jbd2_update_log_tail(journal, tid, block, freed);
> > +	jbd_debug(1,
> > +		  "Cleaning journal tail from %d to %d (offset %lu), "
> > +		  "freeing %lu\n",
> > +		  journal->j_tail_sequence, tid, block, freed);
> > +
> > +	journal->j_free += freed;
> > +	journal->j_tail_sequence = tid;
> > +	journal->j_tail = block;
> > +	write_unlock(&journal->j_state_lock);
> > +}
> > +
> >  struct jbd2_stats_proc_session {
> >  	journal_t *journal;
> >  	struct transaction_stats_s *stats;
> > @@ -1127,17 +1206,29 @@ static int journal_reset(journal_t *journal)
> >  	} else {
> >  		/* Lock here to make assertions happy... */
> >  		mutex_lock(&journal->j_checkpoint_mutex);
> > -		/* Add the dynamic fields and write it to disk. */
> > -		jbd2_journal_update_sb_log_tail(journal);
> > +		/*
> > +		 * Update log tail information. We use WRITE_FUA since new
> > +		 * transaction will start reusing journal space and so we
> > +		 * must make sure information about current log tail is on
> > +		 * disk before that.
> > +		 */
> > +		jbd2_journal_update_sb_log_tail(journal,
> > +						journal->j_tail_sequence,
> > +						journal->j_tail,
> > +						WRITE_FUA);
> >  		mutex_unlock(&journal->j_checkpoint_mutex);
> >  	}
> >  	return jbd2_journal_start_thread(journal);
> >  }
> >  
> > -static void jbd2_write_superblock(journal_t *journal)
> > +static void jbd2_write_superblock(journal_t *journal, int write_op)
> >  {
> >  	struct buffer_head *bh = journal->j_sb_buffer;
> > +	int ret;
> >  
> > +	if (!(journal->j_flags & JBD2_BARRIER))
> > +		write_op &= ~(REQ_FUA | REQ_FLUSH);
> > +	lock_buffer(bh);
> >  	if (buffer_write_io_error(bh)) {
> >  		/*
> >  		 * Oh, dear.  A previous attempt to write the journal
> > @@ -1153,40 +1244,45 @@ static void jbd2_write_superblock(journal_t *journal)
> >  		clear_buffer_write_io_error(bh);
> >  		set_buffer_uptodate(bh);
> >  	}
> > -
> > -	BUFFER_TRACE(bh, "marking dirty");
> > -	mark_buffer_dirty(bh);
> > -	sync_dirty_buffer(bh);
> > +	get_bh(bh);
> > +	bh->b_end_io = end_buffer_write_sync;
> > +	ret = submit_bh(write_op, bh);
> > +	wait_on_buffer(bh);
> >  	if (buffer_write_io_error(bh)) {
> > -		printk(KERN_ERR "JBD2: I/O error detected "
> > -		       "when updating journal superblock for %s.\n",
> > -		       journal->j_devname);
> >  		clear_buffer_write_io_error(bh);
> >  		set_buffer_uptodate(bh);
> > +		ret = -EIO;
> > +	}
> > +	if (ret) {
> > +		printk(KERN_ERR "JBD2: Error %d detected when updating "
> > +		       "journal superblock for %s.\n", ret,
> > +		       journal->j_devname);
> >  	}
> >  }
> >  
> >  /**
> >   * jbd2_journal_update_sb_log_tail() - Update log tail in journal sb on disk.
> >   * @journal: The journal to update.
> > + * @tail_tid: TID of the new transaction at the tail of the log
> > + * @tail_block: The first block of the transaction at the tail of the log
> > + * @write_op: With which operation should we write the journal sb
> >   *
> >   * Update a journal's superblock information about log tail and write it to
> >   * disk, waiting for the IO to complete.
> >   */
> > -void jbd2_journal_update_sb_log_tail(journal_t *journal)
> > +void jbd2_journal_update_sb_log_tail(journal_t *journal, tid_t tail_tid,
> > +				     unsigned long tail_block, int write_op)
> >  {
> >  	journal_superblock_t *sb = journal->j_superblock;
> >  
> >  	BUG_ON(!mutex_is_locked(&journal->j_checkpoint_mutex));
> > -	read_lock(&journal->j_state_lock);
> > -	jbd_debug(1, "JBD2: updating superblock (start %ld, seq %d)\n",
> > -		  journal->j_tail, journal->j_tail_sequence);
> > +	jbd_debug(1, "JBD2: updating superblock (start %lu, seq %u)\n",
> > +		  tail_block, tail_tid);
> >  
> > -	sb->s_sequence = cpu_to_be32(journal->j_tail_sequence);
> > -	sb->s_start    = cpu_to_be32(journal->j_tail);
> > -	read_unlock(&journal->j_state_lock);
> > +	sb->s_sequence = cpu_to_be32(tail_tid);
> > +	sb->s_start    = cpu_to_be32(tail_block);
> >  
> > -	jbd2_write_superblock(journal);
> > +	jbd2_write_superblock(journal, write_op);
> >  
> >  	/* Log is no longer empty */
> >  	write_lock(&journal->j_state_lock);
> > @@ -1215,7 +1311,7 @@ static void jbd2_mark_journal_empty(journal_t *journal)
> >  	sb->s_start    = cpu_to_be32(0);
> >  	read_unlock(&journal->j_state_lock);
> >  
> > -	jbd2_write_superblock(journal);
> > +	jbd2_write_superblock(journal, WRITE_FUA);
> >  
> >  	/* Log is no longer empty */
> >  	write_lock(&journal->j_state_lock);
> > @@ -1241,7 +1337,7 @@ static void jbd2_journal_update_sb_errno(journal_t *journal)
> >  	sb->s_errno    = cpu_to_be32(journal->j_errno);
> >  	read_unlock(&journal->j_state_lock);
> >  
> > -	jbd2_write_superblock(journal);
> > +	jbd2_write_superblock(journal, WRITE_SYNC);
> >  }
> >  
> >  /*
> > diff --git a/fs/jbd2/recovery.c b/fs/jbd2/recovery.c
> > index da6d7ba..c1a0335 100644
> > --- a/fs/jbd2/recovery.c
> > +++ b/fs/jbd2/recovery.c
> > @@ -21,6 +21,7 @@
> >  #include <linux/jbd2.h>
> >  #include <linux/errno.h>
> >  #include <linux/crc32.h>
> > +#include <linux/blkdev.h>
> >  #endif
> >  
> >  /*
> > @@ -265,7 +266,9 @@ int jbd2_journal_recover(journal_t *journal)
> >  	err2 = sync_blockdev(journal->j_fs_dev);
> >  	if (!err)
> >  		err = err2;
> > -
> > +	/* Make sure all replayed data is on permanent storage */
> > +	if (journal->j_flags & JBD2_BARRIER)
> > +		blkdev_issue_flush(journal->j_fs_dev, GFP_KERNEL, NULL);
> >  	return err;
> >  }
> >  
> > diff --git a/include/linux/jbd2.h b/include/linux/jbd2.h
> > index 78da912..1b10c2a 100644
> > --- a/include/linux/jbd2.h
> > +++ b/include/linux/jbd2.h
> > @@ -971,6 +971,9 @@ extern void __journal_clean_data_list(transaction_t *transaction);
> >  /* Log buffer allocation */
> >  extern struct journal_head * jbd2_journal_get_descriptor_buffer(journal_t *);
> >  int jbd2_journal_next_log_block(journal_t *, unsigned long long *);
> > +int jbd2_journal_get_log_tail(journal_t *journal, tid_t *tid,
> > +			      unsigned long *block);
> > +void __jbd2_update_log_tail(journal_t *journal, tid_t tid, unsigned long block);
> >  
> >  /* Commit management */
> >  extern void jbd2_journal_commit_transaction(journal_t *);
> > @@ -1082,7 +1085,8 @@ extern int	   jbd2_journal_destroy    (journal_t *);
> >  extern int	   jbd2_journal_recover    (journal_t *journal);
> >  extern int	   jbd2_journal_wipe       (journal_t *, int);
> >  extern int	   jbd2_journal_skip_recovery	(journal_t *);
> > -extern void	   jbd2_journal_update_sb_log_tail	(journal_t *);
> > +extern void	   jbd2_journal_update_sb_log_tail	(journal_t *, tid_t,
> > +				unsigned long, int);
> >  extern void	   __jbd2_journal_abort_hard	(journal_t *);
> >  extern void	   jbd2_journal_abort      (journal_t *, int);
> >  extern int	   jbd2_journal_errno      (journal_t *);
> > diff --git a/include/trace/events/jbd2.h b/include/trace/events/jbd2.h
> > index 7596441..5c74007 100644
> > --- a/include/trace/events/jbd2.h
> > +++ b/include/trace/events/jbd2.h
> > @@ -200,7 +200,7 @@ TRACE_EVENT(jbd2_checkpoint_stats,
> >  		  __entry->forced_to_close, __entry->written, __entry->dropped)
> >  );
> >  
> > -TRACE_EVENT(jbd2_cleanup_journal_tail,
> > +TRACE_EVENT(jbd2_update_log_tail,
> >  
> >  	TP_PROTO(journal_t *journal, tid_t first_tid,
> >  		 unsigned long block_nr, unsigned long freed),
> > -- 
> > 1.7.1
> > 
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 3/8] jbd2: Issue cache flush after checkpointing even with internal journal
  2012-03-15  8:59     ` Jan Kara
@ 2012-03-19  3:45       ` Ted Ts'o
  0 siblings, 0 replies; 15+ messages in thread
From: Ted Ts'o @ 2012-03-19  3:45 UTC (permalink / raw)
  To: Jan Kara; +Cc: linux-ext4

On Thu, Mar 15, 2012 at 09:59:35AM +0100, Jan Kara wrote:
> On Tue 13-03-12 22:17:45, Ted Tso wrote:
> > I'm not sure why you pulled out the code for
> > jbd2_journal_get_log_tail(), since there's no other usage of the
> > function, but it did make this commit harder to review.  It's good for
> > code movement patches to be segregated to their own commit, just to
> > make it easier to review....
>   I pulled the code out because the last patch in the series uses it from
> commit code. So it was not a deliberate code movement. But I agree it might
> have been easier to review if the code movement was a separate patch.
> Should I do that or have you already coped with the patch as is?

No, it's fine as it is.  I've merged your patches into the ext4 tree
for the upcoming merge window.  Thanks!!

							 Ted

^ permalink raw reply	[flat|nested] 15+ messages in thread

end of thread, other threads:[~2012-03-19  3:45 UTC | newest]

Thread overview: 15+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2012-02-15 18:34 [PATCH 0/8 v3] Checkpointing fixes and cleanups Jan Kara
2012-02-15 18:34 ` [PATCH 1/8] jbd2: Split updating of journal superblock and marking journal empty Jan Kara
2012-02-15 18:34 ` [PATCH 2/8] jbd2: Protect all log tail updates with j_checkpoint_mutex Jan Kara
2012-02-15 18:34 ` [PATCH 3/8] jbd2: Issue cache flush after checkpointing even with internal journal Jan Kara
2012-03-14  2:17   ` Ted Ts'o
2012-03-15  8:59     ` Jan Kara
2012-03-19  3:45       ` Ted Ts'o
2012-02-15 18:34 ` [PATCH 4/8] jbd2: Fix BH_JWrite setting in checkpointing code Jan Kara
2012-02-15 18:34 ` [PATCH 5/8] jbd2: __jbd2_journal_temp_unlink_buffer() is static Jan Kara
2012-02-15 18:34 ` [PATCH 6/8] jbd2: Remove always true condition in __journal_try_to_free_buffer() Jan Kara
2012-02-15 18:34 ` [PATCH 7/8] jbd2: Remove bh_state lock from checkpointing code Jan Kara
2012-02-15 18:34 ` [PATCH 8/8] jbd2: Cleanup journal tail after transaction commit Jan Kara
2012-02-15 22:03   ` Andreas Dilger
2012-02-16 12:59     ` Jan Kara
2012-02-29 11:03 ` [PATCH 0/8 v3] Checkpointing fixes and cleanups Jan Kara

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).