[PATCH] jbd: Fix a race between checkpointing code and journal_get_write

linux-ext4.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH] jbd: Fix a race between checkpointing code and journal_get_write_access()
@ 2009-06-24 16:02 Jan Kara
  2009-06-24 16:02 ` [PATCH] jbd2: " Jan Kara
  0 siblings, 1 reply; 9+ messages in thread
From: Jan Kara @ 2009-06-24 16:02 UTC (permalink / raw)
  To: linux-ext4; +Cc: tytso, Jan Kara

The following race can happen:

  CPU1                          CPU2
                                checkpointing code checks the buffer, adds
                                  it to an array for writeback
do_get_write_access()
  ...
  lock_buffer()
  unlock_buffer()
                                  flush_batch() submits the buffer for IO
  __jbd_journal_file_buffer()

  So a buffer under writeout is returned from do_get_write_access(). Since
the filesystem code relies on the fact that journaled buffers cannot be
written out, it does not take the buffer lock and so it can modify buffer
while it is under writeout. That can lead to a filesystem corruption
if we crash at the right moment.
  We fix the problem by clearing the buffer dirty bit under buffer_lock
even if the buffer is on BJ_None list. Actually, we clear the dirty bit
regardless the list the buffer is in and warn about the fact if
the buffer is already journalled.

Thanks for spotting the problem goes to dingdinghua <dingdinghua85@gmail.com>.

Reported-by: dingdinghua <dingdinghua85@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
---
 fs/jbd/transaction.c |   52 +++++++++++++++++++------------------------------
 1 files changed, 20 insertions(+), 32 deletions(-)

diff --git a/fs/jbd/transaction.c b/fs/jbd/transaction.c
index 73242ba..e9e141d 100644
--- a/fs/jbd/transaction.c
+++ b/fs/jbd/transaction.c
@@ -489,34 +489,15 @@ void journal_unlock_updates (journal_t *journal)
 	wake_up(&journal->j_wait_transaction_locked);
 }
 
-/*
- * Report any unexpected dirty buffers which turn up.  Normally those
- * indicate an error, but they can occur if the user is running (say)
- * tune2fs to modify the live filesystem, so we need the option of
- * continuing as gracefully as possible.  #
- *
- * The caller should already hold the journal lock and
- * j_list_lock spinlock: most callers will need those anyway
- * in order to probe the buffer's journaling state safely.
- */
-static void jbd_unexpected_dirty_buffer(struct journal_head *jh)
+static void warn_dirty_buffer(struct buffer_head *bh)
 {
-	int jlist;
-
-	/* If this buffer is one which might reasonably be dirty
-	 * --- ie. data, or not part of this journal --- then
-	 * we're OK to leave it alone, but otherwise we need to
-	 * move the dirty bit to the journal's own internal
-	 * JBDDirty bit. */
-	jlist = jh->b_jlist;
-
-	if (jlist == BJ_Metadata || jlist == BJ_Reserved ||
-	    jlist == BJ_Shadow || jlist == BJ_Forget) {
-		struct buffer_head *bh = jh2bh(jh);
+	char b[BDEVNAME_SIZE];
 
-		if (test_clear_buffer_dirty(bh))
-			set_buffer_jbddirty(bh);
-	}
+	printk(KERN_WARNING
+	       "JBD: Spotted dirty metadata buffer (dev = %s, blocknr = %llu). "
+	       "There's a risk of filesystem corruption in case of system "
+	       "crash.\n",
+	       bdevname(bh->b_bdev, b), (unsigned long long)bh->b_blocknr);
 }
 
 /*
@@ -583,14 +564,16 @@ repeat:
 			if (jh->b_next_transaction)
 				J_ASSERT_JH(jh, jh->b_next_transaction ==
 							transaction);
+			warn_dirty_buffer(bh);
 		}
 		/*
 		 * In any case we need to clean the dirty flag and we must
 		 * do it under the buffer lock to be sure we don't race
 		 * with running write-out.
 		 */
-		JBUFFER_TRACE(jh, "Unexpected dirty buffer");
-		jbd_unexpected_dirty_buffer(jh);
+		JBUFFER_TRACE(jh, "Journalling dirty buffer");
+		clear_buffer_dirty(bh);
+		set_buffer_jbddirty(bh);
 	}
 
 	unlock_buffer(bh);
@@ -2041,12 +2024,17 @@ void __journal_file_buffer(struct journal_head *jh,
 	if (jh->b_transaction && jh->b_jlist == jlist)
 		return;
 
-	/* The following list of buffer states needs to be consistent
-	 * with __jbd_unexpected_dirty_buffer()'s handling of dirty
-	 * state. */

^ permalink raw reply related	[flat|nested] 9+ messages in thread

* [PATCH] jbd2: Fix a race between checkpointing code and journal_get_write_access()
  2009-06-24 16:02 [PATCH] jbd: Fix a race between checkpointing code and journal_get_write_access() Jan Kara
@ 2009-06-24 16:02 ` Jan Kara
  2009-07-06  2:53   ` Theodore Tso
  0 siblings, 1 reply; 9+ messages in thread
From: Jan Kara @ 2009-06-24 16:02 UTC (permalink / raw)
  To: linux-ext4; +Cc: tytso, Jan Kara

The following race can happen:

  CPU1                          CPU2
                                checkpointing code checks the buffer, adds
                                  it to an array for writeback
do_get_write_access()
  ...
  lock_buffer()
  unlock_buffer()
                                  flush_batch() submits the buffer for IO
  __jbd2_journal_file_buffer()

  So a buffer under writeout is returned from do_get_write_access(). Since
the filesystem code relies on the fact that journaled buffers cannot be
written out, it does not take the buffer lock and so it can modify buffer
while it is under writeout. That can lead to a filesystem corruption
if we crash at the right moment.
  We fix the problem by clearing the buffer dirty bit under buffer_lock
even if the buffer is on BJ_None list. Actually, we clear the dirty bit
regardless the list the buffer is in and warn about the fact if
the buffer is already journalled.

Thanks for spotting the problem goes to dingdinghua <dingdinghua85@gmail.com>.

Reported-by: dingdinghua <dingdinghua85@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
---
 fs/jbd2/transaction.c |   52 ++++++++++++++++++------------------------------
 1 files changed, 20 insertions(+), 32 deletions(-)

diff --git a/fs/jbd2/transaction.c b/fs/jbd2/transaction.c
index 494501e..86144aa 100644
--- a/fs/jbd2/transaction.c
+++ b/fs/jbd2/transaction.c
@@ -499,34 +499,15 @@ void jbd2_journal_unlock_updates (journal_t *journal)
 	wake_up(&journal->j_wait_transaction_locked);
 }
 
-/*
- * Report any unexpected dirty buffers which turn up.  Normally those
- * indicate an error, but they can occur if the user is running (say)
- * tune2fs to modify the live filesystem, so we need the option of
- * continuing as gracefully as possible.  #
- *
- * The caller should already hold the journal lock and
- * j_list_lock spinlock: most callers will need those anyway
- * in order to probe the buffer's journaling state safely.
- */
-static void jbd_unexpected_dirty_buffer(struct journal_head *jh)
+static void warn_dirty_buffer(struct buffer_head *bh)
 {
-	int jlist;
-
-	/* If this buffer is one which might reasonably be dirty
-	 * --- ie. data, or not part of this journal --- then
-	 * we're OK to leave it alone, but otherwise we need to
-	 * move the dirty bit to the journal's own internal
-	 * JBDDirty bit. */
-	jlist = jh->b_jlist;
-
-	if (jlist == BJ_Metadata || jlist == BJ_Reserved ||
-	    jlist == BJ_Shadow || jlist == BJ_Forget) {
-		struct buffer_head *bh = jh2bh(jh);
+	char b[BDEVNAME_SIZE];
 
-		if (test_clear_buffer_dirty(bh))
-			set_buffer_jbddirty(bh);
-	}
+	printk(KERN_WARNING
+	       "JBD: Spotted dirty metadata buffer (dev = %s, blocknr = %llu). "
+	       "There's a risk of filesystem corruption in case of system "
+	       "crash.\n",
+	       bdevname(bh->b_bdev, b), (unsigned long long)bh->b_blocknr);
 }
 
 /*
@@ -593,14 +574,16 @@ repeat:
 			if (jh->b_next_transaction)
 				J_ASSERT_JH(jh, jh->b_next_transaction ==
 							transaction);
+			warn_dirty_buffer(bh);
 		}
 		/*
 		 * In any case we need to clean the dirty flag and we must
 		 * do it under the buffer lock to be sure we don't race
 		 * with running write-out.
 		 */
-		JBUFFER_TRACE(jh, "Unexpected dirty buffer");
-		jbd_unexpected_dirty_buffer(jh);
+		JBUFFER_TRACE(jh, "Journalling dirty buffer");
+		clear_buffer_dirty(bh);
+		set_buffer_jbddirty(bh);
 	}
 
 	unlock_buffer(bh);
@@ -1896,12 +1879,17 @@ void __jbd2_journal_file_buffer(struct journal_head *jh,
 	if (jh->b_transaction && jh->b_jlist == jlist)
 		return;
 
-	/* The following list of buffer states needs to be consistent
-	 * with __jbd_unexpected_dirty_buffer()'s handling of dirty
-	 * state. */

^ permalink raw reply related	[flat|nested] 9+ messages in thread

* Re: [PATCH] jbd2: Fix a race between checkpointing code and journal_get_write_access()
  2009-06-24 16:02 ` [PATCH] jbd2: " Jan Kara
@ 2009-07-06  2:53   ` Theodore Tso
  2009-07-08 22:31     ` Theodore Tso
  0 siblings, 1 reply; 9+ messages in thread
From: Theodore Tso @ 2009-07-06  2:53 UTC (permalink / raw)
  To: Jan Kara; +Cc: linux-ext4

On Wed, Jun 24, 2009 at 06:02:40PM +0200, Jan Kara wrote:
> The following race can happen:
> 
>   CPU1                          CPU2
>                                 checkpointing code checks the buffer, adds
>                                   it to an array for writeback
> do_get_write_access()
>   ...
>   lock_buffer()
>   unlock_buffer()
>                                   flush_batch() submits the buffer for IO
>   __jbd2_journal_file_buffer()
> 
>   So a buffer under writeout is returned from do_get_write_access(). Since
> the filesystem code relies on the fact that journaled buffers cannot be
> written out, it does not take the buffer lock and so it can modify buffer
> while it is under writeout. That can lead to a filesystem corruption
> if we crash at the right moment.
>   We fix the problem by clearing the buffer dirty bit under buffer_lock
> even if the buffer is on BJ_None list. Actually, we clear the dirty bit
> regardless the list the buffer is in and warn about the fact if
> the buffer is already journalled.
> 
> Thanks for spotting the problem goes to dingdinghua <dingdinghua85@gmail.com>.
> 
> Reported-by: dingdinghua <dingdinghua85@gmail.com>
> Signed-off-by: Jan Kara <jack@suse.cz>

Thanks, applied to the ext4 patch queue.

						- Ted

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH] jbd2: Fix a race between checkpointing code and journal_get_write_access()
  2009-07-06  2:53   ` Theodore Tso
@ 2009-07-08 22:31     ` Theodore Tso
  2009-07-10 10:07       ` Jan Kara
  0 siblings, 1 reply; 9+ messages in thread
From: Theodore Tso @ 2009-07-08 22:31 UTC (permalink / raw)
  To: Jan Kara; +Cc: linux-ext4

On Sun, Jul 05, 2009 at 10:53:19PM -0400, Theodore Tso wrote:
> On Wed, Jun 24, 2009 at 06:02:40PM +0200, Jan Kara wrote:
> > The following race can happen:
> > 
> >   CPU1                          CPU2
> >                                 checkpointing code checks the buffer, adds
> >                                   it to an array for writeback
> > do_get_write_access()
> >   ...
> >   lock_buffer()
> >   unlock_buffer()
> >                                   flush_batch() submits the buffer for IO
> >   __jbd2_journal_file_buffer()
> > 
> >   So a buffer under writeout is returned from do_get_write_access(). Since
> > the filesystem code relies on the fact that journaled buffers cannot be
> > written out, it does not take the buffer lock and so it can modify buffer
> > while it is under writeout. That can lead to a filesystem corruption
> > if we crash at the right moment.
> >   We fix the problem by clearing the buffer dirty bit under buffer_lock
> > even if the buffer is on BJ_None list. Actually, we clear the dirty bit
> > regardless the list the buffer is in and warn about the fact if
> > the buffer is already journalled.

When running fsstress, we get the "Spotted dirty metadata buffer;
there's a risk of filesystem corruption in csae of a system crash" at
least half a dozen times or so.  That sounds like we have a problem.
Were you expecting that this was a "this should never happen"
situation, or is there a known bug that we need to fix here?

	      	       	       	   	- Ted

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH] jbd2: Fix a race between checkpointing code and journal_get_write_access()
  2009-07-08 22:31     ` Theodore Tso
@ 2009-07-10 10:07       ` Jan Kara
  2009-07-13 13:19         ` Theodore Tso
  0 siblings, 1 reply; 9+ messages in thread
From: Jan Kara @ 2009-07-10 10:07 UTC (permalink / raw)
  To: Theodore Tso; +Cc: linux-ext4

[-- Attachment #1: Type: text/plain, Size: 2015 bytes --]

On Wed 08-07-09 18:31:50, Theodore Tso wrote:
> On Sun, Jul 05, 2009 at 10:53:19PM -0400, Theodore Tso wrote:
> > On Wed, Jun 24, 2009 at 06:02:40PM +0200, Jan Kara wrote:
> > > The following race can happen:
> > > 
> > >   CPU1                          CPU2
> > >                                 checkpointing code checks the buffer, adds
> > >                                   it to an array for writeback
> > > do_get_write_access()
> > >   ...
> > >   lock_buffer()
> > >   unlock_buffer()
> > >                                   flush_batch() submits the buffer for IO
> > >   __jbd2_journal_file_buffer()
> > > 
> > >   So a buffer under writeout is returned from do_get_write_access(). Since
> > > the filesystem code relies on the fact that journaled buffers cannot be
> > > written out, it does not take the buffer lock and so it can modify buffer
> > > while it is under writeout. That can lead to a filesystem corruption
> > > if we crash at the right moment.
> > >   We fix the problem by clearing the buffer dirty bit under buffer_lock
> > > even if the buffer is on BJ_None list. Actually, we clear the dirty bit
> > > regardless the list the buffer is in and warn about the fact if
> > > the buffer is already journalled.
> 
> When running fsstress, we get the "Spotted dirty metadata buffer;
> there's a risk of filesystem corruption in csae of a system crash" at
> least half a dozen times or so.  That sounds like we have a problem.
> Were you expecting that this was a "this should never happen"
> situation, or is there a known bug that we need to fix here?
  Yes, it should be "this should never happen", unless you run something
like tune2fs while the filesystem is mounted. But looking at the code, I
have missed that buffer could be dirty also in
jbd2_journal_get_create_access() because jbd2_journal_forget() does not
clear the dirty bit in case the buffer is just being committed. Attached
patch should fix it. Thanks for report.

								Honza

-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

[-- Attachment #2: 0001-jbd2-Clear-dirty-bit-in-jbd2_journal_get_create_acc.patch --]
[-- Type: text/x-patch, Size: 1593 bytes --]

>From 0ab2ad057f96f49ba443f317df8e6368dd76def7 Mon Sep 17 00:00:00 2001
From: Jan Kara <jack@suse.cz>
Date: Fri, 10 Jul 2009 11:56:37 +0200
Subject: [PATCH] jbd2: Clear dirty bit in jbd2_journal_get_create_access()

In case we reallocate previously freed buffer, it can happen that the buffer we
reallocate is dirty (jbd2_journal_forget() got called while the previous
transaction was still committing). At the time we reallocate the buffer, the
transaction freeing the buffer must be already committed so we can just remove
the dirty bit. This silences the warning in jbd2_journal_file_buffer() about
metadata buffer being dirty.

Signed-off-by: Jan Kara <jack@suse.cz>
---
 fs/jbd2/transaction.c |    9 +++++++++
 1 files changed, 9 insertions(+), 0 deletions(-)

diff --git a/fs/jbd2/transaction.c b/fs/jbd2/transaction.c
index 86144aa..0fb7c88 100644
--- a/fs/jbd2/transaction.c
+++ b/fs/jbd2/transaction.c
@@ -826,6 +826,15 @@ int jbd2_journal_get_create_access(handle_t *handle, struct buffer_head *bh)
 	J_ASSERT_JH(jh, buffer_locked(jh2bh(jh)));
 
 	if (jh->b_transaction == NULL) {
+		/*
+		 * Previous jbd2_journal_forget() could have left the buffer
+		 * with jbddirty bit set because it was being committed. When
+		 * the commit finished, we've filed the buffer for
+		 * checkpointing and marked it dirty. Now we are reallocating
+		 * the buffer so the transaction freeing it must have
+		 * committed and so it's safe to clear the dirty bit.
+		 */
+		clear_buffer_dirty(jh2bh(jh));
 		jh->b_transaction = transaction;
 
 		/* first access by this transaction */
-- 
1.6.0.2


^ permalink raw reply related	[flat|nested] 9+ messages in thread

* Re: [PATCH] jbd2: Fix a race between checkpointing code and journal_get_write_access()
  2009-07-10 10:07       ` Jan Kara
@ 2009-07-13 13:19         ` Theodore Tso
  2009-07-13 13:29           ` Jan Kara
  0 siblings, 1 reply; 9+ messages in thread
From: Theodore Tso @ 2009-07-13 13:19 UTC (permalink / raw)
  To: Jan Kara; +Cc: linux-ext4

On Fri, Jul 10, 2009 at 12:07:13PM +0200, Jan Kara wrote:
>   Yes, it should be "this should never happen", unless you run something
> like tune2fs while the filesystem is mounted. But looking at the code, I
> have missed that buffer could be dirty also in
> jbd2_journal_get_create_access() because jbd2_journal_forget() does not
> clear the dirty bit in case the buffer is just being committed. Attached
> patch should fix it. Thanks for report.

Thanks, added to the patch queue.

						- Ted

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH] jbd2: Fix a race between checkpointing code and journal_get_write_access()
  2009-07-13 13:19         ` Theodore Tso
@ 2009-07-13 13:29           ` Jan Kara
  2009-07-13 14:44             ` Theodore Tso
  0 siblings, 1 reply; 9+ messages in thread
From: Jan Kara @ 2009-07-13 13:29 UTC (permalink / raw)
  To: Theodore Tso; +Cc: linux-ext4

On Mon 13-07-09 09:19:01, Theodore Tso wrote:
> On Fri, Jul 10, 2009 at 12:07:13PM +0200, Jan Kara wrote:
> >   Yes, it should be "this should never happen", unless you run something
> > like tune2fs while the filesystem is mounted. But looking at the code, I
> > have missed that buffer could be dirty also in
> > jbd2_journal_get_create_access() because jbd2_journal_forget() does not
> > clear the dirty bit in case the buffer is just being committed. Attached
> > patch should fix it. Thanks for report.
> 
> Thanks, added to the patch queue.
  I've been doing some more tests here and there's one more path which emits
the warning when deleting symlinks (falsely this time). I'll send you the fix
when I verify that I cannot trigger the warning anymore.

									Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH] jbd2: Fix a race between checkpointing code and journal_get_write_access()
  2009-07-13 13:29           ` Jan Kara
@ 2009-07-13 14:44             ` Theodore Tso
  2009-07-13 15:29               ` Jan Kara
  0 siblings, 1 reply; 9+ messages in thread
From: Theodore Tso @ 2009-07-13 14:44 UTC (permalink / raw)
  To: Jan Kara; +Cc: linux-ext4

On Mon, Jul 13, 2009 at 03:29:58PM +0200, Jan Kara wrote:
>   I've been doing some more tests here and there's one more path which emits
> the warning when deleting symlinks (falsely this time). I'll send you the fix
> when I verify that I cannot trigger the warning anymore.

Ah, that would explain why I'm stil seeing the warning when fsstress
runs.  Was it doing it on long symlinks or short symlinks, out of
curiosity?

						- Ted



^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH] jbd2: Fix a race between checkpointing code and journal_get_write_access()
  2009-07-13 14:44             ` Theodore Tso
@ 2009-07-13 15:29               ` Jan Kara
  0 siblings, 0 replies; 9+ messages in thread
From: Jan Kara @ 2009-07-13 15:29 UTC (permalink / raw)
  To: Theodore Tso; +Cc: Jan Kara, linux-ext4

On Mon 13-07-09 10:44:47, Theodore Tso wrote:
> On Mon, Jul 13, 2009 at 03:29:58PM +0200, Jan Kara wrote:
> >   I've been doing some more tests here and there's one more path which emits
> > the warning when deleting symlinks (falsely this time). I'll send you the fix
> > when I verify that I cannot trigger the warning anymore.
> 
> Ah, that would explain why I'm stil seeing the warning when fsstress
> runs.  Was it doing it on long symlinks or short symlinks, out of
> curiosity?
  Long symlinks - the problem was that __dispose_buffer() called from
ext4_invalidatepage() did __jbd2_journal_unfile_buffer() (jbddirty bit was
moved to dirty bit) and then __jbd2_journal_file_buffer() which complained.
As we cleared the jbddirty bit immediately afterwards it was quite
straightforward to fix.  I've audited all other occurences of
jbd2_journal_unfile_buffer() and we discard the buffer head after all other
calls...

								Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2009-07-13 15:29 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2009-06-24 16:02 [PATCH] jbd: Fix a race between checkpointing code and journal_get_write_access() Jan Kara
2009-06-24 16:02 ` [PATCH] jbd2: " Jan Kara
2009-07-06  2:53   ` Theodore Tso
2009-07-08 22:31     ` Theodore Tso
2009-07-10 10:07       ` Jan Kara
2009-07-13 13:19         ` Theodore Tso
2009-07-13 13:29           ` Jan Kara
2009-07-13 14:44             ` Theodore Tso
2009-07-13 15:29               ` Jan Kara

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).