[RFC] Bug in journal_commit

linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [RFC] Bug in journal_commit_transaction?
@ 2006-04-11 13:35 Jan Kara
  2006-04-12  1:27 ` Andrew Morton
  2006-04-12  3:13 ` Stephen C. Tweedie
  0 siblings, 2 replies; 9+ messages in thread
From: Jan Kara @ 2006-04-11 13:35 UTC (permalink / raw)
  To: linux-fsdevel; +Cc: sct, akpm

  Hello,

  I think I have found a possible cause of assertion failure in
journal_commit_buffer() on line 793 (b_next_transaction == NULL) (which
was reported about two times as I remember). And I think the problem is
just a few lines below in cond_resched_lock(&journal->j_list_lock);
  The problem is following: we are processing t_forget list of a
committing transaction. On this forget list are among other buffers also
bitmaps and buffers that are freed by this transaction. Now if bitmap
buffer is processed first, we switch to new bitmap (hence the buffer
freed by this transaction is again available for allocation). If we
reschedule on that cond_resched_lock() and in the meantime someone
allocates the buffer, we later fail with that assertion failure.
  Quick and easy fix to the problem is to just remove that
cond_resched_lock(). But I guess that may harm latency. Another solution
would be to always keep bitmap buffers in the end of forget list (or
create separate list for bitmap buffers) but that sounds a bit hacky.
Any other idea?

							Bye
								Honza

-- 
Jan Kara <jack@suse.cz>
SuSE CR Labs

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [RFC] Bug in journal_commit_transaction?
  2006-04-11 13:35 [RFC] Bug in journal_commit_transaction? Jan Kara
@ 2006-04-12  1:27 ` Andrew Morton
  2006-04-15 21:06   ` Jan Kara
  2006-04-12  3:13 ` Stephen C. Tweedie
  1 sibling, 1 reply; 9+ messages in thread
From: Andrew Morton @ 2006-04-12  1:27 UTC (permalink / raw)
  To: Jan Kara; +Cc: linux-fsdevel, sct

Jan Kara <jack@suse.cz> wrote:
>
>   Hello,
> 
>   I think I have found a possible cause of assertion failure in
> journal_commit_buffer() on line 793 (b_next_transaction == NULL) (which

you mean journal_commit_transaction().

> was reported about two times as I remember). And I think the problem is
> just a few lines below in cond_resched_lock(&journal->j_list_lock);
>   The problem is following: we are processing t_forget list of a
> committing transaction. On this forget list are among other buffers also
> bitmaps and buffers that are freed by this transaction. Now if bitmap
> buffer is processed first, we switch to new bitmap (hence the buffer
> freed by this transaction is again available for allocation). If we
> reschedule on that cond_resched_lock() and in the meantime someone
> allocates the buffer, we later fail with that assertion failure.

I assume you're referring to the cond_resched_lock() at line 799.

I don't really follow your description here (it seems to be missing bits),
but that loop drops that lock in other places - why wouldn't those places
also be vulnerable?


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [RFC] Bug in journal_commit_transaction?
  2006-04-11 13:35 [RFC] Bug in journal_commit_transaction? Jan Kara
  2006-04-12  1:27 ` Andrew Morton
@ 2006-04-12  3:13 ` Stephen C. Tweedie
  2006-04-15 21:08   ` Jan Kara
  1 sibling, 1 reply; 9+ messages in thread
From: Stephen C. Tweedie @ 2006-04-12  3:13 UTC (permalink / raw)
  To: Jan Kara; +Cc: linux-fsdevel, akpm, Stephen Tweedie

Hi,

On Tue, 2006-04-11 at 15:35 +0200, Jan Kara wrote:

>   The problem is following: we are processing t_forget list of a
> committing transaction. On this forget list are among other buffers also
> bitmaps and buffers that are freed by this transaction. 

Bitmaps can never be freed.  But the forget list isn't just for freed
data: we keep things on the forget list once they are written to the
journal, so that on final commit we know which buffers are now safe to
be written back to the main backing store.

> Now if bitmap
> buffer is processed first, we switch to new bitmap (hence the buffer
> freed by this transaction is again available for allocation).

Bitmaps are statically allocated.  The buffer can be *journaled* again,
though, sure.  But it's never deallocated, nor is the data "switch"ed:
the b_frozen_data and b_committed_data pointers are only ever copies of
b_data, b_data never changes.

>  If we
> reschedule on that cond_resched_lock() and in the meantime someone
> allocates the buffer, we later fail with that assertion failure.

I'm not sure what's supposed to be special about bitmap buffers in this
context.  The only difference in this loop should be that block bitmaps
are the only kind of buffer which can have b_committed_data set.  This
has implications for the lifetime of the b_frozen_data copy, but other
than that, not for the lifetime of the bh/jh itself.

>   Quick and easy fix to the problem is to just remove that
> cond_resched_lock().

I'm still not seeing the fault scenario clearly here.

Cheers,
 Stephen

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [RFC] Bug in journal_commit_transaction?
  2006-04-12  1:27 ` Andrew Morton
@ 2006-04-15 21:06   ` Jan Kara
  2006-04-20  8:47     ` Jan Kara
  2006-04-20 14:28     ` Stephen C. Tweedie
  0 siblings, 2 replies; 9+ messages in thread
From: Jan Kara @ 2006-04-15 21:06 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-fsdevel, sct

  Hello,

  sorry, I was probably to terse in my previous mail. I'll try to
describe more details now.

> >   I think I have found a possible cause of assertion failure in
> > journal_commit_buffer() on line 793 (b_next_transaction == NULL) (which
> 
> you mean journal_commit_transaction().
  Yes.

> > was reported about two times as I remember). And I think the problem is
> > just a few lines below in cond_resched_lock(&journal->j_list_lock);
> >   The problem is following: we are processing t_forget list of a
> > committing transaction. On this forget list are among other buffers also
> > bitmaps and buffers that are freed by this transaction. Now if bitmap
> > buffer is processed first, we switch to new bitmap (hence the buffer
> > freed by this transaction is again available for allocation). If we
> > reschedule on that cond_resched_lock() and in the meantime someone
> > allocates the buffer, we later fail with that assertion failure.
> 
> I assume you're referring to the cond_resched_lock() at line 799.
> 
> I don't really follow your description here (it seems to be missing bits),
> but that loop drops that lock in other places - why wouldn't those places
> also be vulnerable?
  Yes, I meant that cond_resched_lock(). But you are right that the
problem is not in cond_resched_lock() itself. It is in the pure fact
that we can process bitmap buffer in t_forget list before all the
buffers transaction has freed. Explanation is below.
  When we are freeing some blocks we do journal_get_undo_access() on the
bitmap buffer. That copies current state of the bitmap into
b_committed_data. When we allocate any new blocks we check that the
block is free in both b_committed_data (old state) and in b_data (new
state). That makes sure that we do not reallocate any freed block before
transaction that frees it is committed. But unlocking j_list_lock when
processing t_forget list in journal_commit_transaction() can actually
break that assumption in some cases:
  If we commit the transaction we write the bitmap buffer into the log
and then file it into BJ_Forget list. We also write the freed data
buffer (e.g. zeroed indirect block) and file it into BJ_Forget list.
Then we start processing BJ_Forget list. First we find the bitmap
buffer. In line 748 we free the b_committed_data stored by
journal_get_undo_access(). From now on the allocation code in
ext3_test_allocatable() will no longer find b_committed_data and hence
it will assume the block is freed. *BUT* we still have the freed buffer
in t_forget list. If it happens that we reallocate the block before
we process the buffer in t_forget list, it will have b_next_transaction
set and we fail at the assertion failure in line 793.
  I hope it is clearer now.

								Honza

-- 
Jan Kara <jack@suse.cz>
SuSE CR Labs

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [RFC] Bug in journal_commit_transaction?
  2006-04-12  3:13 ` Stephen C. Tweedie
@ 2006-04-15 21:08   ` Jan Kara
  0 siblings, 0 replies; 9+ messages in thread
From: Jan Kara @ 2006-04-15 21:08 UTC (permalink / raw)
  To: Stephen C. Tweedie; +Cc: linux-fsdevel, akpm

  Hello,

  Thanks for the answer.

> >   The problem is following: we are processing t_forget list of a
> > committing transaction. On this forget list are among other buffers also
> > bitmaps and buffers that are freed by this transaction. 
> 
> Bitmaps can never be freed.  But the forget list isn't just for freed
> data: we keep things on the forget list once they are written to the
> journal, so that on final commit we know which buffers are now safe to
> be written back to the main backing store.
> 
> > Now if bitmap
> > buffer is processed first, we switch to new bitmap (hence the buffer
> > freed by this transaction is again available for allocation).
> 
> Bitmaps are statically allocated.  The buffer can be *journaled* again,
> though, sure.  But it's never deallocated, nor is the data "switch"ed:
> the b_frozen_data and b_committed_data pointers are only ever copies of
> b_data, b_data never changes.
  Yes, sorry. I was not probably very clear. I hope I was clearer in the
other mail I've sent today.


								Honza

-- 
Jan Kara <jack@suse.cz>
SuSE CR Labs

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [RFC] Bug in journal_commit_transaction?
  2006-04-15 21:06   ` Jan Kara
@ 2006-04-20  8:47     ` Jan Kara
  2006-04-20 16:00       ` Stephen C. Tweedie
  2006-04-20 14:28     ` Stephen C. Tweedie
  1 sibling, 1 reply; 9+ messages in thread
From: Jan Kara @ 2006-04-20  8:47 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-fsdevel, sct

  Hello,

>   sorry, I was probably to terse in my previous mail. I'll try to
> describe more details now.
> 
> > >   I think I have found a possible cause of assertion failure in
> > > journal_commit_buffer() on line 793 (b_next_transaction == NULL) (which
> > 
> > you mean journal_commit_transaction().
>   Yes.
> 
> > > was reported about two times as I remember). And I think the problem is
> > > just a few lines below in cond_resched_lock(&journal->j_list_lock);
> > >   The problem is following: we are processing t_forget list of a
> > > committing transaction. On this forget list are among other buffers also
> > > bitmaps and buffers that are freed by this transaction. Now if bitmap
> > > buffer is processed first, we switch to new bitmap (hence the buffer
> > > freed by this transaction is again available for allocation). If we
> > > reschedule on that cond_resched_lock() and in the meantime someone
> > > allocates the buffer, we later fail with that assertion failure.
> > 
> > I assume you're referring to the cond_resched_lock() at line 799.
> > 
> > I don't really follow your description here (it seems to be missing bits),
> > but that loop drops that lock in other places - why wouldn't those places
> > also be vulnerable?
>   Yes, I meant that cond_resched_lock(). But you are right that the
> problem is not in cond_resched_lock() itself. It is in the pure fact
> that we can process bitmap buffer in t_forget list before all the
> buffers transaction has freed. Explanation is below.
>   When we are freeing some blocks we do journal_get_undo_access() on the
> bitmap buffer. That copies current state of the bitmap into
> b_committed_data. When we allocate any new blocks we check that the
> block is free in both b_committed_data (old state) and in b_data (new
> state). That makes sure that we do not reallocate any freed block before
> transaction that frees it is committed. But unlocking j_list_lock when
> processing t_forget list in journal_commit_transaction() can actually
> break that assumption in some cases:
>   If we commit the transaction we write the bitmap buffer into the log
> and then file it into BJ_Forget list. We also write the freed data
> buffer (e.g. zeroed indirect block) and file it into BJ_Forget list.
> Then we start processing BJ_Forget list. First we find the bitmap
> buffer. In line 748 we free the b_committed_data stored by
> journal_get_undo_access(). From now on the allocation code in
> ext3_test_allocatable() will no longer find b_committed_data and hence
> it will assume the block is freed. *BUT* we still have the freed buffer
> in t_forget list. If it happens that we reallocate the block before
> we process the buffer in t_forget list, it will have b_next_transaction
> set and we fail at the assertion failure in line 793.
>   I hope it is clearer now.
  I've got no answer but I still think my argument was sound ;) Below
is a patch also with a verbose comment what it fixes and how. Andrew,
can you put it in -mm, please? Currently it's only basically tested
but some guys from IBM, who are able to trigger that assertion, should
give it a good beating soon...

							Bye
								Honza
-- 
Jan Kara <jack@suse.cz>
SuSE CR Labs


Fix possible assertion failure in journal_commit_transaction() on
jh->b_next_transaction == NULL (when we are processing BJ_Forget list and
buffer is not jbddirty). !jbddirty buffers can be placed on BJ_Forget list
for example by journal_forget() or by __dispose_buffer() - generally such
buffer means that it has been freed by this transaction. Freed buffers should
not be reallocated until the transaction has committed (that's why we have
the assertion there) but they *can* be reallocated when the transaction
has already been committed to disk and we are just processing the BJ_Forget
list (as soon as we remove b_committed_data from the bitmap bh, ext3 will
be able to reallocate buffers freed by the committing transaction). So
we have to also count with the case that the buffer has been reallocated and
b_next_transaction has been already set.
  And one more subtle point: it can happen that we manage to reallocate the
buffer and also mark it jbddirty. Then we also add the freed buffer to
the checkpoint list of the committing trasaction. But that should do no harm.

Signed-off-by: Jan Kara <jack@suse.cz>

diff -rupX /home/jack/.kerndiffexclude linux-2.6.16/fs/jbd/commit.c linux-2.6.16-1-realloc_freed_fix/fs/jbd/commit.c
--- linux-2.6.16/fs/jbd/commit.c	2006-01-15 00:20:12.000000000 +0100
+++ linux-2.6.16-1-realloc_freed_fix/fs/jbd/commit.c	2006-04-20 10:32:15.000000000 +0200
@@ -790,11 +790,22 @@ restart_loop:
 			jbd_unlock_bh_state(bh);
 		} else {
 			J_ASSERT_BH(bh, !buffer_dirty(bh));
-			J_ASSERT_JH(jh, jh->b_next_transaction == NULL);
-			__journal_unfile_buffer(jh);
-			jbd_unlock_bh_state(bh);
-			journal_remove_journal_head(bh);  /* needs a brelse */
-			release_buffer_page(bh);
+			/* The buffer on BJ_Forget list and not jbddirty means
+			 * it has been freed by this transaction and hence it
+			 * could not have been reallocated until this
+			 * transaction has committed. *BUT* it could be
+			 * reallocated once we have written all the data to
+			 * disk and before we process the buffer on BJ_Forget
+			 * list. */
+			JBUFFER_TRACE(jh, "refile or unfile freed buffer");
+			__journal_refile_buffer(jh);
+			if (!jh->b_transaction) {
+				jbd_unlock_bh_state(bh);
+				 /* needs a brelse */
+				journal_remove_journal_head(bh);
+				release_buffer_page(bh);
+			} else
+				jbd_unlock_bh_state(bh);
 		}
 		cond_resched_lock(&journal->j_list_lock);
 	}

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [RFC] Bug in journal_commit_transaction?
  2006-04-15 21:06   ` Jan Kara
  2006-04-20  8:47     ` Jan Kara
@ 2006-04-20 14:28     ` Stephen C. Tweedie
  1 sibling, 0 replies; 9+ messages in thread
From: Stephen C. Tweedie @ 2006-04-20 14:28 UTC (permalink / raw)
  To: Jan Kara; +Cc: Andrew Morton, linux-fsdevel, Stephen Tweedie

Hi,

On Sat, 2006-04-15 at 23:06 +0200, Jan Kara wrote:

>   When we are freeing some blocks we do journal_get_undo_access() on the
> bitmap buffer. That copies current state of the bitmap into
> b_committed_data. When we allocate any new blocks we check that the
> block is free in both b_committed_data (old state) and in b_data (new
> state). That makes sure that we do not reallocate any freed block before
> transaction that frees it is committed. 

Right; it's a critical data integrity guarantee over crashes.

> But unlocking j_list_lock when
> processing t_forget list in journal_commit_transaction() can actually
> break that assumption in some cases:

Yes and no --- the main goal of the b_committed_data copy is still
preserved in this case, because by this point we've written the commit
block so it's actually OK to reallocate the buffers freed in that
transaction now.  As far as recovery is concerned, we *have* committed
by this point.  It's just that we haven't finished the in-memory cleanup
of buffers belonging to the committed transaction.

So we never break the primary assumption that the b_committed_data copy
is there to protect.  But we might well confuse other parts of the
internal error-checking logic.

>   If we commit the transaction we write the bitmap buffer into the log
> and then file it into BJ_Forget list. We also write the freed data
> buffer (e.g. zeroed indirect block) and file it into BJ_Forget list.
> Then we start processing BJ_Forget list. First we find the bitmap
> buffer. In line 748 we free the b_committed_data stored by
> journal_get_undo_access(). From now on the allocation code in
> ext3_test_allocatable() will no longer find b_committed_data and hence
> it will assume the block is freed. *BUT* we still have the freed buffer
> in t_forget list. 

Right.......

> If it happens that we reallocate the block before
> we process the buffer in t_forget list, it will have b_next_transaction
> set and we fail at the assertion failure in line 793.
>   I hope it is clearer now.

Much clearer, thanks: it wasn't clear in your first description that you
were talking about an interaction between *two* different buffers on the
forget list.  This sounds like a perfectly plausible explanation.

--Stephen

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [RFC] Bug in journal_commit_transaction?
  2006-04-20  8:47     ` Jan Kara
@ 2006-04-20 16:00       ` Stephen C. Tweedie
  2006-04-20 17:51         ` Jan Kara
  0 siblings, 1 reply; 9+ messages in thread
From: Stephen C. Tweedie @ 2006-04-20 16:00 UTC (permalink / raw)
  To: Jan Kara; +Cc: Andrew Morton, linux-fsdevel, Stephen Tweedie

Hi,

On Thu, 2006-04-20 at 10:47 +0200, Jan Kara wrote:

>   I've got no answer but I still think my argument was sound ;) Below
> is a patch also with a verbose comment what it fixes and how.

Looks good to me, but it complicates the logic for this case and removes
some useful debug checks for conditions which are almost always true.

Wouldn't it be better and safer just to make those almost-always true
conditions always true?  We can do that pretty simply, just by
processing the t_forget list in two passes, always doing the bitmaps
last.  (Though this will need care, as the list is potentially moving
under our feet...)

Actually, scratch that, because journal_unmap_buffer() can return a
buffer to the committing transaction's forget list at any time, so we
really cannot guarantee to do all non-bitmaps first: new non-bitmaps
might arrive after we've started doing the bitmaps.  Ugh.

> diff -rupX /home/jack/.kerndiffexclude linux-2.6.16/fs/jbd/commit.c linux-2.6.16-1-realloc_freed_fix/fs/jbd/commit.c
> --- linux-2.6.16/fs/jbd/commit.c	2006-01-15 00:20:12.000000000 +0100
> +++ linux-2.6.16-1-realloc_freed_fix/fs/jbd/commit.c	2006-04-20 10:32:15.000000000 +0200
> @@ -790,11 +790,22 @@ restart_loop:
> +			JBUFFER_TRACE(jh, "refile or unfile freed buffer");
> +			__journal_refile_buffer(jh);
> +			if (!jh->b_transaction) {
> +				jbd_unlock_bh_state(bh);
> +				 /* needs a brelse */
> +				journal_remove_journal_head(bh);
> +				release_buffer_page(bh);
> +			} else
> +				jbd_unlock_bh_state(bh);

Looks good to me.  ACK.

--Stephen

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [RFC] Bug in journal_commit_transaction?
  2006-04-20 16:00       ` Stephen C. Tweedie
@ 2006-04-20 17:51         ` Jan Kara
  0 siblings, 0 replies; 9+ messages in thread
From: Jan Kara @ 2006-04-20 17:51 UTC (permalink / raw)
  To: Stephen C. Tweedie; +Cc: Andrew Morton, linux-fsdevel

  Hello,

> On Thu, 2006-04-20 at 10:47 +0200, Jan Kara wrote:
> 
> >   I've got no answer but I still think my argument was sound ;) Below
> > is a patch also with a verbose comment what it fixes and how.
> 
> Looks good to me, but it complicates the logic for this case and removes
> some useful debug checks for conditions which are almost always true.
> 
> Wouldn't it be better and safer just to make those almost-always true
> conditions always true?  We can do that pretty simply, just by
> processing the t_forget list in two passes, always doing the bitmaps
> last.  (Though this will need care, as the list is potentially moving
> under our feet...)
> 
> Actually, scratch that, because journal_unmap_buffer() can return a
> buffer to the committing transaction's forget list at any time, so we
> really cannot guarantee to do all non-bitmaps first: new non-bitmaps
> might arrive after we've started doing the bitmaps.  Ugh.
  Thanks for looking into it.  Yes, I agree that keeping that safety
check would be nicer. But I was thinking for some time if we cannot
somehow make sure that bitmaps are processed last but I didn't come up
with anything useful. One possibility might be to do some magic in
do_get_write_access() and get_create_access() if we find out we are
reallocating buffer on forget list but I'm afraid that simple solutions
(e.g. wait for commit) would harm performance which might be even
worse...

								Honza

-- 
Jan Kara <jack@suse.cz>
SuSE CR Labs

^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2006-04-20 17:51 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2006-04-11 13:35 [RFC] Bug in journal_commit_transaction? Jan Kara
2006-04-12  1:27 ` Andrew Morton
2006-04-15 21:06   ` Jan Kara
2006-04-20  8:47     ` Jan Kara
2006-04-20 16:00       ` Stephen C. Tweedie
2006-04-20 17:51         ` Jan Kara
2006-04-20 14:28     ` Stephen C. Tweedie
2006-04-12  3:13 ` Stephen C. Tweedie
2006-04-15 21:08   ` Jan Kara

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).