linux-ext4.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* Re: [PATCH] fix for consistency errors after crash
@ 2010-06-23 19:27 Amir G.
  2010-07-06 13:00 ` Amir G.
  0 siblings, 1 reply; 5+ messages in thread
From: Amir G. @ 2010-06-23 19:27 UTC (permalink / raw)
  To: Theodore Tso; +Cc: Aneesh Kumar K.V, Ext4 Developers List

Sorry, posting the patch again with a better title.
This bug needs to be fixed also in Ext3.
just need to move the ext3_forget() line down to ext3_free_blocks() line.

Amir.

On Wed, Jun 23, 2010 at 10:01 PM, Amir G.
<amir73il@users.sourceforge.net> wrote:
> Hi,
>
> We have experienced bitmap inconsistencies after crash during file
> delete under heavy load.
> The crash is not file system related and I the following patch in
> ext4_free_branches() fixes the recovery problem.
>
> If the transaction is restarted and there is a crash before the new
> transaction is committed,
> then after recovery, the blocks that this indirect block points to
> have been freed, but the indirect block itself
> has not been freed and may still point to some of the free blocks
> (because of the ext4_forget()).
>
> So ext4_forget() should be called inside ext4_free_blocks() to avoid
> this problem.
> Are there any consequences to this patch that I am not aware of?
>
> Amir.
>
> Signed-off-by: Amir Goldstein <amir73il@users.sf.net>
>
> diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
> index 42272d6..682e2fa 100644
> --- a/fs/ext4/inode.c
> +++ b/fs/ext4/inode.c
> @@ -4458,6 +4458,7 @@ static void ext4_free_branches(handle_t *handle,
> struct inode *inode,
>  {
>        ext4_fsblk_t nr;
>        __le32 *p;
> +       int     flags;
>
>        if (ext4_handle_is_aborted(handle))
>                return;
> @@ -4520,7 +4521,7 @@ static void ext4_free_branches(handle_t *handle,
> struct inode *inode,
>                         * revoke records must be emitted *before* clearing
>                         * this block's bit in the bitmaps.
>                         */
> -                       ext4_forget(handle, 1, inode, bh, bh->b_blocknr);
> +                       flags =
> EXT4_FREE_BLOCKS_METADATA|EXT4_FREE_BLOCKS_FORGET;
>
>                        /*
>                         * Everything below this this pointer has been
> @@ -4546,8 +4547,7 @@ static void ext4_free_branches(handle_t *handle,
> struct inode *inode,
>                                            blocks_for_truncate(inode));
>                        }
>
> -                       ext4_free_blocks(handle, inode, 0, nr, 1,
> -                                        EXT4_FREE_BLOCKS_METADATA);
> +                       ext4_free_blocks(handle, inode, 0, nr, 1, flags);
>
>                        if (parent_bh) {
>                                /*
>
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [PATCH] fix for consistency errors after crash
  2010-06-23 19:27 [PATCH] fix for consistency errors after crash Amir G.
@ 2010-07-06 13:00 ` Amir G.
  2010-07-06 16:00   ` Eric Sandeen
                     ` (2 more replies)
  0 siblings, 3 replies; 5+ messages in thread
From: Amir G. @ 2010-07-06 13:00 UTC (permalink / raw)
  To: Eric Sandeen; +Cc: Ric Wheeler, Ext4 Developers List

Hi Eric,

I've seen you guys had some open RH bugs on ext3, who all share in
common the "bit already free" error.

This bug I reported can explain many different problems in ext[34].

Essentially, every time there is a kernel crash (or hard reboot)
during delete/truncate of a large file,
it may result in "bit already clear" error after reboot.

The problem is very simple and so is the fix.
I proved the problem with 100% recreation chances using a small patch,
instead of running statistical stress tests.
All I did was to add a print and 10 seconds delay after transaction
restart in ext3_free_branches and reboot > 5 seconds after the
transaction restarts, so that kjournald will have time to commit the
old transaction.
After the reboot, I always get "bit already clear" errors, because the
"half large truncate" transaction is not handled properly.

I did not get any response from ext4 guys so far and since this bug
dates back to ext3,
I was hoping you guys could take a look and put your weight on pushing
the fix upstream.

Thanks,
Amir.

On Wed, Jun 23, 2010 at 9:27 PM, Amir G. <amir73il@users.sourceforge.net> wrote:
> Sorry, posting the patch again with a better title.
> This bug needs to be fixed also in Ext3.
> just need to move the ext3_forget() line down to ext3_free_blocks() line.
>
> Amir.
>
> On Wed, Jun 23, 2010 at 10:01 PM, Amir G.
> <amir73il@users.sourceforge.net> wrote:
>> Hi,
>>
>> We have experienced bitmap inconsistencies after crash during file
>> delete under heavy load.
>> The crash is not file system related and I the following patch in
>> ext4_free_branches() fixes the recovery problem.
>>
>> If the transaction is restarted and there is a crash before the new
>> transaction is committed,
>> then after recovery, the blocks that this indirect block points to
>> have been freed, but the indirect block itself
>> has not been freed and may still point to some of the free blocks
>> (because of the ext4_forget()).
>>
>> So ext4_forget() should be called inside ext4_free_blocks() to avoid
>> this problem.
>> Are there any consequences to this patch that I am not aware of?
>>
>> Amir.
>>
>> Signed-off-by: Amir Goldstein <amir73il@users.sf.net>
>>
>> diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
>> index 42272d6..682e2fa 100644
>> --- a/fs/ext4/inode.c
>> +++ b/fs/ext4/inode.c
>> @@ -4458,6 +4458,7 @@ static void ext4_free_branches(handle_t *handle,
>> struct inode *inode,
>>  {
>>        ext4_fsblk_t nr;
>>        __le32 *p;
>> +       int     flags;
>>
>>        if (ext4_handle_is_aborted(handle))
>>                return;
>> @@ -4520,7 +4521,7 @@ static void ext4_free_branches(handle_t *handle,
>> struct inode *inode,
>>                         * revoke records must be emitted *before* clearing
>>                         * this block's bit in the bitmaps.
>>                         */
>> -                       ext4_forget(handle, 1, inode, bh, bh->b_blocknr);
>> +                       flags =
>> EXT4_FREE_BLOCKS_METADATA|EXT4_FREE_BLOCKS_FORGET;
>>
>>                        /*
>>                         * Everything below this this pointer has been
>> @@ -4546,8 +4547,7 @@ static void ext4_free_branches(handle_t *handle,
>> struct inode *inode,
>>                                            blocks_for_truncate(inode));
>>                        }
>>
>> -                       ext4_free_blocks(handle, inode, 0, nr, 1,
>> -                                        EXT4_FREE_BLOCKS_METADATA);
>> +                       ext4_free_blocks(handle, inode, 0, nr, 1, flags);
>>
>>                        if (parent_bh) {
>>                                /*
>>
>
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [PATCH] fix for consistency errors after crash
  2010-07-06 13:00 ` Amir G.
@ 2010-07-06 16:00   ` Eric Sandeen
  2010-07-12 19:37   ` Jan Kara
  2010-07-23 13:29   ` Ted Ts'o
  2 siblings, 0 replies; 5+ messages in thread
From: Eric Sandeen @ 2010-07-06 16:00 UTC (permalink / raw)
  To: Amir G.; +Cc: Ric Wheeler, Ext4 Developers List

Amir G. wrote:
> Hi Eric,
> 
> I've seen you guys had some open RH bugs on ext3, who all share in
> common the "bit already free" error.
> 
> This bug I reported can explain many different problems in ext[34].
> 
> Essentially, every time there is a kernel crash (or hard reboot)
> during delete/truncate of a large file,
> it may result in "bit already clear" error after reboot.
> 
> The problem is very simple and so is the fix.
> I proved the problem with 100% recreation chances using a small patch,
> instead of running statistical stress tests.
> All I did was to add a print and 10 seconds delay after transaction
> restart in ext3_free_branches and reboot > 5 seconds after the
> transaction restarts, so that kjournald will have time to commit the
> old transaction.
> After the reboot, I always get "bit already clear" errors, because the
> "half large truncate" transaction is not handled properly.
> 
> I did not get any response from ext4 guys so far and since this bug
> dates back to ext3,
> I was hoping you guys could take a look and put your weight on pushing
> the fix upstream.

Hi Amir, I really do appreciate the effort, the patch, and the ping.  :)

I'll have to set aside some time to give it a hard look, but linking
it back to existing bugs of mine raises that priority, thanks.  :)

-Eric

> Thanks,
> Amir.

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [PATCH] fix for consistency errors after crash
  2010-07-06 13:00 ` Amir G.
  2010-07-06 16:00   ` Eric Sandeen
@ 2010-07-12 19:37   ` Jan Kara
  2010-07-23 13:29   ` Ted Ts'o
  2 siblings, 0 replies; 5+ messages in thread
From: Jan Kara @ 2010-07-12 19:37 UTC (permalink / raw)
  To: Amir G.; +Cc: Eric Sandeen, Ric Wheeler, Ext4 Developers List

[-- Attachment #1: Type: text/plain, Size: 1377 bytes --]

  Hi,

> I've seen you guys had some open RH bugs on ext3, who all share in
> common the "bit already free" error.
> 
> This bug I reported can explain many different problems in ext[34].
> 
> Essentially, every time there is a kernel crash (or hard reboot)
> during delete/truncate of a large file,
> it may result in "bit already clear" error after reboot.
> 
> The problem is very simple and so is the fix.
> I proved the problem with 100% recreation chances using a small patch,
> instead of running statistical stress tests.
> All I did was to add a print and 10 seconds delay after transaction
> restart in ext3_free_branches and reboot > 5 seconds after the
> transaction restarts, so that kjournald will have time to commit the
> old transaction.
> After the reboot, I always get "bit already clear" errors, because the
> "half large truncate" transaction is not handled properly.
> 
> I did not get any response from ext4 guys so far and since this bug
> dates back to ext3,
> I was hoping you guys could take a look and put your weight on pushing
> the fix upstream.
  Thanks for a ping. Your analysis and the fix looks correct to me. Attached is
a fix of the problem for ext3 which I'll merge if noone objects. BTW: I've also
updated a comment a bit so you might want to include than in an ext4 patch as
well.

								Honza
-- 
Jan Kara <jack@suse.cz>
SuSE CR Labs

[-- Attachment #2: 0001-ext3-Avoid-filesystem-corruption-after-a-crash-under.patch --]
[-- Type: text/x-diff, Size: 3409 bytes --]

>From 6c91cac7b34f60bf89cb58a9fe753918628d22c4 Mon Sep 17 00:00:00 2001
From: Jan Kara <jack@suse.cz>
Date: Mon, 12 Jul 2010 21:04:31 +0200
Subject: [PATCH] ext3: Avoid filesystem corruption after a crash under heavy delete load

It can happen that ext3_free_branches calls ext3_forget() for an indirect
block in an earlier transaction than ext3_free_blocks() for that block.
Thus if we crash before a transaction with ext3_free_blocks() is committed,
we will see indirect block pointing to already freed blocks and complain
during orphan list cleanup.

The fix is simple: Make sure ext3_forget() is called in the same transaction
as ext3_free_blocks().

This is a backport of an ext4 fix by Amir G. <amir73il@users.sourceforge.net>

Signed-off-by: Jan Kara <jack@suse.cz>
---
 fs/ext3/inode.c |   46 +++++++++++++++++++++++++---------------------
 1 files changed, 25 insertions(+), 21 deletions(-)

diff --git a/fs/ext3/inode.c b/fs/ext3/inode.c
index a786db4..436e5bb 100644
--- a/fs/ext3/inode.c
+++ b/fs/ext3/inode.c
@@ -2270,27 +2270,6 @@ static void ext3_free_branches(handle_t *handle, struct inode *inode,
 					   depth);
 
 			/*
-			 * We've probably journalled the indirect block several
-			 * times during the truncate.  But it's no longer
-			 * needed and we now drop it from the transaction via
-			 * journal_revoke().
-			 *
-			 * That's easy if it's exclusively part of this
-			 * transaction.  But if it's part of the committing
-			 * transaction then journal_forget() will simply
-			 * brelse() it.  That means that if the underlying
-			 * block is reallocated in ext3_get_block(),
-			 * unmap_underlying_metadata() will find this block
-			 * and will try to get rid of it.  damn, damn.
-			 *
-			 * If this block has already been committed to the
-			 * journal, a revoke record will be written.  And
-			 * revoke records must be emitted *before* clearing
-			 * this block's bit in the bitmaps.
-			 */
-			ext3_forget(handle, 1, inode, bh, bh->b_blocknr);
-
-			/*
 			 * Everything below this this pointer has been
 			 * released.  Now let this top-of-subtree go.
 			 *
@@ -2313,6 +2292,31 @@ static void ext3_free_branches(handle_t *handle, struct inode *inode,
 				truncate_restart_transaction(handle, inode);
 			}
 
+			/*
+			 * We've probably journalled the indirect block several
+			 * times during the truncate.  But it's no longer
+			 * needed and we now drop it from the transaction via
+			 * journal_revoke().
+			 *
+			 * That's easy if it's exclusively part of this
+			 * transaction.  But if it's part of the committing
+			 * transaction then journal_forget() will simply
+			 * brelse() it.  That means that if the underlying
+			 * block is reallocated in ext3_get_block(),
+			 * unmap_underlying_metadata() will find this block
+			 * and will try to get rid of it.  damn, damn. Thus
+			 * we don't allow a block to be reallocated until
+			 * a transaction freeing it has fully committed.
+			 *
+			 * We also have to make sure journal replay after a
+			 * crash does not overwrite non-journaled data blocks
+			 * with old metadata when the block got reallocated for
+			 * data.  Thus we have to store a revoke record for a
+			 * block in the same transaction in which we free the
+			 * block.
+			 */
+			ext3_forget(handle, 1, inode, bh, bh->b_blocknr);
+
 			ext3_free_blocks(handle, inode, nr, 1);
 
 			if (parent_bh) {
-- 
1.6.4.2


^ permalink raw reply related	[flat|nested] 5+ messages in thread

* Re: [PATCH] fix for consistency errors after crash
  2010-07-06 13:00 ` Amir G.
  2010-07-06 16:00   ` Eric Sandeen
  2010-07-12 19:37   ` Jan Kara
@ 2010-07-23 13:29   ` Ted Ts'o
  2 siblings, 0 replies; 5+ messages in thread
From: Ted Ts'o @ 2010-07-23 13:29 UTC (permalink / raw)
  To: Amir G.; +Cc: Eric Sandeen, Ric Wheeler, Ext4 Developers List

This is the version of the patch which I ultimately will be including
into the ext4 tree.  I removed the flag variable and cleaned up the
comment to make the code a bit clearer.

						- Ted

>From ee8973f1ff39e44d6f74cd8d456584fed4411663 Mon Sep 17 00:00:00 2001
From: Amir G <amir73il@users.sourceforge.net>
Date: Fri, 23 Jul 2010 09:27:04 -0400
Subject: [PATCH] ext4: Fix block bitmap inconsistencies after a crash when deleting files

We have experienced bitmap inconsistencies after crash during file
delete under heavy load.  The crash is not file system related and I
the following patch in ext4_free_branches() fixes the recovery
problem.

If the transaction is restarted and there is a crash before the new
transaction is committed, then after recovery, the blocks that this
indirect block points to have been freed, but the indirect block
itself has not been freed and may still point to some of the free
blocks (because of the ext4_forget()).

So ext4_forget() should be called inside ext4_free_blocks() to avoid
this problem.

Signed-off-by: Amir Goldstein <amir73il@users.sf.net>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
---
 fs/ext4/inode.c |   35 +++++++++++++----------------------
 1 files changed, 13 insertions(+), 22 deletions(-)

diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index 755ba86..699d1d0 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -4490,27 +4490,6 @@ static void ext4_free_branches(handle_t *handle, struct inode *inode,
 					depth);
 
 			/*
-			 * We've probably journalled the indirect block several
-			 * times during the truncate.  But it's no longer
-			 * needed and we now drop it from the transaction via
-			 * jbd2_journal_revoke().
-			 *
-			 * That's easy if it's exclusively part of this
-			 * transaction.  But if it's part of the committing
-			 * transaction then jbd2_journal_forget() will simply
-			 * brelse() it.  That means that if the underlying
-			 * block is reallocated in ext4_get_block(),
-			 * unmap_underlying_metadata() will find this block
-			 * and will try to get rid of it.  damn, damn.
-			 *
-			 * If this block has already been committed to the
-			 * journal, a revoke record will be written.  And
-			 * revoke records must be emitted *before* clearing
-			 * this block's bit in the bitmaps.
-			 */
-			ext4_forget(handle, 1, inode, bh, bh->b_blocknr);
-
-			/*
 			 * Everything below this this pointer has been
 			 * released.  Now let this top-of-subtree go.
 			 *
@@ -4534,8 +4513,20 @@ static void ext4_free_branches(handle_t *handle, struct inode *inode,
 					    blocks_for_truncate(inode));
 			}
 
+			/*
+			 * The forget flag here is critical because if
+			 * we are journaling (and not doing data
+			 * journaling), we have to make sure a revoke
+			 * record is written to prevent the journal
+			 * replay from overwriting the (former)
+			 * indirect block if it gets reallocated as a
+			 * data block.  This must happen in the same
+			 * transaction where the data blocks are
+			 * actually freed.
+			 */
 			ext4_free_blocks(handle, inode, 0, nr, 1,
-					 EXT4_FREE_BLOCKS_METADATA);
+					 EXT4_FREE_BLOCKS_METADATA|
+					 EXT4_FREE_BLOCKS_FORGET);
 
 			if (parent_bh) {
 				/*
-- 
1.7.0.4


^ permalink raw reply related	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2010-07-23 13:29 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2010-06-23 19:27 [PATCH] fix for consistency errors after crash Amir G.
2010-07-06 13:00 ` Amir G.
2010-07-06 16:00   ` Eric Sandeen
2010-07-12 19:37   ` Jan Kara
2010-07-23 13:29   ` Ted Ts'o

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).