data corrupting bug in 2.4.20 ext3, data=journal

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

* data corrupting bug in 2.4.20 ext3, data=journal
@ 2002-12-01  8:11 Andrew Morton
  2002-12-01  8:52 ` Andrew Morton
                   ` (2 more replies)
  0 siblings, 3 replies; 8+ messages in thread
From: Andrew Morton @ 2002-12-01  8:11 UTC (permalink / raw)
  To: lkml, ext3-users@redhat.com

In 2.4.20-pre5 an optimisation was made to the ext3 fsync function
which can very easily cause file data corruption at unmount time.  This
was first reported by Nick Piggin on November 29th (one day after 2.4.20 was
released, and three months after the bug was merged.  Unfortunate timing)

This only affects filesystems which were mounted with the `data=journal'
option.  Or files which are operating under `chattr -j'.  So most people
are unaffected.  The problem is not present in 2.5 kernels.

The symptoms are that any file data which was written within the thirty
seconds prior to the unmount may not make it to disk.   A workaround is
to run `sync' before unmounting.

The optimisation was intended to avoid writing out and waiting on the
inode's buffers when the subsequent commit would do that anyway. This
optimisation was applied to both data=journal and data=ordered modes.
But it is only valid for data=ordered mode.

In data=journal mode the data is left dirty in memory and the unmount
will silently discard it.

The fix is to only apply the optimisation to inodes which are operating
under data=ordered.

--- linux-akpm/fs/ext3/fsync.c~ext3-fsync-fix	Sat Nov 30 23:37:33 2002
+++ linux-akpm-akpm/fs/ext3/fsync.c	Sat Nov 30 23:39:30 2002
@@ -63,10 +63,12 @@ int ext3_sync_file(struct file * file, s
 	 */
 	ret = fsync_inode_buffers(inode);

-	/* In writeback mode, we need to force out data buffers too.  In
-	 * the other modes, ext3_force_commit takes care of forcing out
-	 * just the right data blocks. */
-	if (test_opt(inode->i_sb, DATA_FLAGS) == EXT3_MOUNT_WRITEBACK_DATA)
+	/*
+	 * If the inode is under ordered-data writeback it is not necessary to
+	 * sync its data buffers here - commit will do that, with potentially
+	 * better IO merging
+	 */
+	if (!ext3_should_order_data(inode))
 		ret |= fsync_inode_data_buffers(inode);

 	ext3_force_commit(inode->i_sb);

_

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: data corrupting bug in 2.4.20 ext3, data=journal
  2002-12-01  8:11 data corrupting bug in 2.4.20 ext3, data=journal Andrew Morton
@ 2002-12-01  8:52 ` Andrew Morton
  2002-12-01 12:41 ` Nick Piggin
  2002-12-02  8:26 ` Nick Piggin
  2 siblings, 0 replies; 8+ messages in thread
From: Andrew Morton @ 2002-12-01  8:52 UTC (permalink / raw)
  To: lkml, ext3-users@redhat.com

Andrew Morton wrote:
> 
> ...
> The fix is to only apply the optimisation to inodes which are operating
> under data=ordered.
> 

That "fix" didn't fix it.  Sorry about that.

Please avoid ext3/data=journal until it is sorted out.

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: data corrupting bug in 2.4.20 ext3, data=journal
  2002-12-01  8:11 data corrupting bug in 2.4.20 ext3, data=journal Andrew Morton
  2002-12-01  8:52 ` Andrew Morton
@ 2002-12-01 12:41 ` Nick Piggin
  2002-12-02  7:17   ` Andrew Morton
  2002-12-02  8:26 ` Nick Piggin
  2 siblings, 1 reply; 8+ messages in thread
From: Nick Piggin @ 2002-12-01 12:41 UTC (permalink / raw)
  To: Andrew Morton; +Cc: lkml, ext3-users@redhat.com

Andrew Morton wrote:

>In 2.4.20-pre5 an optimisation was made to the ext3 fsync function
>which can very easily cause file data corruption at unmount time.  This
>was first reported by Nick Piggin on November 29th (one day after 2.4.20 was
>released, and three months after the bug was merged.  Unfortunate timing)
>
In fact it was reported on lkml on 18th July IIRC before 2.4.19 was
released if that is any help to you. 2.4.19 and 2.4.20 are affected
and I haven't tested previous releases. I was going to re-report it
sometime, but Alan brought it to light just the other day.

Nick


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: data corrupting bug in 2.4.20 ext3, data=journal
  2002-12-01 12:41 ` Nick Piggin
@ 2002-12-02  7:17   ` Andrew Morton
  2002-12-02 12:15     ` Stephen C. Tweedie
  0 siblings, 1 reply; 8+ messages in thread
From: Andrew Morton @ 2002-12-02  7:17 UTC (permalink / raw)
  To: Nick Piggin; +Cc: lkml, ext3-users@redhat.com

Nick Piggin wrote:
> 
> Andrew Morton wrote:
> 
> >In 2.4.20-pre5 an optimisation was made to the ext3 fsync function
> >which can very easily cause file data corruption at unmount time.  This
> >was first reported by Nick Piggin on November 29th (one day after 2.4.20 was
> >released, and three months after the bug was merged.  Unfortunate timing)
> >
> In fact it was reported on lkml on 18th July IIRC before 2.4.19 was
> released if that is any help to you. 2.4.19 and 2.4.20 are affected
> and I haven't tested previous releases. I was going to re-report it
> sometime, but Alan brought it to light just the other day.
> 

Are you sure?  I can't make it happen on 2.4.19.  And disabling the new
BH_Freed logic (which went into 2.4.20-pre5) makes it go away.


--- linux-akpm/fs/jbd/commit.c~a	Sun Dec  1 23:10:12 2002
+++ linux-akpm-akpm/fs/jbd/commit.c	Sun Dec  1 23:10:27 2002
@@ -695,7 +695,7 @@ skip_commit: /* The journal should be un
 		 * use in a different page. */
 		if (__buffer_state(bh, Freed)) {
 			clear_bit(BH_Freed, &bh->b_state);
-			clear_bit(BH_JBDDirty, &bh->b_state);
+//			clear_bit(BH_JBDDirty, &bh->b_state);
 		}
 			
 		if (buffer_jdirty(bh)) {

_

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: data corrupting bug in 2.4.20 ext3, data=journal
  2002-12-02  7:17   ` Andrew Morton
@ 2002-12-02 12:15     ` Stephen C. Tweedie
  2002-12-02 15:37       ` Stephen C. Tweedie
  2002-12-02 17:45       ` Andrew Morton
  0 siblings, 2 replies; 8+ messages in thread
From: Stephen C. Tweedie @ 2002-12-02 12:15 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Nick Piggin, lkml, ext3 users list

On Mon, 2002-12-02 at 07:17, Andrew Morton wrote:

> Are you sure?  I can't make it happen on 2.4.19.  And disabling the new
> BH_Freed logic (which went into 2.4.20-pre5) makes it go away.
> 
> --- linux-akpm/fs/jbd/commit.c~a	Sun Dec  1 23:10:12 2002
> +++ linux-akpm-akpm/fs/jbd/commit.c	Sun Dec  1 23:10:27 2002
> @@ -695,7 +695,7 @@ skip_commit: /* The journal should be un
> -			clear_bit(BH_JBDDirty, &bh->b_state);
> +//			clear_bit(BH_JBDDirty, &bh->b_state);

Argh.

That's not the right fix --- it reintroduces the bug that BH_Freed was
introduced to solve in the first place. 

The problem is that ext3 is expecting that truncate_inode_pages() (and
hence ext3_flushpage) is only called during a truncate.  That's what the
function is named for, after all, and it's the hint we need to indicate
that future writeback on the data we're discarding should be disabled
(so that we don't get old data written on top of new data should the
block get deallocated.)

But kill_supers() eventually calls truncate_inode_pages() too when we're
doing the invalidate_inodes().  And ext3 is reacting just the way it
would for a normal truncate --- the data still gets written to the
journal (correct, if we reboot before the truncate commits then the old
data is preserved in the journal) but is not queued for writeback.

The solution is to set BH_Freed in ext3_flushpage IFF we're being called
from the truncate, but to avoid it if we're in an umount.  I'm not sure
of the best way to do that right now, but there are some trivial but
hacky methods possible (eg. see if we're in a nested transaction; if so,
it's a truncate, if not, it's a umount.)  MS_ACTIVE might be a possible
flag to test, but I'll need to double-check whether that is 100% safe
--- we can't afford to skip the BH_Freed setting if we're in a truncate
and the filesystem is not yet completely quiesced.

Cheers,
 Stephen

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: data corrupting bug in 2.4.20 ext3, data=journal
  2002-12-02 12:15     ` Stephen C. Tweedie
@ 2002-12-02 15:37       ` Stephen C. Tweedie
  2002-12-02 17:45       ` Andrew Morton
  1 sibling, 0 replies; 8+ messages in thread
From: Stephen C. Tweedie @ 2002-12-02 15:37 UTC (permalink / raw)
  To: Stephen C. Tweedie; +Cc: Andrew Morton, Nick Piggin, lkml, ext3 users list

[-- Attachment #1: Type: text/plain, Size: 1722 bytes --]

On Mon, 2002-12-02 at 12:15, Stephen C. Tweedie wrote:

> The problem is that ext3 is expecting that truncate_inode_pages() (and
> hence ext3_flushpage) is only called during a truncate.  That's what the
> function is named for, after all, and it's the hint we need to indicate
> that future writeback on the data we're discarding should be disabled
> (so that we don't get old data written on top of new data should the
> block get deallocated.)
> 
> But kill_supers() eventually calls truncate_inode_pages() too when we're
> doing the invalidate_inodes().

But we've already called fsync_super() at this point.  If that completes
synchronously, then the buffers will already be out of the journal and
queued for writeback when we get here, and clearing BH_JBDDirty won't
have any dire consequences.

Indeed, loading the ext3 module with "do_sync_supers=1" cures the
symptoms.

However, doing every superblock write synchronously is a non-starter, as
that requires a journal commit in the ext3 universe, and so this would
essentially force bdflush to serialise all of its filesystem flushes
(which is a real problem once you've got a significant number of
filesystems mounted.)  But the VFS simply doesn't have any clean way of
telling foo_write_super() whether the call needs to be completed
synchronously or asynchronously.

There *is* a totally unclean way, though.  kill_super() sets sb->s_root
to NULL before calling its final fsync_super(), and ext3 can easily
check that in ext3_write_super().  It's a nasty, messy dependency, but
should work for 2.4 at least.  For 2.5 we probably want to look at
passing an async flag down into the write_super.

The attached patch seems to fix things for me.

Cheers,
 Stephen

[-- Attachment #2: Type: text/plain, Size: 598 bytes --]

--- linux-2.4-ext3merge/fs/ext3/super.c.=K0027=.orig	2002-12-02 15:35:13.000000000 +0000
+++ linux-2.4-ext3merge/fs/ext3/super.c	2002-12-02 15:35:14.000000000 +0000
@@ -1640,7 +1640,12 @@
 	sb->s_dirt = 0;
 	target = log_start_commit(EXT3_SB(sb)->s_journal, NULL);

-	if (do_sync_supers) {
+	/*
+	 * Tricky --- if we are unmounting, the write really does need
+	 * to be synchronous.  We can detect that by looking for NULL in
+	 * sb->s_root.
+	 */
+	if (do_sync_supers || !sb->s_root) {
 		unlock_super(sb);
 		log_wait_commit(EXT3_SB(sb)->s_journal, target);
 		lock_super(sb);

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: data corrupting bug in 2.4.20 ext3, data=journal
  2002-12-02 12:15     ` Stephen C. Tweedie
  2002-12-02 15:37       ` Stephen C. Tweedie
@ 2002-12-02 17:45       ` Andrew Morton
  1 sibling, 0 replies; 8+ messages in thread
From: Andrew Morton @ 2002-12-02 17:45 UTC (permalink / raw)
  To: Stephen C. Tweedie; +Cc: Nick Piggin, lkml, ext3 users list

"Stephen C. Tweedie" wrote:
> 
> ...
> The problem is that ext3 is expecting that truncate_inode_pages() (and
> hence ext3_flushpage) is only called during a truncate.

That's OK, because there shouldn't be any dirty data attached to the
inodes at that time.  But there is, because the commit which write_super()
started hasn't finished yet:

static int do_sync_supers = 0;
...
        target = log_start_commit(EXT3_SB(sb)->s_journal, NULL);

        if (do_sync_supers) {
                unlock_super(sb);
                log_wait_commit(EXT3_SB(sb)->s_journal, target);

We need to do a full sync at unmount.

I assume that in other journalling modes the buffers are dirty
anyway, so sync_buffers() gets them.

And indeed enabling do_sync_supers fixes it up in both 2.4 and 2.5 (2.5
doesn't have sync_buffers(), but sync_inodes_sb() gets everything).

But are we sure that ->write_super() will always be called?

int fsync_super(struct super_block *sb)
{
 	...
        if (sb->s_dirt && sb->s_op && sb->s_op->write_super)
                sb->s_op->write_super(sb);

I suspect that if s_dirt is not set, and we have dirty inodes,
write_super is not called and nothing will force a commit anywhere
in the unmount process.  Which could explain the similar failure
in 2.4.19-rc1 which Nick reports.

We need to be able to distinguish between a periodic flush of the
superblock and a real sync.  The write_super overload has always
been uncomfortable.

Google for "write_super is not for syncing" ;)  I think Chris's
patch is the way to fix all this up.

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: data corrupting bug in 2.4.20 ext3, data=journal
  2002-12-01  8:11 data corrupting bug in 2.4.20 ext3, data=journal Andrew Morton
  2002-12-01  8:52 ` Andrew Morton
  2002-12-01 12:41 ` Nick Piggin
@ 2002-12-02  8:26 ` Nick Piggin
  2 siblings, 0 replies; 8+ messages in thread
From: Nick Piggin @ 2002-12-02  8:26 UTC (permalink / raw)
  To: Andrew Morton; +Cc: lkml, ext3-users@redhat.com

Andrew Morton wrote:

>Nick Piggin wrote:
>> 
>> Andrew Morton wrote:
>> 
>> >In 2.4.20-pre5 an optimisation was made to the ext3 fsync function
>> >which can very easily cause file data corruption at unmount time.  This
>> >was first reported by Nick Piggin on November 29th (one day after 2.4.20 was
>> >released, and three months after the bug was merged.  Unfortunate timing)
>> >
>> In fact it was reported on lkml on 18th July IIRC before 2.4.19 was
>> released if that is any help to you. 2.4.19 and 2.4.20 are affected
>> and I haven't tested previous releases. I was going to re-report it
>> sometime, but Alan brought it to light just the other day.
>> 
>
>Are you sure?  I can't make it happen on 2.4.19.  And disabling the new
>BH_Freed logic (which went into 2.4.20-pre5) makes it go away.
>
>
>--- linux-akpm/fs/jbd/commit.c~a	Sun Dec  1 23:10:12 2002
>+++ linux-akpm-akpm/fs/jbd/commit.c	Sun Dec  1 23:10:27 2002
>@@ -695,7 +695,7 @@ skip_commit: /* The journal should be un
> 		 * use in a different page. */
> 		if (__buffer_state(bh, Freed)) {
> 			clear_bit(BH_Freed, &bh->b_state);
>-			clear_bit(BH_JBDDirty, &bh->b_state);
>+//			clear_bit(BH_JBDDirty, &bh->b_state);
> 		}
> 			
> 		if (buffer_jdirty(bh)) {
>
I reported the bug for 2.4.19-rc1 and 2 but I can't remember if I tested 
2.4.19
when it was released. It has an external journal on a seperate disk. I can't
really do any testing with the machine unfortunately.

Regards,
Nick


^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2002-12-02 17:37 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2002-12-01  8:11 data corrupting bug in 2.4.20 ext3, data=journal Andrew Morton
2002-12-01  8:52 ` Andrew Morton
2002-12-01 12:41 ` Nick Piggin
2002-12-02  7:17   ` Andrew Morton
2002-12-02 12:15     ` Stephen C. Tweedie
2002-12-02 15:37       ` Stephen C. Tweedie
2002-12-02 17:45       ` Andrew Morton
2002-12-02  8:26 ` Nick Piggin

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox