* data corrupting bug in 2.4.20 ext3, data=journal
@ 2002-12-01 8:11 Andrew Morton
2002-12-01 8:52 ` Andrew Morton
` (2 more replies)
0 siblings, 3 replies; 8+ messages in thread
From: Andrew Morton @ 2002-12-01 8:11 UTC (permalink / raw)
To: lkml, ext3-users@redhat.com
In 2.4.20-pre5 an optimisation was made to the ext3 fsync function
which can very easily cause file data corruption at unmount time. This
was first reported by Nick Piggin on November 29th (one day after 2.4.20 was
released, and three months after the bug was merged. Unfortunate timing)
This only affects filesystems which were mounted with the `data=journal'
option. Or files which are operating under `chattr -j'. So most people
are unaffected. The problem is not present in 2.5 kernels.
The symptoms are that any file data which was written within the thirty
seconds prior to the unmount may not make it to disk. A workaround is
to run `sync' before unmounting.
The optimisation was intended to avoid writing out and waiting on the
inode's buffers when the subsequent commit would do that anyway. This
optimisation was applied to both data=journal and data=ordered modes.
But it is only valid for data=ordered mode.
In data=journal mode the data is left dirty in memory and the unmount
will silently discard it.
The fix is to only apply the optimisation to inodes which are operating
under data=ordered.
--- linux-akpm/fs/ext3/fsync.c~ext3-fsync-fix Sat Nov 30 23:37:33 2002
+++ linux-akpm-akpm/fs/ext3/fsync.c Sat Nov 30 23:39:30 2002
@@ -63,10 +63,12 @@ int ext3_sync_file(struct file * file, s
*/
ret = fsync_inode_buffers(inode);
- /* In writeback mode, we need to force out data buffers too. In
- * the other modes, ext3_force_commit takes care of forcing out
- * just the right data blocks. */
- if (test_opt(inode->i_sb, DATA_FLAGS) == EXT3_MOUNT_WRITEBACK_DATA)
+ /*
+ * If the inode is under ordered-data writeback it is not necessary to
+ * sync its data buffers here - commit will do that, with potentially
+ * better IO merging
+ */
+ if (!ext3_should_order_data(inode))
ret |= fsync_inode_data_buffers(inode);
ext3_force_commit(inode->i_sb);
_
^ permalink raw reply [flat|nested] 8+ messages in thread* Re: data corrupting bug in 2.4.20 ext3, data=journal 2002-12-01 8:11 data corrupting bug in 2.4.20 ext3, data=journal Andrew Morton @ 2002-12-01 8:52 ` Andrew Morton 2002-12-01 12:41 ` Nick Piggin 2002-12-02 8:26 ` Nick Piggin 2 siblings, 0 replies; 8+ messages in thread From: Andrew Morton @ 2002-12-01 8:52 UTC (permalink / raw) To: lkml, ext3-users@redhat.com Andrew Morton wrote: > > ... > The fix is to only apply the optimisation to inodes which are operating > under data=ordered. > That "fix" didn't fix it. Sorry about that. Please avoid ext3/data=journal until it is sorted out. ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: data corrupting bug in 2.4.20 ext3, data=journal 2002-12-01 8:11 data corrupting bug in 2.4.20 ext3, data=journal Andrew Morton 2002-12-01 8:52 ` Andrew Morton @ 2002-12-01 12:41 ` Nick Piggin 2002-12-02 7:17 ` Andrew Morton 2002-12-02 8:26 ` Nick Piggin 2 siblings, 1 reply; 8+ messages in thread From: Nick Piggin @ 2002-12-01 12:41 UTC (permalink / raw) To: Andrew Morton; +Cc: lkml, ext3-users@redhat.com Andrew Morton wrote: >In 2.4.20-pre5 an optimisation was made to the ext3 fsync function >which can very easily cause file data corruption at unmount time. This >was first reported by Nick Piggin on November 29th (one day after 2.4.20 was >released, and three months after the bug was merged. Unfortunate timing) > In fact it was reported on lkml on 18th July IIRC before 2.4.19 was released if that is any help to you. 2.4.19 and 2.4.20 are affected and I haven't tested previous releases. I was going to re-report it sometime, but Alan brought it to light just the other day. Nick ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: data corrupting bug in 2.4.20 ext3, data=journal 2002-12-01 12:41 ` Nick Piggin @ 2002-12-02 7:17 ` Andrew Morton 2002-12-02 12:15 ` Stephen C. Tweedie 0 siblings, 1 reply; 8+ messages in thread From: Andrew Morton @ 2002-12-02 7:17 UTC (permalink / raw) To: Nick Piggin; +Cc: lkml, ext3-users@redhat.com Nick Piggin wrote: > > Andrew Morton wrote: > > >In 2.4.20-pre5 an optimisation was made to the ext3 fsync function > >which can very easily cause file data corruption at unmount time. This > >was first reported by Nick Piggin on November 29th (one day after 2.4.20 was > >released, and three months after the bug was merged. Unfortunate timing) > > > In fact it was reported on lkml on 18th July IIRC before 2.4.19 was > released if that is any help to you. 2.4.19 and 2.4.20 are affected > and I haven't tested previous releases. I was going to re-report it > sometime, but Alan brought it to light just the other day. > Are you sure? I can't make it happen on 2.4.19. And disabling the new BH_Freed logic (which went into 2.4.20-pre5) makes it go away. --- linux-akpm/fs/jbd/commit.c~a Sun Dec 1 23:10:12 2002 +++ linux-akpm-akpm/fs/jbd/commit.c Sun Dec 1 23:10:27 2002 @@ -695,7 +695,7 @@ skip_commit: /* The journal should be un * use in a different page. */ if (__buffer_state(bh, Freed)) { clear_bit(BH_Freed, &bh->b_state); - clear_bit(BH_JBDDirty, &bh->b_state); +// clear_bit(BH_JBDDirty, &bh->b_state); } if (buffer_jdirty(bh)) { _ ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: data corrupting bug in 2.4.20 ext3, data=journal 2002-12-02 7:17 ` Andrew Morton @ 2002-12-02 12:15 ` Stephen C. Tweedie 2002-12-02 15:37 ` Stephen C. Tweedie 2002-12-02 17:45 ` Andrew Morton 0 siblings, 2 replies; 8+ messages in thread From: Stephen C. Tweedie @ 2002-12-02 12:15 UTC (permalink / raw) To: Andrew Morton; +Cc: Nick Piggin, lkml, ext3 users list On Mon, 2002-12-02 at 07:17, Andrew Morton wrote: > Are you sure? I can't make it happen on 2.4.19. And disabling the new > BH_Freed logic (which went into 2.4.20-pre5) makes it go away. > > --- linux-akpm/fs/jbd/commit.c~a Sun Dec 1 23:10:12 2002 > +++ linux-akpm-akpm/fs/jbd/commit.c Sun Dec 1 23:10:27 2002 > @@ -695,7 +695,7 @@ skip_commit: /* The journal should be un > - clear_bit(BH_JBDDirty, &bh->b_state); > +// clear_bit(BH_JBDDirty, &bh->b_state); Argh. That's not the right fix --- it reintroduces the bug that BH_Freed was introduced to solve in the first place. The problem is that ext3 is expecting that truncate_inode_pages() (and hence ext3_flushpage) is only called during a truncate. That's what the function is named for, after all, and it's the hint we need to indicate that future writeback on the data we're discarding should be disabled (so that we don't get old data written on top of new data should the block get deallocated.) But kill_supers() eventually calls truncate_inode_pages() too when we're doing the invalidate_inodes(). And ext3 is reacting just the way it would for a normal truncate --- the data still gets written to the journal (correct, if we reboot before the truncate commits then the old data is preserved in the journal) but is not queued for writeback. The solution is to set BH_Freed in ext3_flushpage IFF we're being called from the truncate, but to avoid it if we're in an umount. I'm not sure of the best way to do that right now, but there are some trivial but hacky methods possible (eg. see if we're in a nested transaction; if so, it's a truncate, if not, it's a umount.) MS_ACTIVE might be a possible flag to test, but I'll need to double-check whether that is 100% safe --- we can't afford to skip the BH_Freed setting if we're in a truncate and the filesystem is not yet completely quiesced. Cheers, Stephen ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: data corrupting bug in 2.4.20 ext3, data=journal 2002-12-02 12:15 ` Stephen C. Tweedie @ 2002-12-02 15:37 ` Stephen C. Tweedie 2002-12-02 17:45 ` Andrew Morton 1 sibling, 0 replies; 8+ messages in thread From: Stephen C. Tweedie @ 2002-12-02 15:37 UTC (permalink / raw) To: Stephen C. Tweedie; +Cc: Andrew Morton, Nick Piggin, lkml, ext3 users list [-- Attachment #1: Type: text/plain, Size: 1722 bytes --] On Mon, 2002-12-02 at 12:15, Stephen C. Tweedie wrote: > The problem is that ext3 is expecting that truncate_inode_pages() (and > hence ext3_flushpage) is only called during a truncate. That's what the > function is named for, after all, and it's the hint we need to indicate > that future writeback on the data we're discarding should be disabled > (so that we don't get old data written on top of new data should the > block get deallocated.) > > But kill_supers() eventually calls truncate_inode_pages() too when we're > doing the invalidate_inodes(). But we've already called fsync_super() at this point. If that completes synchronously, then the buffers will already be out of the journal and queued for writeback when we get here, and clearing BH_JBDDirty won't have any dire consequences. Indeed, loading the ext3 module with "do_sync_supers=1" cures the symptoms. However, doing every superblock write synchronously is a non-starter, as that requires a journal commit in the ext3 universe, and so this would essentially force bdflush to serialise all of its filesystem flushes (which is a real problem once you've got a significant number of filesystems mounted.) But the VFS simply doesn't have any clean way of telling foo_write_super() whether the call needs to be completed synchronously or asynchronously. There *is* a totally unclean way, though. kill_super() sets sb->s_root to NULL before calling its final fsync_super(), and ext3 can easily check that in ext3_write_super(). It's a nasty, messy dependency, but should work for 2.4 at least. For 2.5 we probably want to look at passing an async flag down into the write_super. The attached patch seems to fix things for me. Cheers, Stephen [-- Attachment #2: Type: text/plain, Size: 598 bytes --] --- linux-2.4-ext3merge/fs/ext3/super.c.=K0027=.orig 2002-12-02 15:35:13.000000000 +0000 +++ linux-2.4-ext3merge/fs/ext3/super.c 2002-12-02 15:35:14.000000000 +0000 @@ -1640,7 +1640,12 @@ sb->s_dirt = 0; target = log_start_commit(EXT3_SB(sb)->s_journal, NULL); - if (do_sync_supers) { + /* + * Tricky --- if we are unmounting, the write really does need + * to be synchronous. We can detect that by looking for NULL in + * sb->s_root. + */ + if (do_sync_supers || !sb->s_root) { unlock_super(sb); log_wait_commit(EXT3_SB(sb)->s_journal, target); lock_super(sb); ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: data corrupting bug in 2.4.20 ext3, data=journal 2002-12-02 12:15 ` Stephen C. Tweedie 2002-12-02 15:37 ` Stephen C. Tweedie @ 2002-12-02 17:45 ` Andrew Morton 1 sibling, 0 replies; 8+ messages in thread From: Andrew Morton @ 2002-12-02 17:45 UTC (permalink / raw) To: Stephen C. Tweedie; +Cc: Nick Piggin, lkml, ext3 users list "Stephen C. Tweedie" wrote: > > ... > The problem is that ext3 is expecting that truncate_inode_pages() (and > hence ext3_flushpage) is only called during a truncate. That's OK, because there shouldn't be any dirty data attached to the inodes at that time. But there is, because the commit which write_super() started hasn't finished yet: static int do_sync_supers = 0; ... target = log_start_commit(EXT3_SB(sb)->s_journal, NULL); if (do_sync_supers) { unlock_super(sb); log_wait_commit(EXT3_SB(sb)->s_journal, target); We need to do a full sync at unmount. I assume that in other journalling modes the buffers are dirty anyway, so sync_buffers() gets them. And indeed enabling do_sync_supers fixes it up in both 2.4 and 2.5 (2.5 doesn't have sync_buffers(), but sync_inodes_sb() gets everything). But are we sure that ->write_super() will always be called? int fsync_super(struct super_block *sb) { ... if (sb->s_dirt && sb->s_op && sb->s_op->write_super) sb->s_op->write_super(sb); I suspect that if s_dirt is not set, and we have dirty inodes, write_super is not called and nothing will force a commit anywhere in the unmount process. Which could explain the similar failure in 2.4.19-rc1 which Nick reports. We need to be able to distinguish between a periodic flush of the superblock and a real sync. The write_super overload has always been uncomfortable. Google for "write_super is not for syncing" ;) I think Chris's patch is the way to fix all this up. ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: data corrupting bug in 2.4.20 ext3, data=journal 2002-12-01 8:11 data corrupting bug in 2.4.20 ext3, data=journal Andrew Morton 2002-12-01 8:52 ` Andrew Morton 2002-12-01 12:41 ` Nick Piggin @ 2002-12-02 8:26 ` Nick Piggin 2 siblings, 0 replies; 8+ messages in thread From: Nick Piggin @ 2002-12-02 8:26 UTC (permalink / raw) To: Andrew Morton; +Cc: lkml, ext3-users@redhat.com Andrew Morton wrote: >Nick Piggin wrote: >> >> Andrew Morton wrote: >> >> >In 2.4.20-pre5 an optimisation was made to the ext3 fsync function >> >which can very easily cause file data corruption at unmount time. This >> >was first reported by Nick Piggin on November 29th (one day after 2.4.20 was >> >released, and three months after the bug was merged. Unfortunate timing) >> > >> In fact it was reported on lkml on 18th July IIRC before 2.4.19 was >> released if that is any help to you. 2.4.19 and 2.4.20 are affected >> and I haven't tested previous releases. I was going to re-report it >> sometime, but Alan brought it to light just the other day. >> > >Are you sure? I can't make it happen on 2.4.19. And disabling the new >BH_Freed logic (which went into 2.4.20-pre5) makes it go away. > > >--- linux-akpm/fs/jbd/commit.c~a Sun Dec 1 23:10:12 2002 >+++ linux-akpm-akpm/fs/jbd/commit.c Sun Dec 1 23:10:27 2002 >@@ -695,7 +695,7 @@ skip_commit: /* The journal should be un > * use in a different page. */ > if (__buffer_state(bh, Freed)) { > clear_bit(BH_Freed, &bh->b_state); >- clear_bit(BH_JBDDirty, &bh->b_state); >+// clear_bit(BH_JBDDirty, &bh->b_state); > } > > if (buffer_jdirty(bh)) { > I reported the bug for 2.4.19-rc1 and 2 but I can't remember if I tested 2.4.19 when it was released. It has an external journal on a seperate disk. I can't really do any testing with the machine unfortunately. Regards, Nick ^ permalink raw reply [flat|nested] 8+ messages in thread
end of thread, other threads:[~2002-12-02 17:37 UTC | newest] Thread overview: 8+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2002-12-01 8:11 data corrupting bug in 2.4.20 ext3, data=journal Andrew Morton 2002-12-01 8:52 ` Andrew Morton 2002-12-01 12:41 ` Nick Piggin 2002-12-02 7:17 ` Andrew Morton 2002-12-02 12:15 ` Stephen C. Tweedie 2002-12-02 15:37 ` Stephen C. Tweedie 2002-12-02 17:45 ` Andrew Morton 2002-12-02 8:26 ` Nick Piggin
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox