From: Theodore Tso <tytso@mit.edu>
To: Valerie Henson <val_henson@linux.intel.com>
Cc: linux-fsdevel@vger.kernel.org, Can Sar <csar@stanford.edu>,
Junfeng Yang <junfeng@gmail.com>,
Dawson Engler <engler@csl.stanford.edu>
Subject: Re: Fix(es) for ext2 fsync bug
Date: Thu, 15 Feb 2007 09:20:21 -0500 [thread overview]
Message-ID: <20070215142020.GA9930@thunk.org> (raw)
In-Reply-To: <20070214195453.GB7521@nifty>
On Wed, Feb 14, 2007 at 11:54:54AM -0800, Valerie Henson wrote:
> Background: The eXplode file system checker found a bug in ext2 fsync
> behavior. Do the following: truncate file A, create file B which
> reallocates one of A's old indirect blocks, fsync file B. If you then
> crash before file A's metadata is all written out, fsck will complete
> the truncate for file A... thereby deleting file B's data. So fsync
> file B doesn't guarantee data is on disk after a crash. Details:
It's actually not the case that fsck will complete the truncate for
file A. The problem is that while e2fsck is processing indirect
blocks in pass 1, the block which is marked as file A's indirect block
(but which actually contain's file B's data) gets "fixed" when e2fsck
sees block numbers which look like illegal block numbers. So this
ends up corrupting file B's data.
This is actually legal end result, BTW, since it's POSIX states the
result of fsync() is undefined if the system crashes. Technically
fsync() did actually guarantee that file B's data is "on disk"; the
problem is that e2fsck would corrupt the data afterwards. Ironically,
fsync()'ing file B actually makes it more likely that it might get
corrupted afterwards, since normally filesystem metadata gets sync'ed
out on 5 second intervals, while data gets sync'ed out at 30 second
intervals.
> * Rearrange order of duplicate block checking and fixing file size in
> fsck. Not sure how hard this is. (Ted?)
It's not a matter of changing when we deal with fixing the file size,
as described above. At the fsck time, we would need to keep backup
copies of any indirect blocks that get modified for whatever reason,
and then in pass 1D, when we clone a block that has been claimed by
multiple inods, the inodes which claim the block as a data block
should get a copy of the block before it was modified by e2fsck.
> * Keep a set of "still allocated on disk" block bitmaps that gets
> flushed whenever a sync happens. Don't allocate these blocks.
> Journaling file systems already have to do this.
A list would be more efficient, as others have pointed out. That
would work, although the knowing when entries could be removed from
the list. The machinery for knowing when metadata has been updated
isn't present in ext2, and that's a fair amount of complexity. You
could clear the list/bitmap after the 5 second metadata flush command
has been kicked off, or if you associate a data block with the
previous inode's owner, you could clear the entry when the inode's
dirty bit has been cleared, but that doesn't completely get rid of the
race unless you tie it to when the write has completed (and this
assumes write barriers to make sure the block was actually flushed to
the media).
Another very heavyweight approach would be to simply force a full sync
of the filesystem whenever fysnc() is called. Not pretty, and without
the proper write ordering, the race is still potentially there.
I'd say that the best way to handle this is in fsck, but quite frankly
it's relatively low priority "bug" to handle, since a much simpler
workaround is to tell people to use ext3 instead.
Regards,
- Ted
next prev parent reply other threads:[~2007-02-15 14:20 UTC|newest]
Thread overview: 20+ messages / expand[flat|nested] mbox.gz Atom feed top
2007-02-14 19:54 Fix(es) for ext2 fsync bug Valerie Henson
2007-02-14 20:31 ` David Chinner
2007-02-14 21:26 ` Dave Kleikamp
2007-02-14 23:32 ` David Chinner
2007-02-14 21:08 ` sfaibish
2007-02-15 14:20 ` Theodore Tso [this message]
2007-02-15 15:09 ` Dave Kleikamp
2007-02-15 15:59 ` sfaibish
2007-02-15 16:39 ` Dave Kleikamp
2007-02-15 17:15 ` Theodore Tso
2007-02-15 17:52 ` sfaibish
[not found] ` <21e789ec0702151111v4cb2aa8dqa168c886cb909c9@mail.gmail.com>
2007-02-15 19:26 ` Dave Kleikamp
2007-02-15 18:54 ` Dawson Engler
[not found] ` <21e789ec0702151118x1c6af801gd34981d72db0f5b2@mail.gmail.com>
[not found] ` <21e789ec0702151128x744f61e5lb24d2da972af185a@mail.gmail.com>
2007-02-16 1:18 ` Theodore Tso
2007-02-20 21:13 ` Valerie Henson
[not found] ` <21e789ec0702201330x1c2706b7kcd055b97cb37e0e@mail.gmail.com>
2007-02-20 21:39 ` Valerie Henson
2007-02-20 21:47 ` Dawson Engler
2007-02-20 22:25 ` Dave Kleikamp
2007-02-20 21:30 ` Valerie Henson
2007-02-20 22:12 ` Erez Zadok
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20070215142020.GA9930@thunk.org \
--to=tytso@mit.edu \
--cc=csar@stanford.edu \
--cc=engler@csl.stanford.edu \
--cc=junfeng@gmail.com \
--cc=linux-fsdevel@vger.kernel.org \
--cc=val_henson@linux.intel.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).