Re: topics for the file system mini-summit

linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

From: Andreas Dilger <adilger@clusterfs.com>
To: "Stephen C. Tweedie" <sct@redhat.com>
Cc: Ric Wheeler <ric@emc.com>, Matthew Wilcox <matthew@wil.cx>,
	linux-fsdevel@vger.kernel.org
Subject: Re: topics for the file system mini-summit
Date: Wed, 7 Jun 2006 12:55:23 -0600	[thread overview]
Message-ID: <20060607185523.GE5964@schatzie.adilger.int> (raw)
In-Reply-To: <1149675046.5520.13.camel@sisko.sctweedie.blueyonder.co.uk>

On Jun 07, 2006  11:10 +0100, Stephen C. Tweedie wrote:
> On Mon, 2006-05-29 at 10:09 -0600, Andreas Dilger wrote:
> > This is one thing that we have been thinking of for ext3.  Instead of a
> > filesystem-wide "error" bit we could move this per-group to only mark
> > the block or inode bitmaps in error if they have a checksum failure.
> > This would prevent allocations from that group to avoid further potential
> > corruption of the filesystem metadata.
> 
> Trouble is, individual files can span multiple groups easily.  And one
> of the common failure modes is failure in the indirect tree.  What
> action do you take if you detect that?

Return an IO error for that part of the file?  We already refuse to
free file blocks that overlap with filesystem metadata, but have no
way to know whether the rest of the blocks are valid or not.

> There is fundamentally a large difference between the class of errors
> that can arise due to EIO --- simple loss of a block of data --- and
> those which can arise from actual corrupt data/metadata.  If we detect
> the latter and attempt to soldier on regardless, then we have no idea
> what inconsistencies we are allowing to be propagated through the
> filesystem.  

Recall that one of the other goals is to add checksumming to the extent
tree metadata (if it isn't already covered by the inode checksum).  Even
today, the fact that the extent format has a magic allows some types of
corruption to be detected.  The structure is also somewhat verifiable 
(e.g. logical extent offsets are increasing, logical_offset + length is
non-overlapping with next logical offset, etc) even without checksums.

The proposed ext3_extent_tail would also contain an inode+generation
back-reference and the checksum would depend on the physical block
location so if one extent index block were incorrectly written in the
place of another, or the higher-level reference were corrupted this
would also be detectable.

        struct ext3_extent_tail {
		__u64   et_inum;
		__u32   et_igeneration;
		__u32   et_checksum;
	}

> That can easily end up corrupting files far from the actual error.  Say
> an indirect block is corrupted; we delete that file, and end up freeing
> a block belonging to some other file on a distant block group.  Ooops.
> Once that other block gets reallocated and overwritten, we have
> corrupted that other file.

Oh, I totally agree with that, which is another reason why I've proposed
the "block mapped extent" several times.  It would be referenced from
an extent index block or inode, would start with an extent header to
verify that this is at least semi-plausible block pointers, and can
optionally have an ext3_extent_tail to validate the block data itself.

The block-mapped extent is useful for fragmented files or files with
lots of small holes in them.  Concievably it would also be possible
to quickly remap old block-mapped (indirect tree) files to bm-extent
files if this was desirable.

> *That* is why taking the fs down/readonly on failure is the safe option.

And wait 17 years for e2fsck to complete?  While I agree it is the
safest option, sometimes it is necessary to just block off parts of the
filesystem from writes and soldier on until the system can be taken down
safely.

> The inclusion of checksums would certainly allow us to harden things.
> In the above scenario, failure of the checksum test would allow us to
> discard corrupt indirect blocks before we could allow any harm to come
> to other disk blocks.  But that only works for cases where the checksum
> notices the problem; if we're talking about possible OS bugs, memory
> corruption etc. then it is quite possible to get corruption in the in-
> memory copy, which gets properly checksummed and written to disk, so you
> can't rely on that catching all cases.

I agree, we can't ever handle everything unless we get checksums from the
top of linux to the bottom (maybe stored in the page table?), but we can
at least do the best we can.

Cheers, Andreas
--
Andreas Dilger
Principal Software Engineer
Cluster File Systems, Inc.

next prev parent reply	other threads:[~2006-06-07 18:55 UTC|newest]

Thread overview: 24+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2006-05-25 21:44 topics for the file system mini-summit Ric Wheeler
2006-05-26 16:48 ` Andreas Dilger
2006-05-27  0:49   ` Ric Wheeler
2006-05-27 14:18     ` Andreas Dilger
2006-05-28  1:44       ` Ric Wheeler
2006-05-29  0:11 ` Matthew Wilcox
2006-05-29  2:07   ` Ric Wheeler
2006-05-29 16:09     ` Andreas Dilger
2006-05-29 19:29       ` Ric Wheeler
2006-05-30  6:14         ` Andreas Dilger
2006-06-07 10:10       ` Stephen C. Tweedie
2006-06-07 14:03         ` Andi Kleen
2006-06-07 18:55         ` Andreas Dilger [this message]
2006-06-01  2:19 ` Valerie Henson
2006-06-01  2:42   ` Matthew Wilcox
2006-06-01  3:24     ` Valerie Henson
2006-06-01 12:45       ` Matthew Wilcox
2006-06-01 12:53         ` Arjan van de Ven
2006-06-01 20:06         ` Russell Cattelan
2006-06-02 11:27         ` Nathan Scott
2006-06-01  5:36   ` Andreas Dilger
2006-06-03 13:50   ` Ric Wheeler
2006-06-03 14:13     ` Arjan van de Ven
2006-06-03 15:07       ` Ric Wheeler

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20060607185523.GE5964@schatzie.adilger.int \
    --to=adilger@clusterfs.com \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=matthew@wil.cx \
    --cc=ric@emc.com \
    --cc=sct@redhat.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).