Re: topics for the file system mini-summit

linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

From: Andreas Dilger <adilger@clusterfs.com>
To: Ric Wheeler <ric@emc.com>
Cc: Matthew Wilcox <matthew@wil.cx>, linux-fsdevel@vger.kernel.org
Subject: Re: topics for the file system mini-summit
Date: Tue, 30 May 2006 00:14:29 -0600	[thread overview]
Message-ID: <20060530061429.GE5964@schatzie.adilger.int> (raw)
In-Reply-To: <447B4B87.7040403@emc.com>

On May 29, 2006  15:29 -0400, Ric Wheeler wrote:
> Andreas Dilger wrote:
> >Instead of a filesystem-wide "error" bit we could move this per-group to
> >only mark the block or inode bitmaps in error if they have a checksum
> >failure.  This would prevent allocations from that group to avoid further
> >potential corruption of the filesystem metadata.
> >
> >Once an error is detected then a filesystem service thread or a userspace
> >helper would walk the inode table (starting in the current group, which
> >is most likely to hold the relevant data) recreating the respective bitmap
> >table and keeping a "valid bit" bitmap as well.  Once all of the bits
> >in the bitmap are marked valid then we can start using this group again.
>
> That is a neat idea - would you lose complete access to the impacted 
> group, or have you thought about "best effort" read-only while under repair?

I think we would only need to prevent new allocation from the group if the
bitmap is corrupted.  The extent format already has a magic number to give
a very quick sanity check (unlike indirect blocks which can be filled with
random garbage on large filesystems and still appear valid).  We are looking
at adding checksums in the extent metadata and could also do extra internal
consistency checks to validate this metadata (e.g. sequential ordering of
logical offsets, non-overlapping logical offsets, proper parent->child
logical offset heirarchy, etc).

So, we are mostly safe from the "incorrect block free" side, and just need
to worry about the "block is free in bitmap, don't reallocate" problem.
Allowing unlinks in a group also allows the "valid" bitmap to be updated
when the bits are cleared, so this is beneficial to the end goal of getting
an all-valid block bitmap.  We could even get more fancy and allow blocks
marked valid to be used for allocations, but that is more complex than I like.

> One thing that has worked very well for us is that we keep a digital 
> signature of each user object (MD5, SHAX hash, etc) so we can validate 
> that what we wrote is what got read back.  This also provides a very 
> powerful sanity check after getting hit by failing media or severe file 
> system corruption since what ever we do manage to salvage (which might 
> not be all files) can be validated.

Yes, we've looked at this also for Lustre (we can already do checksums
from the client memory down to the server disk), but the problem of
consistency in the face of write/truncate/append and a crash is complex.
There's also the issue of whether to do partial-file checksums (in order
to allow more efficient updates) or full-file checksums.

I believe at one point there was work on a checksum loop device, but this
also has potential consistency problems in the face of a crash.

> For general purpose read/write work loads, I wonder if it would make 
> sense to compute and store such a checksum or signature on close (say in 
> an extended attribute)?  It might be useful to use another of those 
> special attributes (like immutable attribute) to indicate that this file 
> is important enough to digitally sign on close.

Hmm, good idea.  If a file is immutable that makes it fairly certain it
won't be modified any time soon so a good candidate for checksumming.

Cheers, Andreas
--
Andreas Dilger
Principal Software Engineer
Cluster File Systems, Inc.

next prev parent reply	other threads:[~2006-05-30  6:14 UTC|newest]

Thread overview: 24+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2006-05-25 21:44 topics for the file system mini-summit Ric Wheeler
2006-05-26 16:48 ` Andreas Dilger
2006-05-27  0:49   ` Ric Wheeler
2006-05-27 14:18     ` Andreas Dilger
2006-05-28  1:44       ` Ric Wheeler
2006-05-29  0:11 ` Matthew Wilcox
2006-05-29  2:07   ` Ric Wheeler
2006-05-29 16:09     ` Andreas Dilger
2006-05-29 19:29       ` Ric Wheeler
2006-05-30  6:14         ` Andreas Dilger [this message]
2006-06-07 10:10       ` Stephen C. Tweedie
2006-06-07 14:03         ` Andi Kleen
2006-06-07 18:55         ` Andreas Dilger
2006-06-01  2:19 ` Valerie Henson
2006-06-01  2:42   ` Matthew Wilcox
2006-06-01  3:24     ` Valerie Henson
2006-06-01 12:45       ` Matthew Wilcox
2006-06-01 12:53         ` Arjan van de Ven
2006-06-01 20:06         ` Russell Cattelan
2006-06-02 11:27         ` Nathan Scott
2006-06-01  5:36   ` Andreas Dilger
2006-06-03 13:50   ` Ric Wheeler
2006-06-03 14:13     ` Arjan van de Ven
2006-06-03 15:07       ` Ric Wheeler

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20060530061429.GE5964@schatzie.adilger.int \
    --to=adilger@clusterfs.com \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=matthew@wil.cx \
    --cc=ric@emc.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).