linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Matt Mackall <mpm@selenic.com>
To: "Jörn Engel" <joern@lazybastard.org>
Cc: linux-fsdevel@vger.kernel.org
Subject: Re: [RFC] TileFS - a proposal for scalable integrity checking
Date: Sun, 29 Apr 2007 07:57:18 -0500	[thread overview]
Message-ID: <20070429125717.GF11115@waste.org> (raw)
In-Reply-To: <20070429122112.GA30608@lazybastard.org>

On Sun, Apr 29, 2007 at 02:21:13PM +0200, Jörn Engel wrote:
> On Sat, 28 April 2007 17:05:22 -0500, Matt Mackall wrote:
> > 
> > This is a relatively simple scheme for making a filesystem with
> > incremental online consistency checks of both data and metadata.
> > Overhead can be well under 1% disk space and CPU overhead may also be
> > very small, while greatly improving filesystem integrity.
> 
> I like it a lot.  Until now it appears to solve more problems and cause
> fewer new problems than ChunkFS.

Thanks. I think this is a bit more direct solution than ChunkFS, but
a) I haven't followed ChunkFS closely and b) I haven't been thinking
about fsck very long, so this is still just a presented as fodder for
discussion.

> > [...]
> > 
> >  Divide disk into a bunch of tiles. For each tile, allocate a one
> >  block tile header that contains (inode, checksum) pairs for each
> >  block in the tile. Unused blocks get marked inode -1, filesystem
> >  metadata blocks -2. The first element contains a last-clean
> >  timestamp, a clean flag and a checksum for the block itself. For 4K
> >  blocks with 32-bit inode and CRC, that's 512 blocks per tile (2MB),
> >  with ~.2% overhead.
> 
> You should add a 64bit fpos field.  That allows you to easily check for
> addressing errors.

Describe the scenario where this manifests, please.

It just occurred to me that my approach is analogous to object-based
rmap on the filesystem. The fpos proposal I think makes it more like
the original per-pte rmap. This is not to say I think the same lessons
apply, as I'm not clear what you're proposing yet.

Ooh.. I also just realized the tile approach allows much easier
defragging/shrinking of large filesystems because finding the
associated inode for blocks you want to move is fast.

> >  [Note that CRCs are optional so we can cut the overhead in half. I
> >  choose CRCs here because they're capable of catching the vast
> >  majority of accidental corruptions at a small cost and mostly serve
> >  to protect against errors not caught by on-disk ECC (eg cable noise,
> >  kernel bugs, cosmic rays). Replacing CRCs with a stronger hash like
> >  SHA-n is perfectly doable.]
> 
> The checksum cannot protect against a maliciously prepared medium
> anyway, so crypto makes little sense.

In a past life, I wrote a device mapper layer that kept a
cryptographic hash per cluster of the underlying device, with a
top-level digital signature of said hashes. That gets you pretty
close to tamper-proof, in theory. Practice of course is a different
matter, so don't try this at home.

As it happens, this earlier system was the inspiration for the tile
idea, the integrity parts of which have been kicking around in my head
since I heard ZFS was tracking checksums.

> Crc32 can provably (if you trust those who did the proof) detect all
> 1, 2 and 3-bit errors and has a 1:2^32 chance of detecting any
> remaining errors. That is fairly hard to improve on.

Indeed.
 
> >  Every time we write to a tile, we must mark the tile dirty. To cut
> >  down time to find dirty tiles, the clean bits can be collected into a
> >  smaller set of blocks, one clean bitmap block per 64GB of data.
> 
> You can and possibly should organize this as a tree, similar to a file.
> One bit at the lowest level marks a tile as dirty.  One bit for each
> indirect block pointer marks some tiles behind the pointer as dirty.
> That scales logarithmically to any filesystem size.

Right. 3 levels takes us to 512TB, etc..

-- 
Mathematics is the supreme nostalgia of our time.
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

  reply	other threads:[~2007-04-29 13:09 UTC|newest]

Thread overview: 31+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2007-04-28 22:05 [RFC] TileFS - a proposal for scalable integrity checking Matt Mackall
2007-04-29 12:21 ` Jörn Engel
2007-04-29 12:57   ` Matt Mackall [this message]
2007-04-29 15:47     ` Jörn Engel
2007-05-09  5:56   ` Valerie Henson
2007-05-09 10:12     ` Jörn Engel
2007-04-29 15:58 ` Jörn Engel
2007-04-29 16:24   ` Matt Mackall
2007-04-29 16:34 ` Andi Kleen
2007-04-29 16:05   ` Jörn Engel
2007-04-29 16:09   ` Matt Mackall
2007-04-29 23:23 ` Theodore Tso
2007-04-30  1:40   ` Matt Mackall
2007-04-30 17:26     ` Theodore Tso
2007-04-30 17:59       ` Matt Mackall
2007-05-02 13:18         ` Jörn Engel
2007-05-02 13:32     ` Jörn Engel
2007-05-02 15:37       ` Matt Mackall
2007-05-02 16:35         ` Jörn Engel
2007-05-09  7:56     ` Valerie Henson
2007-05-09 11:16       ` Nikita Danilov
2007-05-09 18:56         ` Valerie Henson
2007-05-09 19:19           ` Nikita Danilov
2007-05-09 17:06       ` Matt Mackall
2007-05-09 18:59         ` Valerie Henson
2007-05-09 19:51           ` Matt Mackall
2007-05-10  0:03             ` Jörn Engel
2007-05-11  9:46             ` Valerie Henson
2007-05-11 15:55               ` Matt Mackall
2007-05-09 19:01     ` Valerie Henson
2007-05-09 20:05       ` Matt Mackall

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20070429125717.GF11115@waste.org \
    --to=mpm@selenic.com \
    --cc=joern@lazybastard.org \
    --cc=linux-fsdevel@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).