From: Matt Mackall <mpm@selenic.com>
To: linux-fsdevel@vger.kernel.org
Subject: [RFC] TileFS - a proposal for scalable integrity checking
Date: Sat, 28 Apr 2007 17:05:22 -0500 [thread overview]
Message-ID: <20070428220522.GN11166@waste.org> (raw)
This is a relatively simple scheme for making a filesystem with
incremental online consistency checks of both data and metadata.
Overhead can be well under 1% disk space and CPU overhead may also be
very small, while greatly improving filesystem integrity.
Some things we need to check during fsck:
all directories point to in-use inodes
all in-use inodes are referred to by directories
all inodes in use are marked in use
all free inodes are marked free
all inodes point to in-use blocks
all inode refcounts are correct
all inode block counts are correct
free inode count is correct
no blocks are used twice
all used blocks are marked as used
all free blocks are marked as free
optional: all block contents are correct
Layout:
Start with a conventional ext2/3-like filesystem.
Divide disk into a bunch of tiles. For each tile, allocate a one
block tile header that contains (inode, checksum) pairs for each
block in the tile. Unused blocks get marked inode -1, filesystem
metadata blocks -2. The first element contains a last-clean
timestamp, a clean flag and a checksum for the block itself. For 4K
blocks with 32-bit inode and CRC, that's 512 blocks per tile (2MB),
with ~.2% overhead.
[Note that CRCs are optional so we can cut the overhead in half. I
choose CRCs here because they're capable of catching the vast
majority of accidental corruptions at a small cost and mostly serve
to protect against errors not caught by on-disk ECC (eg cable noise,
kernel bugs, cosmic rays). Replacing CRCs with a stronger hash like
SHA-n is perfectly doable.]
Every time we write to a tile, we must mark the tile dirty. To cut
down time to find dirty tiles, the clean bits can be collected into a
smaller set of blocks, one clean bitmap block per 64GB of data.
Inodes have a backpointer to a directory that links them. Hardlinked
files have two extra inode pointers in the directory structure, to
the previous and next directories containing the link. Hardlinked
inodes have a checksum of the members of that list.
Checking a tile:
Read the tile
If clean and current, we're done.
Check the tile header checksum
Check the checksum on each block in the tile
Check that metadata blocks are metadata
Check that inodes in tile agree with inode bitmap
Check that data blocks in tile agree with block bitmap
Check for each inode covering tile according to header:
Marked in-use
Blocks are owned by the inode
Blocks incorrectly owned put on list for tree sweep or orphaned
Block count is right
Traverse backpointer linked list, check link count and checksum
Inodes with incorrect refcount on tree sweep list or orphaned
If directory:
Points to in-use inodes
Mark tile clean
If all tiles are simultaneously clean, fs is clean.
Tree sweeps and orphans:
For damaged backpointers, we may want to walk the directory tree to
repair the links. Memory usage here is bounded by the size of the
orphan list and we may be able to trim the walk when examining files
or directories inside of clean tiles.
Alternately, we can move orphaned blocks and inodes to lost+found/
and forgo the more expensive tree walk.
Performance implications:
Amount of RAM needed to check an FS is tiny
Tile headers are very near their blocks, so usually don't require a seek
Caching overhead is relatively small
Most actual tile header writes can be folded in the journal
Hardlinks are a bit of a pain, but somewhat rare
We can cache checks of inodes covering multiple tiles
Huge file deletes have to hit even more blocks than ext2/3
It's antithetical to using extents, as is any system with block CRCs
Keeping the average read-mostly filesystem fully-checked should be easy
--
Mathematics is the supreme nostalgia of our time.
next reply other threads:[~2007-04-28 22:17 UTC|newest]
Thread overview: 31+ messages / expand[flat|nested] mbox.gz Atom feed top
2007-04-28 22:05 Matt Mackall [this message]
2007-04-29 12:21 ` [RFC] TileFS - a proposal for scalable integrity checking Jörn Engel
2007-04-29 12:57 ` Matt Mackall
2007-04-29 15:47 ` Jörn Engel
2007-05-09 5:56 ` Valerie Henson
2007-05-09 10:12 ` Jörn Engel
2007-04-29 15:58 ` Jörn Engel
2007-04-29 16:24 ` Matt Mackall
2007-04-29 16:34 ` Andi Kleen
2007-04-29 16:05 ` Jörn Engel
2007-04-29 16:09 ` Matt Mackall
2007-04-29 23:23 ` Theodore Tso
2007-04-30 1:40 ` Matt Mackall
2007-04-30 17:26 ` Theodore Tso
2007-04-30 17:59 ` Matt Mackall
2007-05-02 13:18 ` Jörn Engel
2007-05-02 13:32 ` Jörn Engel
2007-05-02 15:37 ` Matt Mackall
2007-05-02 16:35 ` Jörn Engel
2007-05-09 7:56 ` Valerie Henson
2007-05-09 11:16 ` Nikita Danilov
2007-05-09 18:56 ` Valerie Henson
2007-05-09 19:19 ` Nikita Danilov
2007-05-09 17:06 ` Matt Mackall
2007-05-09 18:59 ` Valerie Henson
2007-05-09 19:51 ` Matt Mackall
2007-05-10 0:03 ` Jörn Engel
2007-05-11 9:46 ` Valerie Henson
2007-05-11 15:55 ` Matt Mackall
2007-05-09 19:01 ` Valerie Henson
2007-05-09 20:05 ` Matt Mackall
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20070428220522.GN11166@waste.org \
--to=mpm@selenic.com \
--cc=linux-fsdevel@vger.kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).