From: Bob Peterson <rpeterso@redhat.com>
To: cluster-devel.redhat.com
Subject: [Cluster-devel] [gfs2-utils PATCH 24/47] fsck.gfs2: Rework the "undo" functions
Date: Mon, 20 May 2013 09:08:28 -0400 (EDT) [thread overview]
Message-ID: <583844991.17343873.1369055308863.JavaMail.root@redhat.com> (raw)
In-Reply-To: <1368717857.2680.48.camel@menhir>
----- Original Message -----
| > Hi,
| >
| > One thing to bear in mind is that the fsck blockmap is supposed to
| > represent the correct state of all blocks in the on-disk bitmap.
| > The job of pass1 is to build the blockmap, which starts out entirely
| > "free".
| > As the metadata is traversed, the blocks are filled in with the appropriate
| > type.
| >
| Yes, but this cannot be true, as we don't actually always know the true
| correct state of the blocks, so sometimes, blocks will need to be marked
| as unknown until more evidence is available, as I think your example
| shows.
While pass1 is running, we don't know the correct state of every block,
but that's what pass1 is trying to determine. At the end of pass1, we should
know the correct state of every block, except those with duplicate references.
You are correct about the "unknown" state, which is called "free" in the blockmap.
The bitmap and blockmap are two different things, with two different purposes.
The bitmap is the state of the block on the media, and "free" means "free".
The blockmap is what we know about each block, based on reference, and "free" means
the state is "unknown" and so far it has not been referenced by anything.
|
| >
| > At this point, it doesn't make sense to continue marking 0x1025 and beyond
| > as referenced data blocks, because that will only make matters worse.
| >
| I'm not convinced. In that case you have a single reference to a single
| out of range block. All that we need to do there is to reset the pointer
| to 0 (unallocated) and thats it. There is no need to stop processing the
| remainder of the block unless there is some reason to believe that this
| was not just a one off issue.
Actually, the way we have to approach this is a lot more complex.
The major problem here is, and probably always will be, duplicate block
references. If there's a badly corrupt dinode that references 1000 blocks,
and that file should have been deleted, but it's not, due to a bitmap problem,
and those 1000 blocks have been reallocated to 200 different healthy files,
many different problems arise, depending on whether you've processed that
corrupt file before or after the other 200 (and maybe 100 before, and 100 after),
so that some of the blocks are marked as duplicate references and some aren't.
| > Now we've got a problem: Before we knew 0x1000 was corrupt, we marked all
| > its references in the blockmap. We can't just delete the corrupt dinode
| > because most of its blocks are in-use by that other dinode.
| >
| I think the confusion seems to stem from doing things in the wrong
| order. What we should be doing is verifying the tree structure of the
| filesystem first, and then after that is done, updating the bitmaps to
| match, in case there is a mismatch between the actual fs structure and
| the bitmaps.
That's exactly what we do. Pass1 verifies the file system tree structure, by
building its blockmap. Pass5 updates the bitmaps to match the blockmap built
by pass1.
| Also, we can use the initial state of the bitmaps in order to find more
| objects to look at (i.e. in case of unlinked inodes) and also use
| pointers in the filesystem in the same way (in case of directory entries
| which point to non-inode blocks) for example. But neither of those
| things requires undoing anything so far as I can see.
|
| So what we ought to have is something like this:
|
| - Start looking at bitmaps to find initial set of things to check
| - Start checking inodes, adding additional blocks to list of things to
| check as we go
| - Once finished, check bitmaps against list to ensure consistency
| against the tree
Yes, this would work ideally, but only for metadata blocks. You still need
some way to determine the state of the bitmap versus the references in order
to ensure there aren't duplicate block references. Unless, of course, the
list includes the data blocks as well as metadata, but then it would take an
enormous amount of memory to accomplish: at least 8X more memory than the
entire file system size. If you don't include the data blocks, you still
need some means to ensure the same blocks aren't referenced multiple times,
and any proposed solution would have to take into account the fact that
blocks can be multiple-referenced as either metadata or data from multiple
sources. Using a 2-bit bitmap and a corresponding 8-bit blockmap requires
much less memory, which is what we have today.
Even if the list includes only metadata, I'm concerned about the amount
of memory it would take, especially for common use cases like email, where
the file system contains many terabytes of tiny files.
| Obviously that is rather simplified since there are a few extras we need
| to deal with some corner cases, but that should be the core of it. If
| each rgrp is checked as we go, then it should be possible to do with
| just one pass through the fs.
|
| And I know that we will not be able to do that immediately, but that is
| the kind of structure that we probably should be working towards over
| time. So this isn't really something for this patch set, but something
| that we should be looking into in due course,
|
| Steve.
Bob Peterson
Red Hat File Systems
next prev parent reply other threads:[~2013-05-20 13:08 UTC|newest]
Thread overview: 59+ messages / expand[flat|nested] mbox.gz Atom feed top
2013-05-14 16:21 [Cluster-devel] [gfs2-utils PATCH 01/47] libgfs2: externalize dir_split_leaf Bob Peterson
2013-05-14 16:21 ` [Cluster-devel] [gfs2-utils PATCH 02/47] libgfs2: allow dir_split_leaf to receive a leaf buffer Bob Peterson
2013-05-14 16:21 ` [Cluster-devel] [gfs2-utils PATCH 03/47] libgfs2: let dir_split_leaf receive a "broken" lindex Bob Peterson
2013-05-15 16:01 ` Steven Whitehouse
2013-05-20 16:02 ` Bob Peterson
2013-05-14 16:21 ` [Cluster-devel] [gfs2-utils PATCH 04/47] fsck.gfs2: Move function find_free_blk to util.c Bob Peterson
2013-05-15 16:04 ` Steven Whitehouse
2013-05-14 16:21 ` [Cluster-devel] [gfs2-utils PATCH 05/47] fsck.gfs2: Split out function to make sure lost+found exists Bob Peterson
2013-05-14 16:21 ` [Cluster-devel] [gfs2-utils PATCH 06/47] fsck.gfs2: Check for formal inode mismatch when adding to lost+found Bob Peterson
2013-05-15 16:08 ` Steven Whitehouse
2013-05-17 12:47 ` Bob Peterson
2013-05-17 12:55 ` Steven Whitehouse
2013-05-14 16:21 ` [Cluster-devel] [gfs2-utils PATCH 07/47] fsck.gfs2: shorten some debug messages in lost+found Bob Peterson
2013-05-14 16:21 ` [Cluster-devel] [gfs2-utils PATCH 08/47] fsck.gfs2: Move basic directory entry checks to separate function Bob Peterson
2013-05-14 16:21 ` [Cluster-devel] [gfs2-utils PATCH 09/47] fsck.gfs2: Add formal inode check to basic dirent checks Bob Peterson
2013-05-14 16:21 ` [Cluster-devel] [gfs2-utils PATCH 10/47] fsck.gfs2: Add new function to check dir hash tables Bob Peterson
2013-05-14 16:21 ` [Cluster-devel] [gfs2-utils PATCH 11/47] fsck.gfs2: Special case '..' when processing bad formal inode number Bob Peterson
2013-05-14 16:21 ` [Cluster-devel] [gfs2-utils PATCH 12/47] fsck.gfs2: Move function to read directory hash table to util.c Bob Peterson
2013-05-14 16:21 ` [Cluster-devel] [gfs2-utils PATCH 13/47] fsck.gfs2: Misc cleanups Bob Peterson
2013-05-14 16:21 ` [Cluster-devel] [gfs2-utils PATCH 14/47] fsck.gfs2: Verify dirent hash values correspond to proper leaf block Bob Peterson
2013-05-14 16:21 ` [Cluster-devel] [gfs2-utils PATCH 15/47] fsck.gfs2: re-read hash table if directory height or depth changes Bob Peterson
2013-05-14 16:21 ` [Cluster-devel] [gfs2-utils PATCH 16/47] fsck.gfs2: fix leaf blocks, don't try to patch the hash table Bob Peterson
2013-05-14 16:21 ` [Cluster-devel] [gfs2-utils PATCH 17/47] fsck.gfs2: check leaf depth when validating leaf blocks Bob Peterson
2013-05-14 16:21 ` [Cluster-devel] [gfs2-utils PATCH 18/47] fsck.gfs2: small cleanups Bob Peterson
2013-05-14 16:21 ` [Cluster-devel] [gfs2-utils PATCH 19/47] fsck.gfs2: reprocess inodes when blocks are added Bob Peterson
2013-05-14 16:21 ` [Cluster-devel] [gfs2-utils PATCH 20/47] fsck.gfs2: Remove redundant leaf depth check Bob Peterson
2013-05-14 16:21 ` [Cluster-devel] [gfs2-utils PATCH 21/47] fsck.gfs2: link dinodes that only have extended attribute problems Bob Peterson
2013-05-14 16:21 ` [Cluster-devel] [gfs2-utils PATCH 22/47] fsck.gfs2: Add clarifying message to duplicate processing Bob Peterson
2013-05-14 16:21 ` [Cluster-devel] [gfs2-utils PATCH 23/47] fsck.gfs2: separate function to calculate metadata block header size Bob Peterson
2013-05-14 16:21 ` [Cluster-devel] [gfs2-utils PATCH 24/47] fsck.gfs2: Rework the "undo" functions Bob Peterson
2013-05-16 13:27 ` Steven Whitehouse
2013-05-16 13:49 ` Bob Peterson
2013-05-16 14:02 ` Steven Whitehouse
2013-05-16 15:02 ` Bob Peterson
2013-05-16 15:24 ` Steven Whitehouse
2013-05-20 13:08 ` Bob Peterson [this message]
2013-05-14 16:21 ` [Cluster-devel] [gfs2-utils PATCH 25/47] fsck.gfs2: Check for interrupt when resolving duplicates Bob Peterson
2013-05-14 16:21 ` [Cluster-devel] [gfs2-utils PATCH 26/47] fsck.gfs2: Consistent naming of struct duptree variables Bob Peterson
2013-05-14 16:21 ` [Cluster-devel] [gfs2-utils PATCH 27/47] fsck.gfs2: Keep proper counts when duplicates are found Bob Peterson
2013-05-14 16:21 ` [Cluster-devel] [gfs2-utils PATCH 28/47] fsck.gfs2: print metadata block reference on data errors Bob Peterson
2013-05-14 16:21 ` [Cluster-devel] [gfs2-utils PATCH 29/47] fsck.gfs2: print block count values when fixing them Bob Peterson
2013-05-14 16:21 ` [Cluster-devel] [gfs2-utils PATCH 30/47] fsck.gfs2: Do not invalidate metablocks of dinodes with invalid mode Bob Peterson
2013-05-14 16:21 ` [Cluster-devel] [gfs2-utils PATCH 31/47] fsck.gfs2: Log when unrecoverable data block errors are encountered Bob Peterson
2013-05-14 16:21 ` [Cluster-devel] [gfs2-utils PATCH 32/47] fsck.gfs2: don't remove buffers from the list when errors are found Bob Peterson
2013-05-14 16:21 ` [Cluster-devel] [gfs2-utils PATCH 33/47] fsck.gfs2: Don't flag GFS1 non-dinode blocks as duplicates Bob Peterson
2013-05-14 16:21 ` [Cluster-devel] [gfs2-utils PATCH 34/47] fsck.gfs2: externalize check_leaf Bob Peterson
2013-05-14 16:21 ` [Cluster-devel] [gfs2-utils PATCH 35/47] fsck.gfs2: pass2: check leaf blocks when fixing hash table Bob Peterson
2013-05-14 16:21 ` [Cluster-devel] [gfs2-utils PATCH 36/47] fsck.gfs2: standardize check_metatree return codes Bob Peterson
2013-05-14 16:22 ` [Cluster-devel] [gfs2-utils PATCH 37/47] fsck.gfs2: don't invalidate files with duplicate data block refs Bob Peterson
2013-05-14 16:22 ` [Cluster-devel] [gfs2-utils PATCH 38/47] fsck.gfs2: check for duplicate first references Bob Peterson
2013-05-14 16:22 ` [Cluster-devel] [gfs2-utils PATCH 39/47] fsck.gfs2: When flagging a duplicate reference, show valid or invalid Bob Peterson
2013-05-14 16:22 ` [Cluster-devel] [gfs2-utils PATCH 40/47] fsck.gfs2: major duplicate reference reform Bob Peterson
2013-05-14 16:22 ` [Cluster-devel] [gfs2-utils PATCH 41/47] fsck.gfs2: Remove all bad eattr blocks Bob Peterson
2013-05-14 16:22 ` [Cluster-devel] [gfs2-utils PATCH 42/47] fsck.gfs2: Remove unused variable Bob Peterson
2013-05-14 16:22 ` [Cluster-devel] [gfs2-utils PATCH 43/47] fsck.gfs2: double-check transitions from dinode to data Bob Peterson
2013-05-14 16:22 ` [Cluster-devel] [gfs2-utils PATCH 44/47] fsck.gfs2: Stop "undo" process when error data block is reached Bob Peterson
2013-05-14 16:22 ` [Cluster-devel] [gfs2-utils PATCH 45/47] fsck.gfs2: Don't allocate leaf blocks in pass1 Bob Peterson
2013-05-14 16:22 ` [Cluster-devel] [gfs2-utils PATCH 46/47] fsck.gfs2: take hash table start boundaries into account Bob Peterson
2013-05-14 16:22 ` [Cluster-devel] [gfs2-utils PATCH 47/47] fsck.gfs2: delete all duplicates from unrecoverable damaged dinodes Bob Peterson
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=583844991.17343873.1369055308863.JavaMail.root@redhat.com \
--to=rpeterso@redhat.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).