From mboxrd@z Thu Jan 1 00:00:00 1970 From: Steven Whitehouse Date: Thu, 16 May 2013 16:24:17 +0100 Subject: [Cluster-devel] [gfs2-utils PATCH 24/47] fsck.gfs2: Rework the "undo" functions In-Reply-To: <439495569.10483675.1368716527959.JavaMail.root@redhat.com> References: <143016273297ab30fb85274ea801de7df7fd8726.1368548305.git.rpeterso@redhat.com> <1368710854.2680.28.camel@menhir> <1405920527.10229516.1368712143340.JavaMail.root@redhat.com> <1368712925.2680.33.camel@menhir> <439495569.10483675.1368716527959.JavaMail.root@redhat.com> Message-ID: <1368717857.2680.48.camel@menhir> List-Id: To: cluster-devel.redhat.com MIME-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit Hi, On Thu, 2013-05-16 at 11:02 -0400, Bob Peterson wrote: > ----- Original Message ----- > | Yes, but the undo side of things worries me... it is very easy to get > | tied in knots doing that. The question is what is "damage it can't > | recover from"? this is a bit vague and doesn't really explain what is > | going on here. > | > | I don't yet understand why we'd need to run through each inodes metadata > | tree more than once in this case, > | > | Steve. > > Hi, > > One thing to bear in mind is that the fsck blockmap is supposed to > represent the correct state of all blocks in the on-disk bitmap. > The job of pass1 is to build the blockmap, which starts out entirely "free". > As the metadata is traversed, the blocks are filled in with the appropriate > type. > Yes, but this cannot be true, as we don't actually always know the true correct state of the blocks, so sometimes, blocks will need to be marked as unknown until more evidence is available, as I think your example shows. > The job of pass5 is to synchronize the on-disk bitmap to the blockmap. > > So we must ensure that the blockmap is accurate at ALL times after pass1. > One of the primary checks pass1 does is to make sure that a block is "free" > in the blockmap before changing its designation, otherwise it's a duplicate > block reference that must be resolved in pass1b. > > Here's an example: Suppose you have a file with di_height==2, two levels of > indirection. Suppose the dinode is layed out something like this: > > dinode indirect data > ------ -------- ------ > 0x1000 - dinode > ---> 0x1001 > ---> 0x1002 > ---> 0x1003 > ... > ---> 0x1010 > ---> 0x1011 > ---> 0x1012 > ---> 0x1013 > ... > ---> 0x1020 > ---> 0x1021 > ---> 0x1022 > ---> 0x1023 > ---> 0x7777777777777777777 > ---> 0x1025 > ... > ---> 0x1030 > > Now let's further suppose that this file was supposed to be deleted, > and many of its blocks were in fact reused by a newer, valid dinode, > but somehow, the bitmap was corrupted into saying this dinode is still > alive (a dinode, not free or unlinked). > > For the sake of argument, say that second dinode appears later in the > bitmap, so pass1 gets to corrupt dinode 0x1000 before it gets to the > valid dinode that correctly references the blocks. > > As it traverses the metadata tree, it builds an array of lists, one for > each height. Each item in the linked list corresponds to a metadata block. > So pass1 traverses the array, marks down in its blockmap that block 0x1000 is > dinode, blocks 0x1001, 0x1011, and 0x1021 are metadata blocks. Then it > processes the data block pointers within the metadata blocks, marking > 0x1002, 0x1003, all the way up to 1023 as "data" blocks. > When it hits the block 0x7777777777777777777, it determines that's out > of range for the device, and therefore the data file has an unrecoverable > data block error. > > At this point, it doesn't make sense to continue marking 0x1025 and beyond > as referenced data blocks, because that will only make matters worse. > I'm not convinced. In that case you have a single reference to a single out of range block. All that we need to do there is to reset the pointer to 0 (unallocated) and thats it. There is no need to stop processing the remainder of the block unless there is some reason to believe that this was not just a one off issue. > Now we've got a problem: Before we knew 0x1000 was corrupt, we marked all > its references in the blockmap. We can't just delete the corrupt dinode > because most of its blocks are in-use by that other dinode. > I think the confusion seems to stem from doing things in the wrong order. What we should be doing is verifying the tree structure of the filesystem first, and then after that is done, updating the bitmaps to match, in case there is a mismatch between the actual fs structure and the bitmaps. Also, we can use the initial state of the bitmaps in order to find more objects to look at (i.e. in case of unlinked inodes) and also use pointers in the filesystem in the same way (in case of directory entries which point to non-inode blocks) for example. But neither of those things requires undoing anything so far as I can see. So what we ought to have is something like this: - Start looking at bitmaps to find initial set of things to check - Start checking inodes, adding additional blocks to list of things to check as we go - Once finished, check bitmaps against list to ensure consistency against the tree Obviously that is rather simplified since there are a few extras we need to deal with some corner cases, but that should be the core of it. If each rgrp is checked as we go, then it should be possible to do with just one pass through the fs. And I know that we will not be able to do that immediately, but that is the kind of structure that we probably should be working towards over time. So this isn't really something for this patch set, but something that we should be looking into in due course, Steve.