David Chinner wrote:
> On Fri, Aug 31, 2007 at 02:01:37PM +1000, Mark Goodwin wrote:
>> Lachlan McIlroy wrote:
>>> Timothy Shimmin wrote:
>>>> Timothy Shimmin wrote:
>>>>>>>  But I'm not sure this is an error...
>>>>>>>  Hmmmm...I'm a bit confused.
>>>>>>>  So you are _almost_ combining an error check with a flushiter check?
>>>>>>>  If one buffer is an inode magic# and the other isn't then we
>>>>>>>  have an error right - and could report it - but we are not doing 
>>>>>>> that here.
>>>>>> Not exactly.  If what's on disk is not an inode but the log item is
>>>>>> then that could be because we haven't written the inode to disk yet
>>>>>> and we need to perform recovery.
>>>>> Yeah, I was thinking about that afterward.
>>>>> The item's format which gives the blk# for the buf to read could
>>>>> be a block which hasn't been used for an inode yet.
>>>>>
>>>> Well, if what's on disk is not an inode but some other data
>>>> and it happens to have the inode magic# which is remotely possible,
>>>> then we are making a bad assumption.
>>>> i.e. if we're not sure what the block/buffer should be, then testing the
>>>> MAGIC# isn't a guarantee it's an inode then.
>>>> Well not for the freeing of inode clusters case I would assume.
>>>> Or am I missing something?
>>> I don't think you're missing anything!
>>>
>>> You're right though - a magic number check is no guarantee.  On the same
>>> vein, adding a generation number check isn't much better.
>> unlink will have to invalidate the on-disk inode magic number? Or only
>> when the whole cluster is free'd?
> 
> An unlinked inode is only detectable by the mode parameter being zero.
> The rest of the inode will look valid.
> 
> To detect the difference between a newly allocated inode *chunk*
> that has been written to and a stale inode chunk that we have
> just allocated and not written to yet, you need to walk every inode
> in the chunk and determine if the mode parameter is zero in every
> inode.
> 
> If the mode is zero for all inodes and there are generation numbers
> that are not zero, then you've detected a stale buffer and you should
> replay the inode cluster buffer initialisation.
> 

Thanks for this info Dave.  I looked into it and came up with a solution
that looks at the ondisk inode buffer and determines if it has been
written to since being logged.  It iterates through all the inodes and
checks each one with:

- if the magic number is wrong the buffer is stale
- if the mode is non-zero then the buffer is newer than the log
- if the mode is zero and the generation count is non-zero then the
   buffer is stale

If the end result is a stale buffer then the buffer is replayed otherwise
it is skipped.  I added a new flag that gets logged with a new inode
cluster so that we can identify a buffer of inodes from something else.
This fix is passing all the tests we have.  Is this a better approach
than the last fix?

Lachlan