[PATCH] log replay should not overwrite newer ondisk inodes

public inbox for linux-xfs@vger.kernel.org
 help / color / mirror / Atom feed

* [PATCH] log replay should not overwrite newer ondisk inodes
@ 2007-08-30  2:12 Lachlan McIlroy
  2007-08-30  4:31 ` Timothy Shimmin
  2007-09-04 23:05 ` Shailendra Tripathi
  0 siblings, 2 replies; 21+ messages in thread
From: Lachlan McIlroy @ 2007-08-30  2:12 UTC (permalink / raw)
  To: xfs-dev; +Cc: xfs-oss

[-- Attachment #1: Type: text/plain, Size: 555 bytes --]

Log replay of clustered inodes currently ignores the flushiter
field in the inode that is used to determine if the on-disk inode
is more up to date than the copy in the log.  As a result during
log replay the newer inode is being overwritten with an older
version and file size updates are being lost.

I haven't handled the case of the flushiter counter overflowing
but that shouldn't be a problem in this case.  The log buffer
contains newly created inodes so their flushiter values will be 0
and the on-disk inodes should not be much greater.

Lachlan

[-- Attachment #2: xfs_log_recover.diff --]
[-- Type: text/x-patch, Size: 1278 bytes --]

--- fs/xfs/xfs_log_recover.c_1.322	2007-08-27 17:45:45.000000000 +1000
+++ fs/xfs/xfs_log_recover.c	2007-08-30 11:50:44.000000000 +1000
@@ -1866,6 +1866,27 @@ xlog_recover_do_inode_buffer(
 }
 
 /*
+ * Check if we need to recover an inode from a buffer
+ */
+int
+xfs_recover_inode(
+	char	*dest,
+	char	*src)
+{
+	xfs_dinode_t	*dip = (xfs_dinode_t *)dest;
+	xfs_dinode_t	*dilp = (xfs_dinode_t*)src;
+
+	if ((be16_to_cpu(dip->di_core.di_magic) == XFS_DINODE_MAGIC) &&
+		(be16_to_cpu(dilp->di_core.di_magic) == XFS_DINODE_MAGIC) &&
+		(be16_to_cpu(dilp->di_core.di_flushiter) <
+			be16_to_cpu(dip->di_core.di_flushiter))) {
+		return 1;
+	}
+
+	return 0;
+}
+
+/*
  * Perform a 'normal' buffer recovery.  Each logged region of the
  * buffer should be copied over the corresponding region in the
  * given buffer.  The bitmap in the buf log format structure indicates
@@ -1917,6 +1938,13 @@ xlog_recover_do_reg_buffer(
 					       -1, 0, XFS_QMOPT_DOWARN,
 					       "dquot_buf_recover");
 		}
+		/*
+		 * Sanity check if this is an inode buffer.
+		 */
+		if (!error)
+			error = xfs_recover_inode(xfs_buf_offset(bp,
+					(uint)bit << XFS_BLI_SHIFT),
+					item->ri_buf[i].i_addr);
 		if (!error)
 			memcpy(xfs_buf_offset(bp,
 				(uint)bit << XFS_BLI_SHIFT),	/* dest */

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH] log replay should not overwrite newer ondisk inodes
  2007-08-30  2:12 [PATCH] log replay should not overwrite newer ondisk inodes Lachlan McIlroy
@ 2007-08-30  4:31 ` Timothy Shimmin
  2007-08-30  4:50   ` Lachlan McIlroy
  2007-08-30 14:02   ` David Chinner
  2007-09-04 23:05 ` Shailendra Tripathi
  1 sibling, 2 replies; 21+ messages in thread
From: Timothy Shimmin @ 2007-08-30  4:31 UTC (permalink / raw)
  To: Lachlan McIlroy; +Cc: xfs-dev, xfs-oss

Lachlan McIlroy wrote:
> Log replay of clustered inodes currently ignores the flushiter
> field in the inode that is used to determine if the on-disk inode
> is more up to date than the copy in the log.  As a result during
> log replay the newer inode is being overwritten with an older
> version and file size updates are being lost.
> 
> I haven't handled the case of the flushiter counter overflowing
> but that shouldn't be a problem in this case.  The log buffer
> contains newly created inodes so their flushiter values will be 0
> and the on-disk inodes should not be much greater.
> 
> Lachlan
> 

Still would want to understand why blf_flags doesn't have
XFS_BLI_INODE_ALLOC_BUF set and so we could test that
- I didn't understand Dave's (dgc) response about that earlier.
But anyway...:)


Just some nit-picks:

* instead of a comment and then calling xfs_recover_inode
   why don't we just give the function a better name.
   When I see the name I think it is going to recover the inode
   but instead it is just a flush_iter check and matching
   the magic#s.

* My style preference is for option 2 over option 1:
   1.
     if (a) {
        return 1;
     }
     return 0;

   2.
     return a;

   But I understand the predicate is long and sometimes people like
   to add tracing/debug for the alternate paths in which case (1)
   is handy for that. I just like simplicity.

* Also it is funny to return a boolean for error in the call.
   Although, I see that is consistent with the rest of the function.
   But I'm not sure this is an error...
   Hmmmm...I'm a bit confused.
   So you are _almost_ combining an error check with a flushiter check?
   If one buffer is an inode magic# and the other isn't then we
   have an error right - and could report it - but we are not doing that here.
   If we have an inode buf and the flushiter on the item is older than the ondisk buffer
   one then it is not an error but we just don't want to recover it.



--Tim


--- fs/xfs/xfs_log_recover.c_1.322	2007-08-27 17:45:45.000000000 +1000
+++ fs/xfs/xfs_log_recover.c	2007-08-30 11:50:44.000000000 +1000
@@ -1866,6 +1866,27 @@ xlog_recover_do_inode_buffer(
  }

  /*
+ * Check if we need to recover an inode from a buffer
+ */
+int
+xfs_recover_inode(
+	char	*dest,
+	char	*src)
+{
+	xfs_dinode_t	*dip = (xfs_dinode_t *)dest;
+	xfs_dinode_t	*dilp = (xfs_dinode_t*)src;
+
+	if ((be16_to_cpu(dip->di_core.di_magic) == XFS_DINODE_MAGIC) &&
+		(be16_to_cpu(dilp->di_core.di_magic) == XFS_DINODE_MAGIC) &&
+		(be16_to_cpu(dilp->di_core.di_flushiter) <
+			be16_to_cpu(dip->di_core.di_flushiter))) {
+		return 1;
+	}
+
+	return 0;
+}
+
+/*
   * Perform a 'normal' buffer recovery.  Each logged region of the
   * buffer should be copied over the corresponding region in the
   * given buffer.  The bitmap in the buf log format structure indicates
@@ -1917,6 +1938,13 @@ xlog_recover_do_reg_buffer(
  					       -1, 0, XFS_QMOPT_DOWARN,
  					       "dquot_buf_recover");
  		}
+		/*
+		 * Sanity check if this is an inode buffer.
+		 */
+		if (!error)
+			error = xfs_recover_inode(xfs_buf_offset(bp,
+					(uint)bit << XFS_BLI_SHIFT),
+					item->ri_buf[i].i_addr);
  		if (!error)
  			memcpy(xfs_buf_offset(bp,
  				(uint)bit << XFS_BLI_SHIFT),	/* dest */

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH] log replay should not overwrite newer ondisk inodes
  2007-08-30  4:31 ` Timothy Shimmin
@ 2007-08-30  4:50   ` Lachlan McIlroy
  2007-08-30  8:29     ` Timothy Shimmin
  2007-08-30 14:02   ` David Chinner
  1 sibling, 1 reply; 21+ messages in thread
From: Lachlan McIlroy @ 2007-08-30  4:50 UTC (permalink / raw)
  To: Timothy Shimmin; +Cc: xfs-dev, xfs-oss

Timothy Shimmin wrote:
> Lachlan McIlroy wrote:
>> Log replay of clustered inodes currently ignores the flushiter
>> field in the inode that is used to determine if the on-disk inode
>> is more up to date than the copy in the log.  As a result during
>> log replay the newer inode is being overwritten with an older
>> version and file size updates are being lost.
>>
>> I haven't handled the case of the flushiter counter overflowing
>> but that shouldn't be a problem in this case.  The log buffer
>> contains newly created inodes so their flushiter values will be 0
>> and the on-disk inodes should not be much greater.
>>
>> Lachlan
>>
> 
> Still would want to understand why blf_flags doesn't have
> XFS_BLI_INODE_ALLOC_BUF set and so we could test that
> - I didn't understand Dave's (dgc) response about that earlier.
> But anyway...:)
> 
> 
> Just some nit-picks:
> 
> * instead of a comment and then calling xfs_recover_inode
>   why don't we just give the function a better name.
>   When I see the name I think it is going to recover the inode
>   but instead it is just a flush_iter check and matching
>   the magic#s.
What would you like the function to be called?

> 
> * My style preference is for option 2 over option 1:
>   1.
>     if (a) {
>        return 1;
>     }
>     return 0;
> 
>   2.
>     return a;
> 
>   But I understand the predicate is long and sometimes people like
>   to add tracing/debug for the alternate paths in which case (1)
>   is handy for that. I just like simplicity.
I can change that.  It just looks odd when the body of the function
is after the return statement.

> 
> * Also it is funny to return a boolean for error in the call.
>   Although, I see that is consistent with the rest of the function.
Yep, keeping consistent with coding standards.

>   But I'm not sure this is an error...
>   Hmmmm...I'm a bit confused.
>   So you are _almost_ combining an error check with a flushiter check?
>   If one buffer is an inode magic# and the other isn't then we
>   have an error right - and could report it - but we are not doing that 
> here.
Not exactly.  If what's on disk is not an inode but the log item is
then that could be because we haven't written the inode to disk yet
and we need to perform recovery.  If what's on disk is an inode but
the log item is not an inode then what we are about to recover is not
an inode and we abort the check.  If neither on-disk nor log item are
inodes then we've got no business continuing the check.

>   If we have an inode buf and the flushiter on the item is older than 
> the ondisk buffer
>   one then it is not an error but we just don't want to recover it.
Exactly.

> 
> 
> 
> --Tim
> 
> 
> --- fs/xfs/xfs_log_recover.c_1.322    2007-08-27 17:45:45.000000000 +1000
> +++ fs/xfs/xfs_log_recover.c    2007-08-30 11:50:44.000000000 +1000
> @@ -1866,6 +1866,27 @@ xlog_recover_do_inode_buffer(
>  }
> 
>  /*
> + * Check if we need to recover an inode from a buffer
> + */
> +int
> +xfs_recover_inode(
> +    char    *dest,
> +    char    *src)
> +{
> +    xfs_dinode_t    *dip = (xfs_dinode_t *)dest;
> +    xfs_dinode_t    *dilp = (xfs_dinode_t*)src;
> +
> +    if ((be16_to_cpu(dip->di_core.di_magic) == XFS_DINODE_MAGIC) &&
> +        (be16_to_cpu(dilp->di_core.di_magic) == XFS_DINODE_MAGIC) &&
> +        (be16_to_cpu(dilp->di_core.di_flushiter) <
> +            be16_to_cpu(dip->di_core.di_flushiter))) {
> +        return 1;
> +    }
> +
> +    return 0;
> +}
> +
> +/*
>   * Perform a 'normal' buffer recovery.  Each logged region of the
>   * buffer should be copied over the corresponding region in the
>   * given buffer.  The bitmap in the buf log format structure indicates
> @@ -1917,6 +1938,13 @@ xlog_recover_do_reg_buffer(
>                             -1, 0, XFS_QMOPT_DOWARN,
>                             "dquot_buf_recover");
>          }
> +        /*
> +         * Sanity check if this is an inode buffer.
> +         */
> +        if (!error)
> +            error = xfs_recover_inode(xfs_buf_offset(bp,
> +                    (uint)bit << XFS_BLI_SHIFT),
> +                    item->ri_buf[i].i_addr);
>          if (!error)
>              memcpy(xfs_buf_offset(bp,
>                  (uint)bit << XFS_BLI_SHIFT),    /* dest */
> 
> 

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH] log replay should not overwrite newer ondisk inodes
  2007-08-30  4:50   ` Lachlan McIlroy
@ 2007-08-30  8:29     ` Timothy Shimmin
  2007-08-30  8:51       ` Timothy Shimmin
  2007-08-31  2:14       ` Lachlan McIlroy
  0 siblings, 2 replies; 21+ messages in thread
From: Timothy Shimmin @ 2007-08-30  8:29 UTC (permalink / raw)
  To: Lachlan McIlroy; +Cc: xfs-dev, xfs-oss

Lachlan McIlroy wrote:
> Timothy Shimmin wrote:
>> Lachlan McIlroy wrote:
>>> Log replay of clustered inodes currently ignores the flushiter
>>> field in the inode that is used to determine if the on-disk inode
>>> is more up to date than the copy in the log.  As a result during
>>> log replay the newer inode is being overwritten with an older
>>> version and file size updates are being lost.
>>>
>>> I haven't handled the case of the flushiter counter overflowing
>>> but that shouldn't be a problem in this case.  The log buffer
>>> contains newly created inodes so their flushiter values will be 0
>>> and the on-disk inodes should not be much greater.
>>>
>>> Lachlan
>>>
>>
>> Still would want to understand why blf_flags doesn't have
>> XFS_BLI_INODE_ALLOC_BUF set and so we could test that
>> - I didn't understand Dave's (dgc) response about that earlier.
>> But anyway...:)
>>
>>
>> Just some nit-picks:
>>
>> * instead of a comment and then calling xfs_recover_inode
>>   why don't we just give the function a better name.
>>   When I see the name I think it is going to recover the inode
>>   but instead it is just a flush_iter check and matching
>>   the magic#s.
> What would you like the function to be called?
> 
Dunno :)
xfs_flushiter_check

>>
>> * My style preference is for option 2 over option 1:
>>   1.
>>     if (a) {
>>        return 1;
>>     }
>>     return 0;
>>
>>   2.
>>     return a;
>>
>>   But I understand the predicate is long and sometimes people like
>>   to add tracing/debug for the alternate paths in which case (1)
>>   is handy for that. I just like simplicity.
> I can change that.  It just looks odd when the body of the function
> is after the return statement.
> 
It doesn't look funny to me :-)
But I'm not really fussed - do what you prefer.

>>
>> * Also it is funny to return a boolean for error in the call.
>>   Although, I see that is consistent with the rest of the function.
> Yep, keeping consistent with coding standards.
Well consistent with that function at least :)

> 
>>   But I'm not sure this is an error...
>>   Hmmmm...I'm a bit confused.
>>   So you are _almost_ combining an error check with a flushiter check?
>>   If one buffer is an inode magic# and the other isn't then we
>>   have an error right - and could report it - but we are not doing 
>> that here.
> Not exactly.  If what's on disk is not an inode but the log item is
> then that could be because we haven't written the inode to disk yet
> and we need to perform recovery.
Yeah, I was thinking about that afterward.
The item's format which gives the blk# for the buf to read could
be a block which hasn't been used for an inode yet.

OOI,
Say the inode block was freed in the past and we freshly allocate one using
this inode buf item in replay, so that the old block has the inode magic#
but it has an old large flushiter. We then say it is newer when
really it is older.
We could check for zero mode or different gen#.
Or perhaps we reset the flushiter when we free the inode or
do we go thru a different replay path.


 > If what's on disk is an inode but
> the log item is not an inode then what we are about to recover is not
> an inode and we abort the check.
Well yes we are only interested in inode buf items.

> If neither on-disk nor log item are
> inodes then we've got no business continuing the check.
Well as above it aint even an inode in the log.

> 
>>   If we have an inode buf and the flushiter on the item is older than 
>> the ondisk buffer
>>   one then it is not an error but we just don't want to recover it.
> Exactly.
> 
So its not an error then yet we are calling it that.
So perhaps we could have:

if (!error && !old_inode_item) {
....

--Tim






>> --- fs/xfs/xfs_log_recover.c_1.322    2007-08-27 17:45:45.000000000 +1000
>> +++ fs/xfs/xfs_log_recover.c    2007-08-30 11:50:44.000000000 +1000
>> @@ -1866,6 +1866,27 @@ xlog_recover_do_inode_buffer(
>>  }
>>
>>  /*
>> + * Check if we need to recover an inode from a buffer
>> + */
>> +int
>> +xfs_recover_inode(
>> +    char    *dest,
>> +    char    *src)
>> +{
>> +    xfs_dinode_t    *dip = (xfs_dinode_t *)dest;
>> +    xfs_dinode_t    *dilp = (xfs_dinode_t*)src;
>> +
>> +    if ((be16_to_cpu(dip->di_core.di_magic) == XFS_DINODE_MAGIC) &&
>> +        (be16_to_cpu(dilp->di_core.di_magic) == XFS_DINODE_MAGIC) &&
>> +        (be16_to_cpu(dilp->di_core.di_flushiter) <
>> +            be16_to_cpu(dip->di_core.di_flushiter))) {
>> +        return 1;
>> +    }
>> +
>> +    return 0;
>> +}
>> +
>> +/*
>>   * Perform a 'normal' buffer recovery.  Each logged region of the
>>   * buffer should be copied over the corresponding region in the
>>   * given buffer.  The bitmap in the buf log format structure indicates
>> @@ -1917,6 +1938,13 @@ xlog_recover_do_reg_buffer(
>>                             -1, 0, XFS_QMOPT_DOWARN,
>>                             "dquot_buf_recover");
>>          }
>> +        /*
>> +         * Sanity check if this is an inode buffer.
>> +         */
>> +        if (!error)
>> +            error = xfs_recover_inode(xfs_buf_offset(bp,
>> +                    (uint)bit << XFS_BLI_SHIFT),
>> +                    item->ri_buf[i].i_addr);
>>          if (!error)
>>              memcpy(xfs_buf_offset(bp,
>>                  (uint)bit << XFS_BLI_SHIFT),    /* dest */
>>
>>

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH] log replay should not overwrite newer ondisk inodes
  2007-08-30  8:29     ` Timothy Shimmin
@ 2007-08-30  8:51       ` Timothy Shimmin
  2007-08-31  2:22         ` Lachlan McIlroy
  2007-08-31  2:14       ` Lachlan McIlroy
  1 sibling, 1 reply; 21+ messages in thread
From: Timothy Shimmin @ 2007-08-30  8:51 UTC (permalink / raw)
  To: Lachlan McIlroy; +Cc: xfs-dev, xfs-oss

Timothy Shimmin wrote:
>>>   But I'm not sure this is an error...
>>>   Hmmmm...I'm a bit confused.
>>>   So you are _almost_ combining an error check with a flushiter check?
>>>   If one buffer is an inode magic# and the other isn't then we
>>>   have an error right - and could report it - but we are not doing 
>>> that here.
>> Not exactly.  If what's on disk is not an inode but the log item is
>> then that could be because we haven't written the inode to disk yet
>> and we need to perform recovery.
> Yeah, I was thinking about that afterward.
> The item's format which gives the blk# for the buf to read could
> be a block which hasn't been used for an inode yet.
> 
Well, if what's on disk is not an inode but some other data
and it happens to have the inode magic# which is remotely possible,
then we are making a bad assumption.
i.e. if we're not sure what the block/buffer should be, then testing the
MAGIC# isn't a guarantee it's an inode then.
Well not for the freeing of inode clusters case I would assume.
Or am I missing something?

--Tim

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH] log replay should not overwrite newer ondisk inodes
  2007-08-30  4:31 ` Timothy Shimmin
  2007-08-30  4:50   ` Lachlan McIlroy
@ 2007-08-30 14:02   ` David Chinner
  1 sibling, 0 replies; 21+ messages in thread
From: David Chinner @ 2007-08-30 14:02 UTC (permalink / raw)
  To: Timothy Shimmin; +Cc: Lachlan McIlroy, xfs-dev, xfs-oss

On Thu, Aug 30, 2007 at 02:31:11PM +1000, Timothy Shimmin wrote:
> Lachlan McIlroy wrote:
> >Log replay of clustered inodes currently ignores the flushiter
> >field in the inode that is used to determine if the on-disk inode
> >is more up to date than the copy in the log.  As a result during
> >log replay the newer inode is being overwritten with an older
> >version and file size updates are being lost.
> >
> >I haven't handled the case of the flushiter counter overflowing
> >but that shouldn't be a problem in this case.  The log buffer
> >contains newly created inodes so their flushiter values will be 0
> >and the on-disk inodes should not be much greater.
> >
> >Lachlan
> >
> 
> Still would want to understand why blf_flags doesn't have
> XFS_BLI_INODE_ALLOC_BUF set and so we could test that
> - I didn't understand Dave's (dgc) response about that earlier.

I never commented on why or why not that flag isn't set in the log
item. FWICT, it's not set because it is purely an in-memory flag.
State that gets logged gets put in bli_format.blf_flags (e.g.
XFS_BLI_FORMAT) whereas XFS_BLI_INODE_ALLOC_BUF is only ever
placed in bli_flags....

The function of that flag is to prevent the logged inode buffer from being
moved forward in the AIL so that it is always replayed at the time of
allocation so that subsequent transactions that modify the inodes and inode
buffers have a correctly allocated inode buffer to apply changes to during
recovery. And if the tail of the log moves past the allocation transaction,
it is guaranteed that the inodes can be read from disk so that recovery
has a correctly allocated inode buffer to apply changes to....

That being said, the flag could probably be propagated into the blf_flags
for (only) the allocation transaction if it makes discovery of this type
of buffer during recovery more reliable....

Cheers,

Dave.
-- 
Dave Chinner
Principal Engineer
SGI Australian Software Group

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH] log replay should not overwrite newer ondisk inodes
  2007-08-30  8:29     ` Timothy Shimmin
  2007-08-30  8:51       ` Timothy Shimmin
@ 2007-08-31  2:14       ` Lachlan McIlroy
  1 sibling, 0 replies; 21+ messages in thread
From: Lachlan McIlroy @ 2007-08-31  2:14 UTC (permalink / raw)
  To: Timothy Shimmin; +Cc: xfs-dev, xfs-oss

Timothy Shimmin wrote:
> Lachlan McIlroy wrote:
>> Timothy Shimmin wrote:
>>> Lachlan McIlroy wrote:
>>>> Log replay of clustered inodes currently ignores the flushiter
>>>> field in the inode that is used to determine if the on-disk inode
>>>> is more up to date than the copy in the log.  As a result during
>>>> log replay the newer inode is being overwritten with an older
>>>> version and file size updates are being lost.
>>>>
>>>> I haven't handled the case of the flushiter counter overflowing
>>>> but that shouldn't be a problem in this case.  The log buffer
>>>> contains newly created inodes so their flushiter values will be 0
>>>> and the on-disk inodes should not be much greater.
>>>>
>>>> Lachlan
>>>>
>>>
>>> Still would want to understand why blf_flags doesn't have
>>> XFS_BLI_INODE_ALLOC_BUF set and so we could test that
>>> - I didn't understand Dave's (dgc) response about that earlier.
>>> But anyway...:)
>>>
>>>
>>> Just some nit-picks:
>>>
>>> * instead of a comment and then calling xfs_recover_inode
>>>   why don't we just give the function a better name.
>>>   When I see the name I think it is going to recover the inode
>>>   but instead it is just a flush_iter check and matching
>>>   the magic#s.
>> What would you like the function to be called?
>>
> Dunno :)
> xfs_flushiter_check
> 
>>>
>>> * My style preference is for option 2 over option 1:
>>>   1.
>>>     if (a) {
>>>        return 1;
>>>     }
>>>     return 0;
>>>
>>>   2.
>>>     return a;
>>>
>>>   But I understand the predicate is long and sometimes people like
>>>   to add tracing/debug for the alternate paths in which case (1)
>>>   is handy for that. I just like simplicity.
>> I can change that.  It just looks odd when the body of the function
>> is after the return statement.
>>
> It doesn't look funny to me :-)
> But I'm not really fussed - do what you prefer.
> 
>>>
>>> * Also it is funny to return a boolean for error in the call.
>>>   Although, I see that is consistent with the rest of the function.
>> Yep, keeping consistent with coding standards.
> Well consistent with that function at least :)
> 
>>
>>>   But I'm not sure this is an error...
>>>   Hmmmm...I'm a bit confused.
>>>   So you are _almost_ combining an error check with a flushiter check?
>>>   If one buffer is an inode magic# and the other isn't then we
>>>   have an error right - and could report it - but we are not doing 
>>> that here.
>> Not exactly.  If what's on disk is not an inode but the log item is
>> then that could be because we haven't written the inode to disk yet
>> and we need to perform recovery.
> Yeah, I was thinking about that afterward.
> The item's format which gives the blk# for the buf to read could
> be a block which hasn't been used for an inode yet.
> 
> OOI,
> Say the inode block was freed in the past and we freshly allocate one using
> this inode buf item in replay, so that the old block has the inode magic#
> but it has an old large flushiter. We then say it is newer when
> really it is older.
Yeah that could be a problem.  Knew this flushiter thing was a hack - if
we logged every inode change we wouldn't have to worry what is on disk we
could just play back the log.

> We could check for zero mode or different gen#.
That might work, thanks.  I suppose the existing flushiter check will need
this extra sanity too.

> Or perhaps we reset the flushiter when we free the inode or
> do we go thru a different replay path.
Not sure how this would work.

> 
> 
>  > If what's on disk is an inode but
>> the log item is not an inode then what we are about to recover is not
>> an inode and we abort the check.
> Well yes we are only interested in inode buf items.
> 
>> If neither on-disk nor log item are
>> inodes then we've got no business continuing the check.
> Well as above it aint even an inode in the log.
> 
>>
>>>   If we have an inode buf and the flushiter on the item is older than 
>>> the ondisk buffer
>>>   one then it is not an error but we just don't want to recover it.
>> Exactly.
>>
> So its not an error then yet we are calling it that.
> So perhaps we could have:
> 
> if (!error && !old_inode_item) {
> ....
> 
> --Tim
> 
> 
> 
> 
> 
> 
>>> --- fs/xfs/xfs_log_recover.c_1.322    2007-08-27 17:45:45.000000000 
>>> +1000
>>> +++ fs/xfs/xfs_log_recover.c    2007-08-30 11:50:44.000000000 +1000
>>> @@ -1866,6 +1866,27 @@ xlog_recover_do_inode_buffer(
>>>  }
>>>
>>>  /*
>>> + * Check if we need to recover an inode from a buffer
>>> + */
>>> +int
>>> +xfs_recover_inode(
>>> +    char    *dest,
>>> +    char    *src)
>>> +{
>>> +    xfs_dinode_t    *dip = (xfs_dinode_t *)dest;
>>> +    xfs_dinode_t    *dilp = (xfs_dinode_t*)src;
>>> +
>>> +    if ((be16_to_cpu(dip->di_core.di_magic) == XFS_DINODE_MAGIC) &&
>>> +        (be16_to_cpu(dilp->di_core.di_magic) == XFS_DINODE_MAGIC) &&
>>> +        (be16_to_cpu(dilp->di_core.di_flushiter) <
>>> +            be16_to_cpu(dip->di_core.di_flushiter))) {
>>> +        return 1;
>>> +    }
>>> +
>>> +    return 0;
>>> +}
>>> +
>>> +/*
>>>   * Perform a 'normal' buffer recovery.  Each logged region of the
>>>   * buffer should be copied over the corresponding region in the
>>>   * given buffer.  The bitmap in the buf log format structure indicates
>>> @@ -1917,6 +1938,13 @@ xlog_recover_do_reg_buffer(
>>>                             -1, 0, XFS_QMOPT_DOWARN,
>>>                             "dquot_buf_recover");
>>>          }
>>> +        /*
>>> +         * Sanity check if this is an inode buffer.
>>> +         */
>>> +        if (!error)
>>> +            error = xfs_recover_inode(xfs_buf_offset(bp,
>>> +                    (uint)bit << XFS_BLI_SHIFT),
>>> +                    item->ri_buf[i].i_addr);
>>>          if (!error)
>>>              memcpy(xfs_buf_offset(bp,
>>>                  (uint)bit << XFS_BLI_SHIFT),    /* dest */
>>>
>>>
> 
> 
> 

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH] log replay should not overwrite newer ondisk inodes
  2007-08-30  8:51       ` Timothy Shimmin
@ 2007-08-31  2:22         ` Lachlan McIlroy
  2007-08-31  4:01           ` Mark Goodwin
  0 siblings, 1 reply; 21+ messages in thread
From: Lachlan McIlroy @ 2007-08-31  2:22 UTC (permalink / raw)
  To: Timothy Shimmin; +Cc: xfs-dev, xfs-oss

Timothy Shimmin wrote:
> Timothy Shimmin wrote:
>>>>   But I'm not sure this is an error...
>>>>   Hmmmm...I'm a bit confused.
>>>>   So you are _almost_ combining an error check with a flushiter check?
>>>>   If one buffer is an inode magic# and the other isn't then we
>>>>   have an error right - and could report it - but we are not doing 
>>>> that here.
>>> Not exactly.  If what's on disk is not an inode but the log item is
>>> then that could be because we haven't written the inode to disk yet
>>> and we need to perform recovery.
>> Yeah, I was thinking about that afterward.
>> The item's format which gives the blk# for the buf to read could
>> be a block which hasn't been used for an inode yet.
>>
> Well, if what's on disk is not an inode but some other data
> and it happens to have the inode magic# which is remotely possible,
> then we are making a bad assumption.
> i.e. if we're not sure what the block/buffer should be, then testing the
> MAGIC# isn't a guarantee it's an inode then.
> Well not for the freeing of inode clusters case I would assume.
> Or am I missing something?
I don't think you're missing anything!

You're right though - a magic number check is no guarantee.  On the same
vein, adding a generation number check isn't much better.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH] log replay should not overwrite newer ondisk inodes
  2007-08-31  2:22         ` Lachlan McIlroy
@ 2007-08-31  4:01           ` Mark Goodwin
  2007-08-31 15:48             ` David Chinner
  0 siblings, 1 reply; 21+ messages in thread
From: Mark Goodwin @ 2007-08-31  4:01 UTC (permalink / raw)
  To: Lachlan McIlroy; +Cc: Timothy Shimmin, xfs-dev, xfs-oss



Lachlan McIlroy wrote:
> Timothy Shimmin wrote:
>> Timothy Shimmin wrote:
>>>>>   But I'm not sure this is an error...
>>>>>   Hmmmm...I'm a bit confused.
>>>>>   So you are _almost_ combining an error check with a flushiter check?
>>>>>   If one buffer is an inode magic# and the other isn't then we
>>>>>   have an error right - and could report it - but we are not doing 
>>>>> that here.
>>>> Not exactly.  If what's on disk is not an inode but the log item is
>>>> then that could be because we haven't written the inode to disk yet
>>>> and we need to perform recovery.
>>> Yeah, I was thinking about that afterward.
>>> The item's format which gives the blk# for the buf to read could
>>> be a block which hasn't been used for an inode yet.
>>>
>> Well, if what's on disk is not an inode but some other data
>> and it happens to have the inode magic# which is remotely possible,
>> then we are making a bad assumption.
>> i.e. if we're not sure what the block/buffer should be, then testing the
>> MAGIC# isn't a guarantee it's an inode then.
>> Well not for the freeing of inode clusters case I would assume.
>> Or am I missing something?
> I don't think you're missing anything!
> 
> You're right though - a magic number check is no guarantee.  On the same
> vein, adding a generation number check isn't much better.

unlink will have to invalidate the on-disk inode magic number? Or only
when the whole cluster is free'd?

-- Mark

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH] log replay should not overwrite newer ondisk inodes
  2007-08-31  4:01           ` Mark Goodwin
@ 2007-08-31 15:48             ` David Chinner
  2007-09-02 22:50               ` Vlad Apostolov
  2007-09-07  2:03               ` Lachlan McIlroy
  0 siblings, 2 replies; 21+ messages in thread
From: David Chinner @ 2007-08-31 15:48 UTC (permalink / raw)
  To: Mark Goodwin; +Cc: Lachlan McIlroy, Timothy Shimmin, xfs-dev, xfs-oss

On Fri, Aug 31, 2007 at 02:01:37PM +1000, Mark Goodwin wrote:
> Lachlan McIlroy wrote:
> >Timothy Shimmin wrote:
> >>Timothy Shimmin wrote:
> >>>>>  But I'm not sure this is an error...
> >>>>>  Hmmmm...I'm a bit confused.
> >>>>>  So you are _almost_ combining an error check with a flushiter check?
> >>>>>  If one buffer is an inode magic# and the other isn't then we
> >>>>>  have an error right - and could report it - but we are not doing 
> >>>>>that here.
> >>>>Not exactly.  If what's on disk is not an inode but the log item is
> >>>>then that could be because we haven't written the inode to disk yet
> >>>>and we need to perform recovery.
> >>>Yeah, I was thinking about that afterward.
> >>>The item's format which gives the blk# for the buf to read could
> >>>be a block which hasn't been used for an inode yet.
> >>>
> >>Well, if what's on disk is not an inode but some other data
> >>and it happens to have the inode magic# which is remotely possible,
> >>then we are making a bad assumption.
> >>i.e. if we're not sure what the block/buffer should be, then testing the
> >>MAGIC# isn't a guarantee it's an inode then.
> >>Well not for the freeing of inode clusters case I would assume.
> >>Or am I missing something?
> >I don't think you're missing anything!
> >
> >You're right though - a magic number check is no guarantee.  On the same
> >vein, adding a generation number check isn't much better.
> 
> unlink will have to invalidate the on-disk inode magic number? Or only
> when the whole cluster is free'd?

An unlinked inode is only detectable by the mode parameter being zero.
The rest of the inode will look valid.

To detect the difference between a newly allocated inode *chunk*
that has been written to and a stale inode chunk that we have
just allocated and not written to yet, you need to walk every inode
in the chunk and determine if the mode parameter is zero in every
inode.

If the mode is zero for all inodes and there are generation numbers
that are not zero, then you've detected a stale buffer and you should
replay the inode cluster buffer initialisation.

Cheers,

Dave.
-- 
Dave Chinner
Principal Engineer
SGI Australian Software Group

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH] log replay should not overwrite newer ondisk inodes
  2007-08-31 15:48             ` David Chinner
@ 2007-09-02 22:50               ` Vlad Apostolov
  2007-09-03  8:49                 ` David Chinner
  2007-09-07  2:03               ` Lachlan McIlroy
  1 sibling, 1 reply; 21+ messages in thread
From: Vlad Apostolov @ 2007-09-02 22:50 UTC (permalink / raw)
  To: David Chinner
  Cc: Mark Goodwin, Lachlan McIlroy, Timothy Shimmin, xfs-dev, xfs-oss

David Chinner wrote:
> An unlinked inode is only detectable by the mode parameter being zero.
> The rest of the inode will look valid.
What about the nlink count? Can't we detect unlinked inode by nlink == 0?

Regards,
Vlad

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH] log replay should not overwrite newer ondisk inodes
  2007-09-02 22:50               ` Vlad Apostolov
@ 2007-09-03  8:49                 ` David Chinner
  0 siblings, 0 replies; 21+ messages in thread
From: David Chinner @ 2007-09-03  8:49 UTC (permalink / raw)
  To: Vlad Apostolov
  Cc: David Chinner, Mark Goodwin, Lachlan McIlroy, Timothy Shimmin,
	xfs-dev, xfs-oss

On Mon, Sep 03, 2007 at 08:50:51AM +1000, Vlad Apostolov wrote:
> David Chinner wrote:
> >An unlinked inode is only detectable by the mode parameter being zero.
> >The rest of the inode will look valid.
> What about the nlink count? Can't we detect unlinked inode by nlink == 0?

A file that is open but unlinked must still remain valid on disk
until the last reference goes away. This inode will have a link
count of zero, but it is still a valid, referenced inode. Hence
detection of an inode with a link count of zero does not mean it
has been fully removed yet - a zero link count and a non-zero
mode indicates the inode should be onthe unlinked inode list.

See xfs_ifree() - the mode is zeroed only after the inode has been
pulled from the AGI unlinked list and marked free in the AGI btree.

Cheers,

Dave.
-- 
Dave Chinner
Principal Engineer
SGI Australian Software Group

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH] log replay should not overwrite newer ondisk inodes
  2007-08-30  2:12 [PATCH] log replay should not overwrite newer ondisk inodes Lachlan McIlroy
  2007-08-30  4:31 ` Timothy Shimmin
@ 2007-09-04 23:05 ` Shailendra Tripathi
  2007-09-04 23:49   ` David Chinner
  2007-09-05  1:19   ` Timothy Shimmin
  1 sibling, 2 replies; 21+ messages in thread
From: Shailendra Tripathi @ 2007-09-04 23:05 UTC (permalink / raw)
  To: Lachlan McIlroy; +Cc: xfs-dev, xfs-oss

Hi,
      Can someone explain how not checking the flushiter can losse 
filesize updates.
Let me the take the case mentioned here in the fix statement:

a. Clustered inode create -  flush iter - X( 0)
b. size update  --> flush iter --> Y

X and Y will always hold as: X <= Y, that is, it is not possible to have 
X >Y (unless size update is non -transactional. As far as I know, size 
update is always transactional.)

There are 2 cases here:
a) log of Y reached to the disk  --> 1) inode with flush iter was 
reached 2) inode didn't make.
b) log of Y didn't reach the disk --> flush_iter Y should have never 
reached disk

In none of cases, I can see the possibility that size update can be lost 
becuase of replaying of the logs in the sequential order. If Log of Y 
didn't reach, does it not make sense to have the flush_iter and size 
correctly set to the last known transaction on the disk. To me, it 
appears unsafe to do as the inode state will not match the state of the 
last known transaction after recovery.

Regards,
Shailendra
Lachlan McIlroy wrote:
> Log replay of clustered inodes currently ignores the flushiter
> field in the inode that is used to determine if the on-disk inode
> is more up to date than the copy in the log.  As a result during
> log replay the newer inode is being overwritten with an older
> version and file size updates are being lost.
>
> I haven't handled the case of the flushiter counter overflowing
> but that shouldn't be a problem in this case.  The log buffer
> contains newly created inodes so their flushiter values will be 0
> and the on-disk inodes should not be much greater.
>
> Lachlan
> ------------------------------------------------------------------------
>
> --- fs/xfs/xfs_log_recover.c_1.322	2007-08-27 17:45:45.000000000 +1000
> +++ fs/xfs/xfs_log_recover.c	2007-08-30 11:50:44.000000000 +1000
> @@ -1866,6 +1866,27 @@ xlog_recover_do_inode_buffer(
>  }
>  
>  /*
> + * Check if we need to recover an inode from a buffer
> + */
> +int
> +xfs_recover_inode(
> +	char	*dest,
> +	char	*src)
> +{
> +	xfs_dinode_t	*dip = (xfs_dinode_t *)dest;
> +	xfs_dinode_t	*dilp = (xfs_dinode_t*)src;
> +
> +	if ((be16_to_cpu(dip->di_core.di_magic) == XFS_DINODE_MAGIC) &&
> +		(be16_to_cpu(dilp->di_core.di_magic) == XFS_DINODE_MAGIC) &&
> +		(be16_to_cpu(dilp->di_core.di_flushiter) <
> +			be16_to_cpu(dip->di_core.di_flushiter))) {
> +		return 1;
> +	}
> +
> +	return 0;
> +}
> +
> +/*
>   * Perform a 'normal' buffer recovery.  Each logged region of the
>   * buffer should be copied over the corresponding region in the
>   * given buffer.  The bitmap in the buf log format structure indicates
> @@ -1917,6 +1938,13 @@ xlog_recover_do_reg_buffer(
>  					       -1, 0, XFS_QMOPT_DOWARN,
>  					       "dquot_buf_recover");
>  		}
> +		/*
> +		 * Sanity check if this is an inode buffer.
> +		 */
> +		if (!error)
> +			error = xfs_recover_inode(xfs_buf_offset(bp,
> +					(uint)bit << XFS_BLI_SHIFT),
> +					item->ri_buf[i].i_addr);
>  		if (!error)
>  			memcpy(xfs_buf_offset(bp,
>  				(uint)bit << XFS_BLI_SHIFT),	/* dest */
>   

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH] log replay should not overwrite newer ondisk inodes
  2007-09-04 23:05 ` Shailendra Tripathi
@ 2007-09-04 23:49   ` David Chinner
  2007-09-04 23:51     ` David Chinner
  2007-09-05  1:19   ` Timothy Shimmin
  1 sibling, 1 reply; 21+ messages in thread
From: David Chinner @ 2007-09-04 23:49 UTC (permalink / raw)
  To: Shailendra Tripathi; +Cc: Lachlan McIlroy, xfs-dev, xfs-oss

On Tue, Sep 04, 2007 at 04:05:06PM -0700, Shailendra Tripathi wrote:
> Hi,
>      Can someone explain how not checking the flushiter can losse 
> filesize updates.
> Let me the take the case mentioned here in the fix statement:
> 
> a. Clustered inode create -  flush iter - X( 0)
> b. size update  --> flush iter --> Y
> 
> X and Y will always hold as: X <= Y, that is, it is not possible to have 
> X >Y (unless size update is non -transactional. As far as I know, size 
> update is always transactional.)

Size updates are not transactional. They are accidentally logged in
other transactions (e.g. truncates) but size updates due to data
writes are never logged. When we write an inode, we take the latest
in memory size and write it to disk without logging that change
so we can't rely on log recovery to get it right simply by replaying
transactions.

Cheers,

Dave.
-- 
Dave Chinner
Principal Engineer
SGI Australian Software Group

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH] log replay should not overwrite newer ondisk inodes
  2007-09-04 23:49   ` David Chinner
@ 2007-09-04 23:51     ` David Chinner
  0 siblings, 0 replies; 21+ messages in thread
From: David Chinner @ 2007-09-04 23:51 UTC (permalink / raw)
  To: David Chinner; +Cc: Shailendra Tripathi, Lachlan McIlroy, xfs-dev, xfs-oss

On Wed, Sep 05, 2007 at 09:49:26AM +1000, David Chinner wrote:
> On Tue, Sep 04, 2007 at 04:05:06PM -0700, Shailendra Tripathi wrote:
> > Hi,
> >      Can someone explain how not checking the flushiter can losse 
> > filesize updates.
> > Let me the take the case mentioned here in the fix statement:
> > 
> > a. Clustered inode create -  flush iter - X( 0)
> > b. size update  --> flush iter --> Y
> > 
> > X and Y will always hold as: X <= Y, that is, it is not possible to have 
> > X >Y (unless size update is non -transactional. As far as I know, size 
> > update is always transactional.)
> 
> Size updates are not transactional. They are accidentally logged in

s/accidentally/intentionally/

> other transactions (e.g. truncates) but size updates due to data
> writes are never logged. When we write an inode, we take the latest
> in memory size and write it to disk without logging that change
> so we can't rely on log recovery to get it right simply by replaying
> transactions.
> 
> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> Principal Engineer
> SGI Australian Software Group

-- 
Dave Chinner
Principal Engineer
SGI Australian Software Group

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH] log replay should not overwrite newer ondisk inodes
  2007-09-04 23:05 ` Shailendra Tripathi
  2007-09-04 23:49   ` David Chinner
@ 2007-09-05  1:19   ` Timothy Shimmin
  2007-09-05  1:40     ` Lachlan McIlroy
  1 sibling, 1 reply; 21+ messages in thread
From: Timothy Shimmin @ 2007-09-05  1:19 UTC (permalink / raw)
  To: Shailendra Tripathi; +Cc: Lachlan McIlroy, xfs-dev, xfs-oss

Shailendra Tripathi wrote:
> Hi,
>      Can someone explain how not checking the flushiter can losse 
> filesize updates.
> Let me the take the case mentioned here in the fix statement:
> 
> a. Clustered inode create -  flush iter - X( 0)
> b. size update  --> flush iter --> Y
> 
> X and Y will always hold as: X <= Y, that is, it is not possible to have 
> X >Y (unless size update is non -transactional. As far as I know, size 
> update is always transactional.)
> 
> There are 2 cases here:
> a) log of Y reached to the disk  --> 1) inode with flush iter was 
> reached 2) inode didn't make.
> b) log of Y didn't reach the disk --> flush_iter Y should have never 
> reached disk
> 
> In none of cases, I can see the possibility that size update can be lost 
> becuase of replaying of the logs in the sequential order. If Log of Y 
> didn't reach, does it not make sense to have the flush_iter and size 
> correctly set to the last known transaction on the disk. To me, it 
> appears unsafe to do as the inode state will not match the state of the 
> last known transaction after recovery.
> 
> Regards,
> Shailendra

Dave answered this but yes this is a case where we are breaking
the transaction model IMO. And my understanding is that we are doing
this for performance reasons.
One of Lachlan's proposals (IIRC) was to log the size change before the
ondisk size change in xfs_aops.c/xfs_setfilesize()
and thus follow the model but questions were raised about introducing
performance overhead of log traffic in the write path.

--Tim

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH] log replay should not overwrite newer ondisk inodes
  2007-09-05  1:19   ` Timothy Shimmin
@ 2007-09-05  1:40     ` Lachlan McIlroy
  2007-09-05  6:54       ` David Chinner
  0 siblings, 1 reply; 21+ messages in thread
From: Lachlan McIlroy @ 2007-09-05  1:40 UTC (permalink / raw)
  To: Timothy Shimmin; +Cc: Shailendra Tripathi, xfs-dev, xfs-oss

Timothy Shimmin wrote:
> Shailendra Tripathi wrote:
>> Hi,
>>      Can someone explain how not checking the flushiter can losse 
>> filesize updates.
>> Let me the take the case mentioned here in the fix statement:
>>
>> a. Clustered inode create -  flush iter - X( 0)
>> b. size update  --> flush iter --> Y
>>
>> X and Y will always hold as: X <= Y, that is, it is not possible to 
>> have X >Y (unless size update is non -transactional. As far as I know, 
>> size update is always transactional.)
>>
>> There are 2 cases here:
>> a) log of Y reached to the disk  --> 1) inode with flush iter was 
>> reached 2) inode didn't make.
>> b) log of Y didn't reach the disk --> flush_iter Y should have never 
>> reached disk
>>
>> In none of cases, I can see the possibility that size update can be 
>> lost becuase of replaying of the logs in the sequential order. If Log 
>> of Y didn't reach, does it not make sense to have the flush_iter and 
>> size correctly set to the last known transaction on the disk. To me, 
>> it appears unsafe to do as the inode state will not match the state of 
>> the last known transaction after recovery.
>>
>> Regards,
>> Shailendra
> 
> Dave answered this but yes this is a case where we are breaking
> the transaction model IMO. And my understanding is that we are doing
> this for performance reasons.
> One of Lachlan's proposals (IIRC) was to log the size change before the
> ondisk size change in xfs_aops.c/xfs_setfilesize()
> and thus follow the model but questions were raised about introducing
> performance overhead of log traffic in the write path.
> 

It would make life easier to do it that way - we wouldn't have to check
the flushiter field of the ondisk inode because we know we will end up
with the same thing by just replaying the log.  But since the addition
of the flushiter stuff pre-dates the NULL files changes there must be
another reason we need it.  Previously the size change would get logged
with an extent allocation transaction but extending a file within the
same last block would not send the new file size to the log.  I think
that may have been the reason for needing the flushiter stuff.  If we
log the file size change in xfs_setfilesize() we may not need the
flushiter stuff at all, hmmm maybe for timestamp updates?

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH] log replay should not overwrite newer ondisk inodes
  2007-09-05  1:40     ` Lachlan McIlroy
@ 2007-09-05  6:54       ` David Chinner
  0 siblings, 0 replies; 21+ messages in thread
From: David Chinner @ 2007-09-05  6:54 UTC (permalink / raw)
  To: Lachlan McIlroy; +Cc: Timothy Shimmin, Shailendra Tripathi, xfs-dev, xfs-oss

On Wed, Sep 05, 2007 at 11:40:22AM +1000, Lachlan McIlroy wrote:
> >Dave answered this but yes this is a case where we are breaking
> >the transaction model IMO. And my understanding is that we are doing
> >this for performance reasons.
> >One of Lachlan's proposals (IIRC) was to log the size change before the
> >ondisk size change in xfs_aops.c/xfs_setfilesize()
> >and thus follow the model but questions were raised about introducing
> >performance overhead of log traffic in the write path.
> 
> It would make life easier to do it that way - we wouldn't have to check
> the flushiter field of the ondisk inode because we know we will end up
> with the same thing by just replaying the log.  But since the addition
> of the flushiter stuff pre-dates the NULL files changes there must be
> another reason we need it.  Previously the size change would get logged
> with an extent allocation transaction but extending a file within the
> same last block would not send the new file size to the log.  I think
> that may have been the reason for needing the flushiter stuff.  If we
> log the file size change in xfs_setfilesize() we may not need the
> flushiter stuff at all, hmmm maybe for timestamp updates?

atime, size updates and inode version format conversion may be written without
a transaction first logging the change.

Cheers,

Dave.
-- 
Dave Chinner
Principal Engineer
SGI Australian Software Group

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH] log replay should not overwrite newer ondisk inodes
  2007-08-31 15:48             ` David Chinner
  2007-09-02 22:50               ` Vlad Apostolov
@ 2007-09-07  2:03               ` Lachlan McIlroy
  2007-09-07 14:05                 ` David Chinner
  1 sibling, 1 reply; 21+ messages in thread
From: Lachlan McIlroy @ 2007-09-07  2:03 UTC (permalink / raw)
  To: David Chinner; +Cc: Mark Goodwin, Timothy Shimmin, xfs-dev, xfs-oss

[-- Attachment #1: Type: text/plain, Size: 2919 bytes --]

David Chinner wrote:
> On Fri, Aug 31, 2007 at 02:01:37PM +1000, Mark Goodwin wrote:
>> Lachlan McIlroy wrote:
>>> Timothy Shimmin wrote:
>>>> Timothy Shimmin wrote:
>>>>>>>  But I'm not sure this is an error...
>>>>>>>  Hmmmm...I'm a bit confused.
>>>>>>>  So you are _almost_ combining an error check with a flushiter check?
>>>>>>>  If one buffer is an inode magic# and the other isn't then we
>>>>>>>  have an error right - and could report it - but we are not doing 
>>>>>>> that here.
>>>>>> Not exactly.  If what's on disk is not an inode but the log item is
>>>>>> then that could be because we haven't written the inode to disk yet
>>>>>> and we need to perform recovery.
>>>>> Yeah, I was thinking about that afterward.
>>>>> The item's format which gives the blk# for the buf to read could
>>>>> be a block which hasn't been used for an inode yet.
>>>>>
>>>> Well, if what's on disk is not an inode but some other data
>>>> and it happens to have the inode magic# which is remotely possible,
>>>> then we are making a bad assumption.
>>>> i.e. if we're not sure what the block/buffer should be, then testing the
>>>> MAGIC# isn't a guarantee it's an inode then.
>>>> Well not for the freeing of inode clusters case I would assume.
>>>> Or am I missing something?
>>> I don't think you're missing anything!
>>>
>>> You're right though - a magic number check is no guarantee.  On the same
>>> vein, adding a generation number check isn't much better.
>> unlink will have to invalidate the on-disk inode magic number? Or only
>> when the whole cluster is free'd?
> 
> An unlinked inode is only detectable by the mode parameter being zero.
> The rest of the inode will look valid.
> 
> To detect the difference between a newly allocated inode *chunk*
> that has been written to and a stale inode chunk that we have
> just allocated and not written to yet, you need to walk every inode
> in the chunk and determine if the mode parameter is zero in every
> inode.
> 
> If the mode is zero for all inodes and there are generation numbers
> that are not zero, then you've detected a stale buffer and you should
> replay the inode cluster buffer initialisation.
> 

Thanks for this info Dave.  I looked into it and came up with a solution
that looks at the ondisk inode buffer and determines if it has been
written to since being logged.  It iterates through all the inodes and
checks each one with:

- if the magic number is wrong the buffer is stale
- if the mode is non-zero then the buffer is newer than the log
- if the mode is zero and the generation count is non-zero then the
   buffer is stale

If the end result is a stale buffer then the buffer is replayed otherwise
it is skipped.  I added a new flag that gets logged with a new inode
cluster so that we can identify a buffer of inodes from something else.
This fix is passing all the tests we have.  Is this a better approach
than the last fix?

Lachlan

[-- Attachment #2: xfs_log_recover.diff --]
[-- Type: text/x-patch, Size: 2760 bytes --]

--- fs/xfs/xfs_buf_item.h_1.44	2007-09-04 13:38:24.000000000 +1000
+++ fs/xfs/xfs_buf_item.h	2007-09-06 12:06:39.000000000 +1000
@@ -52,6 +52,11 @@ typedef struct xfs_buf_log_format_t {
 #define	XFS_BLI_UDQUOT_BUF	0x4
 #define XFS_BLI_PDQUOT_BUF	0x8
 #define	XFS_BLI_GDQUOT_BUF	0x10
+/*
+ * This flag indicates that the buffer contains newly allocated
+ * inodes.
+ */
+#define	XFS_BLI_INODE_NEW_BUF	0x20
 
 #define	XFS_BLI_CHUNK		128
 #define	XFS_BLI_SHIFT		7
--- fs/xfs/xfs_log_recover.c_1.322	2007-08-27 17:45:45.000000000 +1000
+++ fs/xfs/xfs_log_recover.c	2007-09-07 10:41:38.000000000 +1000
@@ -1874,6 +1874,7 @@ xlog_recover_do_inode_buffer(
 /*ARGSUSED*/
 STATIC void
 xlog_recover_do_reg_buffer(
+	xfs_mount_t		*mp,
 	xlog_recover_item_t	*item,
 	xfs_buf_t		*bp,
 	xfs_buf_log_format_t	*buf_f)
@@ -1884,6 +1885,30 @@ xlog_recover_do_reg_buffer(
 	unsigned int		*data_map = NULL;
 	unsigned int		map_size = 0;
 	int                     error;
+	int			stale_buf = 1;
+
+	if (buf_f->blf_flags & XFS_BLI_INODE_NEW_BUF) {
+		xfs_dinode_t    *dip;
+		int             inodes_per_buf;
+
+		stale_buf = 0;
+		inodes_per_buf = XFS_BUF_COUNT(bp) >> mp->m_sb.sb_inodelog;
+		for (i = 0; i < inodes_per_buf; i++) {
+			dip = (xfs_dinode_t *)xfs_buf_offset(bp,
+				i * mp->m_sb.sb_inodesize);
+			if (be16_to_cpu(dip->di_core.di_magic) !=
+					XFS_DINODE_MAGIC) {
+				stale_buf = 1;
+				break;
+			}
+			if (be16_to_cpu(dip->di_core.di_mode))
+				break;
+			if (be16_to_cpu(dip->di_core.di_gen)) {
+				stale_buf = 1;
+				break;
+			}
+		}
+	}
 
 	switch (buf_f->blf_type) {
 	case XFS_LI_BUF:
@@ -1917,7 +1942,7 @@ xlog_recover_do_reg_buffer(
 					       -1, 0, XFS_QMOPT_DOWARN,
 					       "dquot_buf_recover");
 		}
-		if (!error)
+		if (!error && stale_buf)
 			memcpy(xfs_buf_offset(bp,
 				(uint)bit << XFS_BLI_SHIFT),	/* dest */
 				item->ri_buf[i].i_addr,		/* source */
@@ -2089,7 +2114,7 @@ xlog_recover_do_dquot_buffer(
 	if (log->l_quotaoffs_flag & type)
 		return;
 
-	xlog_recover_do_reg_buffer(item, bp, buf_f);
+	xlog_recover_do_reg_buffer(mp, item, bp, buf_f);
 }
 
 /*
@@ -2190,7 +2215,7 @@ xlog_recover_do_buffer_trans(
 		  (XFS_BLI_UDQUOT_BUF|XFS_BLI_PDQUOT_BUF|XFS_BLI_GDQUOT_BUF)) {
 		xlog_recover_do_dquot_buffer(mp, log, item, bp, buf_f);
 	} else {
-		xlog_recover_do_reg_buffer(item, bp, buf_f);
+		xlog_recover_do_reg_buffer(mp, item, bp, buf_f);
 	}
 	if (error)
 		return XFS_ERROR(error);
--- fs/xfs/xfs_trans_buf.c_1.126	2007-09-04 13:38:27.000000000 +1000
+++ fs/xfs/xfs_trans_buf.c	2007-09-05 17:37:31.000000000 +1000
@@ -966,6 +966,7 @@ xfs_trans_inode_alloc_buf(
 	ASSERT(atomic_read(&bip->bli_refcount) > 0);
 
 	bip->bli_flags |= XFS_BLI_INODE_ALLOC_BUF;
+	bip->bli_format.blf_flags |= XFS_BLI_INODE_NEW_BUF;
 }
 
 

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH] log replay should not overwrite newer ondisk inodes
  2007-09-07  2:03               ` Lachlan McIlroy
@ 2007-09-07 14:05                 ` David Chinner
  2007-09-10  4:43                   ` Lachlan McIlroy
  0 siblings, 1 reply; 21+ messages in thread
From: David Chinner @ 2007-09-07 14:05 UTC (permalink / raw)
  To: Lachlan McIlroy; +Cc: Mark Goodwin, Timothy Shimmin, xfs-dev, xfs-oss

On Fri, Sep 07, 2007 at 12:03:00PM +1000, Lachlan McIlroy wrote:
> David Chinner wrote:
> >An unlinked inode is only detectable by the mode parameter being zero.
> >The rest of the inode will look valid.
> >
> >To detect the difference between a newly allocated inode *chunk*
> >that has been written to and a stale inode chunk that we have
> >just allocated and not written to yet, you need to walk every inode
> >in the chunk and determine if the mode parameter is zero in every
> >inode.
> >
> >If the mode is zero for all inodes and there are generation numbers
> >that are not zero, then you've detected a stale buffer and you should
> >replay the inode cluster buffer initialisation.
> >
> 
> Thanks for this info Dave.  I looked into it and came up with a solution
> that looks at the ondisk inode buffer and determines if it has been
> written to since being logged.  It iterates through all the inodes and
> checks each one with:
> 
> - if the magic number is wrong the buffer is stale

*nod*

> - if the mode is non-zero then the buffer is newer than the log

*nod*

> - if the mode is zero and the generation count is non-zero then the
>   buffer is stale

On second thoughts, this might be more complex - the buffer is stale
only if all inodes in the *chunk* have this pattern. We may have multiple
buffers to a chunk. e.g. allocate 55 inodes, they span two 8k cluster
buffers. Both meet the second criteria. Now remove the first 32 inodes,
and we have one buffer with zero allocated inodes but non-zero generation
numbers (i.e. thrid state) and one that meets the second criteria.

However, both buffers are more recent than the buffer being replayed.
We could lose generation count changes if we overwrite the buffer with
no inodes in it....

So I think the stale buffer criteria must be that all the inodes in the entire
inode chunk (i.e. across all buffers) must have a zero mode and at least one
of the inodes has a non-zero generation count. Does that make sense?

> If the end result is a stale buffer then the buffer is replayed otherwise
> it is skipped.  I added a new flag that gets logged with a new inode
> cluster so that we can identify a buffer of inodes from something else.
> This fix is passing all the tests we have.  Is this a better approach
> than the last fix?

Definitely, but I think our "stale buffer" detection needs more work.

Cheers,

Dave.
-- 
Dave Chinner
Principal Engineer
SGI Australian Software Group

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH] log replay should not overwrite newer ondisk inodes
  2007-09-07 14:05                 ` David Chinner
@ 2007-09-10  4:43                   ` Lachlan McIlroy
  0 siblings, 0 replies; 21+ messages in thread
From: Lachlan McIlroy @ 2007-09-10  4:43 UTC (permalink / raw)
  To: David Chinner; +Cc: Mark Goodwin, Timothy Shimmin, xfs-dev, xfs-oss

David Chinner wrote:
> On Fri, Sep 07, 2007 at 12:03:00PM +1000, Lachlan McIlroy wrote:
>> David Chinner wrote:
>>> An unlinked inode is only detectable by the mode parameter being zero.
>>> The rest of the inode will look valid.
>>>
>>> To detect the difference between a newly allocated inode *chunk*
>>> that has been written to and a stale inode chunk that we have
>>> just allocated and not written to yet, you need to walk every inode
>>> in the chunk and determine if the mode parameter is zero in every
>>> inode.
>>>
>>> If the mode is zero for all inodes and there are generation numbers
>>> that are not zero, then you've detected a stale buffer and you should
>>> replay the inode cluster buffer initialisation.
>>>
>> Thanks for this info Dave.  I looked into it and came up with a solution
>> that looks at the ondisk inode buffer and determines if it has been
>> written to since being logged.  It iterates through all the inodes and
>> checks each one with:
>>
>> - if the magic number is wrong the buffer is stale
> 
> *nod*
> 
>> - if the mode is non-zero then the buffer is newer than the log
> 
> *nod*
> 
>> - if the mode is zero and the generation count is non-zero then the
>>   buffer is stale
> 
> On second thoughts, this might be more complex - the buffer is stale
> only if all inodes in the *chunk* have this pattern. We may have multiple
> buffers to a chunk. e.g. allocate 55 inodes, they span two 8k cluster
> buffers. Both meet the second criteria. Now remove the first 32 inodes,
> and we have one buffer with zero allocated inodes but non-zero generation
> numbers (i.e. thrid state) and one that meets the second criteria.
> 
> However, both buffers are more recent than the buffer being replayed.
> We could lose generation count changes if we overwrite the buffer with
> no inodes in it....
> 
> So I think the stale buffer criteria must be that all the inodes in the entire
> inode chunk (i.e. across all buffers) must have a zero mode and at least one
> of the inodes has a non-zero generation count. Does that make sense?

Yeah, that makes sense.  Is inode cluster I/O done in size of chunks then?
So we are trying to determine if a chunk has been written since any of the
buffers were logged.  Is it possible that only some of the buffers are
logged before the chunk is written to disk and the rest of the buffers are
logged after?

> 
>> If the end result is a stale buffer then the buffer is replayed otherwise
>> it is skipped.  I added a new flag that gets logged with a new inode
>> cluster so that we can identify a buffer of inodes from something else.
>> This fix is passing all the tests we have.  Is this a better approach
>> than the last fix?
> 
> Definitely, but I think our "stale buffer" detection needs more work.
> 
> Cheers,
> 
> Dave.

^ permalink raw reply	[flat|nested] 21+ messages in thread

end of thread, other threads:[~2007-09-10  4:41 UTC | newest]

Thread overview: 21+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2007-08-30  2:12 [PATCH] log replay should not overwrite newer ondisk inodes Lachlan McIlroy
2007-08-30  4:31 ` Timothy Shimmin
2007-08-30  4:50   ` Lachlan McIlroy
2007-08-30  8:29     ` Timothy Shimmin
2007-08-30  8:51       ` Timothy Shimmin
2007-08-31  2:22         ` Lachlan McIlroy
2007-08-31  4:01           ` Mark Goodwin
2007-08-31 15:48             ` David Chinner
2007-09-02 22:50               ` Vlad Apostolov
2007-09-03  8:49                 ` David Chinner
2007-09-07  2:03               ` Lachlan McIlroy
2007-09-07 14:05                 ` David Chinner
2007-09-10  4:43                   ` Lachlan McIlroy
2007-08-31  2:14       ` Lachlan McIlroy
2007-08-30 14:02   ` David Chinner
2007-09-04 23:05 ` Shailendra Tripathi
2007-09-04 23:49   ` David Chinner
2007-09-04 23:51     ` David Chinner
2007-09-05  1:19   ` Timothy Shimmin
2007-09-05  1:40     ` Lachlan McIlroy
2007-09-05  6:54       ` David Chinner

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox