public inbox for linux-xfs@vger.kernel.org
 help / color / mirror / Atom feed
From: Lachlan McIlroy <lachlan@sgi.com>
To: David Chinner <dgc@sgi.com>
Cc: Mark Goodwin <markgw@sgi.com>, Timothy Shimmin <tes@sgi.com>,
	xfs-dev <xfs-dev@sgi.com>, xfs-oss <xfs@oss.sgi.com>
Subject: Re: [PATCH] log replay should not overwrite newer ondisk inodes
Date: Fri, 07 Sep 2007 12:03:00 +1000	[thread overview]
Message-ID: <46E0B154.1000805@sgi.com> (raw)
In-Reply-To: <20070831154822.GD734179@sgi.com>

[-- Attachment #1: Type: text/plain, Size: 2919 bytes --]

David Chinner wrote:
> On Fri, Aug 31, 2007 at 02:01:37PM +1000, Mark Goodwin wrote:
>> Lachlan McIlroy wrote:
>>> Timothy Shimmin wrote:
>>>> Timothy Shimmin wrote:
>>>>>>>  But I'm not sure this is an error...
>>>>>>>  Hmmmm...I'm a bit confused.
>>>>>>>  So you are _almost_ combining an error check with a flushiter check?
>>>>>>>  If one buffer is an inode magic# and the other isn't then we
>>>>>>>  have an error right - and could report it - but we are not doing 
>>>>>>> that here.
>>>>>> Not exactly.  If what's on disk is not an inode but the log item is
>>>>>> then that could be because we haven't written the inode to disk yet
>>>>>> and we need to perform recovery.
>>>>> Yeah, I was thinking about that afterward.
>>>>> The item's format which gives the blk# for the buf to read could
>>>>> be a block which hasn't been used for an inode yet.
>>>>>
>>>> Well, if what's on disk is not an inode but some other data
>>>> and it happens to have the inode magic# which is remotely possible,
>>>> then we are making a bad assumption.
>>>> i.e. if we're not sure what the block/buffer should be, then testing the
>>>> MAGIC# isn't a guarantee it's an inode then.
>>>> Well not for the freeing of inode clusters case I would assume.
>>>> Or am I missing something?
>>> I don't think you're missing anything!
>>>
>>> You're right though - a magic number check is no guarantee.  On the same
>>> vein, adding a generation number check isn't much better.
>> unlink will have to invalidate the on-disk inode magic number? Or only
>> when the whole cluster is free'd?
> 
> An unlinked inode is only detectable by the mode parameter being zero.
> The rest of the inode will look valid.
> 
> To detect the difference between a newly allocated inode *chunk*
> that has been written to and a stale inode chunk that we have
> just allocated and not written to yet, you need to walk every inode
> in the chunk and determine if the mode parameter is zero in every
> inode.
> 
> If the mode is zero for all inodes and there are generation numbers
> that are not zero, then you've detected a stale buffer and you should
> replay the inode cluster buffer initialisation.
> 

Thanks for this info Dave.  I looked into it and came up with a solution
that looks at the ondisk inode buffer and determines if it has been
written to since being logged.  It iterates through all the inodes and
checks each one with:

- if the magic number is wrong the buffer is stale
- if the mode is non-zero then the buffer is newer than the log
- if the mode is zero and the generation count is non-zero then the
   buffer is stale

If the end result is a stale buffer then the buffer is replayed otherwise
it is skipped.  I added a new flag that gets logged with a new inode
cluster so that we can identify a buffer of inodes from something else.
This fix is passing all the tests we have.  Is this a better approach
than the last fix?

Lachlan

[-- Attachment #2: xfs_log_recover.diff --]
[-- Type: text/x-patch, Size: 2760 bytes --]

--- fs/xfs/xfs_buf_item.h_1.44	2007-09-04 13:38:24.000000000 +1000
+++ fs/xfs/xfs_buf_item.h	2007-09-06 12:06:39.000000000 +1000
@@ -52,6 +52,11 @@ typedef struct xfs_buf_log_format_t {
 #define	XFS_BLI_UDQUOT_BUF	0x4
 #define XFS_BLI_PDQUOT_BUF	0x8
 #define	XFS_BLI_GDQUOT_BUF	0x10
+/*
+ * This flag indicates that the buffer contains newly allocated
+ * inodes.
+ */
+#define	XFS_BLI_INODE_NEW_BUF	0x20
 
 #define	XFS_BLI_CHUNK		128
 #define	XFS_BLI_SHIFT		7
--- fs/xfs/xfs_log_recover.c_1.322	2007-08-27 17:45:45.000000000 +1000
+++ fs/xfs/xfs_log_recover.c	2007-09-07 10:41:38.000000000 +1000
@@ -1874,6 +1874,7 @@ xlog_recover_do_inode_buffer(
 /*ARGSUSED*/
 STATIC void
 xlog_recover_do_reg_buffer(
+	xfs_mount_t		*mp,
 	xlog_recover_item_t	*item,
 	xfs_buf_t		*bp,
 	xfs_buf_log_format_t	*buf_f)
@@ -1884,6 +1885,30 @@ xlog_recover_do_reg_buffer(
 	unsigned int		*data_map = NULL;
 	unsigned int		map_size = 0;
 	int                     error;
+	int			stale_buf = 1;
+
+	if (buf_f->blf_flags & XFS_BLI_INODE_NEW_BUF) {
+		xfs_dinode_t    *dip;
+		int             inodes_per_buf;
+
+		stale_buf = 0;
+		inodes_per_buf = XFS_BUF_COUNT(bp) >> mp->m_sb.sb_inodelog;
+		for (i = 0; i < inodes_per_buf; i++) {
+			dip = (xfs_dinode_t *)xfs_buf_offset(bp,
+				i * mp->m_sb.sb_inodesize);
+			if (be16_to_cpu(dip->di_core.di_magic) !=
+					XFS_DINODE_MAGIC) {
+				stale_buf = 1;
+				break;
+			}
+			if (be16_to_cpu(dip->di_core.di_mode))
+				break;
+			if (be16_to_cpu(dip->di_core.di_gen)) {
+				stale_buf = 1;
+				break;
+			}
+		}
+	}
 
 	switch (buf_f->blf_type) {
 	case XFS_LI_BUF:
@@ -1917,7 +1942,7 @@ xlog_recover_do_reg_buffer(
 					       -1, 0, XFS_QMOPT_DOWARN,
 					       "dquot_buf_recover");
 		}
-		if (!error)
+		if (!error && stale_buf)
 			memcpy(xfs_buf_offset(bp,
 				(uint)bit << XFS_BLI_SHIFT),	/* dest */
 				item->ri_buf[i].i_addr,		/* source */
@@ -2089,7 +2114,7 @@ xlog_recover_do_dquot_buffer(
 	if (log->l_quotaoffs_flag & type)
 		return;
 
-	xlog_recover_do_reg_buffer(item, bp, buf_f);
+	xlog_recover_do_reg_buffer(mp, item, bp, buf_f);
 }
 
 /*
@@ -2190,7 +2215,7 @@ xlog_recover_do_buffer_trans(
 		  (XFS_BLI_UDQUOT_BUF|XFS_BLI_PDQUOT_BUF|XFS_BLI_GDQUOT_BUF)) {
 		xlog_recover_do_dquot_buffer(mp, log, item, bp, buf_f);
 	} else {
-		xlog_recover_do_reg_buffer(item, bp, buf_f);
+		xlog_recover_do_reg_buffer(mp, item, bp, buf_f);
 	}
 	if (error)
 		return XFS_ERROR(error);
--- fs/xfs/xfs_trans_buf.c_1.126	2007-09-04 13:38:27.000000000 +1000
+++ fs/xfs/xfs_trans_buf.c	2007-09-05 17:37:31.000000000 +1000
@@ -966,6 +966,7 @@ xfs_trans_inode_alloc_buf(
 	ASSERT(atomic_read(&bip->bli_refcount) > 0);
 
 	bip->bli_flags |= XFS_BLI_INODE_ALLOC_BUF;
+	bip->bli_format.blf_flags |= XFS_BLI_INODE_NEW_BUF;
 }
 
 

  parent reply	other threads:[~2007-09-07  2:00 UTC|newest]

Thread overview: 21+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2007-08-30  2:12 [PATCH] log replay should not overwrite newer ondisk inodes Lachlan McIlroy
2007-08-30  4:31 ` Timothy Shimmin
2007-08-30  4:50   ` Lachlan McIlroy
2007-08-30  8:29     ` Timothy Shimmin
2007-08-30  8:51       ` Timothy Shimmin
2007-08-31  2:22         ` Lachlan McIlroy
2007-08-31  4:01           ` Mark Goodwin
2007-08-31 15:48             ` David Chinner
2007-09-02 22:50               ` Vlad Apostolov
2007-09-03  8:49                 ` David Chinner
2007-09-07  2:03               ` Lachlan McIlroy [this message]
2007-09-07 14:05                 ` David Chinner
2007-09-10  4:43                   ` Lachlan McIlroy
2007-08-31  2:14       ` Lachlan McIlroy
2007-08-30 14:02   ` David Chinner
2007-09-04 23:05 ` Shailendra Tripathi
2007-09-04 23:49   ` David Chinner
2007-09-04 23:51     ` David Chinner
2007-09-05  1:19   ` Timothy Shimmin
2007-09-05  1:40     ` Lachlan McIlroy
2007-09-05  6:54       ` David Chinner

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=46E0B154.1000805@sgi.com \
    --to=lachlan@sgi.com \
    --cc=dgc@sgi.com \
    --cc=markgw@sgi.com \
    --cc=tes@sgi.com \
    --cc=xfs-dev@sgi.com \
    --cc=xfs@oss.sgi.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox