public inbox for linux-xfs@vger.kernel.org
 help / color / mirror / Atom feed
From: Dave Chinner <david@fromorbit.com>
To: xfs@oss.sgi.com
Subject: [PATCH 12/12] xfs: Ensure inode allocation buffers are fully replayed
Date: Fri,  7 May 2010 15:41:00 +1000	[thread overview]
Message-ID: <1273210860-23414-13-git-send-email-david@fromorbit.com> (raw)
In-Reply-To: <1273210860-23414-1-git-send-email-david@fromorbit.com>

From: Dave Chinner <dchinner@redhat.com>

With delayed logging, we can get inode allocation buffers in the
same transaction inode unlink buffers. We don't currently mark inode
allocation buffers in the log, so inode unlink buffers take
precedence over allocation buffers.

The result is that when they are combined into the same checkpoint,
only the unlinked inode chain fields are replayed, resulting in
uninitialised inode buffers being detected when the next inode
modification is replayed.

To fix this, we need to ensure that we do not set the inode buffer
flag in the buffer log item format flags if the inode allocation has
not already hit the log. To avoid requiring a change to log
recovery, we really need to make this a modification that relies
only on in-memory sate.

We can do this by checking during buffer log formatting (while the
CIL cannot be flushed) if we are still in the same sequence when we
commit the unlink transaction as the inode allocation transaction.
If we are, then we do not add the inode buffer flag to the buffer
log format item flags. This means the entire buffer will be
replayed, not just the unlinked fields. We do this while
CIL flusheѕ are locked out to ensure that we don't race with the
sequence numbers changing and hence fail to put the inode buffer
flag in the buffer format flags when we really need to.

Also, move an assert in the buffer release path outside the hash
spinlock so that if the assert is hit the system continues to run in
a debuggable state.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 fs/xfs/linux-2.6/xfs_buf.c |    2 +-
 fs/xfs/xfs_buf_item.c      |   14 ++++++++++++
 fs/xfs/xfs_buf_item.h      |    4 ++-
 fs/xfs/xfs_log.h           |    1 +
 fs/xfs/xfs_log_cil.c       |   48 ++++++++++++++++++++++++++++++++++++++++++-
 fs/xfs/xfs_trans.c         |   16 +++++++++++--
 fs/xfs/xfs_trans.h         |    1 +
 fs/xfs/xfs_trans_buf.c     |   20 +++++++++---------
 8 files changed, 89 insertions(+), 17 deletions(-)

diff --git a/fs/xfs/linux-2.6/xfs_buf.c b/fs/xfs/linux-2.6/xfs_buf.c
index 82678bf..e085eca 100644
--- a/fs/xfs/linux-2.6/xfs_buf.c
+++ b/fs/xfs/linux-2.6/xfs_buf.c
@@ -800,9 +800,9 @@ xfs_buf_rele(
 		} else if (bp->b_flags & XBF_FS_MANAGED) {
 			spin_unlock(&hash->bh_lock);
 		} else {
-			ASSERT(!(bp->b_flags & (XBF_DELWRI|_XBF_DELWRI_Q)));
 			list_del_init(&bp->b_hash_list);
 			spin_unlock(&hash->bh_lock);
+			ASSERT(!(bp->b_flags & (XBF_DELWRI|_XBF_DELWRI_Q)));
 			xfs_buf_free(bp);
 		}
 	}
diff --git a/fs/xfs/xfs_buf_item.c b/fs/xfs/xfs_buf_item.c
index bcbb661..02a8098 100644
--- a/fs/xfs/xfs_buf_item.c
+++ b/fs/xfs/xfs_buf_item.c
@@ -254,6 +254,20 @@ xfs_buf_item_format(
 	vecp++;
 	nvecs = 1;
 
+	/*
+	 * If it is an inode buffer, transfer the in-memory state to the
+	 * format flags and clear the in-memory state. We do not transfer
+	 * this state if the inode buffer allocation has not yet been committed
+	 * to the log as setting the XFS_BLI_INODE_BUF flag will prevent
+	 * correct replay of the inode allocation.
+	 */
+	if (bip->bli_flags & XFS_BLI_INODE_BUF) {
+		if (!((bip->bli_flags & XFS_BLI_INODE_ALLOC_BUF) &&
+		      xfs_log_item_in_current_chkpt(&bip->bli_item)))
+			bip->bli_format.blf_flags |= XFS_BLF_INODE_BUF;
+		bip->bli_flags &= ~XFS_BLI_INODE_BUF;
+	}
+
 	if (bip->bli_flags & XFS_BLI_STALE) {
 		/*
 		 * The buffer is stale, so all we need to log
diff --git a/fs/xfs/xfs_buf_item.h b/fs/xfs/xfs_buf_item.h
index 8cbb82b..f20bb47 100644
--- a/fs/xfs/xfs_buf_item.h
+++ b/fs/xfs/xfs_buf_item.h
@@ -69,6 +69,7 @@ typedef struct xfs_buf_log_format {
 #define	XFS_BLI_LOGGED		0x08
 #define	XFS_BLI_INODE_ALLOC_BUF	0x10
 #define XFS_BLI_STALE_INODE	0x20
+#define	XFS_BLI_INODE_BUF	0x40
 
 #define XFS_BLI_FLAGS \
 	{ XFS_BLI_HOLD,		"HOLD" }, \
@@ -76,7 +77,8 @@ typedef struct xfs_buf_log_format {
 	{ XFS_BLI_STALE,	"STALE" }, \
 	{ XFS_BLI_LOGGED,	"LOGGED" }, \
 	{ XFS_BLI_INODE_ALLOC_BUF, "INODE_ALLOC" }, \
-	{ XFS_BLI_STALE_INODE,	"STALE_INODE" }
+	{ XFS_BLI_STALE_INODE,	"STALE_INODE" }, \
+	{ XFS_BLI_INODE_BUF,	"INODE_BUF" }
 
 
 #ifdef __KERNEL__
diff --git a/fs/xfs/xfs_log.h b/fs/xfs/xfs_log.h
index 4a0c574..04c78e6 100644
--- a/fs/xfs/xfs_log.h
+++ b/fs/xfs/xfs_log.h
@@ -198,6 +198,7 @@ xlog_tid_t xfs_log_get_trans_ident(struct xfs_trans *tp);
 int	xfs_log_commit_cil(struct xfs_mount *mp, struct xfs_trans *tp,
 				struct xfs_log_vec *log_vector,
 				xfs_lsn_t *commit_lsn, int flags);
+bool	xfs_log_item_in_current_chkpt(struct xfs_log_item *lip);
 
 #endif
 
diff --git a/fs/xfs/xfs_log_cil.c b/fs/xfs/xfs_log_cil.c
index 806cf6b..f6733eb 100644
--- a/fs/xfs/xfs_log_cil.c
+++ b/fs/xfs/xfs_log_cil.c
@@ -201,6 +201,15 @@ xlog_cil_insert(
 	ctx->nvecs += diff_iovecs;
 
 	/*
+	 * If this is the first time the item is being committed to the CIL,
+	 * store the sequence number on the log item so we can tell
+	 * in future commits whether this is the first checkpoint the item is
+	 * being committed into.
+	 */
+	if (!item->li_seq)
+		item->li_seq = ctx->sequence;
+
+	/*
 	 * Now transfer enough transaction reservation to the context ticket
 	 * for the checkpoint. The context ticket is special - the unit
 	 * reservation has to grow as well as the current reservation as we
@@ -328,6 +337,10 @@ xlog_cil_free_logvec(
  * For more specific information about the order of operations in
  * xfs_log_commit_cil() please refer to the comments in
  * xfs_trans_commit_iclog().
+ *
+ * Called with the context lock already held in read mode to lock out
+ * background commit, returns without it held once background commits are
+ * allowed again.
  */
 int
 xfs_log_commit_cil(
@@ -346,11 +359,10 @@ xfs_log_commit_cil(
 
 	if (XLOG_FORCED_SHUTDOWN(log)) {
 		xlog_cil_free_logvec(log_vector);
+		up_read(&log->l_cilp->xc_ctx_lock);
 		return XFS_ERROR(EIO);
 	}
 
-	/* lock out background commit */
-	down_read(&log->l_cilp->xc_ctx_lock);
 	xlog_cil_format_items(log, log_vector, tp->t_ticket, commit_lsn);
 
 	/* check we didn't blow the reservation */
@@ -687,3 +699,35 @@ restart:
 	return commit_lsn;
 }
 
+/*
+ * Check if the current log item was first committed in this sequence.
+ * We can't rely on just the log item being in the CIL, we have to check
+ * the recorded commit sequence number.
+ *
+ * Note: for this to be used in a non-racy manner, it has to be called with
+ * CIL flushing locked out. As a result, it should only be used during the
+ * transaction commit process when deciding what to format into the item.
+ */
+bool
+xfs_log_item_in_current_chkpt(
+	struct xfs_log_item *lip)
+{
+	struct xfs_cil_ctx *ctx;
+
+	if (!(lip->li_mountp->m_flags & XFS_MOUNT_DELAYLOG))
+		return false;
+	if (list_empty(&lip->li_cil))
+		return false;
+
+	ctx = lip->li_mountp->m_log->l_cilp->xc_ctx;
+
+	/*
+	 * li_seq is written on the first commit of a log item to record the
+	 * first checkpoint it is written to. Hence if it is different to the
+	 * current sequence, we're in a new checkpoint.
+	 */
+	if (XFS_LSN_CMP(lip->li_seq, ctx->sequence) != 0)
+		return false;
+	return true;
+}
+
diff --git a/fs/xfs/xfs_trans.c b/fs/xfs/xfs_trans.c
index 9bdb492..3e88c3f 100644
--- a/fs/xfs/xfs_trans.c
+++ b/fs/xfs/xfs_trans.c
@@ -45,6 +45,7 @@
 #include "xfs_trans_space.h"
 #include "xfs_inode_item.h"
 #include "xfs_trace.h"
+#include "xfs_log_priv.h"
 
 kmem_zone_t	*xfs_trans_zone;
 
@@ -1261,10 +1262,19 @@ xfs_trans_commit_cil(
 		return ENOMEM;
 
 	/*
-	 * Fill in the log_vector and pin the logged items, and
-	 * then write the transaction to the log. We have to lock
-	 * out CIL flushes from this point as we are going to pin
+	 * Now we need to fill in the log_vector and pin the logged items, and
+	 * then write the transaction to the log.
+	 *
+	 * Important: We have to lock out CIL flushes from this point as
+	 * transferring state from the in memory log items to the log item
+	 * headers during formatting may require atomicity against log writes
+	 * to ensure that state is transferred to the log without racing
+	 * against flushes.
+	 *
+	 * xfs_log_commit_cil() will release the lock as part of the commit
+	 * process.
 	 */
+	down_read(&mp->m_log->l_cilp->xc_ctx_lock);
 	xfs_trans_fill_log_vecs(tp, log_vector);
 
 	return xfs_log_commit_cil(mp, tp, log_vector, commit_lsn, flags);
diff --git a/fs/xfs/xfs_trans.h b/fs/xfs/xfs_trans.h
index b1ea20c..8c69e78 100644
--- a/fs/xfs/xfs_trans.h
+++ b/fs/xfs/xfs_trans.h
@@ -835,6 +835,7 @@ typedef struct xfs_log_item {
 	/* delayed logging */
 	struct list_head		li_cil;		/* CIL pointers */
 	struct xfs_log_vec		*li_lv;		/* active log vector */
+	xfs_lsn_t			li_seq;		/* CIL commit seq */
 } xfs_log_item_t;
 
 #define	XFS_LI_IN_AIL	0x1
diff --git a/fs/xfs/xfs_trans_buf.c b/fs/xfs/xfs_trans_buf.c
index 3390c3e..63d81a2 100644
--- a/fs/xfs/xfs_trans_buf.c
+++ b/fs/xfs/xfs_trans_buf.c
@@ -792,7 +792,7 @@ xfs_trans_binval(
 	XFS_BUF_UNDELAYWRITE(bp);
 	XFS_BUF_STALE(bp);
 	bip->bli_flags |= XFS_BLI_STALE;
-	bip->bli_flags &= ~(XFS_BLI_LOGGED | XFS_BLI_DIRTY);
+	bip->bli_flags &= ~(XFS_BLI_INODE_BUF | XFS_BLI_LOGGED | XFS_BLI_DIRTY);
 	bip->bli_format.blf_flags &= ~XFS_BLF_INODE_BUF;
 	bip->bli_format.blf_flags |= XFS_BLF_CANCEL;
 	memset((char *)(bip->bli_format.blf_data_map), 0,
@@ -802,16 +802,16 @@ xfs_trans_binval(
 }
 
 /*
- * This call is used to indicate that the buffer contains on-disk
- * inodes which must be handled specially during recovery.  They
- * require special handling because only the di_next_unlinked from
- * the inodes in the buffer should be recovered.  The rest of the
- * data in the buffer is logged via the inodes themselves.
+ * This call is used to indicate that the buffer contains on-disk inodes which
+ * must be handled specially during recovery.  They require special handling
+ * because only the di_next_unlinked from the inodes in the buffer should be
+ * recovered.  The rest of the data in the buffer is logged via the inodes
+ * themselves.
  *
- * All we do is set the XFS_BLI_INODE_BUF flag in the buffer's log
- * format structure so that we'll know what to do at recovery time.
+ * All we do is set the XFS_BLI_INODE_BUF flag in the items flags so it can be
+ * transferred to the buffer's log format structure so that we'll know what to
+ * do at recovery time.
  */
-/* ARGSUSED */
 void
 xfs_trans_inode_buf(
 	xfs_trans_t	*tp,
@@ -826,7 +826,7 @@ xfs_trans_inode_buf(
 	bip = XFS_BUF_FSPRIVATE(bp, xfs_buf_log_item_t *);
 	ASSERT(atomic_read(&bip->bli_refcount) > 0);
 
-	bip->bli_format.blf_flags |= XFS_BLF_INODE_BUF;
+	bip->bli_flags |= XFS_BLI_INODE_BUF;
 }
 
 /*
-- 
1.5.6.5

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

  parent reply	other threads:[~2010-05-07  5:39 UTC|newest]

Thread overview: 23+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2010-05-07  5:40 [PATCH 0/12] xfs: delayed logging V5 Dave Chinner
2010-05-07  5:40 ` [PATCH 01/12] xfs: Don't reuse the same transaction ID for duplicated transactions Dave Chinner
2010-05-07  5:40 ` [PATCH 02/12] xfs: allow log ticket allocation to take allocation flags Dave Chinner
2010-05-07  5:40 ` [PATCH 03/12] xfs: modify buffer item reference counting Dave Chinner
2010-05-07  5:40 ` [PATCH 04/12] xfs: Clean up XFS_BLI_* flag namespace Dave Chinner
2010-05-07  5:40 ` [PATCH 05/12] xfs: clean up log ticket overrun debug output Dave Chinner
2010-05-07  5:40 ` [PATCH 06/12] xfs: make the log ticket ID available outside the log infrastructure Dave Chinner
2010-05-07 11:41   ` Christoph Hellwig
2010-05-07  5:40 ` [PATCH 07/12] xfs: Improve scalability of busy extent tracking Dave Chinner
2010-05-08 17:15   ` Christoph Hellwig
2010-05-07  5:40 ` [PATCH 08/12] xfs: Delayed logging design documentation Dave Chinner
2010-05-07  5:40 ` [PATCH 09/12] xfs: Introduce delayed logging core code Dave Chinner
2010-05-10 11:44   ` Christoph Hellwig
2010-05-10 12:16     ` Dave Chinner
2010-05-17  5:51       ` Dave Chinner
2010-05-17  7:34         ` Christoph Hellwig
2010-05-07  5:40 ` [PATCH 10/12] xfs: forced unmounts need to push the CIL Dave Chinner
2010-05-10 11:44   ` Christoph Hellwig
2010-05-07  5:40 ` [PATCH 11/12] xfs: enable background pushing of " Dave Chinner
2010-05-10 11:45   ` Christoph Hellwig
2010-05-07  5:41 ` Dave Chinner [this message]
2010-05-10 17:43   ` [PATCH 12/12] xfs: Ensure inode allocation buffers are fully replayed Christoph Hellwig
  -- strict thread matches above, loose matches on Subject: below --
2010-05-17 23:24 [PATCH 0/12] xfs: delayed logging V6 Dave Chinner
2010-05-17 23:24 ` [PATCH 12/12] xfs: Ensure inode allocation buffers are fully replayed Dave Chinner

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=1273210860-23414-13-git-send-email-david@fromorbit.com \
    --to=david@fromorbit.com \
    --cc=xfs@oss.sgi.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox