public inbox for linux-xfs@vger.kernel.org
 help / color / mirror / Atom feed
From: Dave Chinner <david@fromorbit.com>
To: xfs@oss.sgi.com
Cc: bpm@sgi.com
Subject: [PATCH 1/2] xfs: fix log recovery transaction item reordering
Date: Wed, 29 May 2013 06:36:12 +1000	[thread overview]
Message-ID: <1369773373-26697-2-git-send-email-david@fromorbit.com> (raw)
In-Reply-To: <1369773373-26697-1-git-send-email-david@fromorbit.com>

From: Dave Chinner <dchinner@redhat.com>

There are several constraints that inode allocation and unlink
logging impose on log recovery. These all stem from the fact that
inode alloc/unlink are logged in buffers, but all other inode
changes are logged in inode items. Hence there are ordering
constraints that recovery must follow to ensure the correct result
occurs.

As it turns out, this ordering has been working mostly by chance
than good management. The existing code moves all buffers except
cancelled buffers to the head of the list, and everything else to
the tail of the list. The problem with this is that is interleaves
inode items with the buffer cancellation items, and hence whether
the inode item in an cancelled buffer gets replayed is essentially
left to chance.

Further, this ordering causes problems for log recovery when inode
CRCs are enabled. It typically replays the inode unlink buffer long before
it replays the inode core changes, and so the CRC recorded in an
unlink buffer is going to be invalid and hence any attempt to
validate the inode in the buffer is going to fail. Hence we really
need to enforce the ordering that the inode alloc/unlink code has
expected log recovery to have since inode chunk de-allocation was
introduced back in 2003...

Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 fs/xfs/xfs_log_recover.c |   65 +++++++++++++++++++++++++++++++++++++++++-----
 1 file changed, 58 insertions(+), 7 deletions(-)

diff --git a/fs/xfs/xfs_log_recover.c b/fs/xfs/xfs_log_recover.c
index d6204d1..83088d9 100644
--- a/fs/xfs/xfs_log_recover.c
+++ b/fs/xfs/xfs_log_recover.c
@@ -1599,10 +1599,43 @@ xlog_recover_add_to_trans(
 }
 
 /*
- * Sort the log items in the transaction. Cancelled buffers need
- * to be put first so they are processed before any items that might
- * modify the buffers. If they are cancelled, then the modifications
- * don't need to be replayed.
+ * Sort the log items in the transaction.
+ *
+ * The ordering constraints are defined by the inode allocation and unlink
+ * behaviour. The rules are:
+ *
+ *	1. Every item is only logged once in a given transaction. Hence it
+ *	   represents the last logged state of the item. Hence ordering is
+ *	   dependent on the order in which operations need to be performed so
+ *	   required initial conditions are always met.
+ *
+ *	2. Cancelled buffers are recorded in pass 1 in a separate table and
+ *	   there's nothing to replay from them so we can simply cull them
+ *	   from the transaction. However, we can't do that until after we've
+ *	   replayed all the other items because they may be dependent on the
+ *	   cancelled buffer and replaying the cancelled buffer can remove it
+ *	   form the cancelled buffer table. Hence they have tobe done last.
+ *
+ *	3. Inode allocation buffers must be replayed before inode items that
+ *	   read the buffer and replay changes into it.
+ *
+ *	4. Inode unlink buffers must be replayed after inode items are replayed.
+ *	   This ensures that inodes are completely flushed to the inode buffer
+ *	   in a "free" state before we remove the unlinked inode list pointer.
+ *
+ * Hence the ordering needs to be inode allocation buffers first, inode items
+ * second, inode unlink buffers third and cancelled buffers last.
+ *
+ * But there's a problem with that - we can't tell an inode allocation buffer
+ * apart from a regular buffer, so we can't separate them. We can, however,
+ * tell an inode unlink buffer from the others, and so we can separate them out
+ * from all the other buffers and move them to last.
+ *
+ * Hence, 4 lists, in order from head to tail:
+ * 	- buffer_list for all buffers except cancelled/inode unlink buffers
+ * 	- item_list for all non-buffer items
+ * 	- inode_buffer_list for inode unlink buffers
+ * 	- cancel_list for the cancelled buffers
  */
 STATIC int
 xlog_recover_reorder_trans(
@@ -1612,6 +1645,10 @@ xlog_recover_reorder_trans(
 {
 	xlog_recover_item_t	*item, *n;
 	LIST_HEAD(sort_list);
+	LIST_HEAD(cancel_list);
+	LIST_HEAD(buffer_list);
+	LIST_HEAD(inode_buffer_list);
+	LIST_HEAD(inode_list);
 
 	list_splice_init(&trans->r_itemq, &sort_list);
 	list_for_each_entry_safe(item, n, &sort_list, ri_list) {
@@ -1619,12 +1656,18 @@ xlog_recover_reorder_trans(
 
 		switch (ITEM_TYPE(item)) {
 		case XFS_LI_BUF:
-			if (!(buf_f->blf_flags & XFS_BLF_CANCEL)) {
+			if (buf_f->blf_flags & XFS_BLF_CANCEL) {
 				trace_xfs_log_recover_item_reorder_head(log,
 							trans, item, pass);
-				list_move(&item->ri_list, &trans->r_itemq);
+				list_move(&item->ri_list, &cancel_list);
 				break;
 			}
+			if (buf_f->blf_flags & XFS_BLF_INODE_BUF) {
+				list_move(&item->ri_list, &inode_buffer_list);
+				break;
+			}
+			list_move_tail(&item->ri_list, &buffer_list);
+			break;
 		case XFS_LI_INODE:
 		case XFS_LI_DQUOT:
 		case XFS_LI_QUOTAOFF:
@@ -1632,7 +1675,7 @@ xlog_recover_reorder_trans(
 		case XFS_LI_EFI:
 			trace_xfs_log_recover_item_reorder_tail(log,
 							trans, item, pass);
-			list_move_tail(&item->ri_list, &trans->r_itemq);
+			list_move_tail(&item->ri_list, &inode_list);
 			break;
 		default:
 			xfs_warn(log->l_mp,
@@ -1643,6 +1686,14 @@ xlog_recover_reorder_trans(
 		}
 	}
 	ASSERT(list_empty(&sort_list));
+	if (!list_empty(&buffer_list))
+		list_splice(&buffer_list, &trans->r_itemq);
+	if (!list_empty(&inode_list))
+		list_splice_tail(&inode_list, &trans->r_itemq);
+	if (!list_empty(&inode_buffer_list))
+		list_splice_tail(&inode_buffer_list, &trans->r_itemq);
+	if (!list_empty(&cancel_list))
+		list_splice_tail(&cancel_list, &trans->r_itemq);
 	return 0;
 }
 
-- 
1.7.10.4

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

  reply	other threads:[~2013-05-28 20:36 UTC|newest]

Thread overview: 54+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2013-05-27  6:38 [PATH 0/9] xfs: fixes for 3.10-rc4 Dave Chinner
2013-05-27  6:38 ` [PATCH 1/9] xfs: don't emit v5 superblock warnings on write Dave Chinner
2013-05-29 16:39   ` Brian Foster
2013-05-30 17:49     ` Ben Myers
2013-06-11  6:05       ` Dave Chinner
2013-06-11 21:29         ` Ben Myers
2013-05-27  6:38 ` [PATCH 2/9] xfs: fix incorrect remote symlink block count Dave Chinner
2013-05-29 16:39   ` Brian Foster
2013-05-30  0:46     ` Dave Chinner
2013-05-30 17:49     ` Ben Myers
2013-05-27  6:38 ` [PATCH 3/9] xfs: increase number of ACL entries for V5 superblocks Dave Chinner
2013-05-29 16:40   ` Brian Foster
2013-05-27  6:38 ` [PATCH 4/9] xfs: rework dquot CRCs Dave Chinner
2013-05-29 18:58   ` Brian Foster
2013-05-30  1:00     ` Dave Chinner
2013-05-30 12:02       ` Brian Foster
2013-06-03  4:12         ` Dave Chinner
2013-05-27  6:38 ` [PATCH 5/9] xfs: fix split buffer vector log recovery support Dave Chinner
2013-05-29 19:21   ` Mark Tinguely
2013-05-30 17:49     ` Ben Myers
2013-05-27  6:38 ` [PATCH 6/9] xfs: disable swap extents ioctl on CRC enabled filesystems Dave Chinner
2013-05-28 21:49   ` Ben Myers
2013-05-30  1:07     ` Dave Chinner
2013-05-29 21:06   ` Brian Foster
2013-05-30 17:56     ` Ben Myers
2013-05-27  6:38 ` [PATCH 7/9] xfs: kill suid/sgid through the truncate path Dave Chinner
2013-05-30 14:17   ` Brian Foster
2013-05-30 15:52     ` Ben Myers
2013-05-30 16:02       ` Brian Foster
2013-05-30 17:07         ` Ben Myers
2013-05-27  6:38 ` [PATCH 8/9] xfs: add fsgeom flag for v5 superblock support Dave Chinner
2013-05-29 15:10   ` Eric Sandeen
2013-05-29 21:43     ` Ben Myers
2013-05-29 21:47       ` Ben Myers
2013-05-30  1:28       ` Dave Chinner
2013-05-30  1:11     ` Dave Chinner
2013-05-30 14:17   ` Brian Foster
2013-05-30 17:57     ` Ben Myers
2013-05-27  6:38 ` [PATCH 9/9] xfs: inode unlinked list needs to recalculate the inode CRC Dave Chinner
2013-05-28 11:51   ` Dave Chinner
2013-05-28 20:36   ` [PATCH 9a,9b v2, replacements] xfs: unlinked list crcs Dave Chinner
2013-05-28 20:36     ` Dave Chinner [this message]
2013-05-28 20:36     ` [PATCH 2/2] xfs: inode unlinked list needs to recalculate the inode CRC Dave Chinner
2013-05-30 14:17       ` Brian Foster
2013-05-30 20:27         ` Dave Chinner
2013-05-28  8:37 ` [PATCH 10/9] xfs: fix dir3 freespace block corruption Dave Chinner
2013-05-30 19:15   ` Ben Myers
2013-05-31 21:54     ` Ben Myers
2013-05-28 17:56 ` [PATH 0/9] xfs: fixes for 3.10-rc4 Ben Myers
2013-05-28 23:54   ` Dave Chinner
2013-05-29 19:01     ` Ben Myers
2013-05-29 19:27       ` Eric Sandeen
2013-05-29 19:45         ` Ben Myers
2013-05-28 21:27 ` [PATCH 11/9] xfs: fix remote attribute invalidation for a leaf Dave Chinner

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=1369773373-26697-2-git-send-email-david@fromorbit.com \
    --to=david@fromorbit.com \
    --cc=bpm@sgi.com \
    --cc=xfs@oss.sgi.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox