[PATCH v2 0/3] xfs: fix inode use-after-free during log recovery

linux-xfs.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH v2 0/3] xfs: fix inode use-after-free during log recovery
@ 2020-05-05  1:13 Darrick J. Wong
  2020-05-05  1:13 ` [PATCH 1/3] xfs: proper replay of deferred ops queued " Darrick J. Wong
                   ` (2 more replies)
  0 siblings, 3 replies; 19+ messages in thread
From: Darrick J. Wong @ 2020-05-05  1:13 UTC (permalink / raw)
  To: darrick.wong; +Cc: linux-xfs

Hi all,

Fix a use-after-free during log recovery of deferred operations by
creating explicit freeze and thaw mechanisms for deferred ops that were
created while processing intent items that were recovered from the log.
While we're at it, fix all the bogosity around how we gather up log
intents during recovery and actually commit them to the filesystem.

If you're going to start using this mess, you probably ought to just
pull from my git trees, which are linked below.

This is an extraordinary way to destroy everything.  Enjoy!
Comments and questions are, as always, welcome.

--D

kernel git tree:
https://git.kernel.org/cgit/linux/kernel/git/djwong/xfs-linux.git/log/?h=fix-log-recovery

xfsprogs git tree:
https://git.kernel.org/cgit/linux/kernel/git/djwong/xfsprogs-dev.git/log/?h=fix-log-recovery

^ permalink raw reply	[flat|nested] 19+ messages in thread

* [PATCH 1/3] xfs: proper replay of deferred ops queued during log recovery
  2020-05-05  1:13 [PATCH v2 0/3] xfs: fix inode use-after-free during log recovery Darrick J. Wong
@ 2020-05-05  1:13 ` Darrick J. Wong
  2020-05-05  2:33   ` Dave Chinner
  2020-05-05  1:13 ` [PATCH 2/3] xfs: reduce log recovery transaction block reservations Darrick J. Wong
  2020-05-05  1:13 ` [PATCH 3/3] xfs: fix an incore inode UAF in xfs_bui_recover Darrick J. Wong
  2 siblings, 1 reply; 19+ messages in thread
From: Darrick J. Wong @ 2020-05-05  1:13 UTC (permalink / raw)
  To: darrick.wong; +Cc: linux-xfs

From: Darrick J. Wong <darrick.wong@oracle.com>

When we replay unfinished intent items that have been recovered from the
log, it's possible that the replay will cause the creation of more
deferred work items.  As outlined in commit 509955823cc9c ("xfs: log
recovery should replay deferred ops in order"), later work items have an
implicit ordering dependency on earlier work items.  Therefore, recovery
must replay the items (both recovered and created) in the same order
that they would have been during normal operation.

For log recovery, we enforce this ordering by using an empty transaction
to collect deferred ops that get created in the process of recovering a
log intent item to prevent them from being committed before the rest of
the recovered intent items.  After we finish committing all the
recovered log items, we allocate a transaction with an enormous block
reservation, splice our huge list of created deferred ops into that
transaction, and commit it, thereby finishing all those ops.

This is /really/ hokey -- it's the one place in XFS where we allow
nested transactions; the splicing of the defer ops list is is inelegant
and has to be done twice per recovery function; and the broken way we
handle inode pointers and block reservations cause subtle use-after-free
and allocator problems that will be fixed by this patch and the two
patches after it.

Therefore, replace the hokey empty transaction with a structure designed
to capture each chain of deferred ops that are created as part of
recovering a single unfinished log intent.  Finally, refactor the loop
that replays those chains to do so using one transaction per chain.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
---
 fs/xfs/libxfs/xfs_defer.c       |   57 ++++++++++++++++++
 fs/xfs/libxfs/xfs_defer.h       |   21 +++++++
 fs/xfs/libxfs/xfs_log_recover.h |    4 +
 fs/xfs/xfs_bmap_item.c          |   16 +----
 fs/xfs/xfs_extfree_item.c       |    7 +-
 fs/xfs/xfs_log_recover.c        |  122 +++++++++++++++++++++++++--------------
 fs/xfs/xfs_refcount_item.c      |   16 +----
 fs/xfs/xfs_rmap_item.c          |    7 +-
 fs/xfs/xfs_trans.h              |    4 +
 9 files changed, 178 insertions(+), 76 deletions(-)


diff --git a/fs/xfs/libxfs/xfs_defer.c b/fs/xfs/libxfs/xfs_defer.c
index 1172fbf072d8..bbbd0141d4a6 100644
--- a/fs/xfs/libxfs/xfs_defer.c
+++ b/fs/xfs/libxfs/xfs_defer.c
@@ -544,3 +544,60 @@ xfs_defer_move(
 
 	xfs_defer_reset(stp);
 }
+
+/*
+ * Freeze a chain of deferred ops that are attached to a transaction.  The
+ * entire deferred ops state is transferred to the freezer, and each dfops
+ * item will be prepared for freezing.
+ */
+int
+xfs_defer_freeze(
+	struct xfs_trans		*tp,
+	struct xfs_defer_freezer	**dffp)
+{
+	struct xfs_defer_freezer	*dff;
+
+	*dffp = NULL;
+	if (list_empty(&tp->t_dfops))
+		return 0;
+
+	dff = kmem_zalloc(sizeof(*dff), KM_NOFS);
+	if (!dff)
+		return -ENOMEM;
+
+	INIT_LIST_HEAD(&dff->dff_list);
+	INIT_LIST_HEAD(&dff->dff_dfops);
+
+	/* Move all the dfops items to the freezer. */
+	list_splice_init(&tp->t_dfops, &dff->dff_dfops);
+	dff->dff_tpflags = tp->t_flags & XFS_TRANS_LOWMODE;
+	xfs_defer_reset(tp);
+
+	*dffp = dff;
+	return 0;
+}
+
+/* Thaw a chain of deferred ops that are attached to a transaction. */
+int
+xfs_defer_thaw(
+	struct xfs_defer_freezer	*dff,
+	struct xfs_trans		*tp)
+{
+	ASSERT(tp->t_flags & XFS_TRANS_PERM_LOG_RES);
+
+	/* Add the dfops items to the transaction. */
+	list_splice_init(&dff->dff_dfops, &tp->t_dfops);
+	tp->t_flags |= dff->dff_tpflags;
+
+	return 0;
+}
+
+/* Release a deferred op freezer and all resources associated with it. */
+void
+xfs_defer_freeezer_finish(
+	struct xfs_mount		*mp,
+	struct xfs_defer_freezer	*dff)
+{
+	xfs_defer_cancel_list(mp, &dff->dff_dfops);
+	kmem_free(dff);
+}
diff --git a/fs/xfs/libxfs/xfs_defer.h b/fs/xfs/libxfs/xfs_defer.h
index 3bf7c2c4d851..4789bf53dd96 100644
--- a/fs/xfs/libxfs/xfs_defer.h
+++ b/fs/xfs/libxfs/xfs_defer.h
@@ -8,6 +8,7 @@
 
 struct xfs_btree_cur;
 struct xfs_defer_op_type;
+struct xfs_defer_freezer;
 
 /*
  * Header for deferred operation list.
@@ -63,4 +64,24 @@ extern const struct xfs_defer_op_type xfs_rmap_update_defer_type;
 extern const struct xfs_defer_op_type xfs_extent_free_defer_type;
 extern const struct xfs_defer_op_type xfs_agfl_free_defer_type;
 
+/*
+ * Deferred operation freezer.  This structure enables a dfops user to detach
+ * the chain of deferred operations from a transaction so that they can be
+ * run later.
+ */
+struct xfs_defer_freezer {
+	/* List of other freezer heads. */
+	struct list_head	dff_list;
+
+	/* Deferred ops state saved from the transaction. */
+	struct list_head	dff_dfops;
+	unsigned int		dff_tpflags;
+};
+
+/* Functions to freeze a chain of deferred operations for later. */
+int xfs_defer_freeze(struct xfs_trans *tp, struct xfs_defer_freezer **dffp);
+int xfs_defer_thaw(struct xfs_defer_freezer *dff, struct xfs_trans *tp);
+void xfs_defer_freeezer_finish(struct xfs_mount *mp,
+		struct xfs_defer_freezer *dff);
+
 #endif /* __XFS_DEFER_H__ */
diff --git a/fs/xfs/libxfs/xfs_log_recover.h b/fs/xfs/libxfs/xfs_log_recover.h
index d8c0eae87179..6092a02b2d2a 100644
--- a/fs/xfs/libxfs/xfs_log_recover.h
+++ b/fs/xfs/libxfs/xfs_log_recover.h
@@ -6,6 +6,8 @@
 #ifndef	__XFS_LOG_RECOVER_H__
 #define __XFS_LOG_RECOVER_H__
 
+struct xfs_defer_freezer;
+
 /*
  * Each log item type (XFS_LI_*) gets its own xlog_recover_item_ops to
  * define how recovery should work for that type of log item.
@@ -130,5 +132,7 @@ void xlog_recover_release_intent(struct xlog *log, unsigned short intent_type,
 		uint64_t intent_id);
 void xlog_recover_insert_ail(struct xlog *log, struct xfs_log_item *lip,
 		xfs_lsn_t lsn);
+int xlog_recover_trans_commit(struct xfs_trans *tp,
+		struct xfs_defer_freezer **dffp);
 
 #endif	/* __XFS_LOG_RECOVER_H__ */
diff --git a/fs/xfs/xfs_bmap_item.c b/fs/xfs/xfs_bmap_item.c
index 0793c317defb..c733bdeeeb9b 100644
--- a/fs/xfs/xfs_bmap_item.c
+++ b/fs/xfs/xfs_bmap_item.c
@@ -423,13 +423,13 @@ const struct xfs_defer_op_type xfs_bmap_update_defer_type = {
 STATIC int
 xfs_bui_item_recover(
 	struct xfs_log_item		*lip,
-	struct xfs_trans		*parent_tp)
+	struct xfs_defer_freezer	**dffp)
 {
 	struct xfs_bmbt_irec		irec;
 	struct xfs_bui_log_item		*buip = BUI_ITEM(lip);
 	struct xfs_trans		*tp;
 	struct xfs_inode		*ip = NULL;
-	struct xfs_mount		*mp = parent_tp->t_mountp;
+	struct xfs_mount		*mp = lip->li_mountp;
 	struct xfs_map_extent		*bmap;
 	struct xfs_bud_log_item		*budp;
 	xfs_fsblock_t			startblock_fsb;
@@ -485,12 +485,7 @@ xfs_bui_item_recover(
 			XFS_EXTENTADD_SPACE_RES(mp, XFS_DATA_FORK), 0, 0, &tp);
 	if (error)
 		return error;
-	/*
-	 * Recovery stashes all deferred ops during intent processing and
-	 * finishes them on completion. Transfer current dfops state to this
-	 * transaction and transfer the result back before we return.
-	 */
-	xfs_defer_move(tp, parent_tp);
+
 	budp = xfs_trans_get_bud(tp, buip);
 
 	/* Grab the inode. */
@@ -534,15 +529,12 @@ xfs_bui_item_recover(
 		xfs_bmap_unmap_extent(tp, ip, &irec);
 	}
 
-	xfs_defer_move(parent_tp, tp);
-	error = xfs_trans_commit(tp);
+	error = xlog_recover_trans_commit(tp, dffp);
 	xfs_iunlock(ip, XFS_ILOCK_EXCL);
 	xfs_irele(ip);
-
 	return error;
 
 err_inode:
-	xfs_defer_move(parent_tp, tp);
 	xfs_trans_cancel(tp);
 	if (ip) {
 		xfs_iunlock(ip, XFS_ILOCK_EXCL);
diff --git a/fs/xfs/xfs_extfree_item.c b/fs/xfs/xfs_extfree_item.c
index b92678bede24..262acddd2e1b 100644
--- a/fs/xfs/xfs_extfree_item.c
+++ b/fs/xfs/xfs_extfree_item.c
@@ -583,10 +583,10 @@ const struct xfs_defer_op_type xfs_agfl_free_defer_type = {
 STATIC int
 xfs_efi_item_recover(
 	struct xfs_log_item		*lip,
-	struct xfs_trans		*parent_tp)
+	struct xfs_defer_freezer	**dffp)
 {
 	struct xfs_efi_log_item		*efip = EFI_ITEM(lip);
-	struct xfs_mount		*mp = parent_tp->t_mountp;
+	struct xfs_mount		*mp = lip->li_mountp;
 	struct xfs_efd_log_item		*efdp;
 	struct xfs_trans		*tp;
 	struct xfs_extent		*extp;
@@ -631,8 +631,7 @@ xfs_efi_item_recover(
 
 	}
 
-	error = xfs_trans_commit(tp);
-	return error;
+	return xlog_recover_trans_commit(tp, dffp);
 
 abort_error:
 	xfs_trans_cancel(tp);
diff --git a/fs/xfs/xfs_log_recover.c b/fs/xfs/xfs_log_recover.c
index a9cc546535e0..908bfb284a9a 100644
--- a/fs/xfs/xfs_log_recover.c
+++ b/fs/xfs/xfs_log_recover.c
@@ -1809,6 +1809,26 @@ xlog_recover_insert_ail(
 	xfs_trans_ail_update(log->l_ailp, lip, lsn);
 }
 
+/*
+ * Freeze any deferred ops and commit the transaction.  This is the last step
+ * needed to finish a log intent item that we recovered from the log.
+ */
+int
+xlog_recover_trans_commit(
+	struct xfs_trans		*tp,
+	struct xfs_defer_freezer	**dffp)
+{
+	int				error;
+
+	error = xfs_defer_freeze(tp, dffp);
+	if (error) {
+		xfs_trans_cancel(tp);
+		return error;
+	}
+
+	return xfs_trans_commit(tp);
+}
+
 /******************************************************************************
  *
  *		Log recover routines
@@ -2495,35 +2515,59 @@ xlog_recover_process_data(
 /* Take all the collected deferred ops and finish them in order. */
 static int
 xlog_finish_defer_ops(
-	struct xfs_trans	*parent_tp)
+	struct xfs_mount	*mp,
+	struct list_head	*dfops_freezers)
 {
-	struct xfs_mount	*mp = parent_tp->t_mountp;
+	struct xfs_defer_freezer *dff, *next;
 	struct xfs_trans	*tp;
 	int64_t			freeblks;
 	uint			resblks;
-	int			error;
+	int			error = 0;
 
-	/*
-	 * We're finishing the defer_ops that accumulated as a result of
-	 * recovering unfinished intent items during log recovery.  We
-	 * reserve an itruncate transaction because it is the largest
-	 * permanent transaction type.  Since we're the only user of the fs
-	 * right now, take 93% (15/16) of the available free blocks.  Use
-	 * weird math to avoid a 64-bit division.
-	 */
-	freeblks = percpu_counter_sum(&mp->m_fdblocks);
-	if (freeblks <= 0)
-		return -ENOSPC;
-	resblks = min_t(int64_t, UINT_MAX, freeblks);
-	resblks = (resblks * 15) >> 4;
-	error = xfs_trans_alloc(mp, &M_RES(mp)->tr_itruncate, resblks,
-			0, XFS_TRANS_RESERVE, &tp);
-	if (error)
-		return error;
-	/* transfer all collected dfops to this transaction */
-	xfs_defer_move(tp, parent_tp);
+	list_for_each_entry_safe(dff, next, dfops_freezers, dff_list) {
+		/*
+		 * We're finishing the defer_ops that accumulated as a result
+		 * of recovering unfinished intent items during log recovery.
+		 * We reserve an itruncate transaction because it is the
+		 * largest permanent transaction type.  Since we're the only
+		 * user of the fs right now, take 93% (15/16) of the available
+		 * free blocks.  Use weird math to avoid a 64-bit division.
+		 */
+		freeblks = percpu_counter_sum(&mp->m_fdblocks);
+		if (freeblks <= 0) {
+			error = -ENOSPC;
+			break;
+		}
 
-	return xfs_trans_commit(tp);
+		resblks = min_t(int64_t, UINT_MAX, freeblks);
+		resblks = (resblks * 15) >> 4;
+		error = xfs_trans_alloc(mp, &M_RES(mp)->tr_itruncate, resblks,
+				0, XFS_TRANS_RESERVE, &tp);
+		if (error)
+			break;
+
+		/* transfer all collected dfops to this transaction */
+		list_del_init(&dff->dff_list);
+		error = xfs_defer_thaw(dff, tp);
+		if (error) {
+			xfs_trans_cancel(tp);
+			xfs_defer_freeezer_finish(mp, dff);
+			break;
+		}
+
+		error = xfs_trans_commit(tp);
+		xfs_defer_freeezer_finish(mp, dff);
+		if (error)
+			break;
+	}
+
+	/* Kill any remaining freezers. */
+	list_for_each_entry_safe(dff, next, dfops_freezers, dff_list) {
+		list_del_init(&dff->dff_list);
+		xfs_defer_freeezer_finish(mp, dff);
+	}
+
+	return error;
 }
 
 /* Is this log item a deferred action intent? */
@@ -2553,8 +2597,9 @@ STATIC int
 xlog_recover_process_intents(
 	struct xlog		*log)
 {
-	struct xfs_trans	*parent_tp;
+	LIST_HEAD(dfops_freezers);
 	struct xfs_ail_cursor	cur;
+	struct xfs_defer_freezer *freezer = NULL;
 	struct xfs_log_item	*lip;
 	struct xfs_ail		*ailp;
 	int			error = 0;
@@ -2562,19 +2607,6 @@ xlog_recover_process_intents(
 	xfs_lsn_t		last_lsn;
 #endif
 
-	/*
-	 * The intent recovery handlers commit transactions to complete recovery
-	 * for individual intents, but any new deferred operations that are
-	 * queued during that process are held off until the very end. The
-	 * purpose of this transaction is to serve as a container for deferred
-	 * operations. Each intent recovery handler must transfer dfops here
-	 * before its local transaction commits, and we'll finish the entire
-	 * list below.
-	 */
-	error = xfs_trans_alloc_empty(log->l_mp, &parent_tp);
-	if (error)
-		return error;
-
 	ailp = log->l_ailp;
 	spin_lock(&ailp->ail_lock);
 	lip = xfs_trans_ail_cursor_first(ailp, &cur, 0);
@@ -2603,27 +2635,31 @@ xlog_recover_process_intents(
 
 		/*
 		 * NOTE: If your intent processing routine can create more
-		 * deferred ops, you /must/ attach them to the transaction in
+		 * deferred ops, you /must/ attach them to the freezer in
 		 * this routine or else those subsequent intents will get
 		 * replayed in the wrong order!
 		 */
 		if (!test_and_set_bit(XFS_LI_RECOVERED, &lip->li_flags)) {
 			spin_unlock(&ailp->ail_lock);
-			error = lip->li_ops->iop_recover(lip, parent_tp);
+			error = lip->li_ops->iop_recover(lip, &freezer);
 			spin_lock(&ailp->ail_lock);
 		}
 		if (error)
 			goto out;
+		if (freezer) {
+			list_add_tail(&freezer->dff_list, &dfops_freezers);
+			freezer = NULL;
+		}
+
 		lip = xfs_trans_ail_cursor_next(ailp, &cur);
 	}
 out:
 	xfs_trans_ail_cursor_done(&cur);
 	spin_unlock(&ailp->ail_lock);
-	if (!error)
-		error = xlog_finish_defer_ops(parent_tp);
-	xfs_trans_cancel(parent_tp);
+	if (error)
+		return error;
 
-	return error;
+	return xlog_finish_defer_ops(log->l_mp, &dfops_freezers);
 }
 
 /*
diff --git a/fs/xfs/xfs_refcount_item.c b/fs/xfs/xfs_refcount_item.c
index e6d355a09bb3..dcdb975dbb86 100644
--- a/fs/xfs/xfs_refcount_item.c
+++ b/fs/xfs/xfs_refcount_item.c
@@ -423,7 +423,7 @@ const struct xfs_defer_op_type xfs_refcount_update_defer_type = {
 STATIC int
 xfs_cui_item_recover(
 	struct xfs_log_item		*lip,
-	struct xfs_trans		*parent_tp)
+	struct xfs_defer_freezer	**dffp)
 {
 	struct xfs_bmbt_irec		irec;
 	struct xfs_cui_log_item		*cuip = CUI_ITEM(lip);
@@ -431,7 +431,7 @@ xfs_cui_item_recover(
 	struct xfs_cud_log_item		*cudp;
 	struct xfs_trans		*tp;
 	struct xfs_btree_cur		*rcur = NULL;
-	struct xfs_mount		*mp = parent_tp->t_mountp;
+	struct xfs_mount		*mp = lip->li_mountp;
 	xfs_fsblock_t			startblock_fsb;
 	xfs_fsblock_t			new_fsb;
 	xfs_extlen_t			new_len;
@@ -492,12 +492,7 @@ xfs_cui_item_recover(
 			mp->m_refc_maxlevels * 2, 0, XFS_TRANS_RESERVE, &tp);
 	if (error)
 		return error;
-	/*
-	 * Recovery stashes all deferred ops during intent processing and
-	 * finishes them on completion. Transfer current dfops state to this
-	 * transaction and transfer the result back before we return.
-	 */
-	xfs_defer_move(tp, parent_tp);
+
 	cudp = xfs_trans_get_cud(tp, cuip);
 
 	for (i = 0; i < cuip->cui_format.cui_nextents; i++) {
@@ -554,13 +549,10 @@ xfs_cui_item_recover(
 	}
 
 	xfs_refcount_finish_one_cleanup(tp, rcur, error);
-	xfs_defer_move(parent_tp, tp);
-	error = xfs_trans_commit(tp);
-	return error;
+	return xlog_recover_trans_commit(tp, dffp);
 
 abort_error:
 	xfs_refcount_finish_one_cleanup(tp, rcur, error);
-	xfs_defer_move(parent_tp, tp);
 	xfs_trans_cancel(tp);
 	return error;
 }
diff --git a/fs/xfs/xfs_rmap_item.c b/fs/xfs/xfs_rmap_item.c
index 4a5e2b1cf75a..c751487740df 100644
--- a/fs/xfs/xfs_rmap_item.c
+++ b/fs/xfs/xfs_rmap_item.c
@@ -466,14 +466,14 @@ const struct xfs_defer_op_type xfs_rmap_update_defer_type = {
 STATIC int
 xfs_rui_item_recover(
 	struct xfs_log_item		*lip,
-	struct xfs_trans		*parent_tp)
+	struct xfs_defer_freezer	**dffp)
 {
 	struct xfs_rui_log_item		*ruip = RUI_ITEM(lip);
 	struct xfs_map_extent		*rmap;
 	struct xfs_rud_log_item		*rudp;
 	struct xfs_trans		*tp;
 	struct xfs_btree_cur		*rcur = NULL;
-	struct xfs_mount		*mp = parent_tp->t_mountp;
+	struct xfs_mount		*mp = lip->li_mountp;
 	xfs_fsblock_t			startblock_fsb;
 	enum xfs_rmap_intent_type	type;
 	xfs_exntst_t			state;
@@ -572,8 +572,7 @@ xfs_rui_item_recover(
 	}
 
 	xfs_rmap_finish_one_cleanup(tp, rcur, error);
-	error = xfs_trans_commit(tp);
-	return error;
+	return xlog_recover_trans_commit(tp, dffp);
 
 abort_error:
 	xfs_rmap_finish_one_cleanup(tp, rcur, error);
diff --git a/fs/xfs/xfs_trans.h b/fs/xfs/xfs_trans.h
index 8308bf6d7e40..48db4510b9d9 100644
--- a/fs/xfs/xfs_trans.h
+++ b/fs/xfs/xfs_trans.h
@@ -26,6 +26,7 @@ struct xfs_cui_log_item;
 struct xfs_cud_log_item;
 struct xfs_bui_log_item;
 struct xfs_bud_log_item;
+struct xfs_defer_freezer;
 
 struct xfs_log_item {
 	struct list_head		li_ail;		/* AIL pointers */
@@ -79,7 +80,8 @@ struct xfs_item_ops {
 	void (*iop_release)(struct xfs_log_item *);
 	xfs_lsn_t (*iop_committed)(struct xfs_log_item *, xfs_lsn_t);
 	void (*iop_error)(struct xfs_log_item *, xfs_buf_t *);
-	int (*iop_recover)(struct xfs_log_item *lip, struct xfs_trans *tp);
+	int (*iop_recover)(struct xfs_log_item *lip,
+			   struct xfs_defer_freezer **dffp);
 	bool (*iop_match)(struct xfs_log_item *item, uint64_t id);
 };
 


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* Re: [PATCH 1/3] xfs: proper replay of deferred ops queued during log recovery
  2020-05-05  1:13 ` [PATCH 1/3] xfs: proper replay of deferred ops queued " Darrick J. Wong
@ 2020-05-05  2:33   ` Dave Chinner
  2020-05-05  3:06     ` Darrick J. Wong
  0 siblings, 1 reply; 19+ messages in thread
From: Dave Chinner @ 2020-05-05  2:33 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: linux-xfs

On Mon, May 04, 2020 at 06:13:39PM -0700, Darrick J. Wong wrote:
> From: Darrick J. Wong <darrick.wong@oracle.com>
> 
> When we replay unfinished intent items that have been recovered from the
> log, it's possible that the replay will cause the creation of more
> deferred work items.  As outlined in commit 509955823cc9c ("xfs: log
> recovery should replay deferred ops in order"), later work items have an
> implicit ordering dependency on earlier work items.  Therefore, recovery
> must replay the items (both recovered and created) in the same order
> that they would have been during normal operation.
> 
> For log recovery, we enforce this ordering by using an empty transaction
> to collect deferred ops that get created in the process of recovering a
> log intent item to prevent them from being committed before the rest of
> the recovered intent items.  After we finish committing all the
> recovered log items, we allocate a transaction with an enormous block
> reservation, splice our huge list of created deferred ops into that
> transaction, and commit it, thereby finishing all those ops.
> 
> This is /really/ hokey -- it's the one place in XFS where we allow
> nested transactions; the splicing of the defer ops list is is inelegant
> and has to be done twice per recovery function; and the broken way we
> handle inode pointers and block reservations cause subtle use-after-free
> and allocator problems that will be fixed by this patch and the two
> patches after it.
> 
> Therefore, replace the hokey empty transaction with a structure designed
> to capture each chain of deferred ops that are created as part of
> recovering a single unfinished log intent.  Finally, refactor the loop
> that replays those chains to do so using one transaction per chain.
> 
> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>

FWIW, I don't like the "freezer" based naming here. It's too easily
confused with freezing and thawing the filesystem....

I know, "delayed deferred ops" isn't much better, but at least it
won't get confused with existing unrelated functionality.

I've barely looked at the code, so no real comments on that yet,
but I did notice this:

> @@ -2495,35 +2515,59 @@ xlog_recover_process_data(
>  /* Take all the collected deferred ops and finish them in order. */
>  static int
>  xlog_finish_defer_ops(
> -	struct xfs_trans	*parent_tp)
> +	struct xfs_mount	*mp,
> +	struct list_head	*dfops_freezers)
>  {
> -	struct xfs_mount	*mp = parent_tp->t_mountp;
> +	struct xfs_defer_freezer *dff, *next;
>  	struct xfs_trans	*tp;
>  	int64_t			freeblks;
>  	uint			resblks;
....
> +		resblks = min_t(int64_t, UINT_MAX, freeblks);
> +		resblks = (resblks * 15) >> 4;

Can overflow when freeblks > (UINT_MAX / 15).

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 1/3] xfs: proper replay of deferred ops queued during log recovery
  2020-05-05  2:33   ` Dave Chinner
@ 2020-05-05  3:06     ` Darrick J. Wong
  2020-05-05  5:10       ` Dave Chinner
  0 siblings, 1 reply; 19+ messages in thread
From: Darrick J. Wong @ 2020-05-05  3:06 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs

On Tue, May 05, 2020 at 12:33:05PM +1000, Dave Chinner wrote:
> On Mon, May 04, 2020 at 06:13:39PM -0700, Darrick J. Wong wrote:
> > From: Darrick J. Wong <darrick.wong@oracle.com>
> > 
> > When we replay unfinished intent items that have been recovered from the
> > log, it's possible that the replay will cause the creation of more
> > deferred work items.  As outlined in commit 509955823cc9c ("xfs: log
> > recovery should replay deferred ops in order"), later work items have an
> > implicit ordering dependency on earlier work items.  Therefore, recovery
> > must replay the items (both recovered and created) in the same order
> > that they would have been during normal operation.
> > 
> > For log recovery, we enforce this ordering by using an empty transaction
> > to collect deferred ops that get created in the process of recovering a
> > log intent item to prevent them from being committed before the rest of
> > the recovered intent items.  After we finish committing all the
> > recovered log items, we allocate a transaction with an enormous block
> > reservation, splice our huge list of created deferred ops into that
> > transaction, and commit it, thereby finishing all those ops.
> > 
> > This is /really/ hokey -- it's the one place in XFS where we allow
> > nested transactions; the splicing of the defer ops list is is inelegant
> > and has to be done twice per recovery function; and the broken way we
> > handle inode pointers and block reservations cause subtle use-after-free
> > and allocator problems that will be fixed by this patch and the two
> > patches after it.
> > 
> > Therefore, replace the hokey empty transaction with a structure designed
> > to capture each chain of deferred ops that are created as part of
> > recovering a single unfinished log intent.  Finally, refactor the loop
> > that replays those chains to do so using one transaction per chain.
> > 
> > Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
> 
> FWIW, I don't like the "freezer" based naming here. It's too easily
> confused with freezing and thawing the filesystem....
> 
> I know, "delayed deferred ops" isn't much better, but at least it
> won't get confused with existing unrelated functionality.

xfs_defer_{freeze,thaw} -> xfs_defer_{capture,relink} ?

> I've barely looked at the code, so no real comments on that yet,
> but I did notice this:
> 
> > @@ -2495,35 +2515,59 @@ xlog_recover_process_data(
> >  /* Take all the collected deferred ops and finish them in order. */
> >  static int
> >  xlog_finish_defer_ops(
> > -	struct xfs_trans	*parent_tp)
> > +	struct xfs_mount	*mp,
> > +	struct list_head	*dfops_freezers)
> >  {
> > -	struct xfs_mount	*mp = parent_tp->t_mountp;
> > +	struct xfs_defer_freezer *dff, *next;
> >  	struct xfs_trans	*tp;
> >  	int64_t			freeblks;
> >  	uint			resblks;
> ....
> > +		resblks = min_t(int64_t, UINT_MAX, freeblks);
> > +		resblks = (resblks * 15) >> 4;
> 
> Can overflow when freeblks > (UINT_MAX / 15).

D'oh.  Ugh, I hate this whole fugly hack.

TBH I've been thinking that perhaps the freezer function should be
capturing the unused transaction block reservation when we capture the
dfops chain from the transaction.

When we set up the second transaction, we then set t_blk_res to the
captured block reservation.  So long as the recovery function is smart
enough to set up sufficient reservation we should avoid hitting ENOSPC,
right?

--D

> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 1/3] xfs: proper replay of deferred ops queued during log recovery
  2020-05-05  3:06     ` Darrick J. Wong
@ 2020-05-05  5:10       ` Dave Chinner
  2020-05-05 15:08         ` Darrick J. Wong
  0 siblings, 1 reply; 19+ messages in thread
From: Dave Chinner @ 2020-05-05  5:10 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: linux-xfs

On Mon, May 04, 2020 at 08:06:51PM -0700, Darrick J. Wong wrote:
> On Tue, May 05, 2020 at 12:33:05PM +1000, Dave Chinner wrote:
> > On Mon, May 04, 2020 at 06:13:39PM -0700, Darrick J. Wong wrote:
> > > From: Darrick J. Wong <darrick.wong@oracle.com>
> > > 
> > > When we replay unfinished intent items that have been recovered from the
> > > log, it's possible that the replay will cause the creation of more
> > > deferred work items.  As outlined in commit 509955823cc9c ("xfs: log
> > > recovery should replay deferred ops in order"), later work items have an
> > > implicit ordering dependency on earlier work items.  Therefore, recovery
> > > must replay the items (both recovered and created) in the same order
> > > that they would have been during normal operation.
> > > 
> > > For log recovery, we enforce this ordering by using an empty transaction
> > > to collect deferred ops that get created in the process of recovering a
> > > log intent item to prevent them from being committed before the rest of
> > > the recovered intent items.  After we finish committing all the
> > > recovered log items, we allocate a transaction with an enormous block
> > > reservation, splice our huge list of created deferred ops into that
> > > transaction, and commit it, thereby finishing all those ops.
> > > 
> > > This is /really/ hokey -- it's the one place in XFS where we allow
> > > nested transactions; the splicing of the defer ops list is is inelegant
> > > and has to be done twice per recovery function; and the broken way we
> > > handle inode pointers and block reservations cause subtle use-after-free
> > > and allocator problems that will be fixed by this patch and the two
> > > patches after it.
> > > 
> > > Therefore, replace the hokey empty transaction with a structure designed
> > > to capture each chain of deferred ops that are created as part of
> > > recovering a single unfinished log intent.  Finally, refactor the loop
> > > that replays those chains to do so using one transaction per chain.
> > > 
> > > Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
> > 
> > FWIW, I don't like the "freezer" based naming here. It's too easily
> > confused with freezing and thawing the filesystem....
> > 
> > I know, "delayed deferred ops" isn't much better, but at least it
> > won't get confused with existing unrelated functionality.
> 
> xfs_defer_{freeze,thaw} -> xfs_defer_{capture,relink} ?

Yeah, capture seems appropriate, maybe relink -> continue? i.e.
capture the remaining defer_ops to be run, then continue running
them?

/me shrugs and thinks "naming is hard"....

> > I've barely looked at the code, so no real comments on that yet,
> > but I did notice this:
> > 
> > > @@ -2495,35 +2515,59 @@ xlog_recover_process_data(
> > >  /* Take all the collected deferred ops and finish them in order. */
> > >  static int
> > >  xlog_finish_defer_ops(
> > > -	struct xfs_trans	*parent_tp)
> > > +	struct xfs_mount	*mp,
> > > +	struct list_head	*dfops_freezers)
> > >  {
> > > -	struct xfs_mount	*mp = parent_tp->t_mountp;
> > > +	struct xfs_defer_freezer *dff, *next;
> > >  	struct xfs_trans	*tp;
> > >  	int64_t			freeblks;
> > >  	uint			resblks;
> > ....
> > > +		resblks = min_t(int64_t, UINT_MAX, freeblks);
> > > +		resblks = (resblks * 15) >> 4;
> > 
> > Can overflow when freeblks > (UINT_MAX / 15).
> 
> D'oh.  Ugh, I hate this whole fugly hack.
> 
> TBH I've been thinking that perhaps the freezer function should be
> capturing the unused transaction block reservation when we capture the
> dfops chain from the transaction.

Exactly what problem is this hack supposed to avoid? having the
filesystem ENOSPC before all the deferops have been completed?

if so, can that even happen? Because the fact that the intents are
in the log means that when they were started there was enough space
in the fs for them to run, so ENOSPC should not be an issue, right?

> When we set up the second transaction, we then set t_blk_res to the
> captured block reservation.  So long as the recovery function is smart
> enough to set up sufficient reservation we should avoid hitting ENOSPC,
> right?

I'm not sure ENOSPC is really a problem for recovery of deferred ops
given the above...

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 1/3] xfs: proper replay of deferred ops queued during log recovery
  2020-05-05  5:10       ` Dave Chinner
@ 2020-05-05 15:08         ` Darrick J. Wong
  0 siblings, 0 replies; 19+ messages in thread
From: Darrick J. Wong @ 2020-05-05 15:08 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs

On Tue, May 05, 2020 at 03:10:29PM +1000, Dave Chinner wrote:
> On Mon, May 04, 2020 at 08:06:51PM -0700, Darrick J. Wong wrote:
> > On Tue, May 05, 2020 at 12:33:05PM +1000, Dave Chinner wrote:
> > > On Mon, May 04, 2020 at 06:13:39PM -0700, Darrick J. Wong wrote:
> > > > From: Darrick J. Wong <darrick.wong@oracle.com>
> > > > 
> > > > When we replay unfinished intent items that have been recovered from the
> > > > log, it's possible that the replay will cause the creation of more
> > > > deferred work items.  As outlined in commit 509955823cc9c ("xfs: log
> > > > recovery should replay deferred ops in order"), later work items have an
> > > > implicit ordering dependency on earlier work items.  Therefore, recovery
> > > > must replay the items (both recovered and created) in the same order
> > > > that they would have been during normal operation.
> > > > 
> > > > For log recovery, we enforce this ordering by using an empty transaction
> > > > to collect deferred ops that get created in the process of recovering a
> > > > log intent item to prevent them from being committed before the rest of
> > > > the recovered intent items.  After we finish committing all the
> > > > recovered log items, we allocate a transaction with an enormous block
> > > > reservation, splice our huge list of created deferred ops into that
> > > > transaction, and commit it, thereby finishing all those ops.
> > > > 
> > > > This is /really/ hokey -- it's the one place in XFS where we allow
> > > > nested transactions; the splicing of the defer ops list is is inelegant
> > > > and has to be done twice per recovery function; and the broken way we
> > > > handle inode pointers and block reservations cause subtle use-after-free
> > > > and allocator problems that will be fixed by this patch and the two
> > > > patches after it.
> > > > 
> > > > Therefore, replace the hokey empty transaction with a structure designed
> > > > to capture each chain of deferred ops that are created as part of
> > > > recovering a single unfinished log intent.  Finally, refactor the loop
> > > > that replays those chains to do so using one transaction per chain.
> > > > 
> > > > Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
> > > 
> > > FWIW, I don't like the "freezer" based naming here. It's too easily
> > > confused with freezing and thawing the filesystem....
> > > 
> > > I know, "delayed deferred ops" isn't much better, but at least it
> > > won't get confused with existing unrelated functionality.
> > 
> > xfs_defer_{freeze,thaw} -> xfs_defer_{capture,relink} ?
> 
> Yeah, capture seems appropriate, maybe relink -> continue? i.e.
> capture the remaining defer_ops to be run, then continue running
> them?

I think I like capture/continue.

> /me shrugs and thinks "naming is hard"....

:)

> > > I've barely looked at the code, so no real comments on that yet,
> > > but I did notice this:
> > > 
> > > > @@ -2495,35 +2515,59 @@ xlog_recover_process_data(
> > > >  /* Take all the collected deferred ops and finish them in order. */
> > > >  static int
> > > >  xlog_finish_defer_ops(
> > > > -	struct xfs_trans	*parent_tp)
> > > > +	struct xfs_mount	*mp,
> > > > +	struct list_head	*dfops_freezers)
> > > >  {
> > > > -	struct xfs_mount	*mp = parent_tp->t_mountp;
> > > > +	struct xfs_defer_freezer *dff, *next;
> > > >  	struct xfs_trans	*tp;
> > > >  	int64_t			freeblks;
> > > >  	uint			resblks;
> > > ....
> > > > +		resblks = min_t(int64_t, UINT_MAX, freeblks);
> > > > +		resblks = (resblks * 15) >> 4;
> > > 
> > > Can overflow when freeblks > (UINT_MAX / 15).
> > 
> > D'oh.  Ugh, I hate this whole fugly hack.
> > 
> > TBH I've been thinking that perhaps the freezer function should be
> > capturing the unused transaction block reservation when we capture the
> > dfops chain from the transaction.
> 
> Exactly what problem is this hack supposed to avoid? having the
> filesystem ENOSPC before all the deferops have been completed?
>
> if so, can that even happen? Because the fact that the intents are
> in the log means that when they were started there was enough space
> in the fs for them to run, so ENOSPC should not be an issue, right?

We're probably not going to run out of space, seeing as we had enough
space to run the first time.  However, there's various parts of the
filesystem that either behave differently or ENOSPC early if the
transaction has no block reservation, so we need to avoid them.

The outcome of the recovery work ought to be as close as possible to
what would have happened if the fs hadn't gone down.

> > When we set up the second transaction, we then set t_blk_res to the
> > captured block reservation.  So long as the recovery function is smart
> > enough to set up sufficient reservation we should avoid hitting ENOSPC,
> > right?
> 
> I'm not sure ENOSPC is really a problem for recovery of deferred ops
> given the above...

<nod>

--D

> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com

^ permalink raw reply	[flat|nested] 19+ messages in thread

* [PATCH 2/3] xfs: reduce log recovery transaction block reservations
  2020-05-05  1:13 [PATCH v2 0/3] xfs: fix inode use-after-free during log recovery Darrick J. Wong
  2020-05-05  1:13 ` [PATCH 1/3] xfs: proper replay of deferred ops queued " Darrick J. Wong
@ 2020-05-05  1:13 ` Darrick J. Wong
  2020-05-05  1:13 ` [PATCH 3/3] xfs: fix an incore inode UAF in xfs_bui_recover Darrick J. Wong
  2 siblings, 0 replies; 19+ messages in thread
From: Darrick J. Wong @ 2020-05-05  1:13 UTC (permalink / raw)
  To: darrick.wong; +Cc: linux-xfs

From: Darrick J. Wong <darrick.wong@oracle.com>

On filesystems that support them, bmap intent log items can be used to
change mappings in inode data or attr forks.  However, if the bmbt must
expand, the enormous block reservations that we make for finishing
chains of deferred log items become a liability because the bmbt block
allocator sets minleft to the transaction reservation and there probably
aren't any AGs in the filesystem that have that much free space.

Whereas previously we would reserve 93% of the free blocks in the
filesystem, now we only want to reserve 7/8ths of the free space in the
least full AG, and no more than half of the usable blocks in an AG.  In
theory we shouldn't run out of space because (prior to the unclean
shutdown) all of the in-progress transactions successfully reserved the
worst case number of disk blocks.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
---
 fs/xfs/libxfs/xfs_defer.c |    6 +---
 fs/xfs/libxfs/xfs_defer.h |    2 +
 fs/xfs/xfs_log_recover.c  |   70 +++++++++++++++++++++++++++++++++------------
 3 files changed, 54 insertions(+), 24 deletions(-)


diff --git a/fs/xfs/libxfs/xfs_defer.c b/fs/xfs/libxfs/xfs_defer.c
index bbbd0141d4a6..ea4d28851bbd 100644
--- a/fs/xfs/libxfs/xfs_defer.c
+++ b/fs/xfs/libxfs/xfs_defer.c
@@ -577,8 +577,8 @@ xfs_defer_freeze(
 	return 0;
 }
 
-/* Thaw a chain of deferred ops that are attached to a transaction. */
-int
+/* Thaw a chain of frozen deferred ops by attaching them to the transaction. */
+void
 xfs_defer_thaw(
 	struct xfs_defer_freezer	*dff,
 	struct xfs_trans		*tp)
@@ -588,8 +588,6 @@ xfs_defer_thaw(
 	/* Add the dfops items to the transaction. */
 	list_splice_init(&dff->dff_dfops, &tp->t_dfops);
 	tp->t_flags |= dff->dff_tpflags;
-
-	return 0;
 }
 
 /* Release a deferred op freezer and all resources associated with it. */
diff --git a/fs/xfs/libxfs/xfs_defer.h b/fs/xfs/libxfs/xfs_defer.h
index 4789bf53dd96..7ae05e10d750 100644
--- a/fs/xfs/libxfs/xfs_defer.h
+++ b/fs/xfs/libxfs/xfs_defer.h
@@ -80,7 +80,7 @@ struct xfs_defer_freezer {
 
 /* Functions to freeze a chain of deferred operations for later. */
 int xfs_defer_freeze(struct xfs_trans *tp, struct xfs_defer_freezer **dffp);
-int xfs_defer_thaw(struct xfs_defer_freezer *dff, struct xfs_trans *tp);
+void xfs_defer_thaw(struct xfs_defer_freezer *dff, struct xfs_trans *tp);
 void xfs_defer_freeezer_finish(struct xfs_mount *mp,
 		struct xfs_defer_freezer *dff);
 
diff --git a/fs/xfs/xfs_log_recover.c b/fs/xfs/xfs_log_recover.c
index 908bfb284a9a..50b9d70e502f 100644
--- a/fs/xfs/xfs_log_recover.c
+++ b/fs/xfs/xfs_log_recover.c
@@ -2512,6 +2512,44 @@ xlog_recover_process_data(
 	return 0;
 }
 
+/*
+ * Estimate a block reservation for a log recovery transaction.  Since we run
+ * separate transactions for each chain of deferred ops that get created as a
+ * result of recovering unfinished log intent items, we must be careful not to
+ * reserve so many blocks that block allocations fail because we can't satisfy
+ * the minleft requirements (e.g. for bmbt blocks).
+ */
+static int
+xlog_estimate_recovery_resblks(
+	struct xfs_mount	*mp,
+	unsigned int		*resblks)
+{
+	struct xfs_perag	*pag;
+	xfs_agnumber_t		agno;
+	unsigned int		free = 0;
+	int			error;
+
+	/* Don't use more than 7/8th of the free space in the least full AG. */
+	for (agno = 0; agno < mp->m_sb.sb_agcount; agno++) {
+		unsigned int	ag_free;
+
+		error = xfs_alloc_pagf_init(mp, NULL, agno, 0);
+		if (error)
+			return error;
+		pag = xfs_perag_get(mp, agno);
+		ag_free = pag->pagf_freeblks + pag->pagf_flcount;
+		free = max(free, (ag_free * 7) / 8);
+		xfs_perag_put(pag);
+	}
+
+	/* Don't try to reserve more than half the usable AG blocks. */
+	*resblks = min(free, xfs_alloc_ag_max_usable(mp) / 2);
+	if (*resblks == 0)
+		return -ENOSPC;
+
+	return 0;
+}
+
 /* Take all the collected deferred ops and finish them in order. */
 static int
 xlog_finish_defer_ops(
@@ -2520,27 +2558,25 @@ xlog_finish_defer_ops(
 {
 	struct xfs_defer_freezer *dff, *next;
 	struct xfs_trans	*tp;
-	int64_t			freeblks;
-	uint			resblks;
+	unsigned int		max_resblks;
 	int			error = 0;
 
+	error = xlog_estimate_recovery_resblks(mp, &max_resblks);
+	if (error)
+		goto out;
+
 	list_for_each_entry_safe(dff, next, dfops_freezers, dff_list) {
+		unsigned int	resblks;
+
+		resblks = min_t(int64_t, percpu_counter_sum(&mp->m_fdblocks),
+				max_resblks);
+
 		/*
 		 * We're finishing the defer_ops that accumulated as a result
 		 * of recovering unfinished intent items during log recovery.
 		 * We reserve an itruncate transaction because it is the
-		 * largest permanent transaction type.  Since we're the only
-		 * user of the fs right now, take 93% (15/16) of the available
-		 * free blocks.  Use weird math to avoid a 64-bit division.
+		 * largest permanent transaction type.
 		 */
-		freeblks = percpu_counter_sum(&mp->m_fdblocks);
-		if (freeblks <= 0) {
-			error = -ENOSPC;
-			break;
-		}
-
-		resblks = min_t(int64_t, UINT_MAX, freeblks);
-		resblks = (resblks * 15) >> 4;
 		error = xfs_trans_alloc(mp, &M_RES(mp)->tr_itruncate, resblks,
 				0, XFS_TRANS_RESERVE, &tp);
 		if (error)
@@ -2548,12 +2584,7 @@ xlog_finish_defer_ops(
 
 		/* transfer all collected dfops to this transaction */
 		list_del_init(&dff->dff_list);
-		error = xfs_defer_thaw(dff, tp);
-		if (error) {
-			xfs_trans_cancel(tp);
-			xfs_defer_freeezer_finish(mp, dff);
-			break;
-		}
+		xfs_defer_thaw(dff, tp);
 
 		error = xfs_trans_commit(tp);
 		xfs_defer_freeezer_finish(mp, dff);
@@ -2561,6 +2592,7 @@ xlog_finish_defer_ops(
 			break;
 	}
 
+out:
 	/* Kill any remaining freezers. */
 	list_for_each_entry_safe(dff, next, dfops_freezers, dff_list) {
 		list_del_init(&dff->dff_list);


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [PATCH 3/3] xfs: fix an incore inode UAF in xfs_bui_recover
  2020-05-05  1:13 [PATCH v2 0/3] xfs: fix inode use-after-free during log recovery Darrick J. Wong
  2020-05-05  1:13 ` [PATCH 1/3] xfs: proper replay of deferred ops queued " Darrick J. Wong
  2020-05-05  1:13 ` [PATCH 2/3] xfs: reduce log recovery transaction block reservations Darrick J. Wong
@ 2020-05-05  1:13 ` Darrick J. Wong
  2020-05-05 14:11   ` Brian Foster
  2 siblings, 1 reply; 19+ messages in thread
From: Darrick J. Wong @ 2020-05-05  1:13 UTC (permalink / raw)
  To: darrick.wong; +Cc: linux-xfs

From: Darrick J. Wong <darrick.wong@oracle.com>

In xfs_bui_item_recover, there exists a use-after-free bug with regards
to the inode that is involved in the bmap replay operation.  If the
mapping operation does not complete, we call xfs_bmap_unmap_extent to
create a deferred op to finish the unmapping work, and we retain a
pointer to the incore inode.

Unfortunately, the very next thing we do is commit the transaction and
drop the inode.  If reclaim tears down the inode before we try to finish
the defer ops, we dereference garbage and blow up.  Therefore, create a
way to join inodes to the defer ops freezer so that we can maintain the
xfs_inode reference until we're done with the inode.

Note: This imposes the requirement that there be enough memory to keep
every incore inode in memory throughout recovery.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
---
 fs/xfs/libxfs/xfs_defer.c |   50 +++++++++++++++++++++++++++++++++++++++++++++
 fs/xfs/libxfs/xfs_defer.h |   10 +++++++++
 fs/xfs/xfs_bmap_item.c    |    7 ++++--
 fs/xfs/xfs_icache.c       |   19 +++++++++++++++++
 4 files changed, 83 insertions(+), 3 deletions(-)


diff --git a/fs/xfs/libxfs/xfs_defer.c b/fs/xfs/libxfs/xfs_defer.c
index ea4d28851bbd..72933fdafcb2 100644
--- a/fs/xfs/libxfs/xfs_defer.c
+++ b/fs/xfs/libxfs/xfs_defer.c
@@ -16,6 +16,7 @@
 #include "xfs_inode.h"
 #include "xfs_inode_item.h"
 #include "xfs_trace.h"
+#include "xfs_icache.h"
 
 /*
  * Deferred Operations in XFS
@@ -583,8 +584,19 @@ xfs_defer_thaw(
 	struct xfs_defer_freezer	*dff,
 	struct xfs_trans		*tp)
 {
+	int				i;
+
 	ASSERT(tp->t_flags & XFS_TRANS_PERM_LOG_RES);
 
+	/* Re-acquire the inode locks. */
+	for (i = 0; i < XFS_DEFER_FREEZER_INODES; i++) {
+		if (!dff->dff_inodes[i])
+			break;
+
+		dff->dff_ilocks[i] = XFS_ILOCK_EXCL;
+		xfs_ilock(dff->dff_inodes[i], dff->dff_ilocks[i]);
+	}
+
 	/* Add the dfops items to the transaction. */
 	list_splice_init(&dff->dff_dfops, &tp->t_dfops);
 	tp->t_flags |= dff->dff_tpflags;
@@ -597,5 +609,43 @@ xfs_defer_freeezer_finish(
 	struct xfs_defer_freezer	*dff)
 {
 	xfs_defer_cancel_list(mp, &dff->dff_dfops);
+	xfs_defer_freezer_irele(dff);
 	kmem_free(dff);
 }
+
+/*
+ * Attach an inode to this deferred ops freezer.  Callers must hold ILOCK_EXCL,
+ * which will be dropped and reacquired when we're ready to thaw the frozen
+ * deferred ops.
+ */
+int
+xfs_defer_freezer_ijoin(
+	struct xfs_defer_freezer	*dff,
+	struct xfs_inode		*ip)
+{
+	unsigned int			i;
+
+	ASSERT(xfs_isilocked(ip, XFS_ILOCK_EXCL));
+
+	for (i = 0; i < XFS_DEFER_FREEZER_INODES; i++) {
+		if (dff->dff_inodes[i] == ip)
+			goto out;
+		if (dff->dff_inodes[i] == NULL)
+			break;
+	}
+
+	if (i == XFS_DEFER_FREEZER_INODES) {
+		ASSERT(0);
+		return -EFSCORRUPTED;
+	}
+
+	/*
+	 * Attach this inode to the freezer and drop its ILOCK because we
+	 * assume the caller will need to allocate a transaction.
+	 */
+	dff->dff_inodes[i] = ip;
+	dff->dff_ilocks[i] = 0;
+out:
+	xfs_iunlock(ip, XFS_ILOCK_EXCL);
+	return 0;
+}
diff --git a/fs/xfs/libxfs/xfs_defer.h b/fs/xfs/libxfs/xfs_defer.h
index 7ae05e10d750..0052a0313283 100644
--- a/fs/xfs/libxfs/xfs_defer.h
+++ b/fs/xfs/libxfs/xfs_defer.h
@@ -76,6 +76,11 @@ struct xfs_defer_freezer {
 	/* Deferred ops state saved from the transaction. */
 	struct list_head	dff_dfops;
 	unsigned int		dff_tpflags;
+
+	/* Inodes to hold when we want to finish the deferred work items. */
+#define XFS_DEFER_FREEZER_INODES	2
+	unsigned int		dff_ilocks[XFS_DEFER_FREEZER_INODES];
+	struct xfs_inode	*dff_inodes[XFS_DEFER_FREEZER_INODES];
 };
 
 /* Functions to freeze a chain of deferred operations for later. */
@@ -83,5 +88,10 @@ int xfs_defer_freeze(struct xfs_trans *tp, struct xfs_defer_freezer **dffp);
 void xfs_defer_thaw(struct xfs_defer_freezer *dff, struct xfs_trans *tp);
 void xfs_defer_freeezer_finish(struct xfs_mount *mp,
 		struct xfs_defer_freezer *dff);
+int xfs_defer_freezer_ijoin(struct xfs_defer_freezer *dff,
+		struct xfs_inode *ip);
+
+/* These functions must be provided by the xfs implementation. */
+void xfs_defer_freezer_irele(struct xfs_defer_freezer *dff);
 
 #endif /* __XFS_DEFER_H__ */
diff --git a/fs/xfs/xfs_bmap_item.c b/fs/xfs/xfs_bmap_item.c
index c733bdeeeb9b..bbce191d8fcd 100644
--- a/fs/xfs/xfs_bmap_item.c
+++ b/fs/xfs/xfs_bmap_item.c
@@ -530,12 +530,13 @@ xfs_bui_item_recover(
 	}
 
 	error = xlog_recover_trans_commit(tp, dffp);
-	xfs_iunlock(ip, XFS_ILOCK_EXCL);
-	xfs_irele(ip);
-	return error;
+	if (error)
+		goto err_rele;
+	return xfs_defer_freezer_ijoin(*dffp, ip);
 
 err_inode:
 	xfs_trans_cancel(tp);
+err_rele:
 	if (ip) {
 		xfs_iunlock(ip, XFS_ILOCK_EXCL);
 		xfs_irele(ip);
diff --git a/fs/xfs/xfs_icache.c b/fs/xfs/xfs_icache.c
index 17a0b86fe701..b96ddf5ff334 100644
--- a/fs/xfs/xfs_icache.c
+++ b/fs/xfs/xfs_icache.c
@@ -12,6 +12,7 @@
 #include "xfs_sb.h"
 #include "xfs_mount.h"
 #include "xfs_inode.h"
+#include "xfs_defer.h"
 #include "xfs_trans.h"
 #include "xfs_trans_priv.h"
 #include "xfs_inode_item.h"
@@ -1847,3 +1848,21 @@ xfs_start_block_reaping(
 	xfs_queue_eofblocks(mp);
 	xfs_queue_cowblocks(mp);
 }
+
+/* Release all the inode resources attached to this freezer. */
+void
+xfs_defer_freezer_irele(
+	struct xfs_defer_freezer	*dff)
+{
+	unsigned int			i;
+
+	for (i = 0; i < XFS_DEFER_FREEZER_INODES; i++) {
+		if (!dff->dff_inodes[i])
+			break;
+
+		if (dff->dff_ilocks[i])
+			xfs_iunlock(dff->dff_inodes[i], dff->dff_ilocks[i]);
+		xfs_irele(dff->dff_inodes[i]);
+		dff->dff_inodes[i] = NULL;
+	}
+}


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* Re: [PATCH 3/3] xfs: fix an incore inode UAF in xfs_bui_recover
  2020-05-05  1:13 ` [PATCH 3/3] xfs: fix an incore inode UAF in xfs_bui_recover Darrick J. Wong
@ 2020-05-05 14:11   ` Brian Foster
  2020-05-06  0:34     ` Darrick J. Wong
  0 siblings, 1 reply; 19+ messages in thread
From: Brian Foster @ 2020-05-05 14:11 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: linux-xfs

On Mon, May 04, 2020 at 06:13:53PM -0700, Darrick J. Wong wrote:
> From: Darrick J. Wong <darrick.wong@oracle.com>
> 
> In xfs_bui_item_recover, there exists a use-after-free bug with regards
> to the inode that is involved in the bmap replay operation.  If the
> mapping operation does not complete, we call xfs_bmap_unmap_extent to
> create a deferred op to finish the unmapping work, and we retain a
> pointer to the incore inode.
> 
> Unfortunately, the very next thing we do is commit the transaction and
> drop the inode.  If reclaim tears down the inode before we try to finish
> the defer ops, we dereference garbage and blow up.  Therefore, create a
> way to join inodes to the defer ops freezer so that we can maintain the
> xfs_inode reference until we're done with the inode.
> 
> Note: This imposes the requirement that there be enough memory to keep
> every incore inode in memory throughout recovery.
> 
> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
> ---

Maybe I'm missing something, but I thought the discussion on the
previous version[1] landed on an approach where the intent would hold a
reference to the inode. Wouldn't that break the dependency on the dfops
freeze/thaw mechanism?

Brian

[1] https://lore.kernel.org/linux-xfs/20200429235818.GX6742@magnolia/

>  fs/xfs/libxfs/xfs_defer.c |   50 +++++++++++++++++++++++++++++++++++++++++++++
>  fs/xfs/libxfs/xfs_defer.h |   10 +++++++++
>  fs/xfs/xfs_bmap_item.c    |    7 ++++--
>  fs/xfs/xfs_icache.c       |   19 +++++++++++++++++
>  4 files changed, 83 insertions(+), 3 deletions(-)
> 
> 
> diff --git a/fs/xfs/libxfs/xfs_defer.c b/fs/xfs/libxfs/xfs_defer.c
> index ea4d28851bbd..72933fdafcb2 100644
> --- a/fs/xfs/libxfs/xfs_defer.c
> +++ b/fs/xfs/libxfs/xfs_defer.c
> @@ -16,6 +16,7 @@
>  #include "xfs_inode.h"
>  #include "xfs_inode_item.h"
>  #include "xfs_trace.h"
> +#include "xfs_icache.h"
>  
>  /*
>   * Deferred Operations in XFS
> @@ -583,8 +584,19 @@ xfs_defer_thaw(
>  	struct xfs_defer_freezer	*dff,
>  	struct xfs_trans		*tp)
>  {
> +	int				i;
> +
>  	ASSERT(tp->t_flags & XFS_TRANS_PERM_LOG_RES);
>  
> +	/* Re-acquire the inode locks. */
> +	for (i = 0; i < XFS_DEFER_FREEZER_INODES; i++) {
> +		if (!dff->dff_inodes[i])
> +			break;
> +
> +		dff->dff_ilocks[i] = XFS_ILOCK_EXCL;
> +		xfs_ilock(dff->dff_inodes[i], dff->dff_ilocks[i]);
> +	}
> +
>  	/* Add the dfops items to the transaction. */
>  	list_splice_init(&dff->dff_dfops, &tp->t_dfops);
>  	tp->t_flags |= dff->dff_tpflags;
> @@ -597,5 +609,43 @@ xfs_defer_freeezer_finish(
>  	struct xfs_defer_freezer	*dff)
>  {
>  	xfs_defer_cancel_list(mp, &dff->dff_dfops);
> +	xfs_defer_freezer_irele(dff);
>  	kmem_free(dff);
>  }
> +
> +/*
> + * Attach an inode to this deferred ops freezer.  Callers must hold ILOCK_EXCL,
> + * which will be dropped and reacquired when we're ready to thaw the frozen
> + * deferred ops.
> + */
> +int
> +xfs_defer_freezer_ijoin(
> +	struct xfs_defer_freezer	*dff,
> +	struct xfs_inode		*ip)
> +{
> +	unsigned int			i;
> +
> +	ASSERT(xfs_isilocked(ip, XFS_ILOCK_EXCL));
> +
> +	for (i = 0; i < XFS_DEFER_FREEZER_INODES; i++) {
> +		if (dff->dff_inodes[i] == ip)
> +			goto out;
> +		if (dff->dff_inodes[i] == NULL)
> +			break;
> +	}
> +
> +	if (i == XFS_DEFER_FREEZER_INODES) {
> +		ASSERT(0);
> +		return -EFSCORRUPTED;
> +	}
> +
> +	/*
> +	 * Attach this inode to the freezer and drop its ILOCK because we
> +	 * assume the caller will need to allocate a transaction.
> +	 */
> +	dff->dff_inodes[i] = ip;
> +	dff->dff_ilocks[i] = 0;
> +out:
> +	xfs_iunlock(ip, XFS_ILOCK_EXCL);
> +	return 0;
> +}
> diff --git a/fs/xfs/libxfs/xfs_defer.h b/fs/xfs/libxfs/xfs_defer.h
> index 7ae05e10d750..0052a0313283 100644
> --- a/fs/xfs/libxfs/xfs_defer.h
> +++ b/fs/xfs/libxfs/xfs_defer.h
> @@ -76,6 +76,11 @@ struct xfs_defer_freezer {
>  	/* Deferred ops state saved from the transaction. */
>  	struct list_head	dff_dfops;
>  	unsigned int		dff_tpflags;
> +
> +	/* Inodes to hold when we want to finish the deferred work items. */
> +#define XFS_DEFER_FREEZER_INODES	2
> +	unsigned int		dff_ilocks[XFS_DEFER_FREEZER_INODES];
> +	struct xfs_inode	*dff_inodes[XFS_DEFER_FREEZER_INODES];
>  };
>  
>  /* Functions to freeze a chain of deferred operations for later. */
> @@ -83,5 +88,10 @@ int xfs_defer_freeze(struct xfs_trans *tp, struct xfs_defer_freezer **dffp);
>  void xfs_defer_thaw(struct xfs_defer_freezer *dff, struct xfs_trans *tp);
>  void xfs_defer_freeezer_finish(struct xfs_mount *mp,
>  		struct xfs_defer_freezer *dff);
> +int xfs_defer_freezer_ijoin(struct xfs_defer_freezer *dff,
> +		struct xfs_inode *ip);
> +
> +/* These functions must be provided by the xfs implementation. */
> +void xfs_defer_freezer_irele(struct xfs_defer_freezer *dff);
>  
>  #endif /* __XFS_DEFER_H__ */
> diff --git a/fs/xfs/xfs_bmap_item.c b/fs/xfs/xfs_bmap_item.c
> index c733bdeeeb9b..bbce191d8fcd 100644
> --- a/fs/xfs/xfs_bmap_item.c
> +++ b/fs/xfs/xfs_bmap_item.c
> @@ -530,12 +530,13 @@ xfs_bui_item_recover(
>  	}
>  
>  	error = xlog_recover_trans_commit(tp, dffp);
> -	xfs_iunlock(ip, XFS_ILOCK_EXCL);
> -	xfs_irele(ip);
> -	return error;
> +	if (error)
> +		goto err_rele;
> +	return xfs_defer_freezer_ijoin(*dffp, ip);
>  
>  err_inode:
>  	xfs_trans_cancel(tp);
> +err_rele:
>  	if (ip) {
>  		xfs_iunlock(ip, XFS_ILOCK_EXCL);
>  		xfs_irele(ip);
> diff --git a/fs/xfs/xfs_icache.c b/fs/xfs/xfs_icache.c
> index 17a0b86fe701..b96ddf5ff334 100644
> --- a/fs/xfs/xfs_icache.c
> +++ b/fs/xfs/xfs_icache.c
> @@ -12,6 +12,7 @@
>  #include "xfs_sb.h"
>  #include "xfs_mount.h"
>  #include "xfs_inode.h"
> +#include "xfs_defer.h"
>  #include "xfs_trans.h"
>  #include "xfs_trans_priv.h"
>  #include "xfs_inode_item.h"
> @@ -1847,3 +1848,21 @@ xfs_start_block_reaping(
>  	xfs_queue_eofblocks(mp);
>  	xfs_queue_cowblocks(mp);
>  }
> +
> +/* Release all the inode resources attached to this freezer. */
> +void
> +xfs_defer_freezer_irele(
> +	struct xfs_defer_freezer	*dff)
> +{
> +	unsigned int			i;
> +
> +	for (i = 0; i < XFS_DEFER_FREEZER_INODES; i++) {
> +		if (!dff->dff_inodes[i])
> +			break;
> +
> +		if (dff->dff_ilocks[i])
> +			xfs_iunlock(dff->dff_inodes[i], dff->dff_ilocks[i]);
> +		xfs_irele(dff->dff_inodes[i]);
> +		dff->dff_inodes[i] = NULL;
> +	}
> +}
> 


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 3/3] xfs: fix an incore inode UAF in xfs_bui_recover
  2020-05-05 14:11   ` Brian Foster
@ 2020-05-06  0:34     ` Darrick J. Wong
  2020-05-06 13:56       ` Brian Foster
  0 siblings, 1 reply; 19+ messages in thread
From: Darrick J. Wong @ 2020-05-06  0:34 UTC (permalink / raw)
  To: Brian Foster; +Cc: linux-xfs

On Tue, May 05, 2020 at 10:11:38AM -0400, Brian Foster wrote:
> On Mon, May 04, 2020 at 06:13:53PM -0700, Darrick J. Wong wrote:
> > From: Darrick J. Wong <darrick.wong@oracle.com>
> > 
> > In xfs_bui_item_recover, there exists a use-after-free bug with regards
> > to the inode that is involved in the bmap replay operation.  If the
> > mapping operation does not complete, we call xfs_bmap_unmap_extent to
> > create a deferred op to finish the unmapping work, and we retain a
> > pointer to the incore inode.
> > 
> > Unfortunately, the very next thing we do is commit the transaction and
> > drop the inode.  If reclaim tears down the inode before we try to finish
> > the defer ops, we dereference garbage and blow up.  Therefore, create a
> > way to join inodes to the defer ops freezer so that we can maintain the
> > xfs_inode reference until we're done with the inode.
> > 
> > Note: This imposes the requirement that there be enough memory to keep
> > every incore inode in memory throughout recovery.
> > 
> > Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
> > ---
> 
> Maybe I'm missing something, but I thought the discussion on the
> previous version[1] landed on an approach where the intent would hold a
> reference to the inode. Wouldn't that break the dependency on the dfops
> freeze/thaw mechanism?

We did, but as I started pondering how to make this work in practice,
I soon realized that it's not as simple as "don't irele the inode":

(Part 1)

It would be easy enough to add a flag to struct xfs_bmap_intent that
means "when you're done, unlock and irele the inode".  You'd have to add
this capability to any log item that deals with inodes, but only
deferred bmap items need this.

Log recovery looks like this:

----------------
Transaction 0
BUI1 unmap file1, offset1, startblock1, length == 200000
----------------
<end>

What happens if the first bunmapi call doesn't actually unmap
everything?  We'll queue yet another bmap intent item (call it BUI2) to
finish the work.  This means that we can't just drop the inode when
we're done processing BUI1; we have to propagate the "unlock and irele"
flag to BUI2.  Ok, so now there's this chaining behavior that wasn't
there before.

Then I thought about it some more, and wondered what would happen if the
log contained two unfinished bmap items for the same inode?  If this
happened, you'd deadlock because you've not released the ILOCK for the
first unfinished bmap item when you want to acquire the ILOCK for the
second unfinished bmap item.

You'd have to drop the ILOCK (but retain the inode reference) between
the two bmap items.  I think this is currently impossible because there
aren't any transactions that queue multiple bmap intents per
transaction, nobody else holds the ilock, and the existing rmap
implementation of the swapext ioctl doesn't allow swapping a file with
itself.  However...

(Part 2)

The second thing I thought about is, what if there's a lot of work after
transaction 0?  Say the recovery stream looks like:

----------------
Transaction 0
BUI1 unmap file1, offset1, startblock1, length == 200000
----------------
<10000 more intents applied to other things>
----------------
<end>

This is a problem, because we must drop the ILOCK after completing the
BUI1 work so that we can process the 10000 more intents before we can
get to BUI2.  We cannot let the tail of the log get pinned on the locked
inode, so we must release the ILOCK after we're done with BUI1 and
re-acquire it when we're ready to deal with BUI2.

This means I need /two/ flags in struct xfs_bmap_intent -- one to say
that we need to irele the inode at the end of processing, and another to
say that the bmap item needs to grab the ILOCK.  If there's going to be
a BUI2 as a result of recovering BUI1, the first flag must be propagated
to BUI2, but the second flag must /not/ be propagated.

A potential counterargument is that we crammed all 10000 intents into
the log without pinning the log, so maybe we can hold the ILOCK.
However....

(Part 3)

Next, I considered how the swapext log item in the atomic file update
patchset would interact with log recovery.  A couple of notes about the
swapext log items:

1. They are careful not to create the kinds of bunmap operations that
would cause the creation of /more/ BUIs.

2. Every time we log one extent's worth of unmap/remap operations (using
deferred bmap log items, naturally) we log an done item for the original
swapext intent item and immediately log another swapext intent for the
work remaining.

I came up with the following sequence of transactions to recover:

----------------
Transaction 0
SXI0 file1, offset1, file2, offset2, length == 10
----------------
Transaction 1
BUI1 unmap file1, offset1, startblock1, length == 2
BUI2 unmap file2, offset2, startblock2, length == 2
BUI3 map file1, offset1, startblock2, length == 2
BUI4 map file2, offset2, startblock1, length == 2
SXD done (SXI0)
SXI5 file1, offset1 + 2, file2, offset2 + 2, length == 8
---------------
<end of log>

Recovery in this case will end up with BUI[1-4] and SXI5 that still need
processing.  Each of the 5 intent items gets its own recovery
transaction and dfops freezer chain.  BUI1, BUI3, and SXI5 will each
require ilock'd access to file1, which means that each of them will have
to have the ability to drop the ILOCK but retain the reference.  The
same argument applies to file2 and BUI2, BUI4, and SXI5.  Note also that
in our brave new world, file1 and file2 can be the same.

For the swapext log item type this inode management is particularly
acute because we're certain to have new SXIs created as a result of
recovering an SXI.

(End)

In conclusion, we need a mechanism to drop both the inode lock and the
inode reference.  Each log item type that works with inodes will have to
set aside 2 flags somewhere in the incore extent structure, and provide
more or less the same code to manage that state.  Right now that's just
the bmap items, but then the swapext items and the deferred xattr items
will have that same requirement when they get added.

If we do that, the inode reference and ilock management diverges even
further from regular operations.  Regular callers are expected to iget
and ilock the inode and maintain that all the way to the end of
xfs_trans_commit.  Log recovery, on the other hand, will add a bunch of
state handling to some of the log items so that we can drop the ILOCK
after the first item, and then drop the inode reference after finishing
the last possible user of that inode in the revived dfops chain.

On the other hand, keeping the inode management in the defer freezer
code means that I keep all that complexity and state management out of
the ->finish_item code.  When we go to finish the new incore intents
that get created during recovery, the inode reference and ILOCK rules
are the same as anywhere else -- the caller is required to grab the
inode, the transaction, and the ilock (in that order).

To the extent that recovering log intent items is /not/ the same as
running a normal transaction, all that weirdness is concentrated in
xfs_log_recover.c and a single function in xfs_defer.c.  The desired
behavior is the same across all the log intent item types, so I
decided in favor of keeping the centralised behavior and (trying to)
contain the wacky.  We don't need all that right away, but we will.

Note that I did make it so that we retain the reference always, so
there's no longer a need for irele/igrab cycles, which makes capturing
the dfops chain less error prone.

--D

> Brian
> 
> [1] https://lore.kernel.org/linux-xfs/20200429235818.GX6742@magnolia/
> 
> >  fs/xfs/libxfs/xfs_defer.c |   50 +++++++++++++++++++++++++++++++++++++++++++++
> >  fs/xfs/libxfs/xfs_defer.h |   10 +++++++++
> >  fs/xfs/xfs_bmap_item.c    |    7 ++++--
> >  fs/xfs/xfs_icache.c       |   19 +++++++++++++++++
> >  4 files changed, 83 insertions(+), 3 deletions(-)
> > 
> > 
> > diff --git a/fs/xfs/libxfs/xfs_defer.c b/fs/xfs/libxfs/xfs_defer.c
> > index ea4d28851bbd..72933fdafcb2 100644
> > --- a/fs/xfs/libxfs/xfs_defer.c
> > +++ b/fs/xfs/libxfs/xfs_defer.c
> > @@ -16,6 +16,7 @@
> >  #include "xfs_inode.h"
> >  #include "xfs_inode_item.h"
> >  #include "xfs_trace.h"
> > +#include "xfs_icache.h"
> >  
> >  /*
> >   * Deferred Operations in XFS
> > @@ -583,8 +584,19 @@ xfs_defer_thaw(
> >  	struct xfs_defer_freezer	*dff,
> >  	struct xfs_trans		*tp)
> >  {
> > +	int				i;
> > +
> >  	ASSERT(tp->t_flags & XFS_TRANS_PERM_LOG_RES);
> >  
> > +	/* Re-acquire the inode locks. */
> > +	for (i = 0; i < XFS_DEFER_FREEZER_INODES; i++) {
> > +		if (!dff->dff_inodes[i])
> > +			break;
> > +
> > +		dff->dff_ilocks[i] = XFS_ILOCK_EXCL;
> > +		xfs_ilock(dff->dff_inodes[i], dff->dff_ilocks[i]);
> > +	}
> > +
> >  	/* Add the dfops items to the transaction. */
> >  	list_splice_init(&dff->dff_dfops, &tp->t_dfops);
> >  	tp->t_flags |= dff->dff_tpflags;
> > @@ -597,5 +609,43 @@ xfs_defer_freeezer_finish(
> >  	struct xfs_defer_freezer	*dff)
> >  {
> >  	xfs_defer_cancel_list(mp, &dff->dff_dfops);
> > +	xfs_defer_freezer_irele(dff);
> >  	kmem_free(dff);
> >  }
> > +
> > +/*
> > + * Attach an inode to this deferred ops freezer.  Callers must hold ILOCK_EXCL,
> > + * which will be dropped and reacquired when we're ready to thaw the frozen
> > + * deferred ops.
> > + */
> > +int
> > +xfs_defer_freezer_ijoin(
> > +	struct xfs_defer_freezer	*dff,
> > +	struct xfs_inode		*ip)
> > +{
> > +	unsigned int			i;
> > +
> > +	ASSERT(xfs_isilocked(ip, XFS_ILOCK_EXCL));
> > +
> > +	for (i = 0; i < XFS_DEFER_FREEZER_INODES; i++) {
> > +		if (dff->dff_inodes[i] == ip)
> > +			goto out;
> > +		if (dff->dff_inodes[i] == NULL)
> > +			break;
> > +	}
> > +
> > +	if (i == XFS_DEFER_FREEZER_INODES) {
> > +		ASSERT(0);
> > +		return -EFSCORRUPTED;
> > +	}
> > +
> > +	/*
> > +	 * Attach this inode to the freezer and drop its ILOCK because we
> > +	 * assume the caller will need to allocate a transaction.
> > +	 */
> > +	dff->dff_inodes[i] = ip;
> > +	dff->dff_ilocks[i] = 0;
> > +out:
> > +	xfs_iunlock(ip, XFS_ILOCK_EXCL);
> > +	return 0;
> > +}
> > diff --git a/fs/xfs/libxfs/xfs_defer.h b/fs/xfs/libxfs/xfs_defer.h
> > index 7ae05e10d750..0052a0313283 100644
> > --- a/fs/xfs/libxfs/xfs_defer.h
> > +++ b/fs/xfs/libxfs/xfs_defer.h
> > @@ -76,6 +76,11 @@ struct xfs_defer_freezer {
> >  	/* Deferred ops state saved from the transaction. */
> >  	struct list_head	dff_dfops;
> >  	unsigned int		dff_tpflags;
> > +
> > +	/* Inodes to hold when we want to finish the deferred work items. */
> > +#define XFS_DEFER_FREEZER_INODES	2
> > +	unsigned int		dff_ilocks[XFS_DEFER_FREEZER_INODES];
> > +	struct xfs_inode	*dff_inodes[XFS_DEFER_FREEZER_INODES];
> >  };
> >  
> >  /* Functions to freeze a chain of deferred operations for later. */
> > @@ -83,5 +88,10 @@ int xfs_defer_freeze(struct xfs_trans *tp, struct xfs_defer_freezer **dffp);
> >  void xfs_defer_thaw(struct xfs_defer_freezer *dff, struct xfs_trans *tp);
> >  void xfs_defer_freeezer_finish(struct xfs_mount *mp,
> >  		struct xfs_defer_freezer *dff);
> > +int xfs_defer_freezer_ijoin(struct xfs_defer_freezer *dff,
> > +		struct xfs_inode *ip);
> > +
> > +/* These functions must be provided by the xfs implementation. */
> > +void xfs_defer_freezer_irele(struct xfs_defer_freezer *dff);
> >  
> >  #endif /* __XFS_DEFER_H__ */
> > diff --git a/fs/xfs/xfs_bmap_item.c b/fs/xfs/xfs_bmap_item.c
> > index c733bdeeeb9b..bbce191d8fcd 100644
> > --- a/fs/xfs/xfs_bmap_item.c
> > +++ b/fs/xfs/xfs_bmap_item.c
> > @@ -530,12 +530,13 @@ xfs_bui_item_recover(
> >  	}
> >  
> >  	error = xlog_recover_trans_commit(tp, dffp);
> > -	xfs_iunlock(ip, XFS_ILOCK_EXCL);
> > -	xfs_irele(ip);
> > -	return error;
> > +	if (error)
> > +		goto err_rele;
> > +	return xfs_defer_freezer_ijoin(*dffp, ip);
> >  
> >  err_inode:
> >  	xfs_trans_cancel(tp);
> > +err_rele:
> >  	if (ip) {
> >  		xfs_iunlock(ip, XFS_ILOCK_EXCL);
> >  		xfs_irele(ip);
> > diff --git a/fs/xfs/xfs_icache.c b/fs/xfs/xfs_icache.c
> > index 17a0b86fe701..b96ddf5ff334 100644
> > --- a/fs/xfs/xfs_icache.c
> > +++ b/fs/xfs/xfs_icache.c
> > @@ -12,6 +12,7 @@
> >  #include "xfs_sb.h"
> >  #include "xfs_mount.h"
> >  #include "xfs_inode.h"
> > +#include "xfs_defer.h"
> >  #include "xfs_trans.h"
> >  #include "xfs_trans_priv.h"
> >  #include "xfs_inode_item.h"
> > @@ -1847,3 +1848,21 @@ xfs_start_block_reaping(
> >  	xfs_queue_eofblocks(mp);
> >  	xfs_queue_cowblocks(mp);
> >  }
> > +
> > +/* Release all the inode resources attached to this freezer. */
> > +void
> > +xfs_defer_freezer_irele(
> > +	struct xfs_defer_freezer	*dff)
> > +{
> > +	unsigned int			i;
> > +
> > +	for (i = 0; i < XFS_DEFER_FREEZER_INODES; i++) {
> > +		if (!dff->dff_inodes[i])
> > +			break;
> > +
> > +		if (dff->dff_ilocks[i])
> > +			xfs_iunlock(dff->dff_inodes[i], dff->dff_ilocks[i]);
> > +		xfs_irele(dff->dff_inodes[i]);
> > +		dff->dff_inodes[i] = NULL;
> > +	}
> > +}
> > 
> 

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 3/3] xfs: fix an incore inode UAF in xfs_bui_recover
  2020-05-06  0:34     ` Darrick J. Wong
@ 2020-05-06 13:56       ` Brian Foster
  2020-05-06 17:01         ` Darrick J. Wong
  0 siblings, 1 reply; 19+ messages in thread
From: Brian Foster @ 2020-05-06 13:56 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: linux-xfs

On Tue, May 05, 2020 at 05:34:23PM -0700, Darrick J. Wong wrote:
> On Tue, May 05, 2020 at 10:11:38AM -0400, Brian Foster wrote:
> > On Mon, May 04, 2020 at 06:13:53PM -0700, Darrick J. Wong wrote:
> > > From: Darrick J. Wong <darrick.wong@oracle.com>
> > > 
> > > In xfs_bui_item_recover, there exists a use-after-free bug with regards
> > > to the inode that is involved in the bmap replay operation.  If the
> > > mapping operation does not complete, we call xfs_bmap_unmap_extent to
> > > create a deferred op to finish the unmapping work, and we retain a
> > > pointer to the incore inode.
> > > 
> > > Unfortunately, the very next thing we do is commit the transaction and
> > > drop the inode.  If reclaim tears down the inode before we try to finish
> > > the defer ops, we dereference garbage and blow up.  Therefore, create a
> > > way to join inodes to the defer ops freezer so that we can maintain the
> > > xfs_inode reference until we're done with the inode.
> > > 
> > > Note: This imposes the requirement that there be enough memory to keep
> > > every incore inode in memory throughout recovery.
> > > 
> > > Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
> > > ---
> > 
> > Maybe I'm missing something, but I thought the discussion on the
> > previous version[1] landed on an approach where the intent would hold a
> > reference to the inode. Wouldn't that break the dependency on the dfops
> > freeze/thaw mechanism?
> 
> We did, but as I started pondering how to make this work in practice,
> I soon realized that it's not as simple as "don't irele the inode":
> 
> (Part 1)
> 
> It would be easy enough to add a flag to struct xfs_bmap_intent that
> means "when you're done, unlock and irele the inode".  You'd have to add
> this capability to any log item that deals with inodes, but only
> deferred bmap items need this.
> 
> Log recovery looks like this:
> 
> ----------------
> Transaction 0
> BUI1 unmap file1, offset1, startblock1, length == 200000
> ----------------
> <end>
> 
> What happens if the first bunmapi call doesn't actually unmap
> everything?  We'll queue yet another bmap intent item (call it BUI2) to
> finish the work.  This means that we can't just drop the inode when
> we're done processing BUI1; we have to propagate the "unlock and irele"
> flag to BUI2.  Ok, so now there's this chaining behavior that wasn't
> there before.
> 
> Then I thought about it some more, and wondered what would happen if the
> log contained two unfinished bmap items for the same inode?  If this
> happened, you'd deadlock because you've not released the ILOCK for the
> first unfinished bmap item when you want to acquire the ILOCK for the
> second unfinished bmap item.
> 

Hm, Ok.. so IIUC the lock is implicitly held across these map/unmap
dfops at runtime, but that's not necessarily the case during recovery
due to the dfops chaining behavior. To me, this raises the question of
why we need to reorder/split these particular intents during recovery if
the original operation would have completed under a single inode lock
cycle.

The comments in xfs_log_recover.c don't really explain it. From going
back to commit 509955823cc9c ("xfs: log recovery should replay deferred
ops in order"), it looks like the issue is that ordering gets screwed up
for operations that commit multiple intents per-transaction and
completion of those intents creates new child intents. At runtime, we'd
process the intents in the current transaction first and then get to the
child intents in the order they were created. If we didn't have the
recovery time dfops shuffling implemented in this patch, recovery would
process the first intent in the log, then all of the child intents of
that one before getting to the next intent in the log that presumably
would have been second at runtime. Am I following that correctly?

I think the rest makes sense from the perspective of not wanting to
plumb the conditional locking complexity through the dfops/intent
processing for every intent type that eventually needs it, as opposed to
containing in the log recovery code (via dfops magic). I need to think
about all that some more, but I'd like to make sure I understand why we
have this recovery requirement in the first place.

Brian

> You'd have to drop the ILOCK (but retain the inode reference) between
> the two bmap items.  I think this is currently impossible because there
> aren't any transactions that queue multiple bmap intents per
> transaction, nobody else holds the ilock, and the existing rmap
> implementation of the swapext ioctl doesn't allow swapping a file with
> itself.  However...
> 
> (Part 2)
> 
> The second thing I thought about is, what if there's a lot of work after
> transaction 0?  Say the recovery stream looks like:
> 
> ----------------
> Transaction 0
> BUI1 unmap file1, offset1, startblock1, length == 200000
> ----------------
> <10000 more intents applied to other things>
> ----------------
> <end>
> 
> This is a problem, because we must drop the ILOCK after completing the
> BUI1 work so that we can process the 10000 more intents before we can
> get to BUI2.  We cannot let the tail of the log get pinned on the locked
> inode, so we must release the ILOCK after we're done with BUI1 and
> re-acquire it when we're ready to deal with BUI2.
> 
> This means I need /two/ flags in struct xfs_bmap_intent -- one to say
> that we need to irele the inode at the end of processing, and another to
> say that the bmap item needs to grab the ILOCK.  If there's going to be
> a BUI2 as a result of recovering BUI1, the first flag must be propagated
> to BUI2, but the second flag must /not/ be propagated.
> 
> A potential counterargument is that we crammed all 10000 intents into
> the log without pinning the log, so maybe we can hold the ILOCK.
> However....
> 
> (Part 3)
> 
> Next, I considered how the swapext log item in the atomic file update
> patchset would interact with log recovery.  A couple of notes about the
> swapext log items:
> 
> 1. They are careful not to create the kinds of bunmap operations that
> would cause the creation of /more/ BUIs.
> 
> 2. Every time we log one extent's worth of unmap/remap operations (using
> deferred bmap log items, naturally) we log an done item for the original
> swapext intent item and immediately log another swapext intent for the
> work remaining.
> 
> I came up with the following sequence of transactions to recover:
> 
> ----------------
> Transaction 0
> SXI0 file1, offset1, file2, offset2, length == 10
> ----------------
> Transaction 1
> BUI1 unmap file1, offset1, startblock1, length == 2
> BUI2 unmap file2, offset2, startblock2, length == 2
> BUI3 map file1, offset1, startblock2, length == 2
> BUI4 map file2, offset2, startblock1, length == 2
> SXD done (SXI0)
> SXI5 file1, offset1 + 2, file2, offset2 + 2, length == 8
> ---------------
> <end of log>
> 
> Recovery in this case will end up with BUI[1-4] and SXI5 that still need
> processing.  Each of the 5 intent items gets its own recovery
> transaction and dfops freezer chain.  BUI1, BUI3, and SXI5 will each
> require ilock'd access to file1, which means that each of them will have
> to have the ability to drop the ILOCK but retain the reference.  The
> same argument applies to file2 and BUI2, BUI4, and SXI5.  Note also that
> in our brave new world, file1 and file2 can be the same.
> 
> For the swapext log item type this inode management is particularly
> acute because we're certain to have new SXIs created as a result of
> recovering an SXI.
> 
> (End)
> 
> In conclusion, we need a mechanism to drop both the inode lock and the
> inode reference.  Each log item type that works with inodes will have to
> set aside 2 flags somewhere in the incore extent structure, and provide
> more or less the same code to manage that state.  Right now that's just
> the bmap items, but then the swapext items and the deferred xattr items
> will have that same requirement when they get added.
> 
> If we do that, the inode reference and ilock management diverges even
> further from regular operations.  Regular callers are expected to iget
> and ilock the inode and maintain that all the way to the end of
> xfs_trans_commit.  Log recovery, on the other hand, will add a bunch of
> state handling to some of the log items so that we can drop the ILOCK
> after the first item, and then drop the inode reference after finishing
> the last possible user of that inode in the revived dfops chain.
> 
> On the other hand, keeping the inode management in the defer freezer
> code means that I keep all that complexity and state management out of
> the ->finish_item code.  When we go to finish the new incore intents
> that get created during recovery, the inode reference and ILOCK rules
> are the same as anywhere else -- the caller is required to grab the
> inode, the transaction, and the ilock (in that order).
> 
> To the extent that recovering log intent items is /not/ the same as
> running a normal transaction, all that weirdness is concentrated in
> xfs_log_recover.c and a single function in xfs_defer.c.  The desired
> behavior is the same across all the log intent item types, so I
> decided in favor of keeping the centralised behavior and (trying to)
> contain the wacky.  We don't need all that right away, but we will.
> 
> Note that I did make it so that we retain the reference always, so
> there's no longer a need for irele/igrab cycles, which makes capturing
> the dfops chain less error prone.
> 
> --D
> 
> > Brian
> > 
> > [1] https://lore.kernel.org/linux-xfs/20200429235818.GX6742@magnolia/
> > 
> > >  fs/xfs/libxfs/xfs_defer.c |   50 +++++++++++++++++++++++++++++++++++++++++++++
> > >  fs/xfs/libxfs/xfs_defer.h |   10 +++++++++
> > >  fs/xfs/xfs_bmap_item.c    |    7 ++++--
> > >  fs/xfs/xfs_icache.c       |   19 +++++++++++++++++
> > >  4 files changed, 83 insertions(+), 3 deletions(-)
> > > 
> > > 
> > > diff --git a/fs/xfs/libxfs/xfs_defer.c b/fs/xfs/libxfs/xfs_defer.c
> > > index ea4d28851bbd..72933fdafcb2 100644
> > > --- a/fs/xfs/libxfs/xfs_defer.c
> > > +++ b/fs/xfs/libxfs/xfs_defer.c
> > > @@ -16,6 +16,7 @@
> > >  #include "xfs_inode.h"
> > >  #include "xfs_inode_item.h"
> > >  #include "xfs_trace.h"
> > > +#include "xfs_icache.h"
> > >  
> > >  /*
> > >   * Deferred Operations in XFS
> > > @@ -583,8 +584,19 @@ xfs_defer_thaw(
> > >  	struct xfs_defer_freezer	*dff,
> > >  	struct xfs_trans		*tp)
> > >  {
> > > +	int				i;
> > > +
> > >  	ASSERT(tp->t_flags & XFS_TRANS_PERM_LOG_RES);
> > >  
> > > +	/* Re-acquire the inode locks. */
> > > +	for (i = 0; i < XFS_DEFER_FREEZER_INODES; i++) {
> > > +		if (!dff->dff_inodes[i])
> > > +			break;
> > > +
> > > +		dff->dff_ilocks[i] = XFS_ILOCK_EXCL;
> > > +		xfs_ilock(dff->dff_inodes[i], dff->dff_ilocks[i]);
> > > +	}
> > > +
> > >  	/* Add the dfops items to the transaction. */
> > >  	list_splice_init(&dff->dff_dfops, &tp->t_dfops);
> > >  	tp->t_flags |= dff->dff_tpflags;
> > > @@ -597,5 +609,43 @@ xfs_defer_freeezer_finish(
> > >  	struct xfs_defer_freezer	*dff)
> > >  {
> > >  	xfs_defer_cancel_list(mp, &dff->dff_dfops);
> > > +	xfs_defer_freezer_irele(dff);
> > >  	kmem_free(dff);
> > >  }
> > > +
> > > +/*
> > > + * Attach an inode to this deferred ops freezer.  Callers must hold ILOCK_EXCL,
> > > + * which will be dropped and reacquired when we're ready to thaw the frozen
> > > + * deferred ops.
> > > + */
> > > +int
> > > +xfs_defer_freezer_ijoin(
> > > +	struct xfs_defer_freezer	*dff,
> > > +	struct xfs_inode		*ip)
> > > +{
> > > +	unsigned int			i;
> > > +
> > > +	ASSERT(xfs_isilocked(ip, XFS_ILOCK_EXCL));
> > > +
> > > +	for (i = 0; i < XFS_DEFER_FREEZER_INODES; i++) {
> > > +		if (dff->dff_inodes[i] == ip)
> > > +			goto out;
> > > +		if (dff->dff_inodes[i] == NULL)
> > > +			break;
> > > +	}
> > > +
> > > +	if (i == XFS_DEFER_FREEZER_INODES) {
> > > +		ASSERT(0);
> > > +		return -EFSCORRUPTED;
> > > +	}
> > > +
> > > +	/*
> > > +	 * Attach this inode to the freezer and drop its ILOCK because we
> > > +	 * assume the caller will need to allocate a transaction.
> > > +	 */
> > > +	dff->dff_inodes[i] = ip;
> > > +	dff->dff_ilocks[i] = 0;
> > > +out:
> > > +	xfs_iunlock(ip, XFS_ILOCK_EXCL);
> > > +	return 0;
> > > +}
> > > diff --git a/fs/xfs/libxfs/xfs_defer.h b/fs/xfs/libxfs/xfs_defer.h
> > > index 7ae05e10d750..0052a0313283 100644
> > > --- a/fs/xfs/libxfs/xfs_defer.h
> > > +++ b/fs/xfs/libxfs/xfs_defer.h
> > > @@ -76,6 +76,11 @@ struct xfs_defer_freezer {
> > >  	/* Deferred ops state saved from the transaction. */
> > >  	struct list_head	dff_dfops;
> > >  	unsigned int		dff_tpflags;
> > > +
> > > +	/* Inodes to hold when we want to finish the deferred work items. */
> > > +#define XFS_DEFER_FREEZER_INODES	2
> > > +	unsigned int		dff_ilocks[XFS_DEFER_FREEZER_INODES];
> > > +	struct xfs_inode	*dff_inodes[XFS_DEFER_FREEZER_INODES];
> > >  };
> > >  
> > >  /* Functions to freeze a chain of deferred operations for later. */
> > > @@ -83,5 +88,10 @@ int xfs_defer_freeze(struct xfs_trans *tp, struct xfs_defer_freezer **dffp);
> > >  void xfs_defer_thaw(struct xfs_defer_freezer *dff, struct xfs_trans *tp);
> > >  void xfs_defer_freeezer_finish(struct xfs_mount *mp,
> > >  		struct xfs_defer_freezer *dff);
> > > +int xfs_defer_freezer_ijoin(struct xfs_defer_freezer *dff,
> > > +		struct xfs_inode *ip);
> > > +
> > > +/* These functions must be provided by the xfs implementation. */
> > > +void xfs_defer_freezer_irele(struct xfs_defer_freezer *dff);
> > >  
> > >  #endif /* __XFS_DEFER_H__ */
> > > diff --git a/fs/xfs/xfs_bmap_item.c b/fs/xfs/xfs_bmap_item.c
> > > index c733bdeeeb9b..bbce191d8fcd 100644
> > > --- a/fs/xfs/xfs_bmap_item.c
> > > +++ b/fs/xfs/xfs_bmap_item.c
> > > @@ -530,12 +530,13 @@ xfs_bui_item_recover(
> > >  	}
> > >  
> > >  	error = xlog_recover_trans_commit(tp, dffp);
> > > -	xfs_iunlock(ip, XFS_ILOCK_EXCL);
> > > -	xfs_irele(ip);
> > > -	return error;
> > > +	if (error)
> > > +		goto err_rele;
> > > +	return xfs_defer_freezer_ijoin(*dffp, ip);
> > >  
> > >  err_inode:
> > >  	xfs_trans_cancel(tp);
> > > +err_rele:
> > >  	if (ip) {
> > >  		xfs_iunlock(ip, XFS_ILOCK_EXCL);
> > >  		xfs_irele(ip);
> > > diff --git a/fs/xfs/xfs_icache.c b/fs/xfs/xfs_icache.c
> > > index 17a0b86fe701..b96ddf5ff334 100644
> > > --- a/fs/xfs/xfs_icache.c
> > > +++ b/fs/xfs/xfs_icache.c
> > > @@ -12,6 +12,7 @@
> > >  #include "xfs_sb.h"
> > >  #include "xfs_mount.h"
> > >  #include "xfs_inode.h"
> > > +#include "xfs_defer.h"
> > >  #include "xfs_trans.h"
> > >  #include "xfs_trans_priv.h"
> > >  #include "xfs_inode_item.h"
> > > @@ -1847,3 +1848,21 @@ xfs_start_block_reaping(
> > >  	xfs_queue_eofblocks(mp);
> > >  	xfs_queue_cowblocks(mp);
> > >  }
> > > +
> > > +/* Release all the inode resources attached to this freezer. */
> > > +void
> > > +xfs_defer_freezer_irele(
> > > +	struct xfs_defer_freezer	*dff)
> > > +{
> > > +	unsigned int			i;
> > > +
> > > +	for (i = 0; i < XFS_DEFER_FREEZER_INODES; i++) {
> > > +		if (!dff->dff_inodes[i])
> > > +			break;
> > > +
> > > +		if (dff->dff_ilocks[i])
> > > +			xfs_iunlock(dff->dff_inodes[i], dff->dff_ilocks[i]);
> > > +		xfs_irele(dff->dff_inodes[i]);
> > > +		dff->dff_inodes[i] = NULL;
> > > +	}
> > > +}
> > > 
> > 
> 


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 3/3] xfs: fix an incore inode UAF in xfs_bui_recover
  2020-05-06 13:56       ` Brian Foster
@ 2020-05-06 17:01         ` Darrick J. Wong
  2020-05-07  9:53           ` Brian Foster
  0 siblings, 1 reply; 19+ messages in thread
From: Darrick J. Wong @ 2020-05-06 17:01 UTC (permalink / raw)
  To: Brian Foster; +Cc: linux-xfs

On Wed, May 06, 2020 at 09:56:56AM -0400, Brian Foster wrote:
> On Tue, May 05, 2020 at 05:34:23PM -0700, Darrick J. Wong wrote:
> > On Tue, May 05, 2020 at 10:11:38AM -0400, Brian Foster wrote:
> > > On Mon, May 04, 2020 at 06:13:53PM -0700, Darrick J. Wong wrote:
> > > > From: Darrick J. Wong <darrick.wong@oracle.com>
> > > > 
> > > > In xfs_bui_item_recover, there exists a use-after-free bug with regards
> > > > to the inode that is involved in the bmap replay operation.  If the
> > > > mapping operation does not complete, we call xfs_bmap_unmap_extent to
> > > > create a deferred op to finish the unmapping work, and we retain a
> > > > pointer to the incore inode.
> > > > 
> > > > Unfortunately, the very next thing we do is commit the transaction and
> > > > drop the inode.  If reclaim tears down the inode before we try to finish
> > > > the defer ops, we dereference garbage and blow up.  Therefore, create a
> > > > way to join inodes to the defer ops freezer so that we can maintain the
> > > > xfs_inode reference until we're done with the inode.
> > > > 
> > > > Note: This imposes the requirement that there be enough memory to keep
> > > > every incore inode in memory throughout recovery.
> > > > 
> > > > Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
> > > > ---
> > > 
> > > Maybe I'm missing something, but I thought the discussion on the
> > > previous version[1] landed on an approach where the intent would hold a
> > > reference to the inode. Wouldn't that break the dependency on the dfops
> > > freeze/thaw mechanism?
> > 
> > We did, but as I started pondering how to make this work in practice,
> > I soon realized that it's not as simple as "don't irele the inode":
> > 
> > (Part 1)
> > 
> > It would be easy enough to add a flag to struct xfs_bmap_intent that
> > means "when you're done, unlock and irele the inode".  You'd have to add
> > this capability to any log item that deals with inodes, but only
> > deferred bmap items need this.
> > 
> > Log recovery looks like this:
> > 
> > ----------------
> > Transaction 0
> > BUI1 unmap file1, offset1, startblock1, length == 200000
> > ----------------
> > <end>
> > 
> > What happens if the first bunmapi call doesn't actually unmap
> > everything?  We'll queue yet another bmap intent item (call it BUI2) to
> > finish the work.  This means that we can't just drop the inode when
> > we're done processing BUI1; we have to propagate the "unlock and irele"
> > flag to BUI2.  Ok, so now there's this chaining behavior that wasn't
> > there before.
> > 
> > Then I thought about it some more, and wondered what would happen if the
> > log contained two unfinished bmap items for the same inode?  If this
> > happened, you'd deadlock because you've not released the ILOCK for the
> > first unfinished bmap item when you want to acquire the ILOCK for the
> > second unfinished bmap item.
> > 
> 
> Hm, Ok.. so IIUC the lock is implicitly held across these map/unmap
> dfops at runtime, but that's not necessarily the case during recovery
> due to the dfops chaining behavior. To me, this raises the question of
> why we need to reorder/split these particular intents during recovery if
> the original operation would have completed under a single inode lock
> cycle.

Log recovery /did/ function that way (no reordering) until
509955823cc9c, when it was discovered that we don't have a good way to
reconstruct the dfops chain from the list of recovered log intent items.

> The comments in xfs_log_recover.c don't really explain it. From going
> back to commit 509955823cc9c ("xfs: log recovery should replay deferred
> ops in order"), it looks like the issue is that ordering gets screwed up
> for operations that commit multiple intents per-transaction and
> completion of those intents creates new child intents. At runtime, we'd
> process the intents in the current transaction first and then get to the
> child intents in the order they were created. If we didn't have the
> recovery time dfops shuffling implemented in this patch, recovery would
> process the first intent in the log, then all of the child intents of
> that one before getting to the next intent in the log that presumably
> would have been second at runtime. Am I following that correctly?

Correct.  If we could figure out how to rebuild the dfops chain (and
thus the ordering requirements), we would be able to eliminate this
freezer stuff entirely.

> I think the rest makes sense from the perspective of not wanting to
> plumb the conditional locking complexity through the dfops/intent
> processing for every intent type that eventually needs it, as opposed to
> containing in the log recovery code (via dfops magic). I need to think
> about all that some more, but I'd like to make sure I understand why we
> have this recovery requirement in the first place.

<nod>  It took me a few hours to figure out why that patch was correct
the first time I saw it.  It took me a few hours again(!) when I was
trying to write this series. :/

--D

> Brian
> 
> > You'd have to drop the ILOCK (but retain the inode reference) between
> > the two bmap items.  I think this is currently impossible because there
> > aren't any transactions that queue multiple bmap intents per
> > transaction, nobody else holds the ilock, and the existing rmap
> > implementation of the swapext ioctl doesn't allow swapping a file with
> > itself.  However...
> > 
> > (Part 2)
> > 
> > The second thing I thought about is, what if there's a lot of work after
> > transaction 0?  Say the recovery stream looks like:
> > 
> > ----------------
> > Transaction 0
> > BUI1 unmap file1, offset1, startblock1, length == 200000
> > ----------------
> > <10000 more intents applied to other things>
> > ----------------
> > <end>
> > 
> > This is a problem, because we must drop the ILOCK after completing the
> > BUI1 work so that we can process the 10000 more intents before we can
> > get to BUI2.  We cannot let the tail of the log get pinned on the locked
> > inode, so we must release the ILOCK after we're done with BUI1 and
> > re-acquire it when we're ready to deal with BUI2.
> > 
> > This means I need /two/ flags in struct xfs_bmap_intent -- one to say
> > that we need to irele the inode at the end of processing, and another to
> > say that the bmap item needs to grab the ILOCK.  If there's going to be
> > a BUI2 as a result of recovering BUI1, the first flag must be propagated
> > to BUI2, but the second flag must /not/ be propagated.
> > 
> > A potential counterargument is that we crammed all 10000 intents into
> > the log without pinning the log, so maybe we can hold the ILOCK.
> > However....
> > 
> > (Part 3)
> > 
> > Next, I considered how the swapext log item in the atomic file update
> > patchset would interact with log recovery.  A couple of notes about the
> > swapext log items:
> > 
> > 1. They are careful not to create the kinds of bunmap operations that
> > would cause the creation of /more/ BUIs.
> > 
> > 2. Every time we log one extent's worth of unmap/remap operations (using
> > deferred bmap log items, naturally) we log an done item for the original
> > swapext intent item and immediately log another swapext intent for the
> > work remaining.
> > 
> > I came up with the following sequence of transactions to recover:
> > 
> > ----------------
> > Transaction 0
> > SXI0 file1, offset1, file2, offset2, length == 10
> > ----------------
> > Transaction 1
> > BUI1 unmap file1, offset1, startblock1, length == 2
> > BUI2 unmap file2, offset2, startblock2, length == 2
> > BUI3 map file1, offset1, startblock2, length == 2
> > BUI4 map file2, offset2, startblock1, length == 2
> > SXD done (SXI0)
> > SXI5 file1, offset1 + 2, file2, offset2 + 2, length == 8
> > ---------------
> > <end of log>
> > 
> > Recovery in this case will end up with BUI[1-4] and SXI5 that still need
> > processing.  Each of the 5 intent items gets its own recovery
> > transaction and dfops freezer chain.  BUI1, BUI3, and SXI5 will each
> > require ilock'd access to file1, which means that each of them will have
> > to have the ability to drop the ILOCK but retain the reference.  The
> > same argument applies to file2 and BUI2, BUI4, and SXI5.  Note also that
> > in our brave new world, file1 and file2 can be the same.
> > 
> > For the swapext log item type this inode management is particularly
> > acute because we're certain to have new SXIs created as a result of
> > recovering an SXI.
> > 
> > (End)
> > 
> > In conclusion, we need a mechanism to drop both the inode lock and the
> > inode reference.  Each log item type that works with inodes will have to
> > set aside 2 flags somewhere in the incore extent structure, and provide
> > more or less the same code to manage that state.  Right now that's just
> > the bmap items, but then the swapext items and the deferred xattr items
> > will have that same requirement when they get added.
> > 
> > If we do that, the inode reference and ilock management diverges even
> > further from regular operations.  Regular callers are expected to iget
> > and ilock the inode and maintain that all the way to the end of
> > xfs_trans_commit.  Log recovery, on the other hand, will add a bunch of
> > state handling to some of the log items so that we can drop the ILOCK
> > after the first item, and then drop the inode reference after finishing
> > the last possible user of that inode in the revived dfops chain.
> > 
> > On the other hand, keeping the inode management in the defer freezer
> > code means that I keep all that complexity and state management out of
> > the ->finish_item code.  When we go to finish the new incore intents
> > that get created during recovery, the inode reference and ILOCK rules
> > are the same as anywhere else -- the caller is required to grab the
> > inode, the transaction, and the ilock (in that order).
> > 
> > To the extent that recovering log intent items is /not/ the same as
> > running a normal transaction, all that weirdness is concentrated in
> > xfs_log_recover.c and a single function in xfs_defer.c.  The desired
> > behavior is the same across all the log intent item types, so I
> > decided in favor of keeping the centralised behavior and (trying to)
> > contain the wacky.  We don't need all that right away, but we will.
> > 
> > Note that I did make it so that we retain the reference always, so
> > there's no longer a need for irele/igrab cycles, which makes capturing
> > the dfops chain less error prone.
> > 
> > --D
> > 
> > > Brian
> > > 
> > > [1] https://lore.kernel.org/linux-xfs/20200429235818.GX6742@magnolia/
> > > 
> > > >  fs/xfs/libxfs/xfs_defer.c |   50 +++++++++++++++++++++++++++++++++++++++++++++
> > > >  fs/xfs/libxfs/xfs_defer.h |   10 +++++++++
> > > >  fs/xfs/xfs_bmap_item.c    |    7 ++++--
> > > >  fs/xfs/xfs_icache.c       |   19 +++++++++++++++++
> > > >  4 files changed, 83 insertions(+), 3 deletions(-)
> > > > 
> > > > 
> > > > diff --git a/fs/xfs/libxfs/xfs_defer.c b/fs/xfs/libxfs/xfs_defer.c
> > > > index ea4d28851bbd..72933fdafcb2 100644
> > > > --- a/fs/xfs/libxfs/xfs_defer.c
> > > > +++ b/fs/xfs/libxfs/xfs_defer.c
> > > > @@ -16,6 +16,7 @@
> > > >  #include "xfs_inode.h"
> > > >  #include "xfs_inode_item.h"
> > > >  #include "xfs_trace.h"
> > > > +#include "xfs_icache.h"
> > > >  
> > > >  /*
> > > >   * Deferred Operations in XFS
> > > > @@ -583,8 +584,19 @@ xfs_defer_thaw(
> > > >  	struct xfs_defer_freezer	*dff,
> > > >  	struct xfs_trans		*tp)
> > > >  {
> > > > +	int				i;
> > > > +
> > > >  	ASSERT(tp->t_flags & XFS_TRANS_PERM_LOG_RES);
> > > >  
> > > > +	/* Re-acquire the inode locks. */
> > > > +	for (i = 0; i < XFS_DEFER_FREEZER_INODES; i++) {
> > > > +		if (!dff->dff_inodes[i])
> > > > +			break;
> > > > +
> > > > +		dff->dff_ilocks[i] = XFS_ILOCK_EXCL;
> > > > +		xfs_ilock(dff->dff_inodes[i], dff->dff_ilocks[i]);
> > > > +	}
> > > > +
> > > >  	/* Add the dfops items to the transaction. */
> > > >  	list_splice_init(&dff->dff_dfops, &tp->t_dfops);
> > > >  	tp->t_flags |= dff->dff_tpflags;
> > > > @@ -597,5 +609,43 @@ xfs_defer_freeezer_finish(
> > > >  	struct xfs_defer_freezer	*dff)
> > > >  {
> > > >  	xfs_defer_cancel_list(mp, &dff->dff_dfops);
> > > > +	xfs_defer_freezer_irele(dff);
> > > >  	kmem_free(dff);
> > > >  }
> > > > +
> > > > +/*
> > > > + * Attach an inode to this deferred ops freezer.  Callers must hold ILOCK_EXCL,
> > > > + * which will be dropped and reacquired when we're ready to thaw the frozen
> > > > + * deferred ops.
> > > > + */
> > > > +int
> > > > +xfs_defer_freezer_ijoin(
> > > > +	struct xfs_defer_freezer	*dff,
> > > > +	struct xfs_inode		*ip)
> > > > +{
> > > > +	unsigned int			i;
> > > > +
> > > > +	ASSERT(xfs_isilocked(ip, XFS_ILOCK_EXCL));
> > > > +
> > > > +	for (i = 0; i < XFS_DEFER_FREEZER_INODES; i++) {
> > > > +		if (dff->dff_inodes[i] == ip)
> > > > +			goto out;
> > > > +		if (dff->dff_inodes[i] == NULL)
> > > > +			break;
> > > > +	}
> > > > +
> > > > +	if (i == XFS_DEFER_FREEZER_INODES) {
> > > > +		ASSERT(0);
> > > > +		return -EFSCORRUPTED;
> > > > +	}
> > > > +
> > > > +	/*
> > > > +	 * Attach this inode to the freezer and drop its ILOCK because we
> > > > +	 * assume the caller will need to allocate a transaction.
> > > > +	 */
> > > > +	dff->dff_inodes[i] = ip;
> > > > +	dff->dff_ilocks[i] = 0;
> > > > +out:
> > > > +	xfs_iunlock(ip, XFS_ILOCK_EXCL);
> > > > +	return 0;
> > > > +}
> > > > diff --git a/fs/xfs/libxfs/xfs_defer.h b/fs/xfs/libxfs/xfs_defer.h
> > > > index 7ae05e10d750..0052a0313283 100644
> > > > --- a/fs/xfs/libxfs/xfs_defer.h
> > > > +++ b/fs/xfs/libxfs/xfs_defer.h
> > > > @@ -76,6 +76,11 @@ struct xfs_defer_freezer {
> > > >  	/* Deferred ops state saved from the transaction. */
> > > >  	struct list_head	dff_dfops;
> > > >  	unsigned int		dff_tpflags;
> > > > +
> > > > +	/* Inodes to hold when we want to finish the deferred work items. */
> > > > +#define XFS_DEFER_FREEZER_INODES	2
> > > > +	unsigned int		dff_ilocks[XFS_DEFER_FREEZER_INODES];
> > > > +	struct xfs_inode	*dff_inodes[XFS_DEFER_FREEZER_INODES];
> > > >  };
> > > >  
> > > >  /* Functions to freeze a chain of deferred operations for later. */
> > > > @@ -83,5 +88,10 @@ int xfs_defer_freeze(struct xfs_trans *tp, struct xfs_defer_freezer **dffp);
> > > >  void xfs_defer_thaw(struct xfs_defer_freezer *dff, struct xfs_trans *tp);
> > > >  void xfs_defer_freeezer_finish(struct xfs_mount *mp,
> > > >  		struct xfs_defer_freezer *dff);
> > > > +int xfs_defer_freezer_ijoin(struct xfs_defer_freezer *dff,
> > > > +		struct xfs_inode *ip);
> > > > +
> > > > +/* These functions must be provided by the xfs implementation. */
> > > > +void xfs_defer_freezer_irele(struct xfs_defer_freezer *dff);
> > > >  
> > > >  #endif /* __XFS_DEFER_H__ */
> > > > diff --git a/fs/xfs/xfs_bmap_item.c b/fs/xfs/xfs_bmap_item.c
> > > > index c733bdeeeb9b..bbce191d8fcd 100644
> > > > --- a/fs/xfs/xfs_bmap_item.c
> > > > +++ b/fs/xfs/xfs_bmap_item.c
> > > > @@ -530,12 +530,13 @@ xfs_bui_item_recover(
> > > >  	}
> > > >  
> > > >  	error = xlog_recover_trans_commit(tp, dffp);
> > > > -	xfs_iunlock(ip, XFS_ILOCK_EXCL);
> > > > -	xfs_irele(ip);
> > > > -	return error;
> > > > +	if (error)
> > > > +		goto err_rele;
> > > > +	return xfs_defer_freezer_ijoin(*dffp, ip);
> > > >  
> > > >  err_inode:
> > > >  	xfs_trans_cancel(tp);
> > > > +err_rele:
> > > >  	if (ip) {
> > > >  		xfs_iunlock(ip, XFS_ILOCK_EXCL);
> > > >  		xfs_irele(ip);
> > > > diff --git a/fs/xfs/xfs_icache.c b/fs/xfs/xfs_icache.c
> > > > index 17a0b86fe701..b96ddf5ff334 100644
> > > > --- a/fs/xfs/xfs_icache.c
> > > > +++ b/fs/xfs/xfs_icache.c
> > > > @@ -12,6 +12,7 @@
> > > >  #include "xfs_sb.h"
> > > >  #include "xfs_mount.h"
> > > >  #include "xfs_inode.h"
> > > > +#include "xfs_defer.h"
> > > >  #include "xfs_trans.h"
> > > >  #include "xfs_trans_priv.h"
> > > >  #include "xfs_inode_item.h"
> > > > @@ -1847,3 +1848,21 @@ xfs_start_block_reaping(
> > > >  	xfs_queue_eofblocks(mp);
> > > >  	xfs_queue_cowblocks(mp);
> > > >  }
> > > > +
> > > > +/* Release all the inode resources attached to this freezer. */
> > > > +void
> > > > +xfs_defer_freezer_irele(
> > > > +	struct xfs_defer_freezer	*dff)
> > > > +{
> > > > +	unsigned int			i;
> > > > +
> > > > +	for (i = 0; i < XFS_DEFER_FREEZER_INODES; i++) {
> > > > +		if (!dff->dff_inodes[i])
> > > > +			break;
> > > > +
> > > > +		if (dff->dff_ilocks[i])
> > > > +			xfs_iunlock(dff->dff_inodes[i], dff->dff_ilocks[i]);
> > > > +		xfs_irele(dff->dff_inodes[i]);
> > > > +		dff->dff_inodes[i] = NULL;
> > > > +	}
> > > > +}
> > > > 
> > > 
> > 
> 

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 3/3] xfs: fix an incore inode UAF in xfs_bui_recover
  2020-05-06 17:01         ` Darrick J. Wong
@ 2020-05-07  9:53           ` Brian Foster
  2020-05-07 15:09             ` Darrick J. Wong
  0 siblings, 1 reply; 19+ messages in thread
From: Brian Foster @ 2020-05-07  9:53 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: linux-xfs

On Wed, May 06, 2020 at 10:01:50AM -0700, Darrick J. Wong wrote:
> On Wed, May 06, 2020 at 09:56:56AM -0400, Brian Foster wrote:
> > On Tue, May 05, 2020 at 05:34:23PM -0700, Darrick J. Wong wrote:
> > > On Tue, May 05, 2020 at 10:11:38AM -0400, Brian Foster wrote:
> > > > On Mon, May 04, 2020 at 06:13:53PM -0700, Darrick J. Wong wrote:
> > > > > From: Darrick J. Wong <darrick.wong@oracle.com>
> > > > > 
> > > > > In xfs_bui_item_recover, there exists a use-after-free bug with regards
> > > > > to the inode that is involved in the bmap replay operation.  If the
> > > > > mapping operation does not complete, we call xfs_bmap_unmap_extent to
> > > > > create a deferred op to finish the unmapping work, and we retain a
> > > > > pointer to the incore inode.
> > > > > 
> > > > > Unfortunately, the very next thing we do is commit the transaction and
> > > > > drop the inode.  If reclaim tears down the inode before we try to finish
> > > > > the defer ops, we dereference garbage and blow up.  Therefore, create a
> > > > > way to join inodes to the defer ops freezer so that we can maintain the
> > > > > xfs_inode reference until we're done with the inode.
> > > > > 
> > > > > Note: This imposes the requirement that there be enough memory to keep
> > > > > every incore inode in memory throughout recovery.
> > > > > 
> > > > > Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
> > > > > ---
> > > > 
> > > > Maybe I'm missing something, but I thought the discussion on the
> > > > previous version[1] landed on an approach where the intent would hold a
> > > > reference to the inode. Wouldn't that break the dependency on the dfops
> > > > freeze/thaw mechanism?
> > > 
> > > We did, but as I started pondering how to make this work in practice,
> > > I soon realized that it's not as simple as "don't irele the inode":
> > > 
> > > (Part 1)
> > > 
> > > It would be easy enough to add a flag to struct xfs_bmap_intent that
> > > means "when you're done, unlock and irele the inode".  You'd have to add
> > > this capability to any log item that deals with inodes, but only
> > > deferred bmap items need this.
> > > 
> > > Log recovery looks like this:
> > > 
> > > ----------------
> > > Transaction 0
> > > BUI1 unmap file1, offset1, startblock1, length == 200000
> > > ----------------
> > > <end>
> > > 
> > > What happens if the first bunmapi call doesn't actually unmap
> > > everything?  We'll queue yet another bmap intent item (call it BUI2) to
> > > finish the work.  This means that we can't just drop the inode when
> > > we're done processing BUI1; we have to propagate the "unlock and irele"
> > > flag to BUI2.  Ok, so now there's this chaining behavior that wasn't
> > > there before.
> > > 
> > > Then I thought about it some more, and wondered what would happen if the
> > > log contained two unfinished bmap items for the same inode?  If this
> > > happened, you'd deadlock because you've not released the ILOCK for the
> > > first unfinished bmap item when you want to acquire the ILOCK for the
> > > second unfinished bmap item.
> > > 
> > 
> > Hm, Ok.. so IIUC the lock is implicitly held across these map/unmap
> > dfops at runtime, but that's not necessarily the case during recovery
> > due to the dfops chaining behavior. To me, this raises the question of
> > why we need to reorder/split these particular intents during recovery if
> > the original operation would have completed under a single inode lock
> > cycle.
> 
> Log recovery /did/ function that way (no reordering) until
> 509955823cc9c, when it was discovered that we don't have a good way to
> reconstruct the dfops chain from the list of recovered log intent items.
> 
> > The comments in xfs_log_recover.c don't really explain it. From going
> > back to commit 509955823cc9c ("xfs: log recovery should replay deferred
> > ops in order"), it looks like the issue is that ordering gets screwed up
> > for operations that commit multiple intents per-transaction and
> > completion of those intents creates new child intents. At runtime, we'd
> > process the intents in the current transaction first and then get to the
> > child intents in the order they were created. If we didn't have the
> > recovery time dfops shuffling implemented in this patch, recovery would
> > process the first intent in the log, then all of the child intents of
> > that one before getting to the next intent in the log that presumably
> > would have been second at runtime. Am I following that correctly?
> 
> Correct.  If we could figure out how to rebuild the dfops chain (and
> thus the ordering requirements), we would be able to eliminate this
> freezer stuff entirely.
> 

Ok, thanks. I was wondering the same, but it's not clear how to do that
with just the intent information in the log and big checkpoint
transactions. Perhaps we could if we had a transactional id that was
common across intents in a single transaction, or otherwise unique
intent ids that would allow an intent to refer to the next in a chain at
the time the intent was logged (such that recovery could learn to
construct and process chains instead of just individual intents). I'm
curious if we're going to want/need to solve that issue somehow or
another more directly eventually as we grow more intents and perhaps
come up with new and more complex intent-based operations (with more
complex recovery scenarios).

> > I think the rest makes sense from the perspective of not wanting to
> > plumb the conditional locking complexity through the dfops/intent
> > processing for every intent type that eventually needs it, as opposed to
> > containing in the log recovery code (via dfops magic). I need to think
> > about all that some more, but I'd like to make sure I understand why we
> > have this recovery requirement in the first place.
> 
> <nod>  It took me a few hours to figure out why that patch was correct
> the first time I saw it.  It took me a few hours again(!) when I was
> trying to write this series. :/
> 

If I follow correctly, the rmap swapext sequence makes for a good
example because the bmap completions generate the rmap intents. A
simplified (i.e. single extent remap) runtime transactional flow for
that looks something like the following:

t1: unmap intent, map intent
t2: unmap done, map intent, runmap intent
t3: map done, runmap intent, rmap intent
t4: runmap done, rmap intent
...

So at runtime these operations complete in the obvious linear order.
Intents drop off the front as done items are logged and new intents are
tacked on the end. If we crash between t2 and t3, however, we see t2 in
the log and would end up completing the map intent and the subsequently
generated rmap intent (that pops up in t3 at runtime) before we process
the runmap intent that's still in the log (and was originally in the
dfops queue for t2). I _think_ that's essentially the scenario described
in 509955823cc9c, but it's not clear to me if it's the same runtime
example..

Brian

> --D
> 
> > Brian
> > 
> > > You'd have to drop the ILOCK (but retain the inode reference) between
> > > the two bmap items.  I think this is currently impossible because there
> > > aren't any transactions that queue multiple bmap intents per
> > > transaction, nobody else holds the ilock, and the existing rmap
> > > implementation of the swapext ioctl doesn't allow swapping a file with
> > > itself.  However...
> > > 
> > > (Part 2)
> > > 
> > > The second thing I thought about is, what if there's a lot of work after
> > > transaction 0?  Say the recovery stream looks like:
> > > 
> > > ----------------
> > > Transaction 0
> > > BUI1 unmap file1, offset1, startblock1, length == 200000
> > > ----------------
> > > <10000 more intents applied to other things>
> > > ----------------
> > > <end>
> > > 
> > > This is a problem, because we must drop the ILOCK after completing the
> > > BUI1 work so that we can process the 10000 more intents before we can
> > > get to BUI2.  We cannot let the tail of the log get pinned on the locked
> > > inode, so we must release the ILOCK after we're done with BUI1 and
> > > re-acquire it when we're ready to deal with BUI2.
> > > 
> > > This means I need /two/ flags in struct xfs_bmap_intent -- one to say
> > > that we need to irele the inode at the end of processing, and another to
> > > say that the bmap item needs to grab the ILOCK.  If there's going to be
> > > a BUI2 as a result of recovering BUI1, the first flag must be propagated
> > > to BUI2, but the second flag must /not/ be propagated.
> > > 
> > > A potential counterargument is that we crammed all 10000 intents into
> > > the log without pinning the log, so maybe we can hold the ILOCK.
> > > However....
> > > 
> > > (Part 3)
> > > 
> > > Next, I considered how the swapext log item in the atomic file update
> > > patchset would interact with log recovery.  A couple of notes about the
> > > swapext log items:
> > > 
> > > 1. They are careful not to create the kinds of bunmap operations that
> > > would cause the creation of /more/ BUIs.
> > > 
> > > 2. Every time we log one extent's worth of unmap/remap operations (using
> > > deferred bmap log items, naturally) we log an done item for the original
> > > swapext intent item and immediately log another swapext intent for the
> > > work remaining.
> > > 
> > > I came up with the following sequence of transactions to recover:
> > > 
> > > ----------------
> > > Transaction 0
> > > SXI0 file1, offset1, file2, offset2, length == 10
> > > ----------------
> > > Transaction 1
> > > BUI1 unmap file1, offset1, startblock1, length == 2
> > > BUI2 unmap file2, offset2, startblock2, length == 2
> > > BUI3 map file1, offset1, startblock2, length == 2
> > > BUI4 map file2, offset2, startblock1, length == 2
> > > SXD done (SXI0)
> > > SXI5 file1, offset1 + 2, file2, offset2 + 2, length == 8
> > > ---------------
> > > <end of log>
> > > 
> > > Recovery in this case will end up with BUI[1-4] and SXI5 that still need
> > > processing.  Each of the 5 intent items gets its own recovery
> > > transaction and dfops freezer chain.  BUI1, BUI3, and SXI5 will each
> > > require ilock'd access to file1, which means that each of them will have
> > > to have the ability to drop the ILOCK but retain the reference.  The
> > > same argument applies to file2 and BUI2, BUI4, and SXI5.  Note also that
> > > in our brave new world, file1 and file2 can be the same.
> > > 
> > > For the swapext log item type this inode management is particularly
> > > acute because we're certain to have new SXIs created as a result of
> > > recovering an SXI.
> > > 
> > > (End)
> > > 
> > > In conclusion, we need a mechanism to drop both the inode lock and the
> > > inode reference.  Each log item type that works with inodes will have to
> > > set aside 2 flags somewhere in the incore extent structure, and provide
> > > more or less the same code to manage that state.  Right now that's just
> > > the bmap items, but then the swapext items and the deferred xattr items
> > > will have that same requirement when they get added.
> > > 
> > > If we do that, the inode reference and ilock management diverges even
> > > further from regular operations.  Regular callers are expected to iget
> > > and ilock the inode and maintain that all the way to the end of
> > > xfs_trans_commit.  Log recovery, on the other hand, will add a bunch of
> > > state handling to some of the log items so that we can drop the ILOCK
> > > after the first item, and then drop the inode reference after finishing
> > > the last possible user of that inode in the revived dfops chain.
> > > 
> > > On the other hand, keeping the inode management in the defer freezer
> > > code means that I keep all that complexity and state management out of
> > > the ->finish_item code.  When we go to finish the new incore intents
> > > that get created during recovery, the inode reference and ILOCK rules
> > > are the same as anywhere else -- the caller is required to grab the
> > > inode, the transaction, and the ilock (in that order).
> > > 
> > > To the extent that recovering log intent items is /not/ the same as
> > > running a normal transaction, all that weirdness is concentrated in
> > > xfs_log_recover.c and a single function in xfs_defer.c.  The desired
> > > behavior is the same across all the log intent item types, so I
> > > decided in favor of keeping the centralised behavior and (trying to)
> > > contain the wacky.  We don't need all that right away, but we will.
> > > 
> > > Note that I did make it so that we retain the reference always, so
> > > there's no longer a need for irele/igrab cycles, which makes capturing
> > > the dfops chain less error prone.
> > > 
> > > --D
> > > 
> > > > Brian
> > > > 
> > > > [1] https://lore.kernel.org/linux-xfs/20200429235818.GX6742@magnolia/
> > > > 
> > > > >  fs/xfs/libxfs/xfs_defer.c |   50 +++++++++++++++++++++++++++++++++++++++++++++
> > > > >  fs/xfs/libxfs/xfs_defer.h |   10 +++++++++
> > > > >  fs/xfs/xfs_bmap_item.c    |    7 ++++--
> > > > >  fs/xfs/xfs_icache.c       |   19 +++++++++++++++++
> > > > >  4 files changed, 83 insertions(+), 3 deletions(-)
> > > > > 
> > > > > 
> > > > > diff --git a/fs/xfs/libxfs/xfs_defer.c b/fs/xfs/libxfs/xfs_defer.c
> > > > > index ea4d28851bbd..72933fdafcb2 100644
> > > > > --- a/fs/xfs/libxfs/xfs_defer.c
> > > > > +++ b/fs/xfs/libxfs/xfs_defer.c
> > > > > @@ -16,6 +16,7 @@
> > > > >  #include "xfs_inode.h"
> > > > >  #include "xfs_inode_item.h"
> > > > >  #include "xfs_trace.h"
> > > > > +#include "xfs_icache.h"
> > > > >  
> > > > >  /*
> > > > >   * Deferred Operations in XFS
> > > > > @@ -583,8 +584,19 @@ xfs_defer_thaw(
> > > > >  	struct xfs_defer_freezer	*dff,
> > > > >  	struct xfs_trans		*tp)
> > > > >  {
> > > > > +	int				i;
> > > > > +
> > > > >  	ASSERT(tp->t_flags & XFS_TRANS_PERM_LOG_RES);
> > > > >  
> > > > > +	/* Re-acquire the inode locks. */
> > > > > +	for (i = 0; i < XFS_DEFER_FREEZER_INODES; i++) {
> > > > > +		if (!dff->dff_inodes[i])
> > > > > +			break;
> > > > > +
> > > > > +		dff->dff_ilocks[i] = XFS_ILOCK_EXCL;
> > > > > +		xfs_ilock(dff->dff_inodes[i], dff->dff_ilocks[i]);
> > > > > +	}
> > > > > +
> > > > >  	/* Add the dfops items to the transaction. */
> > > > >  	list_splice_init(&dff->dff_dfops, &tp->t_dfops);
> > > > >  	tp->t_flags |= dff->dff_tpflags;
> > > > > @@ -597,5 +609,43 @@ xfs_defer_freeezer_finish(
> > > > >  	struct xfs_defer_freezer	*dff)
> > > > >  {
> > > > >  	xfs_defer_cancel_list(mp, &dff->dff_dfops);
> > > > > +	xfs_defer_freezer_irele(dff);
> > > > >  	kmem_free(dff);
> > > > >  }
> > > > > +
> > > > > +/*
> > > > > + * Attach an inode to this deferred ops freezer.  Callers must hold ILOCK_EXCL,
> > > > > + * which will be dropped and reacquired when we're ready to thaw the frozen
> > > > > + * deferred ops.
> > > > > + */
> > > > > +int
> > > > > +xfs_defer_freezer_ijoin(
> > > > > +	struct xfs_defer_freezer	*dff,
> > > > > +	struct xfs_inode		*ip)
> > > > > +{
> > > > > +	unsigned int			i;
> > > > > +
> > > > > +	ASSERT(xfs_isilocked(ip, XFS_ILOCK_EXCL));
> > > > > +
> > > > > +	for (i = 0; i < XFS_DEFER_FREEZER_INODES; i++) {
> > > > > +		if (dff->dff_inodes[i] == ip)
> > > > > +			goto out;
> > > > > +		if (dff->dff_inodes[i] == NULL)
> > > > > +			break;
> > > > > +	}
> > > > > +
> > > > > +	if (i == XFS_DEFER_FREEZER_INODES) {
> > > > > +		ASSERT(0);
> > > > > +		return -EFSCORRUPTED;
> > > > > +	}
> > > > > +
> > > > > +	/*
> > > > > +	 * Attach this inode to the freezer and drop its ILOCK because we
> > > > > +	 * assume the caller will need to allocate a transaction.
> > > > > +	 */
> > > > > +	dff->dff_inodes[i] = ip;
> > > > > +	dff->dff_ilocks[i] = 0;
> > > > > +out:
> > > > > +	xfs_iunlock(ip, XFS_ILOCK_EXCL);
> > > > > +	return 0;
> > > > > +}
> > > > > diff --git a/fs/xfs/libxfs/xfs_defer.h b/fs/xfs/libxfs/xfs_defer.h
> > > > > index 7ae05e10d750..0052a0313283 100644
> > > > > --- a/fs/xfs/libxfs/xfs_defer.h
> > > > > +++ b/fs/xfs/libxfs/xfs_defer.h
> > > > > @@ -76,6 +76,11 @@ struct xfs_defer_freezer {
> > > > >  	/* Deferred ops state saved from the transaction. */
> > > > >  	struct list_head	dff_dfops;
> > > > >  	unsigned int		dff_tpflags;
> > > > > +
> > > > > +	/* Inodes to hold when we want to finish the deferred work items. */
> > > > > +#define XFS_DEFER_FREEZER_INODES	2
> > > > > +	unsigned int		dff_ilocks[XFS_DEFER_FREEZER_INODES];
> > > > > +	struct xfs_inode	*dff_inodes[XFS_DEFER_FREEZER_INODES];
> > > > >  };
> > > > >  
> > > > >  /* Functions to freeze a chain of deferred operations for later. */
> > > > > @@ -83,5 +88,10 @@ int xfs_defer_freeze(struct xfs_trans *tp, struct xfs_defer_freezer **dffp);
> > > > >  void xfs_defer_thaw(struct xfs_defer_freezer *dff, struct xfs_trans *tp);
> > > > >  void xfs_defer_freeezer_finish(struct xfs_mount *mp,
> > > > >  		struct xfs_defer_freezer *dff);
> > > > > +int xfs_defer_freezer_ijoin(struct xfs_defer_freezer *dff,
> > > > > +		struct xfs_inode *ip);
> > > > > +
> > > > > +/* These functions must be provided by the xfs implementation. */
> > > > > +void xfs_defer_freezer_irele(struct xfs_defer_freezer *dff);
> > > > >  
> > > > >  #endif /* __XFS_DEFER_H__ */
> > > > > diff --git a/fs/xfs/xfs_bmap_item.c b/fs/xfs/xfs_bmap_item.c
> > > > > index c733bdeeeb9b..bbce191d8fcd 100644
> > > > > --- a/fs/xfs/xfs_bmap_item.c
> > > > > +++ b/fs/xfs/xfs_bmap_item.c
> > > > > @@ -530,12 +530,13 @@ xfs_bui_item_recover(
> > > > >  	}
> > > > >  
> > > > >  	error = xlog_recover_trans_commit(tp, dffp);
> > > > > -	xfs_iunlock(ip, XFS_ILOCK_EXCL);
> > > > > -	xfs_irele(ip);
> > > > > -	return error;
> > > > > +	if (error)
> > > > > +		goto err_rele;
> > > > > +	return xfs_defer_freezer_ijoin(*dffp, ip);
> > > > >  
> > > > >  err_inode:
> > > > >  	xfs_trans_cancel(tp);
> > > > > +err_rele:
> > > > >  	if (ip) {
> > > > >  		xfs_iunlock(ip, XFS_ILOCK_EXCL);
> > > > >  		xfs_irele(ip);
> > > > > diff --git a/fs/xfs/xfs_icache.c b/fs/xfs/xfs_icache.c
> > > > > index 17a0b86fe701..b96ddf5ff334 100644
> > > > > --- a/fs/xfs/xfs_icache.c
> > > > > +++ b/fs/xfs/xfs_icache.c
> > > > > @@ -12,6 +12,7 @@
> > > > >  #include "xfs_sb.h"
> > > > >  #include "xfs_mount.h"
> > > > >  #include "xfs_inode.h"
> > > > > +#include "xfs_defer.h"
> > > > >  #include "xfs_trans.h"
> > > > >  #include "xfs_trans_priv.h"
> > > > >  #include "xfs_inode_item.h"
> > > > > @@ -1847,3 +1848,21 @@ xfs_start_block_reaping(
> > > > >  	xfs_queue_eofblocks(mp);
> > > > >  	xfs_queue_cowblocks(mp);
> > > > >  }
> > > > > +
> > > > > +/* Release all the inode resources attached to this freezer. */
> > > > > +void
> > > > > +xfs_defer_freezer_irele(
> > > > > +	struct xfs_defer_freezer	*dff)
> > > > > +{
> > > > > +	unsigned int			i;
> > > > > +
> > > > > +	for (i = 0; i < XFS_DEFER_FREEZER_INODES; i++) {
> > > > > +		if (!dff->dff_inodes[i])
> > > > > +			break;
> > > > > +
> > > > > +		if (dff->dff_ilocks[i])
> > > > > +			xfs_iunlock(dff->dff_inodes[i], dff->dff_ilocks[i]);
> > > > > +		xfs_irele(dff->dff_inodes[i]);
> > > > > +		dff->dff_inodes[i] = NULL;
> > > > > +	}
> > > > > +}
> > > > > 
> > > > 
> > > 
> > 
> 


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 3/3] xfs: fix an incore inode UAF in xfs_bui_recover
  2020-05-07  9:53           ` Brian Foster
@ 2020-05-07 15:09             ` Darrick J. Wong
  2020-05-07 16:58               ` Brian Foster
  0 siblings, 1 reply; 19+ messages in thread
From: Darrick J. Wong @ 2020-05-07 15:09 UTC (permalink / raw)
  To: Brian Foster; +Cc: linux-xfs

On Thu, May 07, 2020 at 05:53:06AM -0400, Brian Foster wrote:
> On Wed, May 06, 2020 at 10:01:50AM -0700, Darrick J. Wong wrote:
> > On Wed, May 06, 2020 at 09:56:56AM -0400, Brian Foster wrote:
> > > On Tue, May 05, 2020 at 05:34:23PM -0700, Darrick J. Wong wrote:
> > > > On Tue, May 05, 2020 at 10:11:38AM -0400, Brian Foster wrote:
> > > > > On Mon, May 04, 2020 at 06:13:53PM -0700, Darrick J. Wong wrote:
> > > > > > From: Darrick J. Wong <darrick.wong@oracle.com>
> > > > > > 
> > > > > > In xfs_bui_item_recover, there exists a use-after-free bug with regards
> > > > > > to the inode that is involved in the bmap replay operation.  If the
> > > > > > mapping operation does not complete, we call xfs_bmap_unmap_extent to
> > > > > > create a deferred op to finish the unmapping work, and we retain a
> > > > > > pointer to the incore inode.
> > > > > > 
> > > > > > Unfortunately, the very next thing we do is commit the transaction and
> > > > > > drop the inode.  If reclaim tears down the inode before we try to finish
> > > > > > the defer ops, we dereference garbage and blow up.  Therefore, create a
> > > > > > way to join inodes to the defer ops freezer so that we can maintain the
> > > > > > xfs_inode reference until we're done with the inode.
> > > > > > 
> > > > > > Note: This imposes the requirement that there be enough memory to keep
> > > > > > every incore inode in memory throughout recovery.
> > > > > > 
> > > > > > Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
> > > > > > ---
> > > > > 
> > > > > Maybe I'm missing something, but I thought the discussion on the
> > > > > previous version[1] landed on an approach where the intent would hold a
> > > > > reference to the inode. Wouldn't that break the dependency on the dfops
> > > > > freeze/thaw mechanism?
> > > > 
> > > > We did, but as I started pondering how to make this work in practice,
> > > > I soon realized that it's not as simple as "don't irele the inode":
> > > > 
> > > > (Part 1)
> > > > 
> > > > It would be easy enough to add a flag to struct xfs_bmap_intent that
> > > > means "when you're done, unlock and irele the inode".  You'd have to add
> > > > this capability to any log item that deals with inodes, but only
> > > > deferred bmap items need this.
> > > > 
> > > > Log recovery looks like this:
> > > > 
> > > > ----------------
> > > > Transaction 0
> > > > BUI1 unmap file1, offset1, startblock1, length == 200000
> > > > ----------------
> > > > <end>
> > > > 
> > > > What happens if the first bunmapi call doesn't actually unmap
> > > > everything?  We'll queue yet another bmap intent item (call it BUI2) to
> > > > finish the work.  This means that we can't just drop the inode when
> > > > we're done processing BUI1; we have to propagate the "unlock and irele"
> > > > flag to BUI2.  Ok, so now there's this chaining behavior that wasn't
> > > > there before.
> > > > 
> > > > Then I thought about it some more, and wondered what would happen if the
> > > > log contained two unfinished bmap items for the same inode?  If this
> > > > happened, you'd deadlock because you've not released the ILOCK for the
> > > > first unfinished bmap item when you want to acquire the ILOCK for the
> > > > second unfinished bmap item.
> > > > 
> > > 
> > > Hm, Ok.. so IIUC the lock is implicitly held across these map/unmap
> > > dfops at runtime, but that's not necessarily the case during recovery
> > > due to the dfops chaining behavior. To me, this raises the question of
> > > why we need to reorder/split these particular intents during recovery if
> > > the original operation would have completed under a single inode lock
> > > cycle.
> > 
> > Log recovery /did/ function that way (no reordering) until
> > 509955823cc9c, when it was discovered that we don't have a good way to
> > reconstruct the dfops chain from the list of recovered log intent items.
> > 
> > > The comments in xfs_log_recover.c don't really explain it. From going
> > > back to commit 509955823cc9c ("xfs: log recovery should replay deferred
> > > ops in order"), it looks like the issue is that ordering gets screwed up
> > > for operations that commit multiple intents per-transaction and
> > > completion of those intents creates new child intents. At runtime, we'd
> > > process the intents in the current transaction first and then get to the
> > > child intents in the order they were created. If we didn't have the
> > > recovery time dfops shuffling implemented in this patch, recovery would
> > > process the first intent in the log, then all of the child intents of
> > > that one before getting to the next intent in the log that presumably
> > > would have been second at runtime. Am I following that correctly?
> > 
> > Correct.  If we could figure out how to rebuild the dfops chain (and
> > thus the ordering requirements), we would be able to eliminate this
> > freezer stuff entirely.
> > 
> 
> Ok, thanks. I was wondering the same, but it's not clear how to do that
> with just the intent information in the log and big checkpoint
> transactions. Perhaps we could if we had a transactional id that was
> common across intents in a single transaction, or otherwise unique
> intent ids that would allow an intent to refer to the next in a chain at
> the time the intent was logged (such that recovery could learn to
> construct and process chains instead of just individual intents). I'm
> curious if we're going to want/need to solve that issue somehow or
> another more directly eventually as we grow more intents and perhaps
> come up with new and more complex intent-based operations (with more
> complex recovery scenarios).

<nod>  Too bad that's a log format change. :(

> > > I think the rest makes sense from the perspective of not wanting to
> > > plumb the conditional locking complexity through the dfops/intent
> > > processing for every intent type that eventually needs it, as opposed to
> > > containing in the log recovery code (via dfops magic). I need to think
> > > about all that some more, but I'd like to make sure I understand why we
> > > have this recovery requirement in the first place.
> > 
> > <nod>  It took me a few hours to figure out why that patch was correct
> > the first time I saw it.  It took me a few hours again(!) when I was
> > trying to write this series. :/
> > 
> 
> If I follow correctly, the rmap swapext sequence makes for a good
> example because the bmap completions generate the rmap intents. A
> simplified (i.e. single extent remap) runtime transactional flow for
> that looks something like the following:
> 
> t1: unmap intent, map intent
> t2: unmap done, map intent, runmap intent
> t3: map done, runmap intent, rmap intent
> t4: runmap done, rmap intent

Work items that aren't touched in a particular transaction do not get
relogged, so this sequence is:

t1: unmap(0) intent, map(1) intent
t2: unmap(0) done, runmap(2) intent
t3: map(1) done, rmap(3) intent
t4: runmap(2) done
t5: rmap(3) done

(Maybe I need to take another look at the autorelogging patchset.)

> ...
> 
> So at runtime these operations complete in the obvious linear order.
> Intents drop off the front as done items are logged and new intents are
> tacked on the end. If we crash between t2 and t3, however, we see t2 in
> the log and would end up completing the map intent and the subsequently
> generated rmap intent (that pops up in t3 at runtime) before we process
> the runmap intent that's still in the log (and was originally in the
> dfops queue for t2). I _think_ that's essentially the scenario described
> in 509955823cc9c,

Yes...

> but it's not clear to me if it's the same runtime example..

...and you arrived at it with a different scenario. :)

--D

> Brian
> 
> > --D
> > 
> > > Brian
> > > 
> > > > You'd have to drop the ILOCK (but retain the inode reference) between
> > > > the two bmap items.  I think this is currently impossible because there
> > > > aren't any transactions that queue multiple bmap intents per
> > > > transaction, nobody else holds the ilock, and the existing rmap
> > > > implementation of the swapext ioctl doesn't allow swapping a file with
> > > > itself.  However...
> > > > 
> > > > (Part 2)
> > > > 
> > > > The second thing I thought about is, what if there's a lot of work after
> > > > transaction 0?  Say the recovery stream looks like:
> > > > 
> > > > ----------------
> > > > Transaction 0
> > > > BUI1 unmap file1, offset1, startblock1, length == 200000
> > > > ----------------
> > > > <10000 more intents applied to other things>
> > > > ----------------
> > > > <end>
> > > > 
> > > > This is a problem, because we must drop the ILOCK after completing the
> > > > BUI1 work so that we can process the 10000 more intents before we can
> > > > get to BUI2.  We cannot let the tail of the log get pinned on the locked
> > > > inode, so we must release the ILOCK after we're done with BUI1 and
> > > > re-acquire it when we're ready to deal with BUI2.
> > > > 
> > > > This means I need /two/ flags in struct xfs_bmap_intent -- one to say
> > > > that we need to irele the inode at the end of processing, and another to
> > > > say that the bmap item needs to grab the ILOCK.  If there's going to be
> > > > a BUI2 as a result of recovering BUI1, the first flag must be propagated
> > > > to BUI2, but the second flag must /not/ be propagated.
> > > > 
> > > > A potential counterargument is that we crammed all 10000 intents into
> > > > the log without pinning the log, so maybe we can hold the ILOCK.
> > > > However....
> > > > 
> > > > (Part 3)
> > > > 
> > > > Next, I considered how the swapext log item in the atomic file update
> > > > patchset would interact with log recovery.  A couple of notes about the
> > > > swapext log items:
> > > > 
> > > > 1. They are careful not to create the kinds of bunmap operations that
> > > > would cause the creation of /more/ BUIs.
> > > > 
> > > > 2. Every time we log one extent's worth of unmap/remap operations (using
> > > > deferred bmap log items, naturally) we log an done item for the original
> > > > swapext intent item and immediately log another swapext intent for the
> > > > work remaining.
> > > > 
> > > > I came up with the following sequence of transactions to recover:
> > > > 
> > > > ----------------
> > > > Transaction 0
> > > > SXI0 file1, offset1, file2, offset2, length == 10
> > > > ----------------
> > > > Transaction 1
> > > > BUI1 unmap file1, offset1, startblock1, length == 2
> > > > BUI2 unmap file2, offset2, startblock2, length == 2
> > > > BUI3 map file1, offset1, startblock2, length == 2
> > > > BUI4 map file2, offset2, startblock1, length == 2
> > > > SXD done (SXI0)
> > > > SXI5 file1, offset1 + 2, file2, offset2 + 2, length == 8
> > > > ---------------
> > > > <end of log>
> > > > 
> > > > Recovery in this case will end up with BUI[1-4] and SXI5 that still need
> > > > processing.  Each of the 5 intent items gets its own recovery
> > > > transaction and dfops freezer chain.  BUI1, BUI3, and SXI5 will each
> > > > require ilock'd access to file1, which means that each of them will have
> > > > to have the ability to drop the ILOCK but retain the reference.  The
> > > > same argument applies to file2 and BUI2, BUI4, and SXI5.  Note also that
> > > > in our brave new world, file1 and file2 can be the same.
> > > > 
> > > > For the swapext log item type this inode management is particularly
> > > > acute because we're certain to have new SXIs created as a result of
> > > > recovering an SXI.
> > > > 
> > > > (End)
> > > > 
> > > > In conclusion, we need a mechanism to drop both the inode lock and the
> > > > inode reference.  Each log item type that works with inodes will have to
> > > > set aside 2 flags somewhere in the incore extent structure, and provide
> > > > more or less the same code to manage that state.  Right now that's just
> > > > the bmap items, but then the swapext items and the deferred xattr items
> > > > will have that same requirement when they get added.
> > > > 
> > > > If we do that, the inode reference and ilock management diverges even
> > > > further from regular operations.  Regular callers are expected to iget
> > > > and ilock the inode and maintain that all the way to the end of
> > > > xfs_trans_commit.  Log recovery, on the other hand, will add a bunch of
> > > > state handling to some of the log items so that we can drop the ILOCK
> > > > after the first item, and then drop the inode reference after finishing
> > > > the last possible user of that inode in the revived dfops chain.
> > > > 
> > > > On the other hand, keeping the inode management in the defer freezer
> > > > code means that I keep all that complexity and state management out of
> > > > the ->finish_item code.  When we go to finish the new incore intents
> > > > that get created during recovery, the inode reference and ILOCK rules
> > > > are the same as anywhere else -- the caller is required to grab the
> > > > inode, the transaction, and the ilock (in that order).
> > > > 
> > > > To the extent that recovering log intent items is /not/ the same as
> > > > running a normal transaction, all that weirdness is concentrated in
> > > > xfs_log_recover.c and a single function in xfs_defer.c.  The desired
> > > > behavior is the same across all the log intent item types, so I
> > > > decided in favor of keeping the centralised behavior and (trying to)
> > > > contain the wacky.  We don't need all that right away, but we will.
> > > > 
> > > > Note that I did make it so that we retain the reference always, so
> > > > there's no longer a need for irele/igrab cycles, which makes capturing
> > > > the dfops chain less error prone.
> > > > 
> > > > --D
> > > > 
> > > > > Brian
> > > > > 
> > > > > [1] https://lore.kernel.org/linux-xfs/20200429235818.GX6742@magnolia/
> > > > > 
> > > > > >  fs/xfs/libxfs/xfs_defer.c |   50 +++++++++++++++++++++++++++++++++++++++++++++
> > > > > >  fs/xfs/libxfs/xfs_defer.h |   10 +++++++++
> > > > > >  fs/xfs/xfs_bmap_item.c    |    7 ++++--
> > > > > >  fs/xfs/xfs_icache.c       |   19 +++++++++++++++++
> > > > > >  4 files changed, 83 insertions(+), 3 deletions(-)
> > > > > > 
> > > > > > 
> > > > > > diff --git a/fs/xfs/libxfs/xfs_defer.c b/fs/xfs/libxfs/xfs_defer.c
> > > > > > index ea4d28851bbd..72933fdafcb2 100644
> > > > > > --- a/fs/xfs/libxfs/xfs_defer.c
> > > > > > +++ b/fs/xfs/libxfs/xfs_defer.c
> > > > > > @@ -16,6 +16,7 @@
> > > > > >  #include "xfs_inode.h"
> > > > > >  #include "xfs_inode_item.h"
> > > > > >  #include "xfs_trace.h"
> > > > > > +#include "xfs_icache.h"
> > > > > >  
> > > > > >  /*
> > > > > >   * Deferred Operations in XFS
> > > > > > @@ -583,8 +584,19 @@ xfs_defer_thaw(
> > > > > >  	struct xfs_defer_freezer	*dff,
> > > > > >  	struct xfs_trans		*tp)
> > > > > >  {
> > > > > > +	int				i;
> > > > > > +
> > > > > >  	ASSERT(tp->t_flags & XFS_TRANS_PERM_LOG_RES);
> > > > > >  
> > > > > > +	/* Re-acquire the inode locks. */
> > > > > > +	for (i = 0; i < XFS_DEFER_FREEZER_INODES; i++) {
> > > > > > +		if (!dff->dff_inodes[i])
> > > > > > +			break;
> > > > > > +
> > > > > > +		dff->dff_ilocks[i] = XFS_ILOCK_EXCL;
> > > > > > +		xfs_ilock(dff->dff_inodes[i], dff->dff_ilocks[i]);
> > > > > > +	}
> > > > > > +
> > > > > >  	/* Add the dfops items to the transaction. */
> > > > > >  	list_splice_init(&dff->dff_dfops, &tp->t_dfops);
> > > > > >  	tp->t_flags |= dff->dff_tpflags;
> > > > > > @@ -597,5 +609,43 @@ xfs_defer_freeezer_finish(
> > > > > >  	struct xfs_defer_freezer	*dff)
> > > > > >  {
> > > > > >  	xfs_defer_cancel_list(mp, &dff->dff_dfops);
> > > > > > +	xfs_defer_freezer_irele(dff);
> > > > > >  	kmem_free(dff);
> > > > > >  }
> > > > > > +
> > > > > > +/*
> > > > > > + * Attach an inode to this deferred ops freezer.  Callers must hold ILOCK_EXCL,
> > > > > > + * which will be dropped and reacquired when we're ready to thaw the frozen
> > > > > > + * deferred ops.
> > > > > > + */
> > > > > > +int
> > > > > > +xfs_defer_freezer_ijoin(
> > > > > > +	struct xfs_defer_freezer	*dff,
> > > > > > +	struct xfs_inode		*ip)
> > > > > > +{
> > > > > > +	unsigned int			i;
> > > > > > +
> > > > > > +	ASSERT(xfs_isilocked(ip, XFS_ILOCK_EXCL));
> > > > > > +
> > > > > > +	for (i = 0; i < XFS_DEFER_FREEZER_INODES; i++) {
> > > > > > +		if (dff->dff_inodes[i] == ip)
> > > > > > +			goto out;
> > > > > > +		if (dff->dff_inodes[i] == NULL)
> > > > > > +			break;
> > > > > > +	}
> > > > > > +
> > > > > > +	if (i == XFS_DEFER_FREEZER_INODES) {
> > > > > > +		ASSERT(0);
> > > > > > +		return -EFSCORRUPTED;
> > > > > > +	}
> > > > > > +
> > > > > > +	/*
> > > > > > +	 * Attach this inode to the freezer and drop its ILOCK because we
> > > > > > +	 * assume the caller will need to allocate a transaction.
> > > > > > +	 */
> > > > > > +	dff->dff_inodes[i] = ip;
> > > > > > +	dff->dff_ilocks[i] = 0;
> > > > > > +out:
> > > > > > +	xfs_iunlock(ip, XFS_ILOCK_EXCL);
> > > > > > +	return 0;
> > > > > > +}
> > > > > > diff --git a/fs/xfs/libxfs/xfs_defer.h b/fs/xfs/libxfs/xfs_defer.h
> > > > > > index 7ae05e10d750..0052a0313283 100644
> > > > > > --- a/fs/xfs/libxfs/xfs_defer.h
> > > > > > +++ b/fs/xfs/libxfs/xfs_defer.h
> > > > > > @@ -76,6 +76,11 @@ struct xfs_defer_freezer {
> > > > > >  	/* Deferred ops state saved from the transaction. */
> > > > > >  	struct list_head	dff_dfops;
> > > > > >  	unsigned int		dff_tpflags;
> > > > > > +
> > > > > > +	/* Inodes to hold when we want to finish the deferred work items. */
> > > > > > +#define XFS_DEFER_FREEZER_INODES	2
> > > > > > +	unsigned int		dff_ilocks[XFS_DEFER_FREEZER_INODES];
> > > > > > +	struct xfs_inode	*dff_inodes[XFS_DEFER_FREEZER_INODES];
> > > > > >  };
> > > > > >  
> > > > > >  /* Functions to freeze a chain of deferred operations for later. */
> > > > > > @@ -83,5 +88,10 @@ int xfs_defer_freeze(struct xfs_trans *tp, struct xfs_defer_freezer **dffp);
> > > > > >  void xfs_defer_thaw(struct xfs_defer_freezer *dff, struct xfs_trans *tp);
> > > > > >  void xfs_defer_freeezer_finish(struct xfs_mount *mp,
> > > > > >  		struct xfs_defer_freezer *dff);
> > > > > > +int xfs_defer_freezer_ijoin(struct xfs_defer_freezer *dff,
> > > > > > +		struct xfs_inode *ip);
> > > > > > +
> > > > > > +/* These functions must be provided by the xfs implementation. */
> > > > > > +void xfs_defer_freezer_irele(struct xfs_defer_freezer *dff);
> > > > > >  
> > > > > >  #endif /* __XFS_DEFER_H__ */
> > > > > > diff --git a/fs/xfs/xfs_bmap_item.c b/fs/xfs/xfs_bmap_item.c
> > > > > > index c733bdeeeb9b..bbce191d8fcd 100644
> > > > > > --- a/fs/xfs/xfs_bmap_item.c
> > > > > > +++ b/fs/xfs/xfs_bmap_item.c
> > > > > > @@ -530,12 +530,13 @@ xfs_bui_item_recover(
> > > > > >  	}
> > > > > >  
> > > > > >  	error = xlog_recover_trans_commit(tp, dffp);
> > > > > > -	xfs_iunlock(ip, XFS_ILOCK_EXCL);
> > > > > > -	xfs_irele(ip);
> > > > > > -	return error;
> > > > > > +	if (error)
> > > > > > +		goto err_rele;
> > > > > > +	return xfs_defer_freezer_ijoin(*dffp, ip);
> > > > > >  
> > > > > >  err_inode:
> > > > > >  	xfs_trans_cancel(tp);
> > > > > > +err_rele:
> > > > > >  	if (ip) {
> > > > > >  		xfs_iunlock(ip, XFS_ILOCK_EXCL);
> > > > > >  		xfs_irele(ip);
> > > > > > diff --git a/fs/xfs/xfs_icache.c b/fs/xfs/xfs_icache.c
> > > > > > index 17a0b86fe701..b96ddf5ff334 100644
> > > > > > --- a/fs/xfs/xfs_icache.c
> > > > > > +++ b/fs/xfs/xfs_icache.c
> > > > > > @@ -12,6 +12,7 @@
> > > > > >  #include "xfs_sb.h"
> > > > > >  #include "xfs_mount.h"
> > > > > >  #include "xfs_inode.h"
> > > > > > +#include "xfs_defer.h"
> > > > > >  #include "xfs_trans.h"
> > > > > >  #include "xfs_trans_priv.h"
> > > > > >  #include "xfs_inode_item.h"
> > > > > > @@ -1847,3 +1848,21 @@ xfs_start_block_reaping(
> > > > > >  	xfs_queue_eofblocks(mp);
> > > > > >  	xfs_queue_cowblocks(mp);
> > > > > >  }
> > > > > > +
> > > > > > +/* Release all the inode resources attached to this freezer. */
> > > > > > +void
> > > > > > +xfs_defer_freezer_irele(
> > > > > > +	struct xfs_defer_freezer	*dff)
> > > > > > +{
> > > > > > +	unsigned int			i;
> > > > > > +
> > > > > > +	for (i = 0; i < XFS_DEFER_FREEZER_INODES; i++) {
> > > > > > +		if (!dff->dff_inodes[i])
> > > > > > +			break;
> > > > > > +
> > > > > > +		if (dff->dff_ilocks[i])
> > > > > > +			xfs_iunlock(dff->dff_inodes[i], dff->dff_ilocks[i]);
> > > > > > +		xfs_irele(dff->dff_inodes[i]);
> > > > > > +		dff->dff_inodes[i] = NULL;
> > > > > > +	}
> > > > > > +}
> > > > > > 
> > > > > 
> > > > 
> > > 
> > 
> 

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 3/3] xfs: fix an incore inode UAF in xfs_bui_recover
  2020-05-07 15:09             ` Darrick J. Wong
@ 2020-05-07 16:58               ` Brian Foster
  0 siblings, 0 replies; 19+ messages in thread
From: Brian Foster @ 2020-05-07 16:58 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: linux-xfs

On Thu, May 07, 2020 at 08:09:56AM -0700, Darrick J. Wong wrote:
> On Thu, May 07, 2020 at 05:53:06AM -0400, Brian Foster wrote:
> > On Wed, May 06, 2020 at 10:01:50AM -0700, Darrick J. Wong wrote:
> > > On Wed, May 06, 2020 at 09:56:56AM -0400, Brian Foster wrote:
> > > > On Tue, May 05, 2020 at 05:34:23PM -0700, Darrick J. Wong wrote:
> > > > > On Tue, May 05, 2020 at 10:11:38AM -0400, Brian Foster wrote:
> > > > > > On Mon, May 04, 2020 at 06:13:53PM -0700, Darrick J. Wong wrote:
> > > > > > > From: Darrick J. Wong <darrick.wong@oracle.com>
> > > > > > > 
> > > > > > > In xfs_bui_item_recover, there exists a use-after-free bug with regards
> > > > > > > to the inode that is involved in the bmap replay operation.  If the
> > > > > > > mapping operation does not complete, we call xfs_bmap_unmap_extent to
> > > > > > > create a deferred op to finish the unmapping work, and we retain a
> > > > > > > pointer to the incore inode.
> > > > > > > 
> > > > > > > Unfortunately, the very next thing we do is commit the transaction and
> > > > > > > drop the inode.  If reclaim tears down the inode before we try to finish
> > > > > > > the defer ops, we dereference garbage and blow up.  Therefore, create a
> > > > > > > way to join inodes to the defer ops freezer so that we can maintain the
> > > > > > > xfs_inode reference until we're done with the inode.
> > > > > > > 
> > > > > > > Note: This imposes the requirement that there be enough memory to keep
> > > > > > > every incore inode in memory throughout recovery.
> > > > > > > 
> > > > > > > Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
> > > > > > > ---
> > > > > > 
> > > > > > Maybe I'm missing something, but I thought the discussion on the
> > > > > > previous version[1] landed on an approach where the intent would hold a
> > > > > > reference to the inode. Wouldn't that break the dependency on the dfops
> > > > > > freeze/thaw mechanism?
> > > > > 
> > > > > We did, but as I started pondering how to make this work in practice,
> > > > > I soon realized that it's not as simple as "don't irele the inode":
> > > > > 
> > > > > (Part 1)
> > > > > 
> > > > > It would be easy enough to add a flag to struct xfs_bmap_intent that
> > > > > means "when you're done, unlock and irele the inode".  You'd have to add
> > > > > this capability to any log item that deals with inodes, but only
> > > > > deferred bmap items need this.
> > > > > 
> > > > > Log recovery looks like this:
> > > > > 
> > > > > ----------------
> > > > > Transaction 0
> > > > > BUI1 unmap file1, offset1, startblock1, length == 200000
> > > > > ----------------
> > > > > <end>
> > > > > 
> > > > > What happens if the first bunmapi call doesn't actually unmap
> > > > > everything?  We'll queue yet another bmap intent item (call it BUI2) to
> > > > > finish the work.  This means that we can't just drop the inode when
> > > > > we're done processing BUI1; we have to propagate the "unlock and irele"
> > > > > flag to BUI2.  Ok, so now there's this chaining behavior that wasn't
> > > > > there before.
> > > > > 
> > > > > Then I thought about it some more, and wondered what would happen if the
> > > > > log contained two unfinished bmap items for the same inode?  If this
> > > > > happened, you'd deadlock because you've not released the ILOCK for the
> > > > > first unfinished bmap item when you want to acquire the ILOCK for the
> > > > > second unfinished bmap item.
> > > > > 
> > > > 
> > > > Hm, Ok.. so IIUC the lock is implicitly held across these map/unmap
> > > > dfops at runtime, but that's not necessarily the case during recovery
> > > > due to the dfops chaining behavior. To me, this raises the question of
> > > > why we need to reorder/split these particular intents during recovery if
> > > > the original operation would have completed under a single inode lock
> > > > cycle.
> > > 
> > > Log recovery /did/ function that way (no reordering) until
> > > 509955823cc9c, when it was discovered that we don't have a good way to
> > > reconstruct the dfops chain from the list of recovered log intent items.
> > > 
> > > > The comments in xfs_log_recover.c don't really explain it. From going
> > > > back to commit 509955823cc9c ("xfs: log recovery should replay deferred
> > > > ops in order"), it looks like the issue is that ordering gets screwed up
> > > > for operations that commit multiple intents per-transaction and
> > > > completion of those intents creates new child intents. At runtime, we'd
> > > > process the intents in the current transaction first and then get to the
> > > > child intents in the order they were created. If we didn't have the
> > > > recovery time dfops shuffling implemented in this patch, recovery would
> > > > process the first intent in the log, then all of the child intents of
> > > > that one before getting to the next intent in the log that presumably
> > > > would have been second at runtime. Am I following that correctly?
> > > 
> > > Correct.  If we could figure out how to rebuild the dfops chain (and
> > > thus the ordering requirements), we would be able to eliminate this
> > > freezer stuff entirely.
> > > 
> > 
> > Ok, thanks. I was wondering the same, but it's not clear how to do that
> > with just the intent information in the log and big checkpoint
> > transactions. Perhaps we could if we had a transactional id that was
> > common across intents in a single transaction, or otherwise unique
> > intent ids that would allow an intent to refer to the next in a chain at
> > the time the intent was logged (such that recovery could learn to
> > construct and process chains instead of just individual intents). I'm
> > curious if we're going to want/need to solve that issue somehow or
> > another more directly eventually as we grow more intents and perhaps
> > come up with new and more complex intent-based operations (with more
> > complex recovery scenarios).
> 
> <nod>  Too bad that's a log format change. :(
> 

Yeah. It might be worth considering if we have an opportunity to break
format in the future, but not useful for the current code.

> > > > I think the rest makes sense from the perspective of not wanting to
> > > > plumb the conditional locking complexity through the dfops/intent
> > > > processing for every intent type that eventually needs it, as opposed to
> > > > containing in the log recovery code (via dfops magic). I need to think
> > > > about all that some more, but I'd like to make sure I understand why we
> > > > have this recovery requirement in the first place.
> > > 
> > > <nod>  It took me a few hours to figure out why that patch was correct
> > > the first time I saw it.  It took me a few hours again(!) when I was
> > > trying to write this series. :/
> > > 
> > 
> > If I follow correctly, the rmap swapext sequence makes for a good
> > example because the bmap completions generate the rmap intents. A
> > simplified (i.e. single extent remap) runtime transactional flow for
> > that looks something like the following:
> > 
> > t1: unmap intent, map intent
> > t2: unmap done, map intent, runmap intent
> > t3: map done, runmap intent, rmap intent
> > t4: runmap done, rmap intent
> 
> Work items that aren't touched in a particular transaction do not get
> relogged, so this sequence is:
> 
> t1: unmap(0) intent, map(1) intent
> t2: unmap(0) done, runmap(2) intent
> t3: map(1) done, rmap(3) intent
> t4: runmap(2) done
> t5: rmap(3) done
> 
> (Maybe I need to take another look at the autorelogging patchset.)
> 

Ah, I see. Thanks.

Brian

> > ...
> > 
> > So at runtime these operations complete in the obvious linear order.
> > Intents drop off the front as done items are logged and new intents are
> > tacked on the end. If we crash between t2 and t3, however, we see t2 in
> > the log and would end up completing the map intent and the subsequently
> > generated rmap intent (that pops up in t3 at runtime) before we process
> > the runmap intent that's still in the log (and was originally in the
> > dfops queue for t2). I _think_ that's essentially the scenario described
> > in 509955823cc9c,
> 
> Yes...
> 
> > but it's not clear to me if it's the same runtime example..
> 
> ...and you arrived at it with a different scenario. :)
> 
> --D
> 
> > Brian
> > 
> > > --D
> > > 
> > > > Brian
> > > > 
> > > > > You'd have to drop the ILOCK (but retain the inode reference) between
> > > > > the two bmap items.  I think this is currently impossible because there
> > > > > aren't any transactions that queue multiple bmap intents per
> > > > > transaction, nobody else holds the ilock, and the existing rmap
> > > > > implementation of the swapext ioctl doesn't allow swapping a file with
> > > > > itself.  However...
> > > > > 
> > > > > (Part 2)
> > > > > 
> > > > > The second thing I thought about is, what if there's a lot of work after
> > > > > transaction 0?  Say the recovery stream looks like:
> > > > > 
> > > > > ----------------
> > > > > Transaction 0
> > > > > BUI1 unmap file1, offset1, startblock1, length == 200000
> > > > > ----------------
> > > > > <10000 more intents applied to other things>
> > > > > ----------------
> > > > > <end>
> > > > > 
> > > > > This is a problem, because we must drop the ILOCK after completing the
> > > > > BUI1 work so that we can process the 10000 more intents before we can
> > > > > get to BUI2.  We cannot let the tail of the log get pinned on the locked
> > > > > inode, so we must release the ILOCK after we're done with BUI1 and
> > > > > re-acquire it when we're ready to deal with BUI2.
> > > > > 
> > > > > This means I need /two/ flags in struct xfs_bmap_intent -- one to say
> > > > > that we need to irele the inode at the end of processing, and another to
> > > > > say that the bmap item needs to grab the ILOCK.  If there's going to be
> > > > > a BUI2 as a result of recovering BUI1, the first flag must be propagated
> > > > > to BUI2, but the second flag must /not/ be propagated.
> > > > > 
> > > > > A potential counterargument is that we crammed all 10000 intents into
> > > > > the log without pinning the log, so maybe we can hold the ILOCK.
> > > > > However....
> > > > > 
> > > > > (Part 3)
> > > > > 
> > > > > Next, I considered how the swapext log item in the atomic file update
> > > > > patchset would interact with log recovery.  A couple of notes about the
> > > > > swapext log items:
> > > > > 
> > > > > 1. They are careful not to create the kinds of bunmap operations that
> > > > > would cause the creation of /more/ BUIs.
> > > > > 
> > > > > 2. Every time we log one extent's worth of unmap/remap operations (using
> > > > > deferred bmap log items, naturally) we log an done item for the original
> > > > > swapext intent item and immediately log another swapext intent for the
> > > > > work remaining.
> > > > > 
> > > > > I came up with the following sequence of transactions to recover:
> > > > > 
> > > > > ----------------
> > > > > Transaction 0
> > > > > SXI0 file1, offset1, file2, offset2, length == 10
> > > > > ----------------
> > > > > Transaction 1
> > > > > BUI1 unmap file1, offset1, startblock1, length == 2
> > > > > BUI2 unmap file2, offset2, startblock2, length == 2
> > > > > BUI3 map file1, offset1, startblock2, length == 2
> > > > > BUI4 map file2, offset2, startblock1, length == 2
> > > > > SXD done (SXI0)
> > > > > SXI5 file1, offset1 + 2, file2, offset2 + 2, length == 8
> > > > > ---------------
> > > > > <end of log>
> > > > > 
> > > > > Recovery in this case will end up with BUI[1-4] and SXI5 that still need
> > > > > processing.  Each of the 5 intent items gets its own recovery
> > > > > transaction and dfops freezer chain.  BUI1, BUI3, and SXI5 will each
> > > > > require ilock'd access to file1, which means that each of them will have
> > > > > to have the ability to drop the ILOCK but retain the reference.  The
> > > > > same argument applies to file2 and BUI2, BUI4, and SXI5.  Note also that
> > > > > in our brave new world, file1 and file2 can be the same.
> > > > > 
> > > > > For the swapext log item type this inode management is particularly
> > > > > acute because we're certain to have new SXIs created as a result of
> > > > > recovering an SXI.
> > > > > 
> > > > > (End)
> > > > > 
> > > > > In conclusion, we need a mechanism to drop both the inode lock and the
> > > > > inode reference.  Each log item type that works with inodes will have to
> > > > > set aside 2 flags somewhere in the incore extent structure, and provide
> > > > > more or less the same code to manage that state.  Right now that's just
> > > > > the bmap items, but then the swapext items and the deferred xattr items
> > > > > will have that same requirement when they get added.
> > > > > 
> > > > > If we do that, the inode reference and ilock management diverges even
> > > > > further from regular operations.  Regular callers are expected to iget
> > > > > and ilock the inode and maintain that all the way to the end of
> > > > > xfs_trans_commit.  Log recovery, on the other hand, will add a bunch of
> > > > > state handling to some of the log items so that we can drop the ILOCK
> > > > > after the first item, and then drop the inode reference after finishing
> > > > > the last possible user of that inode in the revived dfops chain.
> > > > > 
> > > > > On the other hand, keeping the inode management in the defer freezer
> > > > > code means that I keep all that complexity and state management out of
> > > > > the ->finish_item code.  When we go to finish the new incore intents
> > > > > that get created during recovery, the inode reference and ILOCK rules
> > > > > are the same as anywhere else -- the caller is required to grab the
> > > > > inode, the transaction, and the ilock (in that order).
> > > > > 
> > > > > To the extent that recovering log intent items is /not/ the same as
> > > > > running a normal transaction, all that weirdness is concentrated in
> > > > > xfs_log_recover.c and a single function in xfs_defer.c.  The desired
> > > > > behavior is the same across all the log intent item types, so I
> > > > > decided in favor of keeping the centralised behavior and (trying to)
> > > > > contain the wacky.  We don't need all that right away, but we will.
> > > > > 
> > > > > Note that I did make it so that we retain the reference always, so
> > > > > there's no longer a need for irele/igrab cycles, which makes capturing
> > > > > the dfops chain less error prone.
> > > > > 
> > > > > --D
> > > > > 
> > > > > > Brian
> > > > > > 
> > > > > > [1] https://lore.kernel.org/linux-xfs/20200429235818.GX6742@magnolia/
> > > > > > 
> > > > > > >  fs/xfs/libxfs/xfs_defer.c |   50 +++++++++++++++++++++++++++++++++++++++++++++
> > > > > > >  fs/xfs/libxfs/xfs_defer.h |   10 +++++++++
> > > > > > >  fs/xfs/xfs_bmap_item.c    |    7 ++++--
> > > > > > >  fs/xfs/xfs_icache.c       |   19 +++++++++++++++++
> > > > > > >  4 files changed, 83 insertions(+), 3 deletions(-)
> > > > > > > 
> > > > > > > 
> > > > > > > diff --git a/fs/xfs/libxfs/xfs_defer.c b/fs/xfs/libxfs/xfs_defer.c
> > > > > > > index ea4d28851bbd..72933fdafcb2 100644
> > > > > > > --- a/fs/xfs/libxfs/xfs_defer.c
> > > > > > > +++ b/fs/xfs/libxfs/xfs_defer.c
> > > > > > > @@ -16,6 +16,7 @@
> > > > > > >  #include "xfs_inode.h"
> > > > > > >  #include "xfs_inode_item.h"
> > > > > > >  #include "xfs_trace.h"
> > > > > > > +#include "xfs_icache.h"
> > > > > > >  
> > > > > > >  /*
> > > > > > >   * Deferred Operations in XFS
> > > > > > > @@ -583,8 +584,19 @@ xfs_defer_thaw(
> > > > > > >  	struct xfs_defer_freezer	*dff,
> > > > > > >  	struct xfs_trans		*tp)
> > > > > > >  {
> > > > > > > +	int				i;
> > > > > > > +
> > > > > > >  	ASSERT(tp->t_flags & XFS_TRANS_PERM_LOG_RES);
> > > > > > >  
> > > > > > > +	/* Re-acquire the inode locks. */
> > > > > > > +	for (i = 0; i < XFS_DEFER_FREEZER_INODES; i++) {
> > > > > > > +		if (!dff->dff_inodes[i])
> > > > > > > +			break;
> > > > > > > +
> > > > > > > +		dff->dff_ilocks[i] = XFS_ILOCK_EXCL;
> > > > > > > +		xfs_ilock(dff->dff_inodes[i], dff->dff_ilocks[i]);
> > > > > > > +	}
> > > > > > > +
> > > > > > >  	/* Add the dfops items to the transaction. */
> > > > > > >  	list_splice_init(&dff->dff_dfops, &tp->t_dfops);
> > > > > > >  	tp->t_flags |= dff->dff_tpflags;
> > > > > > > @@ -597,5 +609,43 @@ xfs_defer_freeezer_finish(
> > > > > > >  	struct xfs_defer_freezer	*dff)
> > > > > > >  {
> > > > > > >  	xfs_defer_cancel_list(mp, &dff->dff_dfops);
> > > > > > > +	xfs_defer_freezer_irele(dff);
> > > > > > >  	kmem_free(dff);
> > > > > > >  }
> > > > > > > +
> > > > > > > +/*
> > > > > > > + * Attach an inode to this deferred ops freezer.  Callers must hold ILOCK_EXCL,
> > > > > > > + * which will be dropped and reacquired when we're ready to thaw the frozen
> > > > > > > + * deferred ops.
> > > > > > > + */
> > > > > > > +int
> > > > > > > +xfs_defer_freezer_ijoin(
> > > > > > > +	struct xfs_defer_freezer	*dff,
> > > > > > > +	struct xfs_inode		*ip)
> > > > > > > +{
> > > > > > > +	unsigned int			i;
> > > > > > > +
> > > > > > > +	ASSERT(xfs_isilocked(ip, XFS_ILOCK_EXCL));
> > > > > > > +
> > > > > > > +	for (i = 0; i < XFS_DEFER_FREEZER_INODES; i++) {
> > > > > > > +		if (dff->dff_inodes[i] == ip)
> > > > > > > +			goto out;
> > > > > > > +		if (dff->dff_inodes[i] == NULL)
> > > > > > > +			break;
> > > > > > > +	}
> > > > > > > +
> > > > > > > +	if (i == XFS_DEFER_FREEZER_INODES) {
> > > > > > > +		ASSERT(0);
> > > > > > > +		return -EFSCORRUPTED;
> > > > > > > +	}
> > > > > > > +
> > > > > > > +	/*
> > > > > > > +	 * Attach this inode to the freezer and drop its ILOCK because we
> > > > > > > +	 * assume the caller will need to allocate a transaction.
> > > > > > > +	 */
> > > > > > > +	dff->dff_inodes[i] = ip;
> > > > > > > +	dff->dff_ilocks[i] = 0;
> > > > > > > +out:
> > > > > > > +	xfs_iunlock(ip, XFS_ILOCK_EXCL);
> > > > > > > +	return 0;
> > > > > > > +}
> > > > > > > diff --git a/fs/xfs/libxfs/xfs_defer.h b/fs/xfs/libxfs/xfs_defer.h
> > > > > > > index 7ae05e10d750..0052a0313283 100644
> > > > > > > --- a/fs/xfs/libxfs/xfs_defer.h
> > > > > > > +++ b/fs/xfs/libxfs/xfs_defer.h
> > > > > > > @@ -76,6 +76,11 @@ struct xfs_defer_freezer {
> > > > > > >  	/* Deferred ops state saved from the transaction. */
> > > > > > >  	struct list_head	dff_dfops;
> > > > > > >  	unsigned int		dff_tpflags;
> > > > > > > +
> > > > > > > +	/* Inodes to hold when we want to finish the deferred work items. */
> > > > > > > +#define XFS_DEFER_FREEZER_INODES	2
> > > > > > > +	unsigned int		dff_ilocks[XFS_DEFER_FREEZER_INODES];
> > > > > > > +	struct xfs_inode	*dff_inodes[XFS_DEFER_FREEZER_INODES];
> > > > > > >  };
> > > > > > >  
> > > > > > >  /* Functions to freeze a chain of deferred operations for later. */
> > > > > > > @@ -83,5 +88,10 @@ int xfs_defer_freeze(struct xfs_trans *tp, struct xfs_defer_freezer **dffp);
> > > > > > >  void xfs_defer_thaw(struct xfs_defer_freezer *dff, struct xfs_trans *tp);
> > > > > > >  void xfs_defer_freeezer_finish(struct xfs_mount *mp,
> > > > > > >  		struct xfs_defer_freezer *dff);
> > > > > > > +int xfs_defer_freezer_ijoin(struct xfs_defer_freezer *dff,
> > > > > > > +		struct xfs_inode *ip);
> > > > > > > +
> > > > > > > +/* These functions must be provided by the xfs implementation. */
> > > > > > > +void xfs_defer_freezer_irele(struct xfs_defer_freezer *dff);
> > > > > > >  
> > > > > > >  #endif /* __XFS_DEFER_H__ */
> > > > > > > diff --git a/fs/xfs/xfs_bmap_item.c b/fs/xfs/xfs_bmap_item.c
> > > > > > > index c733bdeeeb9b..bbce191d8fcd 100644
> > > > > > > --- a/fs/xfs/xfs_bmap_item.c
> > > > > > > +++ b/fs/xfs/xfs_bmap_item.c
> > > > > > > @@ -530,12 +530,13 @@ xfs_bui_item_recover(
> > > > > > >  	}
> > > > > > >  
> > > > > > >  	error = xlog_recover_trans_commit(tp, dffp);
> > > > > > > -	xfs_iunlock(ip, XFS_ILOCK_EXCL);
> > > > > > > -	xfs_irele(ip);
> > > > > > > -	return error;
> > > > > > > +	if (error)
> > > > > > > +		goto err_rele;
> > > > > > > +	return xfs_defer_freezer_ijoin(*dffp, ip);
> > > > > > >  
> > > > > > >  err_inode:
> > > > > > >  	xfs_trans_cancel(tp);
> > > > > > > +err_rele:
> > > > > > >  	if (ip) {
> > > > > > >  		xfs_iunlock(ip, XFS_ILOCK_EXCL);
> > > > > > >  		xfs_irele(ip);
> > > > > > > diff --git a/fs/xfs/xfs_icache.c b/fs/xfs/xfs_icache.c
> > > > > > > index 17a0b86fe701..b96ddf5ff334 100644
> > > > > > > --- a/fs/xfs/xfs_icache.c
> > > > > > > +++ b/fs/xfs/xfs_icache.c
> > > > > > > @@ -12,6 +12,7 @@
> > > > > > >  #include "xfs_sb.h"
> > > > > > >  #include "xfs_mount.h"
> > > > > > >  #include "xfs_inode.h"
> > > > > > > +#include "xfs_defer.h"
> > > > > > >  #include "xfs_trans.h"
> > > > > > >  #include "xfs_trans_priv.h"
> > > > > > >  #include "xfs_inode_item.h"
> > > > > > > @@ -1847,3 +1848,21 @@ xfs_start_block_reaping(
> > > > > > >  	xfs_queue_eofblocks(mp);
> > > > > > >  	xfs_queue_cowblocks(mp);
> > > > > > >  }
> > > > > > > +
> > > > > > > +/* Release all the inode resources attached to this freezer. */
> > > > > > > +void
> > > > > > > +xfs_defer_freezer_irele(
> > > > > > > +	struct xfs_defer_freezer	*dff)
> > > > > > > +{
> > > > > > > +	unsigned int			i;
> > > > > > > +
> > > > > > > +	for (i = 0; i < XFS_DEFER_FREEZER_INODES; i++) {
> > > > > > > +		if (!dff->dff_inodes[i])
> > > > > > > +			break;
> > > > > > > +
> > > > > > > +		if (dff->dff_ilocks[i])
> > > > > > > +			xfs_iunlock(dff->dff_inodes[i], dff->dff_ilocks[i]);
> > > > > > > +		xfs_irele(dff->dff_inodes[i]);
> > > > > > > +		dff->dff_inodes[i] = NULL;
> > > > > > > +	}
> > > > > > > +}
> > > > > > > 
> > > > > > 
> > > > > 
> > > > 
> > > 
> > 
> 


^ permalink raw reply	[flat|nested] 19+ messages in thread

* [PATCH 0/3] xfs: fix inode use-after-free during log recovery
@ 2020-04-22  2:08 Darrick J. Wong
  2020-04-22  2:08 ` [PATCH 2/3] xfs: reduce log recovery transaction block reservations Darrick J. Wong
  0 siblings, 1 reply; 19+ messages in thread
From: Darrick J. Wong @ 2020-04-22  2:08 UTC (permalink / raw)
  To: darrick.wong; +Cc: linux-xfs

Hi all,

Fix a use-after-free during log recovery of deferred operations by
creating explicit freeze and thaw mechanisms for deferred ops that were
created while processing intent items that were recovered from the log.
While we're at it, fix all the bogosity around how we gather up log
intents during recovery and actually commit them to the filesystem.

If you're going to start using this mess, you probably ought to just
pull from my git trees, which are linked below.

This is an extraordinary way to destroy everything.  Enjoy!
Comments and questions are, as always, welcome.

--D

kernel git tree:
https://git.kernel.org/cgit/linux/kernel/git/djwong/xfs-linux.git/log/?h=fix-log-recovery

xfsprogs git tree:
https://git.kernel.org/cgit/linux/kernel/git/djwong/xfsprogs-dev.git/log/?h=fix-log-recovery

^ permalink raw reply	[flat|nested] 19+ messages in thread

* [PATCH 2/3] xfs: reduce log recovery transaction block reservations
  2020-04-22  2:08 [PATCH 0/3] xfs: fix inode use-after-free during log recovery Darrick J. Wong
@ 2020-04-22  2:08 ` Darrick J. Wong
  2020-04-24 14:04   ` Brian Foster
  0 siblings, 1 reply; 19+ messages in thread
From: Darrick J. Wong @ 2020-04-22  2:08 UTC (permalink / raw)
  To: darrick.wong; +Cc: linux-xfs

From: Darrick J. Wong <darrick.wong@oracle.com>

On filesystems that support them, bmap intent log items can be used to
change mappings in inode data or attr forks.  However, if the bmbt must
expand, the enormous block reservations that we make for finishing
chains of deferred log items become a liability because the bmbt block
allocator sets minleft to the transaction reservation and there probably
aren't any AGs in the filesystem that have that much free space.

Whereas previously we would reserve 93% of the free blocks in the
filesystem, now we only want to reserve 7/8ths of the free space in the
least full AG, and no more than half of the usable blocks in an AG.  In
theory we shouldn't run out of space because (prior to the unclean
shutdown) all of the in-progress transactions successfully reserved the
worst case number of disk blocks.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
---
 fs/xfs/xfs_log_recover.c |   55 ++++++++++++++++++++++++++++++++++++----------
 1 file changed, 43 insertions(+), 12 deletions(-)


diff --git a/fs/xfs/xfs_log_recover.c b/fs/xfs/xfs_log_recover.c
index e9b3e901d009..a416b028b320 100644
--- a/fs/xfs/xfs_log_recover.c
+++ b/fs/xfs/xfs_log_recover.c
@@ -2669,6 +2669,44 @@ xlog_recover_process_data(
 	return 0;
 }
 
+/*
+ * Estimate a block reservation for a log recovery transaction.  Since we run
+ * separate transactions for each chain of deferred ops that get created as a
+ * result of recovering unfinished log intent items, we must be careful not to
+ * reserve so many blocks that block allocations fail because we can't satisfy
+ * the minleft requirements (e.g. for bmbt blocks).
+ */
+static int
+xlog_estimate_recovery_resblks(
+	struct xfs_mount	*mp,
+	unsigned int		*resblks)
+{
+	struct xfs_perag	*pag;
+	xfs_agnumber_t		agno;
+	unsigned int		free = 0;
+	int			error;
+
+	/* Don't use more than 7/8th of the free space in the least full AG. */
+	for (agno = 0; agno < mp->m_sb.sb_agcount; agno++) {
+		unsigned int	ag_free;
+
+		error = xfs_alloc_pagf_init(mp, NULL, agno, 0);
+		if (error)
+			return error;
+		pag = xfs_perag_get(mp, agno);
+		ag_free = pag->pagf_freeblks + pag->pagf_flcount;
+		free = max(free, (ag_free * 7) / 8);
+		xfs_perag_put(pag);
+	}
+
+	/* Don't try to reserve more than half the usable AG blocks. */
+	*resblks = min(free, xfs_alloc_ag_max_usable(mp) / 2);
+	if (*resblks == 0)
+		return -ENOSPC;
+
+	return 0;
+}
+
 /* Take all the collected deferred ops and finish them in order. */
 static int
 xlog_finish_defer_ops(
@@ -2677,27 +2715,20 @@ xlog_finish_defer_ops(
 {
 	struct xfs_defer_freezer *dff, *next;
 	struct xfs_trans	*tp;
-	int64_t			freeblks;
 	uint			resblks;
 	int			error = 0;
 
 	list_for_each_entry_safe(dff, next, dfops_freezers, dff_list) {
+		error = xlog_estimate_recovery_resblks(mp, &resblks);
+		if (error)
+			break;
+
 		/*
 		 * We're finishing the defer_ops that accumulated as a result
 		 * of recovering unfinished intent items during log recovery.
 		 * We reserve an itruncate transaction because it is the
-		 * largest permanent transaction type.  Since we're the only
-		 * user of the fs right now, take 93% (15/16) of the available
-		 * free blocks.  Use weird math to avoid a 64-bit division.
+		 * largest permanent transaction type.
 		 */
-		freeblks = percpu_counter_sum(&mp->m_fdblocks);
-		if (freeblks <= 0) {
-			error = -ENOSPC;
-			break;
-		}
-
-		resblks = min_t(int64_t, UINT_MAX, freeblks);
-		resblks = (resblks * 15) >> 4;
 		error = xfs_trans_alloc(mp, &M_RES(mp)->tr_itruncate, resblks,
 				0, XFS_TRANS_RESERVE, &tp);
 		if (error)


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* Re: [PATCH 2/3] xfs: reduce log recovery transaction block reservations
  2020-04-22  2:08 ` [PATCH 2/3] xfs: reduce log recovery transaction block reservations Darrick J. Wong
@ 2020-04-24 14:04   ` Brian Foster
  2020-04-28 22:22     ` Darrick J. Wong
  0 siblings, 1 reply; 19+ messages in thread
From: Brian Foster @ 2020-04-24 14:04 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: linux-xfs

On Tue, Apr 21, 2020 at 07:08:20PM -0700, Darrick J. Wong wrote:
> From: Darrick J. Wong <darrick.wong@oracle.com>
> 
> On filesystems that support them, bmap intent log items can be used to
> change mappings in inode data or attr forks.  However, if the bmbt must
> expand, the enormous block reservations that we make for finishing
> chains of deferred log items become a liability because the bmbt block
> allocator sets minleft to the transaction reservation and there probably
> aren't any AGs in the filesystem that have that much free space.
> 
> Whereas previously we would reserve 93% of the free blocks in the
> filesystem, now we only want to reserve 7/8ths of the free space in the
> least full AG, and no more than half of the usable blocks in an AG.  In
> theory we shouldn't run out of space because (prior to the unclean
> shutdown) all of the in-progress transactions successfully reserved the
> worst case number of disk blocks.
> 
> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
> ---
>  fs/xfs/xfs_log_recover.c |   55 ++++++++++++++++++++++++++++++++++++----------
>  1 file changed, 43 insertions(+), 12 deletions(-)
> 
> 
> diff --git a/fs/xfs/xfs_log_recover.c b/fs/xfs/xfs_log_recover.c
> index e9b3e901d009..a416b028b320 100644
> --- a/fs/xfs/xfs_log_recover.c
> +++ b/fs/xfs/xfs_log_recover.c
> @@ -2669,6 +2669,44 @@ xlog_recover_process_data(
>  	return 0;
>  }
>  
> +/*
> + * Estimate a block reservation for a log recovery transaction.  Since we run
> + * separate transactions for each chain of deferred ops that get created as a
> + * result of recovering unfinished log intent items, we must be careful not to
> + * reserve so many blocks that block allocations fail because we can't satisfy
> + * the minleft requirements (e.g. for bmbt blocks).
> + */
> +static int
> +xlog_estimate_recovery_resblks(
> +	struct xfs_mount	*mp,
> +	unsigned int		*resblks)
> +{
> +	struct xfs_perag	*pag;
> +	xfs_agnumber_t		agno;
> +	unsigned int		free = 0;
> +	int			error;
> +
> +	/* Don't use more than 7/8th of the free space in the least full AG. */
> +	for (agno = 0; agno < mp->m_sb.sb_agcount; agno++) {
> +		unsigned int	ag_free;
> +
> +		error = xfs_alloc_pagf_init(mp, NULL, agno, 0);
> +		if (error)
> +			return error;
> +		pag = xfs_perag_get(mp, agno);
> +		ag_free = pag->pagf_freeblks + pag->pagf_flcount;
> +		free = max(free, (ag_free * 7) / 8);
> +		xfs_perag_put(pag);
> +	}
> +

Somewhat unfortunate that we have to iterate all AGs for each chain. I'm
wondering if that has any effect on a large recovery on fs' with an
inordinate AG count. Have you tested under those particular conditions?
I suppose it's possible the recovery is slow enough that this won't
matter...

Also, perhaps not caused by this patch but does this
outsized/manufactured reservation have the effect of artificially
steering allocations to a particular AG if one happens to be notably
larger than the rest?

Brian

> +	/* Don't try to reserve more than half the usable AG blocks. */
> +	*resblks = min(free, xfs_alloc_ag_max_usable(mp) / 2);
> +	if (*resblks == 0)
> +		return -ENOSPC;
> +
> +	return 0;
> +}
> +
>  /* Take all the collected deferred ops and finish them in order. */
>  static int
>  xlog_finish_defer_ops(
> @@ -2677,27 +2715,20 @@ xlog_finish_defer_ops(
>  {
>  	struct xfs_defer_freezer *dff, *next;
>  	struct xfs_trans	*tp;
> -	int64_t			freeblks;
>  	uint			resblks;
>  	int			error = 0;
>  
>  	list_for_each_entry_safe(dff, next, dfops_freezers, dff_list) {
> +		error = xlog_estimate_recovery_resblks(mp, &resblks);
> +		if (error)
> +			break;
> +
>  		/*
>  		 * We're finishing the defer_ops that accumulated as a result
>  		 * of recovering unfinished intent items during log recovery.
>  		 * We reserve an itruncate transaction because it is the
> -		 * largest permanent transaction type.  Since we're the only
> -		 * user of the fs right now, take 93% (15/16) of the available
> -		 * free blocks.  Use weird math to avoid a 64-bit division.
> +		 * largest permanent transaction type.
>  		 */
> -		freeblks = percpu_counter_sum(&mp->m_fdblocks);
> -		if (freeblks <= 0) {
> -			error = -ENOSPC;
> -			break;
> -		}
> -
> -		resblks = min_t(int64_t, UINT_MAX, freeblks);
> -		resblks = (resblks * 15) >> 4;
>  		error = xfs_trans_alloc(mp, &M_RES(mp)->tr_itruncate, resblks,
>  				0, XFS_TRANS_RESERVE, &tp);
>  		if (error)
> 


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 2/3] xfs: reduce log recovery transaction block reservations
  2020-04-24 14:04   ` Brian Foster
@ 2020-04-28 22:22     ` Darrick J. Wong
  2020-05-27 22:39       ` Darrick J. Wong
  0 siblings, 1 reply; 19+ messages in thread
From: Darrick J. Wong @ 2020-04-28 22:22 UTC (permalink / raw)
  To: Brian Foster; +Cc: linux-xfs

On Fri, Apr 24, 2020 at 10:04:08AM -0400, Brian Foster wrote:
> On Tue, Apr 21, 2020 at 07:08:20PM -0700, Darrick J. Wong wrote:
> > From: Darrick J. Wong <darrick.wong@oracle.com>
> > 
> > On filesystems that support them, bmap intent log items can be used to
> > change mappings in inode data or attr forks.  However, if the bmbt must
> > expand, the enormous block reservations that we make for finishing
> > chains of deferred log items become a liability because the bmbt block
> > allocator sets minleft to the transaction reservation and there probably
> > aren't any AGs in the filesystem that have that much free space.
> > 
> > Whereas previously we would reserve 93% of the free blocks in the
> > filesystem, now we only want to reserve 7/8ths of the free space in the
> > least full AG, and no more than half of the usable blocks in an AG.  In
> > theory we shouldn't run out of space because (prior to the unclean
> > shutdown) all of the in-progress transactions successfully reserved the
> > worst case number of disk blocks.
> > 
> > Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
> > ---
> >  fs/xfs/xfs_log_recover.c |   55 ++++++++++++++++++++++++++++++++++++----------
> >  1 file changed, 43 insertions(+), 12 deletions(-)
> > 
> > 
> > diff --git a/fs/xfs/xfs_log_recover.c b/fs/xfs/xfs_log_recover.c
> > index e9b3e901d009..a416b028b320 100644
> > --- a/fs/xfs/xfs_log_recover.c
> > +++ b/fs/xfs/xfs_log_recover.c
> > @@ -2669,6 +2669,44 @@ xlog_recover_process_data(
> >  	return 0;
> >  }
> >  
> > +/*
> > + * Estimate a block reservation for a log recovery transaction.  Since we run
> > + * separate transactions for each chain of deferred ops that get created as a
> > + * result of recovering unfinished log intent items, we must be careful not to
> > + * reserve so many blocks that block allocations fail because we can't satisfy
> > + * the minleft requirements (e.g. for bmbt blocks).
> > + */
> > +static int
> > +xlog_estimate_recovery_resblks(
> > +	struct xfs_mount	*mp,
> > +	unsigned int		*resblks)
> > +{
> > +	struct xfs_perag	*pag;
> > +	xfs_agnumber_t		agno;
> > +	unsigned int		free = 0;
> > +	int			error;
> > +
> > +	/* Don't use more than 7/8th of the free space in the least full AG. */
> > +	for (agno = 0; agno < mp->m_sb.sb_agcount; agno++) {
> > +		unsigned int	ag_free;
> > +
> > +		error = xfs_alloc_pagf_init(mp, NULL, agno, 0);
> > +		if (error)
> > +			return error;
> > +		pag = xfs_perag_get(mp, agno);
> > +		ag_free = pag->pagf_freeblks + pag->pagf_flcount;
> > +		free = max(free, (ag_free * 7) / 8);
> > +		xfs_perag_put(pag);
> > +	}
> > +
> 
> Somewhat unfortunate that we have to iterate all AGs for each chain. I'm
> wondering if that has any effect on a large recovery on fs' with an
> inordinate AG count. Have you tested under those particular conditions?
> I suppose it's possible the recovery is slow enough that this won't
> matter...

I admit I haven't actually looked at that.  A more precise way to handle
this would be for each intent recovery function to store its own worst
case resblks estimation in the recovery freezer so that we'd be using
roughly the same space requirements as the pre-crash transaction, but
that's also a lot more complicated than this kludge.

> Also, perhaps not caused by this patch but does this
> outsized/manufactured reservation have the effect of artificially
> steering allocations to a particular AG if one happens to be notably
> larger than the rest?

It tends to steer allocations towards whichever AGs were less full at
the start of each transaction.  Were we to shift to scanning the AGs
once for the entire recovery cycle, I think I'd probably pick a smaller
ratio.

--D

> Brian
> 
> > +	/* Don't try to reserve more than half the usable AG blocks. */
> > +	*resblks = min(free, xfs_alloc_ag_max_usable(mp) / 2);
> > +	if (*resblks == 0)
> > +		return -ENOSPC;
> > +
> > +	return 0;
> > +}
> > +
> >  /* Take all the collected deferred ops and finish them in order. */
> >  static int
> >  xlog_finish_defer_ops(
> > @@ -2677,27 +2715,20 @@ xlog_finish_defer_ops(
> >  {
> >  	struct xfs_defer_freezer *dff, *next;
> >  	struct xfs_trans	*tp;
> > -	int64_t			freeblks;
> >  	uint			resblks;
> >  	int			error = 0;
> >  
> >  	list_for_each_entry_safe(dff, next, dfops_freezers, dff_list) {
> > +		error = xlog_estimate_recovery_resblks(mp, &resblks);
> > +		if (error)
> > +			break;
> > +
> >  		/*
> >  		 * We're finishing the defer_ops that accumulated as a result
> >  		 * of recovering unfinished intent items during log recovery.
> >  		 * We reserve an itruncate transaction because it is the
> > -		 * largest permanent transaction type.  Since we're the only
> > -		 * user of the fs right now, take 93% (15/16) of the available
> > -		 * free blocks.  Use weird math to avoid a 64-bit division.
> > +		 * largest permanent transaction type.
> >  		 */
> > -		freeblks = percpu_counter_sum(&mp->m_fdblocks);
> > -		if (freeblks <= 0) {
> > -			error = -ENOSPC;
> > -			break;
> > -		}
> > -
> > -		resblks = min_t(int64_t, UINT_MAX, freeblks);
> > -		resblks = (resblks * 15) >> 4;
> >  		error = xfs_trans_alloc(mp, &M_RES(mp)->tr_itruncate, resblks,
> >  				0, XFS_TRANS_RESERVE, &tp);
> >  		if (error)
> > 
> 

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 2/3] xfs: reduce log recovery transaction block reservations
  2020-04-28 22:22     ` Darrick J. Wong
@ 2020-05-27 22:39       ` Darrick J. Wong
  0 siblings, 0 replies; 19+ messages in thread
From: Darrick J. Wong @ 2020-05-27 22:39 UTC (permalink / raw)
  To: Brian Foster; +Cc: linux-xfs

On Tue, Apr 28, 2020 at 03:22:47PM -0700, Darrick J. Wong wrote:
> On Fri, Apr 24, 2020 at 10:04:08AM -0400, Brian Foster wrote:
> > On Tue, Apr 21, 2020 at 07:08:20PM -0700, Darrick J. Wong wrote:
> > > From: Darrick J. Wong <darrick.wong@oracle.com>
> > > 
> > > On filesystems that support them, bmap intent log items can be used to
> > > change mappings in inode data or attr forks.  However, if the bmbt must
> > > expand, the enormous block reservations that we make for finishing
> > > chains of deferred log items become a liability because the bmbt block
> > > allocator sets minleft to the transaction reservation and there probably
> > > aren't any AGs in the filesystem that have that much free space.
> > > 
> > > Whereas previously we would reserve 93% of the free blocks in the
> > > filesystem, now we only want to reserve 7/8ths of the free space in the
> > > least full AG, and no more than half of the usable blocks in an AG.  In
> > > theory we shouldn't run out of space because (prior to the unclean
> > > shutdown) all of the in-progress transactions successfully reserved the
> > > worst case number of disk blocks.
> > > 
> > > Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
> > > ---
> > >  fs/xfs/xfs_log_recover.c |   55 ++++++++++++++++++++++++++++++++++++----------
> > >  1 file changed, 43 insertions(+), 12 deletions(-)
> > > 
> > > 
> > > diff --git a/fs/xfs/xfs_log_recover.c b/fs/xfs/xfs_log_recover.c
> > > index e9b3e901d009..a416b028b320 100644
> > > --- a/fs/xfs/xfs_log_recover.c
> > > +++ b/fs/xfs/xfs_log_recover.c
> > > @@ -2669,6 +2669,44 @@ xlog_recover_process_data(
> > >  	return 0;
> > >  }
> > >  
> > > +/*
> > > + * Estimate a block reservation for a log recovery transaction.  Since we run
> > > + * separate transactions for each chain of deferred ops that get created as a
> > > + * result of recovering unfinished log intent items, we must be careful not to
> > > + * reserve so many blocks that block allocations fail because we can't satisfy
> > > + * the minleft requirements (e.g. for bmbt blocks).
> > > + */
> > > +static int
> > > +xlog_estimate_recovery_resblks(
> > > +	struct xfs_mount	*mp,
> > > +	unsigned int		*resblks)
> > > +{
> > > +	struct xfs_perag	*pag;
> > > +	xfs_agnumber_t		agno;
> > > +	unsigned int		free = 0;
> > > +	int			error;
> > > +
> > > +	/* Don't use more than 7/8th of the free space in the least full AG. */
> > > +	for (agno = 0; agno < mp->m_sb.sb_agcount; agno++) {
> > > +		unsigned int	ag_free;
> > > +
> > > +		error = xfs_alloc_pagf_init(mp, NULL, agno, 0);
> > > +		if (error)
> > > +			return error;
> > > +		pag = xfs_perag_get(mp, agno);
> > > +		ag_free = pag->pagf_freeblks + pag->pagf_flcount;
> > > +		free = max(free, (ag_free * 7) / 8);
> > > +		xfs_perag_put(pag);
> > > +	}
> > > +
> > 
> > Somewhat unfortunate that we have to iterate all AGs for each chain. I'm
> > wondering if that has any effect on a large recovery on fs' with an
> > inordinate AG count. Have you tested under those particular conditions?
> > I suppose it's possible the recovery is slow enough that this won't
> > matter...
> 
> I admit I haven't actually looked at that.  A more precise way to handle
> this would be for each intent recovery function to store its own worst
> case resblks estimation in the recovery freezer so that we'd be using
> roughly the same space requirements as the pre-crash transaction, but
> that's also a lot more complicated than this kludge.

I've been testing a new patch that rips out all of this in favor of
stealing the unused block reservation from the transaction that the log
item recovery function creates instead of this hokey AG iteration mess.
The results look promising, and I think I'll do this instead.

Granted, this now means that log intent item recovery must be very
careful to reserve enough blocks for the entire chain of operations that
could be created.  This shouldn't be too heavy a lift since we don't
have that many intent types yet, and at least at first glance the
resblks we ask for now looked ok to me.

--D

> > Also, perhaps not caused by this patch but does this
> > outsized/manufactured reservation have the effect of artificially
> > steering allocations to a particular AG if one happens to be notably
> > larger than the rest?
> 
> It tends to steer allocations towards whichever AGs were less full at
> the start of each transaction.  Were we to shift to scanning the AGs
> once for the entire recovery cycle, I think I'd probably pick a smaller
> ratio.
> 
> --D
> 
> > Brian
> > 
> > > +	/* Don't try to reserve more than half the usable AG blocks. */
> > > +	*resblks = min(free, xfs_alloc_ag_max_usable(mp) / 2);
> > > +	if (*resblks == 0)
> > > +		return -ENOSPC;
> > > +
> > > +	return 0;
> > > +}
> > > +
> > >  /* Take all the collected deferred ops and finish them in order. */
> > >  static int
> > >  xlog_finish_defer_ops(
> > > @@ -2677,27 +2715,20 @@ xlog_finish_defer_ops(
> > >  {
> > >  	struct xfs_defer_freezer *dff, *next;
> > >  	struct xfs_trans	*tp;
> > > -	int64_t			freeblks;
> > >  	uint			resblks;
> > >  	int			error = 0;
> > >  
> > >  	list_for_each_entry_safe(dff, next, dfops_freezers, dff_list) {
> > > +		error = xlog_estimate_recovery_resblks(mp, &resblks);
> > > +		if (error)
> > > +			break;
> > > +
> > >  		/*
> > >  		 * We're finishing the defer_ops that accumulated as a result
> > >  		 * of recovering unfinished intent items during log recovery.
> > >  		 * We reserve an itruncate transaction because it is the
> > > -		 * largest permanent transaction type.  Since we're the only
> > > -		 * user of the fs right now, take 93% (15/16) of the available
> > > -		 * free blocks.  Use weird math to avoid a 64-bit division.
> > > +		 * largest permanent transaction type.
> > >  		 */
> > > -		freeblks = percpu_counter_sum(&mp->m_fdblocks);
> > > -		if (freeblks <= 0) {
> > > -			error = -ENOSPC;
> > > -			break;
> > > -		}
> > > -
> > > -		resblks = min_t(int64_t, UINT_MAX, freeblks);
> > > -		resblks = (resblks * 15) >> 4;
> > >  		error = xfs_trans_alloc(mp, &M_RES(mp)->tr_itruncate, resblks,
> > >  				0, XFS_TRANS_RESERVE, &tp);
> > >  		if (error)
> > > 
> > 

^ permalink raw reply	[flat|nested] 19+ messages in thread

end of thread, other threads:[~2020-05-27 22:39 UTC | newest]

Thread overview: 19+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2020-05-05  1:13 [PATCH v2 0/3] xfs: fix inode use-after-free during log recovery Darrick J. Wong
2020-05-05  1:13 ` [PATCH 1/3] xfs: proper replay of deferred ops queued " Darrick J. Wong
2020-05-05  2:33   ` Dave Chinner
2020-05-05  3:06     ` Darrick J. Wong
2020-05-05  5:10       ` Dave Chinner
2020-05-05 15:08         ` Darrick J. Wong
2020-05-05  1:13 ` [PATCH 2/3] xfs: reduce log recovery transaction block reservations Darrick J. Wong
2020-05-05  1:13 ` [PATCH 3/3] xfs: fix an incore inode UAF in xfs_bui_recover Darrick J. Wong
2020-05-05 14:11   ` Brian Foster
2020-05-06  0:34     ` Darrick J. Wong
2020-05-06 13:56       ` Brian Foster
2020-05-06 17:01         ` Darrick J. Wong
2020-05-07  9:53           ` Brian Foster
2020-05-07 15:09             ` Darrick J. Wong
2020-05-07 16:58               ` Brian Foster
  -- strict thread matches above, loose matches on Subject: below --
2020-04-22  2:08 [PATCH 0/3] xfs: fix inode use-after-free during log recovery Darrick J. Wong
2020-04-22  2:08 ` [PATCH 2/3] xfs: reduce log recovery transaction block reservations Darrick J. Wong
2020-04-24 14:04   ` Brian Foster
2020-04-28 22:22     ` Darrick J. Wong
2020-05-27 22:39       ` Darrick J. Wong

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).