public inbox for linux-xfs@vger.kernel.org
 help / color / mirror / Atom feed
From: Dave Chinner <david@fromorbit.com>
To: linux-xfs@vger.kernel.org
Subject: [PATCH 01/42] xfs: fix low space alloc deadlock
Date: Fri, 10 Feb 2023 09:17:44 +1100	[thread overview]
Message-ID: <20230209221825.3722244-2-david@fromorbit.com> (raw)
In-Reply-To: <20230209221825.3722244-1-david@fromorbit.com>

From: Dave Chinner <dchinner@redhat.com>

I've recently encountered an ABBA deadlock with g/476. The upcoming
changes seem to make this much easier to hit, but the underlying
problem is a pre-existing one.

Essentially, if we select an AG for allocation, then lock the AGF
and then fail to allocate for some reason (e.g. minimum length
requirements cannot be satisfied), then we drop out of the
allocation with the AGF still locked.

The caller then modifies the allocation constraints - usually
loosening them up - and tries again. This can result in trying to
access AGFs that are lower than the AGF we already have locked from
the failed attempt. e.g. the failed attempt skipped several AGs
before failing, so we have locks an AG higher than the start AG.
Retrying the allocation from the start AG then causes us to violate
AGF lock ordering and this can lead to deadlocks.

The deadlock exists even if allocation succeeds - we can do a
followup allocations in the same transaction for BMBT blocks that
aren't guaranteed to be in the same AG as the original, and can move
into higher AGs. Hence we really need to move the tp->t_firstblock
tracking down into xfs_alloc_vextent() where it can be set when we
exit with a locked AG.

xfs_alloc_vextent() can also check there if the requested
allocation falls within the allow range of AGs set by
tp->t_firstblock. If we can't allocate within the range set, we have
to fail the allocation. If we are allowed to to non-blocking AGF
locking, we can ignore the AG locking order limitations as we can
use try-locks for the first iteration over requested AG range.

This invalidates a set of post allocation asserts that check that
the allocation is always above tp->t_firstblock if it is set.
Because we can use try-locks to avoid the deadlock in some
circumstances, having a pre-existing locked AGF doesn't always
prevent allocation from lower order AGFs. Hence those ASSERTs need
to be removed.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Allison Henderson <allison.henderson@oracle.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
---
 fs/xfs/libxfs/xfs_alloc.c | 69 ++++++++++++++++++++++++++++++++-------
 fs/xfs/libxfs/xfs_bmap.c  | 14 --------
 fs/xfs/xfs_trace.h        |  1 +
 3 files changed, 58 insertions(+), 26 deletions(-)

diff --git a/fs/xfs/libxfs/xfs_alloc.c b/fs/xfs/libxfs/xfs_alloc.c
index f8ff81c3de76..ffe6345bfafc 100644
--- a/fs/xfs/libxfs/xfs_alloc.c
+++ b/fs/xfs/libxfs/xfs_alloc.c
@@ -3164,10 +3164,13 @@ xfs_alloc_vextent(
 	xfs_alloctype_t		type;	/* input allocation type */
 	int			bump_rotor = 0;
 	xfs_agnumber_t		rotorstep = xfs_rotorstep; /* inode32 agf stepper */
+	xfs_agnumber_t		minimum_agno = 0;
 
 	mp = args->mp;
 	type = args->otype = args->type;
 	args->agbno = NULLAGBLOCK;
+	if (args->tp->t_firstblock != NULLFSBLOCK)
+		minimum_agno = XFS_FSB_TO_AGNO(mp, args->tp->t_firstblock);
 	/*
 	 * Just fix this up, for the case where the last a.g. is shorter
 	 * (or there's only one a.g.) and the caller couldn't easily figure
@@ -3201,6 +3204,13 @@ xfs_alloc_vextent(
 		 */
 		args->agno = XFS_FSB_TO_AGNO(mp, args->fsbno);
 		args->pag = xfs_perag_get(mp, args->agno);
+
+		if (minimum_agno > args->agno) {
+			trace_xfs_alloc_vextent_skip_deadlock(args);
+			error = 0;
+			break;
+		}
+
 		error = xfs_alloc_fix_freelist(args, 0);
 		if (error) {
 			trace_xfs_alloc_vextent_nofix(args);
@@ -3232,6 +3242,8 @@ xfs_alloc_vextent(
 	case XFS_ALLOCTYPE_FIRST_AG:
 		/*
 		 * Rotate through the allocation groups looking for a winner.
+		 * If we are blocking, we must obey minimum_agno contraints for
+		 * avoiding ABBA deadlocks on AGF locking.
 		 */
 		if (type == XFS_ALLOCTYPE_FIRST_AG) {
 			/*
@@ -3239,7 +3251,7 @@ xfs_alloc_vextent(
 			 */
 			args->agno = XFS_FSB_TO_AGNO(mp, args->fsbno);
 			args->type = XFS_ALLOCTYPE_THIS_AG;
-			sagno = 0;
+			sagno = minimum_agno;
 			flags = 0;
 		} else {
 			/*
@@ -3248,6 +3260,7 @@ xfs_alloc_vextent(
 			args->agno = sagno = XFS_FSB_TO_AGNO(mp, args->fsbno);
 			flags = XFS_ALLOC_FLAG_TRYLOCK;
 		}
+
 		/*
 		 * Loop over allocation groups twice; first time with
 		 * trylock set, second time without.
@@ -3276,19 +3289,21 @@ xfs_alloc_vextent(
 			if (args->agno == sagno &&
 			    type == XFS_ALLOCTYPE_START_BNO)
 				args->type = XFS_ALLOCTYPE_THIS_AG;
+
 			/*
-			* For the first allocation, we can try any AG to get
-			* space.  However, if we already have allocated a
-			* block, we don't want to try AGs whose number is below
-			* sagno. Otherwise, we may end up with out-of-order
-			* locking of AGF, which might cause deadlock.
-			*/
+			 * If we are try-locking, we can't deadlock on AGF
+			 * locks, so we can wrap all the way back to the first
+			 * AG. Otherwise, wrap back to the start AG so we can't
+			 * deadlock, and let the end of scan handler decide what
+			 * to do next.
+			 */
 			if (++(args->agno) == mp->m_sb.sb_agcount) {
-				if (args->tp->t_firstblock != NULLFSBLOCK)
-					args->agno = sagno;
-				else
+				if (flags & XFS_ALLOC_FLAG_TRYLOCK)
 					args->agno = 0;
+				else
+					args->agno = sagno;
 			}
+
 			/*
 			 * Reached the starting a.g., must either be done
 			 * or switch to non-trylock mode.
@@ -3300,7 +3315,14 @@ xfs_alloc_vextent(
 					break;
 				}
 
+				/*
+				 * Blocking pass next, so we must obey minimum
+				 * agno constraints to avoid ABBA AGF deadlocks.
+				 */
 				flags = 0;
+				if (minimum_agno > sagno)
+					sagno = minimum_agno;
+
 				if (type == XFS_ALLOCTYPE_START_BNO) {
 					args->agbno = XFS_FSB_TO_AGBNO(mp,
 						args->fsbno);
@@ -3322,9 +3344,9 @@ xfs_alloc_vextent(
 		ASSERT(0);
 		/* NOTREACHED */
 	}
-	if (args->agbno == NULLAGBLOCK)
+	if (args->agbno == NULLAGBLOCK) {
 		args->fsbno = NULLFSBLOCK;
-	else {
+	} else {
 		args->fsbno = XFS_AGB_TO_FSB(mp, args->agno, args->agbno);
 #ifdef DEBUG
 		ASSERT(args->len >= args->minlen);
@@ -3335,6 +3357,29 @@ xfs_alloc_vextent(
 #endif
 
 	}
+
+	/*
+	 * We end up here with a locked AGF. If we failed, the caller is likely
+	 * going to try to allocate again with different parameters, and that
+	 * can widen the AGs that are searched for free space. If we have to do
+	 * BMBT block allocation, we have to do a new allocation.
+	 *
+	 * Hence leaving this function with the AGF locked opens up potential
+	 * ABBA AGF deadlocks because a future allocation attempt in this
+	 * transaction may attempt to lock a lower number AGF.
+	 *
+	 * We can't release the AGF until the transaction is commited, so at
+	 * this point we must update the "firstblock" tracker to point at this
+	 * AG if the tracker is empty or points to a lower AG. This allows the
+	 * next allocation attempt to be modified appropriately to avoid
+	 * deadlocks.
+	 */
+	if (args->agbp &&
+	    (args->tp->t_firstblock == NULLFSBLOCK ||
+	     args->pag->pag_agno > minimum_agno)) {
+		args->tp->t_firstblock = XFS_AGB_TO_FSB(mp,
+					args->pag->pag_agno, 0);
+	}
 	xfs_perag_put(args->pag);
 	return 0;
 error0:
diff --git a/fs/xfs/libxfs/xfs_bmap.c b/fs/xfs/libxfs/xfs_bmap.c
index c8c65387136c..de6d585c00f1 100644
--- a/fs/xfs/libxfs/xfs_bmap.c
+++ b/fs/xfs/libxfs/xfs_bmap.c
@@ -3413,21 +3413,7 @@ xfs_bmap_process_allocated_extent(
 	xfs_fileoff_t		orig_offset,
 	xfs_extlen_t		orig_length)
 {
-	int			nullfb;
-
-	nullfb = ap->tp->t_firstblock == NULLFSBLOCK;
-
-	/*
-	 * check the allocation happened at the same or higher AG than
-	 * the first block that was allocated.
-	 */
-	ASSERT(nullfb ||
-		XFS_FSB_TO_AGNO(args->mp, ap->tp->t_firstblock) <=
-		XFS_FSB_TO_AGNO(args->mp, args->fsbno));
-
 	ap->blkno = args->fsbno;
-	if (nullfb)
-		ap->tp->t_firstblock = args->fsbno;
 	ap->length = args->len;
 	/*
 	 * If the extent size hint is active, we tried to round the
diff --git a/fs/xfs/xfs_trace.h b/fs/xfs/xfs_trace.h
index 6b0e9ae7c513..adddaeb44a56 100644
--- a/fs/xfs/xfs_trace.h
+++ b/fs/xfs/xfs_trace.h
@@ -1877,6 +1877,7 @@ DEFINE_ALLOC_EVENT(xfs_alloc_small_notenough);
 DEFINE_ALLOC_EVENT(xfs_alloc_small_done);
 DEFINE_ALLOC_EVENT(xfs_alloc_small_error);
 DEFINE_ALLOC_EVENT(xfs_alloc_vextent_badargs);
+DEFINE_ALLOC_EVENT(xfs_alloc_vextent_skip_deadlock);
 DEFINE_ALLOC_EVENT(xfs_alloc_vextent_nofix);
 DEFINE_ALLOC_EVENT(xfs_alloc_vextent_noagbp);
 DEFINE_ALLOC_EVENT(xfs_alloc_vextent_loopfailed);
-- 
2.39.0


  reply	other threads:[~2023-02-09 22:18 UTC|newest]

Thread overview: 49+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2023-02-09 22:17 [PATCH v3 00/42] xfs: per-ag centric allocation alogrithms Dave Chinner
2023-02-09 22:17 ` Dave Chinner [this message]
2023-02-09 22:17 ` [PATCH 02/42] xfs: prefer free inodes at ENOSPC over chunk allocation Dave Chinner
2023-02-09 22:17 ` [PATCH 03/42] xfs: block reservation too large for minleft allocation Dave Chinner
2023-02-09 22:17 ` [PATCH 04/42] xfs: drop firstblock constraints from allocation setup Dave Chinner
2023-02-09 22:17 ` [PATCH 05/42] xfs: t_firstblock is tracking AGs not blocks Dave Chinner
2023-02-09 22:17 ` [PATCH 06/42] xfs: don't assert fail on transaction cancel with deferred ops Dave Chinner
2023-02-09 22:17 ` [PATCH 07/42] xfs: active perag reference counting Dave Chinner
2023-02-11  4:06   ` Darrick J. Wong
2023-02-09 22:17 ` [PATCH 08/42] xfs: rework the perag trace points to be perag centric Dave Chinner
2023-02-09 22:17 ` [PATCH 09/42] xfs: convert xfs_imap() to take a perag Dave Chinner
2023-02-09 22:17 ` [PATCH 10/42] xfs: use active perag references for inode allocation Dave Chinner
2023-02-09 22:17 ` [PATCH 11/42] xfs: inobt can use perags in many more places than it does Dave Chinner
2023-02-09 22:17 ` [PATCH 12/42] xfs: convert xfs_ialloc_next_ag() to an atomic Dave Chinner
2023-02-09 22:17 ` [PATCH 13/42] xfs: perags need atomic operational state Dave Chinner
2023-02-09 22:17 ` [PATCH 14/42] xfs: introduce xfs_for_each_perag_wrap() Dave Chinner
2023-02-09 22:17 ` [PATCH 15/42] xfs: rework xfs_alloc_vextent() Dave Chinner
2023-02-09 22:17 ` [PATCH 16/42] xfs: factor xfs_alloc_vextent_this_ag() for _iterate_ags() Dave Chinner
2023-02-09 22:18 ` [PATCH 17/42] xfs: combine __xfs_alloc_vextent_this_ag and xfs_alloc_ag_vextent Dave Chinner
2023-02-09 22:18 ` [PATCH 18/42] xfs: use xfs_alloc_vextent_this_ag() where appropriate Dave Chinner
2023-02-09 22:18 ` [PATCH 19/42] xfs: factor xfs_bmap_btalloc() Dave Chinner
2023-02-09 22:18 ` [PATCH 20/42] xfs: use xfs_alloc_vextent_first_ag() where appropriate Dave Chinner
2023-02-09 22:18 ` [PATCH 21/42] xfs: use xfs_alloc_vextent_start_bno() " Dave Chinner
2023-02-09 22:18 ` [PATCH 22/42] xfs: introduce xfs_alloc_vextent_near_bno() Dave Chinner
2023-02-09 22:18 ` [PATCH 23/42] xfs: introduce xfs_alloc_vextent_exact_bno() Dave Chinner
2023-02-09 22:18 ` [PATCH 24/42] xfs: introduce xfs_alloc_vextent_prepare() Dave Chinner
2023-02-09 22:18 ` [PATCH 25/42] xfs: move allocation accounting to xfs_alloc_vextent_set_fsbno() Dave Chinner
2023-02-09 22:18 ` [PATCH 26/42] xfs: fold xfs_alloc_ag_vextent() into callers Dave Chinner
2023-02-09 22:18 ` [PATCH 27/42] xfs: move the minimum agno checks into xfs_alloc_vextent_check_args Dave Chinner
2023-02-09 22:18 ` [PATCH 28/42] xfs: convert xfs_alloc_vextent_iterate_ags() to use perag walker Dave Chinner
2023-02-09 22:18 ` [PATCH 29/42] xfs: convert trim to use for_each_perag_range Dave Chinner
2023-02-09 22:18 ` [PATCH 30/42] xfs: factor out filestreams from xfs_bmap_btalloc_nullfb Dave Chinner
2023-02-09 22:18 ` [PATCH 31/42] xfs: get rid of notinit from xfs_bmap_longest_free_extent Dave Chinner
2023-02-09 22:18 ` [PATCH 32/42] xfs: use xfs_bmap_longest_free_extent() in filestreams Dave Chinner
2023-02-09 22:18 ` [PATCH 33/42] xfs: move xfs_bmap_btalloc_filestreams() to xfs_filestreams.c Dave Chinner
2023-02-09 22:18 ` [PATCH 34/42] xfs: merge filestream AG lookup into xfs_filestream_select_ag() Dave Chinner
2023-02-09 22:18 ` [PATCH 35/42] xfs: merge new filestream AG selection " Dave Chinner
2023-02-09 22:18 ` [PATCH 36/42] xfs: remove xfs_filestream_select_ag() longest extent check Dave Chinner
2023-02-09 22:18 ` [PATCH 37/42] xfs: factor out MRU hit case in xfs_filestream_select_ag Dave Chinner
2023-02-09 22:18 ` [PATCH 38/42] xfs: track an active perag reference in filestreams Dave Chinner
2023-02-09 22:18 ` [PATCH 39/42] xfs: use for_each_perag_wrap in xfs_filestream_pick_ag Dave Chinner
2023-02-09 22:18 ` [PATCH 40/42] xfs: pass perag to filestreams tracing Dave Chinner
2023-02-09 22:18 ` [PATCH 41/42] xfs: return a referenced perag from filestreams allocator Dave Chinner
2023-02-09 22:18 ` [PATCH 42/42] xfs: refactor the filestreams allocator pick functions Dave Chinner
2023-02-10  3:09 ` [PATCH v3 00/42] xfs: per-ag centric allocation alogrithms Darrick J. Wong
2023-02-10  5:00   ` Dave Chinner
  -- strict thread matches above, loose matches on Subject: below --
2023-01-18 22:44 [PATCH " Dave Chinner
2023-01-18 22:44 ` [PATCH 01/42] xfs: fix low space alloc deadlock Dave Chinner
2023-01-19 16:39   ` Allison Henderson
2024-12-23  9:43   ` luminosity1999

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20230209221825.3722244-2-david@fromorbit.com \
    --to=david@fromorbit.com \
    --cc=linux-xfs@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox