From: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
To: stable@vger.kernel.org
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>,
patches@lists.linux.dev, Dave Chinner <dchinner@redhat.com>,
Allison Henderson <allison.henderson@oracle.com>,
"Darrick J. Wong" <djwong@kernel.org>,
Leah Rumancik <leah.rumancik@gmail.com>,
Chandan Babu R <chandanbabu@kernel.org>
Subject: [PATCH 6.1 36/73] xfs: fix low space alloc deadlock
Date: Fri, 27 Sep 2024 14:23:47 +0200 [thread overview]
Message-ID: <20240927121721.380752328@linuxfoundation.org> (raw)
In-Reply-To: <20240927121719.897851549@linuxfoundation.org>
6.1-stable review patch. If anyone has any objections, please let me know.
------------------
From: Dave Chinner <dchinner@redhat.com>
[ Upstream commit 1dd0510f6d4b85616a36aabb9be38389467122d9 ]
I've recently encountered an ABBA deadlock with g/476. The upcoming
changes seem to make this much easier to hit, but the underlying
problem is a pre-existing one.
Essentially, if we select an AG for allocation, then lock the AGF
and then fail to allocate for some reason (e.g. minimum length
requirements cannot be satisfied), then we drop out of the
allocation with the AGF still locked.
The caller then modifies the allocation constraints - usually
loosening them up - and tries again. This can result in trying to
access AGFs that are lower than the AGF we already have locked from
the failed attempt. e.g. the failed attempt skipped several AGs
before failing, so we have locks an AG higher than the start AG.
Retrying the allocation from the start AG then causes us to violate
AGF lock ordering and this can lead to deadlocks.
The deadlock exists even if allocation succeeds - we can do a
followup allocations in the same transaction for BMBT blocks that
aren't guaranteed to be in the same AG as the original, and can move
into higher AGs. Hence we really need to move the tp->t_firstblock
tracking down into xfs_alloc_vextent() where it can be set when we
exit with a locked AG.
xfs_alloc_vextent() can also check there if the requested
allocation falls within the allow range of AGs set by
tp->t_firstblock. If we can't allocate within the range set, we have
to fail the allocation. If we are allowed to to non-blocking AGF
locking, we can ignore the AG locking order limitations as we can
use try-locks for the first iteration over requested AG range.
This invalidates a set of post allocation asserts that check that
the allocation is always above tp->t_firstblock if it is set.
Because we can use try-locks to avoid the deadlock in some
circumstances, having a pre-existing locked AGF doesn't always
prevent allocation from lower order AGFs. Hence those ASSERTs need
to be removed.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Allison Henderson <allison.henderson@oracle.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Leah Rumancik <leah.rumancik@gmail.com>
Acked-by: Chandan Babu R <chandanbabu@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
---
fs/xfs/libxfs/xfs_alloc.c | 69 ++++++++++++++++++++++++++++++++++++++--------
fs/xfs/libxfs/xfs_bmap.c | 14 ---------
fs/xfs/xfs_trace.h | 1
3 files changed, 58 insertions(+), 26 deletions(-)
--- a/fs/xfs/libxfs/xfs_alloc.c
+++ b/fs/xfs/libxfs/xfs_alloc.c
@@ -3164,10 +3164,13 @@ xfs_alloc_vextent(
xfs_alloctype_t type; /* input allocation type */
int bump_rotor = 0;
xfs_agnumber_t rotorstep = xfs_rotorstep; /* inode32 agf stepper */
+ xfs_agnumber_t minimum_agno = 0;
mp = args->mp;
type = args->otype = args->type;
args->agbno = NULLAGBLOCK;
+ if (args->tp->t_firstblock != NULLFSBLOCK)
+ minimum_agno = XFS_FSB_TO_AGNO(mp, args->tp->t_firstblock);
/*
* Just fix this up, for the case where the last a.g. is shorter
* (or there's only one a.g.) and the caller couldn't easily figure
@@ -3201,6 +3204,13 @@ xfs_alloc_vextent(
*/
args->agno = XFS_FSB_TO_AGNO(mp, args->fsbno);
args->pag = xfs_perag_get(mp, args->agno);
+
+ if (minimum_agno > args->agno) {
+ trace_xfs_alloc_vextent_skip_deadlock(args);
+ error = 0;
+ break;
+ }
+
error = xfs_alloc_fix_freelist(args, 0);
if (error) {
trace_xfs_alloc_vextent_nofix(args);
@@ -3232,6 +3242,8 @@ xfs_alloc_vextent(
case XFS_ALLOCTYPE_FIRST_AG:
/*
* Rotate through the allocation groups looking for a winner.
+ * If we are blocking, we must obey minimum_agno contraints for
+ * avoiding ABBA deadlocks on AGF locking.
*/
if (type == XFS_ALLOCTYPE_FIRST_AG) {
/*
@@ -3239,7 +3251,7 @@ xfs_alloc_vextent(
*/
args->agno = XFS_FSB_TO_AGNO(mp, args->fsbno);
args->type = XFS_ALLOCTYPE_THIS_AG;
- sagno = 0;
+ sagno = minimum_agno;
flags = 0;
} else {
/*
@@ -3248,6 +3260,7 @@ xfs_alloc_vextent(
args->agno = sagno = XFS_FSB_TO_AGNO(mp, args->fsbno);
flags = XFS_ALLOC_FLAG_TRYLOCK;
}
+
/*
* Loop over allocation groups twice; first time with
* trylock set, second time without.
@@ -3276,19 +3289,21 @@ xfs_alloc_vextent(
if (args->agno == sagno &&
type == XFS_ALLOCTYPE_START_BNO)
args->type = XFS_ALLOCTYPE_THIS_AG;
+
/*
- * For the first allocation, we can try any AG to get
- * space. However, if we already have allocated a
- * block, we don't want to try AGs whose number is below
- * sagno. Otherwise, we may end up with out-of-order
- * locking of AGF, which might cause deadlock.
- */
+ * If we are try-locking, we can't deadlock on AGF
+ * locks, so we can wrap all the way back to the first
+ * AG. Otherwise, wrap back to the start AG so we can't
+ * deadlock, and let the end of scan handler decide what
+ * to do next.
+ */
if (++(args->agno) == mp->m_sb.sb_agcount) {
- if (args->tp->t_firstblock != NULLFSBLOCK)
- args->agno = sagno;
- else
+ if (flags & XFS_ALLOC_FLAG_TRYLOCK)
args->agno = 0;
+ else
+ args->agno = sagno;
}
+
/*
* Reached the starting a.g., must either be done
* or switch to non-trylock mode.
@@ -3300,7 +3315,14 @@ xfs_alloc_vextent(
break;
}
+ /*
+ * Blocking pass next, so we must obey minimum
+ * agno constraints to avoid ABBA AGF deadlocks.
+ */
flags = 0;
+ if (minimum_agno > sagno)
+ sagno = minimum_agno;
+
if (type == XFS_ALLOCTYPE_START_BNO) {
args->agbno = XFS_FSB_TO_AGBNO(mp,
args->fsbno);
@@ -3322,9 +3344,9 @@ xfs_alloc_vextent(
ASSERT(0);
/* NOTREACHED */
}
- if (args->agbno == NULLAGBLOCK)
+ if (args->agbno == NULLAGBLOCK) {
args->fsbno = NULLFSBLOCK;
- else {
+ } else {
args->fsbno = XFS_AGB_TO_FSB(mp, args->agno, args->agbno);
#ifdef DEBUG
ASSERT(args->len >= args->minlen);
@@ -3335,6 +3357,29 @@ xfs_alloc_vextent(
#endif
}
+
+ /*
+ * We end up here with a locked AGF. If we failed, the caller is likely
+ * going to try to allocate again with different parameters, and that
+ * can widen the AGs that are searched for free space. If we have to do
+ * BMBT block allocation, we have to do a new allocation.
+ *
+ * Hence leaving this function with the AGF locked opens up potential
+ * ABBA AGF deadlocks because a future allocation attempt in this
+ * transaction may attempt to lock a lower number AGF.
+ *
+ * We can't release the AGF until the transaction is commited, so at
+ * this point we must update the "firstblock" tracker to point at this
+ * AG if the tracker is empty or points to a lower AG. This allows the
+ * next allocation attempt to be modified appropriately to avoid
+ * deadlocks.
+ */
+ if (args->agbp &&
+ (args->tp->t_firstblock == NULLFSBLOCK ||
+ args->pag->pag_agno > minimum_agno)) {
+ args->tp->t_firstblock = XFS_AGB_TO_FSB(mp,
+ args->pag->pag_agno, 0);
+ }
xfs_perag_put(args->pag);
return 0;
error0:
--- a/fs/xfs/libxfs/xfs_bmap.c
+++ b/fs/xfs/libxfs/xfs_bmap.c
@@ -3413,21 +3413,7 @@ xfs_bmap_process_allocated_extent(
xfs_fileoff_t orig_offset,
xfs_extlen_t orig_length)
{
- int nullfb;
-
- nullfb = ap->tp->t_firstblock == NULLFSBLOCK;
-
- /*
- * check the allocation happened at the same or higher AG than
- * the first block that was allocated.
- */
- ASSERT(nullfb ||
- XFS_FSB_TO_AGNO(args->mp, ap->tp->t_firstblock) <=
- XFS_FSB_TO_AGNO(args->mp, args->fsbno));
-
ap->blkno = args->fsbno;
- if (nullfb)
- ap->tp->t_firstblock = args->fsbno;
ap->length = args->len;
/*
* If the extent size hint is active, we tried to round the
--- a/fs/xfs/xfs_trace.h
+++ b/fs/xfs/xfs_trace.h
@@ -1877,6 +1877,7 @@ DEFINE_ALLOC_EVENT(xfs_alloc_small_noten
DEFINE_ALLOC_EVENT(xfs_alloc_small_done);
DEFINE_ALLOC_EVENT(xfs_alloc_small_error);
DEFINE_ALLOC_EVENT(xfs_alloc_vextent_badargs);
+DEFINE_ALLOC_EVENT(xfs_alloc_vextent_skip_deadlock);
DEFINE_ALLOC_EVENT(xfs_alloc_vextent_nofix);
DEFINE_ALLOC_EVENT(xfs_alloc_vextent_noagbp);
DEFINE_ALLOC_EVENT(xfs_alloc_vextent_loopfailed);
next prev parent reply other threads:[~2024-09-27 12:33 UTC|newest]
Thread overview: 83+ messages / expand[flat|nested] mbox.gz Atom feed top
2024-09-27 12:23 [PATCH 6.1 00/73] 6.1.112-rc1 review Greg Kroah-Hartman
2024-09-27 12:23 ` [PATCH 6.1 01/73] ASoC: SOF: mediatek: Add missing board compatible Greg Kroah-Hartman
2024-09-27 12:23 ` [PATCH 6.1 02/73] ASoC: allow module autoloading for table db1200_pids Greg Kroah-Hartman
2024-09-27 12:23 ` [PATCH 6.1 03/73] ASoC: allow module autoloading for table board_ids Greg Kroah-Hartman
2024-09-27 12:23 ` [PATCH 6.1 04/73] ALSA: hda/realtek - Fixed ALC256 headphone no sound Greg Kroah-Hartman
2024-09-27 12:23 ` [PATCH 6.1 05/73] ALSA: hda/realtek - FIxed ALC285 " Greg Kroah-Hartman
2024-09-27 12:23 ` [PATCH 6.1 06/73] scsi: lpfc: Fix overflow build issue Greg Kroah-Hartman
2024-09-27 12:23 ` [PATCH 6.1 07/73] pinctrl: at91: make it work with current gpiolib Greg Kroah-Hartman
2024-09-27 12:23 ` [PATCH 6.1 08/73] hwmon: (asus-ec-sensors) remove VRM temp X570-E GAMING Greg Kroah-Hartman
2024-09-27 12:23 ` [PATCH 6.1 09/73] microblaze: dont treat zero reserved memory regions as error Greg Kroah-Hartman
2024-09-27 12:23 ` [PATCH 6.1 10/73] net: ftgmac100: Ensure tx descriptor updates are visible Greg Kroah-Hartman
2024-09-27 12:23 ` [PATCH 6.1 11/73] LoongArch: Define ARCH_IRQ_INIT_FLAGS as IRQ_NOPROBE Greg Kroah-Hartman
2024-09-27 12:23 ` [PATCH 6.1 12/73] wifi: iwlwifi: lower message level for FW buffer destination Greg Kroah-Hartman
2024-09-27 12:23 ` [PATCH 6.1 13/73] wifi: iwlwifi: mvm: fix iwl_mvm_scan_fits() calculation Greg Kroah-Hartman
2024-09-27 12:23 ` [PATCH 6.1 14/73] wifi: iwlwifi: mvm: pause TCM when the firmware is stopped Greg Kroah-Hartman
2024-09-27 12:23 ` [PATCH 6.1 15/73] wifi: iwlwifi: mvm: dont wait for tx queues if firmware is dead Greg Kroah-Hartman
2024-09-27 12:23 ` [PATCH 6.1 16/73] wifi: mac80211: free skb on error path in ieee80211_beacon_get_ap() Greg Kroah-Hartman
2024-09-27 12:23 ` [PATCH 6.1 17/73] wifi: iwlwifi: clear trans->state earlier upon error Greg Kroah-Hartman
2024-09-27 12:23 ` [PATCH 6.1 18/73] can: mcp251xfd: mcp251xfd_ring_init(): check TX-coalescing configuration Greg Kroah-Hartman
2024-09-27 12:23 ` [PATCH 6.1 19/73] ASoC: Intel: soc-acpi-cht: Make Lenovo Yoga Tab 3 X90F DMI match less strict Greg Kroah-Hartman
2024-09-27 12:23 ` [PATCH 6.1 20/73] ASoC: intel: fix module autoloading Greg Kroah-Hartman
2024-09-27 12:23 ` [PATCH 6.1 21/73] ASoC: tda7419: " Greg Kroah-Hartman
2024-09-27 12:23 ` [PATCH 6.1 22/73] spi: spidev: Add an entry for elgin,jg10309-01 Greg Kroah-Hartman
2024-09-27 12:23 ` [PATCH 6.1 23/73] drm: komeda: Fix an issue related to normalized zpos Greg Kroah-Hartman
2024-09-27 12:23 ` [PATCH 6.1 24/73] spi: bcm63xx: Enable module autoloading Greg Kroah-Hartman
2024-09-27 12:23 ` [PATCH 6.1 25/73] smb: client: fix hang in wait_for_response() for negproto Greg Kroah-Hartman
2024-09-27 12:23 ` [PATCH 6.1 26/73] x86/hyperv: Set X86_FEATURE_TSC_KNOWN_FREQ when Hyper-V provides frequency Greg Kroah-Hartman
2024-09-27 12:23 ` [PATCH 6.1 27/73] tools: hv: rm .*.cmd when make clean Greg Kroah-Hartman
2024-09-27 12:23 ` [PATCH 6.1 28/73] block: Fix where bio IO priority gets set Greg Kroah-Hartman
2024-09-27 12:23 ` [PATCH 6.1 29/73] spi: spidev: Add missing spi_device_id for jg10309-01 Greg Kroah-Hartman
2024-09-27 12:23 ` [PATCH 6.1 30/73] ocfs2: add bounds checking to ocfs2_xattr_find_entry() Greg Kroah-Hartman
2024-09-27 12:23 ` [PATCH 6.1 31/73] ocfs2: strict bound check before memcmp in ocfs2_xattr_find_entry() Greg Kroah-Hartman
2024-09-27 12:23 ` [PATCH 6.1 32/73] xfs: dquot shrinker doesnt check for XFS_DQFLAG_FREEING Greg Kroah-Hartman
2024-09-27 12:23 ` [PATCH 6.1 33/73] xfs: Fix deadlock on xfs_inodegc_worker Greg Kroah-Hartman
2024-09-27 12:23 ` [PATCH 6.1 34/73] xfs: fix extent busy updating Greg Kroah-Hartman
2024-09-27 12:23 ` [PATCH 6.1 35/73] xfs: dont use BMBT btree split workers for IO completion Greg Kroah-Hartman
2024-09-27 12:23 ` Greg Kroah-Hartman [this message]
2024-09-27 12:23 ` [PATCH 6.1 37/73] xfs: prefer free inodes at ENOSPC over chunk allocation Greg Kroah-Hartman
2024-09-27 12:23 ` [PATCH 6.1 38/73] xfs: block reservation too large for minleft allocation Greg Kroah-Hartman
2024-09-27 12:23 ` [PATCH 6.1 39/73] xfs: fix uninitialized variable access Greg Kroah-Hartman
2024-09-27 12:23 ` [PATCH 6.1 40/73] xfs: quotacheck failure can race with background inode inactivation Greg Kroah-Hartman
2024-09-27 12:23 ` [PATCH 6.1 41/73] xfs: fix BUG_ON in xfs_getbmap() Greg Kroah-Hartman
2024-09-27 12:23 ` [PATCH 6.1 42/73] xfs: buffer pins need to hold a buffer reference Greg Kroah-Hartman
2024-09-27 12:23 ` [PATCH 6.1 43/73] xfs: defered work could create precommits Greg Kroah-Hartman
2024-09-27 12:23 ` [PATCH 6.1 44/73] xfs: fix AGF vs inode cluster buffer deadlock Greg Kroah-Hartman
2024-09-27 12:23 ` [PATCH 6.1 45/73] xfs: collect errors from inodegc for unlinked inode recovery Greg Kroah-Hartman
2024-09-27 12:23 ` [PATCH 6.1 46/73] xfs: fix ag count overflow during growfs Greg Kroah-Hartman
2024-09-27 12:23 ` [PATCH 6.1 47/73] xfs: remove WARN when dquot cache insertion fails Greg Kroah-Hartman
2024-09-27 12:23 ` [PATCH 6.1 48/73] xfs: fix the calculation for "end" and "length" Greg Kroah-Hartman
2024-09-27 12:24 ` [PATCH 6.1 49/73] xfs: load uncached unlinked inodes into memory on demand Greg Kroah-Hartman
2024-09-27 12:24 ` [PATCH 6.1 50/73] xfs: fix negative array access in xfs_getbmap Greg Kroah-Hartman
2024-09-27 12:24 ` [PATCH 6.1 51/73] xfs: fix unlink vs cluster buffer instantiation race Greg Kroah-Hartman
2024-09-27 12:24 ` [PATCH 6.1 52/73] xfs: correct calculation for agend and blockcount Greg Kroah-Hartman
2024-09-27 12:24 ` [PATCH 6.1 53/73] xfs: use i_prev_unlinked to distinguish inodes that are not on the unlinked list Greg Kroah-Hartman
2024-09-27 12:24 ` [PATCH 6.1 54/73] xfs: reload entire unlinked bucket lists Greg Kroah-Hartman
2024-09-27 12:24 ` [PATCH 6.1 55/73] xfs: make inode unlinked bucket recovery work with quotacheck Greg Kroah-Hartman
2024-09-27 12:24 ` [PATCH 6.1 56/73] xfs: fix reloading entire unlinked bucket lists Greg Kroah-Hartman
2024-09-27 12:24 ` [PATCH 6.1 57/73] xfs: set bnobt/cntbt numrecs correctly when formatting new AGs Greg Kroah-Hartman
2024-09-27 12:24 ` [PATCH 6.1 58/73] xfs: journal geometry is not properly bounds checked Greg Kroah-Hartman
2024-09-27 12:24 ` [PATCH 6.1 59/73] netfilter: nft_socket: make cgroupsv2 matching work with namespaces Greg Kroah-Hartman
2024-09-27 12:24 ` [PATCH 6.1 60/73] netfilter: nft_socket: Fix a NULL vs IS_ERR() bug in nft_socket_cgroup_subtree_level() Greg Kroah-Hartman
2024-09-27 12:24 ` [PATCH 6.1 61/73] netfilter: nft_set_pipapo: walk over current view on netlink dump Greg Kroah-Hartman
2024-09-27 12:24 ` [PATCH 6.1 62/73] netfilter: nf_tables: missing iterator type in lookup walk Greg Kroah-Hartman
2024-09-27 12:24 ` [PATCH 6.1 63/73] Revert "wifi: cfg80211: check wiphy mutex is held for wdev mutex" Greg Kroah-Hartman
2024-09-27 12:24 ` [PATCH 6.1 64/73] gpiolib: cdev: Ignore reconfiguration without direction Greg Kroah-Hartman
2024-09-27 12:24 ` [PATCH 6.1 65/73] gpio: prevent potential speculation leaks in gpio_device_get_desc() Greg Kroah-Hartman
2024-09-27 12:24 ` [PATCH 6.1 66/73] can: mcp251xfd: properly indent labels Greg Kroah-Hartman
2024-09-27 12:24 ` [PATCH 6.1 67/73] can: mcp251xfd: move mcp251xfd_timestamp_start()/stop() into mcp251xfd_chip_start/stop() Greg Kroah-Hartman
2024-09-27 12:24 ` [PATCH 6.1 68/73] selftests: mptcp: join: restrict fullmesh endp on 1st sf Greg Kroah-Hartman
2024-09-27 12:24 ` [PATCH 6.1 69/73] btrfs: calculate the right space for delayed refs when updating global reserve Greg Kroah-Hartman
2024-09-27 12:24 ` [PATCH 6.1 70/73] powercap: RAPL: fix invalid initialization for pl4_supported field Greg Kroah-Hartman
2024-09-27 12:24 ` [PATCH 6.1 71/73] x86/mm: Switch to new Intel CPU model defines Greg Kroah-Hartman
2024-09-27 12:24 ` [PATCH 6.1 72/73] USB: serial: pl2303: add device id for Macrosilicon MS3020 Greg Kroah-Hartman
2024-09-27 12:24 ` [PATCH 6.1 73/73] USB: usbtmc: prevent kernel-usb-infoleak Greg Kroah-Hartman
2024-09-27 15:17 ` [PATCH 6.1 00/73] 6.1.112-rc1 review Peter Schneider
2024-09-27 15:52 ` Allen
2024-09-27 18:35 ` Jon Hunter
2024-09-27 18:40 ` Florian Fainelli
2024-09-28 12:39 ` Naresh Kamboju
2024-09-28 17:13 ` Shuah Khan
2024-09-29 8:43 ` Ron Economos
2024-09-29 11:19 ` Muhammad Usama Anjum
2024-09-30 8:41 ` Pavel Machek
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20240927121721.380752328@linuxfoundation.org \
--to=gregkh@linuxfoundation.org \
--cc=allison.henderson@oracle.com \
--cc=chandanbabu@kernel.org \
--cc=dchinner@redhat.com \
--cc=djwong@kernel.org \
--cc=leah.rumancik@gmail.com \
--cc=patches@lists.linux.dev \
--cc=stable@vger.kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox