From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id 43C01EB64D9 for ; Thu, 15 Jun 2023 03:40:35 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S237861AbjFODke (ORCPT ); Wed, 14 Jun 2023 23:40:34 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:49358 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S237917AbjFODkc (ORCPT ); Wed, 14 Jun 2023 23:40:32 -0400 Received: from dfw.source.kernel.org (dfw.source.kernel.org [139.178.84.217]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 3FF2B2120 for ; Wed, 14 Jun 2023 20:40:30 -0700 (PDT) Received: from smtp.kernel.org (relay.kernel.org [52.25.139.140]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (2048 bits)) (No client certificate requested) by dfw.source.kernel.org (Postfix) with ESMTPS id CD6C761AFD for ; Thu, 15 Jun 2023 03:40:29 +0000 (UTC) Received: by smtp.kernel.org (Postfix) with ESMTPSA id 38276C433C0; Thu, 15 Jun 2023 03:40:29 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=k20201202; t=1686800429; bh=3Oeh5OdRaxdvtNtILh3Z6iPuvnrkSjTLkN0sUliyRiE=; h=Date:From:To:Cc:Subject:References:In-Reply-To:From; b=C90yiFxqUOThIfTNJfejvoUi/75ybL9XxuKLtAVZrmBAE/uPgiWQnWIMjBLJhHejj cyazhCbgpig34w+rahnyB69M2EPgSqpQDo49kTPRKyOzgpH/lwr6HDxyZ3BLJq/OmO MJP90y7UGGW7VfRs6Dvte9qlBkzL9xMLcCkVCxs7F+qeI2fhXKo9La5In47hDwbQez R1kWAdKiFJ+Gjj5tMW9xGoAzPBIN70vrBmHTiLfUPcXzQJI03QxiChz1Cqq/Kd5vTQ 5LrrEftI+GZ/5g0g+z5iMdOLv3rYfUqE+7PGvMWT22f7x9TxkIBPqL0NajW330+L6r y49YExn9+m3rw== Date: Wed, 14 Jun 2023 20:40:28 -0700 From: "Darrick J. Wong" To: Dave Chinner Cc: linux-xfs@vger.kernel.org, chandanrlinux@gmail.com, wen.gang.wang@oracle.com Subject: Re: [PATCH 3/3] xfs: don't block in busy flushing when freeing extents Message-ID: <20230615034028.GN11441@frogsfrogsfrogs> References: <20230615014201.3171380-1-david@fromorbit.com> <20230615014201.3171380-4-david@fromorbit.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20230615014201.3171380-4-david@fromorbit.com> Precedence: bulk List-ID: X-Mailing-List: linux-xfs@vger.kernel.org On Thu, Jun 15, 2023 at 11:42:01AM +1000, Dave Chinner wrote: > From: Dave Chinner > > If the current transaction holds a busy extent and we are trying to > allocate a new extent to fix up the free list, we can deadlock if > the AG is entirely empty except for the busy extent held by the > transaction. > > This can occur at runtime processing an XEFI with multiple extents > in this path: > > __schedule+0x22f at ffffffff81f75e8f > schedule+0x46 at ffffffff81f76366 > xfs_extent_busy_flush+0x69 at ffffffff81477d99 > xfs_alloc_ag_vextent_size+0x16a at ffffffff8141711a > xfs_alloc_ag_vextent+0x19b at ffffffff81417edb > xfs_alloc_fix_freelist+0x22f at ffffffff8141896f > xfs_free_extent_fix_freelist+0x6a at ffffffff8141939a > __xfs_free_extent+0x99 at ffffffff81419499 > xfs_trans_free_extent+0x3e at ffffffff814a6fee > xfs_extent_free_finish_item+0x24 at ffffffff814a70d4 > xfs_defer_finish_noroll+0x1f7 at ffffffff81441407 > xfs_defer_finish+0x11 at ffffffff814417e1 > xfs_itruncate_extents_flags+0x13d at ffffffff8148b7dd > xfs_inactive_truncate+0xb9 at ffffffff8148bb89 > xfs_inactive+0x227 at ffffffff8148c4f7 > xfs_fs_destroy_inode+0xb8 at ffffffff81496898 > destroy_inode+0x3b at ffffffff8127d2ab > do_unlinkat+0x1d1 at ffffffff81270df1 > do_syscall_64+0x40 at ffffffff81f6b5f0 > entry_SYSCALL_64_after_hwframe+0x44 at ffffffff8200007c > > This can also happen in log recovery when processing an EFI > with multiple extents through this path: > > context_switch() kernel/sched/core.c:3881 > __schedule() kernel/sched/core.c:5111 > schedule() kernel/sched/core.c:5186 > xfs_extent_busy_flush() fs/xfs/xfs_extent_busy.c:598 > xfs_alloc_ag_vextent_size() fs/xfs/libxfs/xfs_alloc.c:1641 > xfs_alloc_ag_vextent() fs/xfs/libxfs/xfs_alloc.c:828 > xfs_alloc_fix_freelist() fs/xfs/libxfs/xfs_alloc.c:2362 > xfs_free_extent_fix_freelist() fs/xfs/libxfs/xfs_alloc.c:3029 > __xfs_free_extent() fs/xfs/libxfs/xfs_alloc.c:3067 > xfs_trans_free_extent() fs/xfs/xfs_extfree_item.c:370 > xfs_efi_recover() fs/xfs/xfs_extfree_item.c:626 > xlog_recover_process_efi() fs/xfs/xfs_log_recover.c:4605 > xlog_recover_process_intents() fs/xfs/xfs_log_recover.c:4893 > xlog_recover_finish() fs/xfs/xfs_log_recover.c:5824 > xfs_log_mount_finish() fs/xfs/xfs_log.c:764 > xfs_mountfs() fs/xfs/xfs_mount.c:978 > xfs_fs_fill_super() fs/xfs/xfs_super.c:1908 > mount_bdev() fs/super.c:1417 > xfs_fs_mount() fs/xfs/xfs_super.c:1985 > legacy_get_tree() fs/fs_context.c:647 > vfs_get_tree() fs/super.c:1547 > do_new_mount() fs/namespace.c:2843 > do_mount() fs/namespace.c:3163 > ksys_mount() fs/namespace.c:3372 > __do_sys_mount() fs/namespace.c:3386 > __se_sys_mount() fs/namespace.c:3383 > __x64_sys_mount() fs/namespace.c:3383 > do_syscall_64() arch/x86/entry/common.c:296 > entry_SYSCALL_64() arch/x86/entry/entry_64.S:180 > > To avoid this deadlock, we should not block in > xfs_extent_busy_flush() if we hold a busy extent in the current > transaction. > > Now that the EFI processing code can handle requeuing a partially > completed EFI, we can detect this situation in > xfs_extent_busy_flush() and return -EAGAIN rather than going to > sleep forever. The -EAGAIN get propagated back out to the > xfs_trans_free_extent() context, where the EFD is populated and the > transaction is rolled, thereby moving the busy extents into the CIL. > > At this point, we can retry the extent free operation again with a > clean transaction. If we hit the same "all free extents are busy" > situation when trying to fix up the free list, we can safely call > xfs_extent_busy_flush() and wait for the busy extents to resolve > and wake us. At this point, the allocation search can make progress > again and we can fix up the free list. > > This deadlock was first reported by Chandan in mid-2021, but I > couldn't make myself understood during review, and didn't have time > to fix it myself. > > It was reported again in March 2023, and again I have found myself > unable to explain the complexities of the solution needed during > review. > > As such, I don't have hours more time to waste trying to get the > fix written the way it needs to be written, so I'm just doing it > myself. This patchset is largely based on Wengang Wang's last patch, > but with all the unnecessary stuff removed, split up into multiple > patches and cleaned up somewhat. Thank you for taking this over! > Reported-by: Chandan Babu R > Reported-by: Wengang Wang > Signed-off-by: Dave Chinner > --- > fs/xfs/libxfs/xfs_alloc.c | 68 ++++++++++++++++++++++++++++----------- > fs/xfs/libxfs/xfs_alloc.h | 11 ++++--- > fs/xfs/xfs_extent_busy.c | 35 +++++++++++++++++--- > fs/xfs/xfs_extent_busy.h | 6 ++-- > 4 files changed, 89 insertions(+), 31 deletions(-) > > diff --git a/fs/xfs/libxfs/xfs_alloc.c b/fs/xfs/libxfs/xfs_alloc.c > index 11bd0a1756a1..7c675aae0a0f 100644 > --- a/fs/xfs/libxfs/xfs_alloc.c > +++ b/fs/xfs/libxfs/xfs_alloc.c > @@ -1556,6 +1556,8 @@ xfs_alloc_ag_vextent_near( > if (args->agbno > args->max_agbno) > args->agbno = args->max_agbno; > > + /* Retry once quickly if we find busy extents before blocking. */ > + alloc_flags |= XFS_ALLOC_FLAG_TRYFLUSH; > restart: > len = 0; > > @@ -1611,9 +1613,20 @@ xfs_alloc_ag_vextent_near( > */ > if (!acur.len) { > if (acur.busy) { > + /* > + * Our only valid extents must have been busy. Flush and > + * retry the allocation again. If we get an -EAGAIN > + * error, we're being told that a deadlock was avoided > + * and the current transaction needs committing before > + * the allocation can be retried. > + */ > trace_xfs_alloc_near_busy(args); > - xfs_extent_busy_flush(args->mp, args->pag, > - acur.busy_gen, alloc_flags); > + error = xfs_extent_busy_flush(args->tp, args->pag, > + acur.busy_gen, alloc_flags); > + if (error) > + goto out; > + > + alloc_flags &= ~XFS_ALLOC_FLAG_TRYFLUSH; > goto restart; > } > trace_xfs_alloc_size_neither(args); > @@ -1653,6 +1666,8 @@ xfs_alloc_ag_vextent_size( > int error; > int i; > > + /* Retry once quickly if we find busy extents before blocking. */ > + alloc_flags |= XFS_ALLOC_FLAG_TRYFLUSH; > restart: > /* > * Allocate and initialize a cursor for the by-size btree. > @@ -1710,19 +1725,25 @@ xfs_alloc_ag_vextent_size( > error = xfs_btree_increment(cnt_cur, 0, &i); > if (error) > goto error0; > - if (i == 0) { > - /* > - * Our only valid extents must have been busy. > - * Make it unbusy by forcing the log out and > - * retrying. > - */ > - xfs_btree_del_cursor(cnt_cur, > - XFS_BTREE_NOERROR); > - trace_xfs_alloc_size_busy(args); > - xfs_extent_busy_flush(args->mp, args->pag, > - busy_gen, alloc_flags); > - goto restart; > - } > + if (i) > + continue; > + > + /* > + * Our only valid extents must have been busy. Flush and > + * retry the allocation again. If we get an -EAGAIN > + * error, we're being told that a deadlock was avoided > + * and the current transaction needs committing before > + * the allocation can be retried. > + */ > + trace_xfs_alloc_size_busy(args); > + error = xfs_extent_busy_flush(args->tp, args->pag, > + busy_gen, alloc_flags); > + if (error) > + goto error0; > + > + alloc_flags &= ~XFS_ALLOC_FLAG_TRYFLUSH; > + xfs_btree_del_cursor(cnt_cur, XFS_BTREE_NOERROR); > + goto restart; > } > } > > @@ -1802,10 +1823,21 @@ xfs_alloc_ag_vextent_size( > args->len = rlen; > if (rlen < args->minlen) { > if (busy) { > - xfs_btree_del_cursor(cnt_cur, XFS_BTREE_NOERROR); > + /* > + * Our only valid extents must have been busy. Flush and > + * retry the allocation again. If we get an -EAGAIN > + * error, we're being told that a deadlock was avoided > + * and the current transaction needs committing before > + * the allocation can be retried. > + */ > trace_xfs_alloc_size_busy(args); > - xfs_extent_busy_flush(args->mp, args->pag, busy_gen, > - alloc_flags); > + error = xfs_extent_busy_flush(args->tp, args->pag, > + busy_gen, alloc_flags); > + if (error) > + goto error0; > + > + alloc_flags &= ~XFS_ALLOC_FLAG_TRYFLUSH; > + xfs_btree_del_cursor(cnt_cur, XFS_BTREE_NOERROR); > goto restart; > } > goto out_nominleft; > diff --git a/fs/xfs/libxfs/xfs_alloc.h b/fs/xfs/libxfs/xfs_alloc.h > index d1aa7c63eafe..f267842e36ba 100644 > --- a/fs/xfs/libxfs/xfs_alloc.h > +++ b/fs/xfs/libxfs/xfs_alloc.h > @@ -19,11 +19,12 @@ unsigned int xfs_agfl_size(struct xfs_mount *mp); > /* > * Flags for xfs_alloc_fix_freelist. > */ > -#define XFS_ALLOC_FLAG_TRYLOCK 0x00000001 /* use trylock for buffer locking */ > -#define XFS_ALLOC_FLAG_FREEING 0x00000002 /* indicate caller is freeing extents*/ > -#define XFS_ALLOC_FLAG_NORMAP 0x00000004 /* don't modify the rmapbt */ > -#define XFS_ALLOC_FLAG_NOSHRINK 0x00000008 /* don't shrink the freelist */ > -#define XFS_ALLOC_FLAG_CHECK 0x00000010 /* test only, don't modify args */ > +#define XFS_ALLOC_FLAG_TRYLOCK (1U << 0) /* use trylock for buffer locking */ > +#define XFS_ALLOC_FLAG_FREEING (1U << 1) /* indicate caller is freeing extents*/ > +#define XFS_ALLOC_FLAG_NORMAP (1U << 2) /* don't modify the rmapbt */ > +#define XFS_ALLOC_FLAG_NOSHRINK (1U << 3) /* don't shrink the freelist */ > +#define XFS_ALLOC_FLAG_CHECK (1U << 4) /* test only, don't modify args */ > +#define XFS_ALLOC_FLAG_TRYFLUSH (1U << 5) /* don't wait in busy extent flush */ > > /* > * Argument structure for xfs_alloc routines. > diff --git a/fs/xfs/xfs_extent_busy.c b/fs/xfs/xfs_extent_busy.c > index 874798949625..7c2fdc71e42d 100644 > --- a/fs/xfs/xfs_extent_busy.c > +++ b/fs/xfs/xfs_extent_busy.c > @@ -566,21 +566,45 @@ xfs_extent_busy_clear( > > /* > * Flush out all busy extents for this AG. > + * > + * If the current transaction is holding busy extents, the caller may not want > + * to wait for committed busy extents to resolve. If we are being told just to > + * try a flush or progress has been made since we last skipped a busy extent, > + * return immediately to allow the caller to try again. > + * > + * If we are freeing extents, we might actually be holding the only free extents > + * in the transaction busy list and the log force won't resolve that situation. > + * In this case, we must return -EAGAIN to avoid a deadlock by informing the > + * caller it needs to commit the busy extents it holds before retrying the > + * extent free operation. > */ > -void > +int > xfs_extent_busy_flush( > - struct xfs_mount *mp, > + struct xfs_trans *tp, > struct xfs_perag *pag, > unsigned busy_gen, > - unsigned int flags) > + uint32_t alloc_flags) > { > DEFINE_WAIT (wait); > int error; > > - error = xfs_log_force(mp, XFS_LOG_SYNC); > + error = xfs_log_force(tp->t_mountp, XFS_LOG_SYNC); > if (error) > - return; > + return error; > > + /* Avoid deadlocks on uncommitted busy extents. */ > + if (!list_empty(&tp->t_busy)) { > + if (alloc_flags & XFS_ALLOC_FLAG_TRYFLUSH) > + return 0; > + > + if (busy_gen != READ_ONCE(pag->pagb_gen)) > + return 0; > + > + if (alloc_flags & XFS_ALLOC_FLAG_FREEING) > + return -EAGAIN; > + } > + > + /* Wait for committed busy extents to resolve. */ > do { > prepare_to_wait(&pag->pagb_wait, &wait, TASK_KILLABLE); > if (busy_gen != READ_ONCE(pag->pagb_gen)) > @@ -589,6 +613,7 @@ xfs_extent_busy_flush( > } while (1); > > finish_wait(&pag->pagb_wait, &wait); > + return 0; > } > > void > diff --git a/fs/xfs/xfs_extent_busy.h b/fs/xfs/xfs_extent_busy.h > index 79ef5a2d7758..2104548d6695 100644 > --- a/fs/xfs/xfs_extent_busy.h > +++ b/fs/xfs/xfs_extent_busy.h > @@ -51,9 +51,9 @@ bool > xfs_extent_busy_trim(struct xfs_alloc_arg *args, xfs_agblock_t *bno, > xfs_extlen_t *len, unsigned *busy_gen); > > -void > -xfs_extent_busy_flush(struct xfs_mount *mp, struct xfs_perag *pag, > - unsigned busy_gen, unsigned int flags); > +int > +xfs_extent_busy_flush(struct xfs_trans *tp, struct xfs_perag *pag, > + unsigned busy_gen, uint32_t flags); Aha, you /did/ change this by the end of the series. Reviewed-by: Darrick J. Wong --D > > void > xfs_extent_busy_wait_all(struct xfs_mount *mp); > -- > 2.40.1 >