* [PATCH] xfs: fix AGFL allocation dead lock @ 2023-03-30 20:46 Wengang Wang 2023-04-05 15:26 ` Wengang Wang 2023-04-11 2:06 ` Dave Chinner 0 siblings, 2 replies; 9+ messages in thread From: Wengang Wang @ 2023-03-30 20:46 UTC (permalink / raw) To: linux-xfs; +Cc: wen.gang.wang There is deadlock with calltrace on process 10133: PID 10133 not sceduled for 4403385ms (was on CPU[10]) #0 context_switch() kernel/sched/core.c:3881 #1 __schedule() kernel/sched/core.c:5111 #2 schedule() kernel/sched/core.c:5186 #3 xfs_extent_busy_flush() fs/xfs/xfs_extent_busy.c:598 #4 xfs_alloc_ag_vextent_size() fs/xfs/libxfs/xfs_alloc.c:1641 #5 xfs_alloc_ag_vextent() fs/xfs/libxfs/xfs_alloc.c:828 #6 xfs_alloc_fix_freelist() fs/xfs/libxfs/xfs_alloc.c:2362 #7 xfs_free_extent_fix_freelist() fs/xfs/libxfs/xfs_alloc.c:3029 #8 __xfs_free_extent() fs/xfs/libxfs/xfs_alloc.c:3067 #9 xfs_trans_free_extent() fs/xfs/xfs_extfree_item.c:370 #10 xfs_efi_recover() fs/xfs/xfs_extfree_item.c:626 #11 xlog_recover_process_efi() fs/xfs/xfs_log_recover.c:4605 #12 xlog_recover_process_intents() fs/xfs/xfs_log_recover.c:4893 #13 xlog_recover_finish() fs/xfs/xfs_log_recover.c:5824 #14 xfs_log_mount_finish() fs/xfs/xfs_log.c:764 #15 xfs_mountfs() fs/xfs/xfs_mount.c:978 #16 xfs_fs_fill_super() fs/xfs/xfs_super.c:1908 #17 mount_bdev() fs/super.c:1417 #18 xfs_fs_mount() fs/xfs/xfs_super.c:1985 #19 legacy_get_tree() fs/fs_context.c:647 #20 vfs_get_tree() fs/super.c:1547 #21 do_new_mount() fs/namespace.c:2843 #22 do_mount() fs/namespace.c:3163 #23 ksys_mount() fs/namespace.c:3372 #24 __do_sys_mount() fs/namespace.c:3386 #25 __se_sys_mount() fs/namespace.c:3383 #26 __x64_sys_mount() fs/namespace.c:3383 #27 do_syscall_64() arch/x86/entry/common.c:296 #28 entry_SYSCALL_64() arch/x86/entry/entry_64.S:180 It's waiting xfs_perag.pagb_gen to increase (busy extent clearing happen). From the vmcore, it's waiting on AG 1. And the ONLY busy extent for AG 1 is with the transaction (in xfs_trans.t_busy) for process 10133. That busy extent is created in a previous EFI with the same transaction. Process 10133 is waiting, it has no change to commit that that transaction. So busy extent clearing can't happen and pagb_gen remain unchanged. So dead lock formed. commit 06058bc40534530e617e5623775c53bb24f032cb disallowed using busy extents for any path that calls xfs_extent_busy_trim(). That looks over-killing. For AGFL block allocation, it just use the first extent that satisfies, it won't try another extent for choose a "better" one. So it's safe to reuse busy extent for AGFL. To fix above dead lock, this patch allows reusing busy extent for AGFL. Signed-off-by: Wengang Wang <wen.gang.wang@oracle.com> --- fs/xfs/xfs_extent_busy.c | 15 +++++++++++++++ 1 file changed, 15 insertions(+) diff --git a/fs/xfs/xfs_extent_busy.c b/fs/xfs/xfs_extent_busy.c index ef17c1f6db32..f857a5759506 100644 --- a/fs/xfs/xfs_extent_busy.c +++ b/fs/xfs/xfs_extent_busy.c @@ -344,6 +344,7 @@ xfs_extent_busy_trim( ASSERT(*len > 0); spin_lock(&args->pag->pagb_lock); +restart: fbno = *bno; flen = *len; rbp = args->pag->pagb_tree.rb_node; @@ -362,6 +363,20 @@ xfs_extent_busy_trim( continue; } + /* + * AGFL reserving (metadata) is just using the first- + * fit extent, there is no optimization that tries further + * extents. So it's safe to reuse the busy extent and safe + * to update the busy extent. + * Reuse for AGFL even busy extent being discarded. + */ + if (args->resv == XFS_AG_RESV_AGFL) { + if (!xfs_extent_busy_update_extent(args->mp, args->pag, + busyp, fbno, flen, false)) + goto restart; + continue; + } + if (bbno <= fbno) { /* start overlap */ -- 2.21.0 (Apple Git-122.2) ^ permalink raw reply related [flat|nested] 9+ messages in thread
* Re: [PATCH] xfs: fix AGFL allocation dead lock 2023-03-30 20:46 [PATCH] xfs: fix AGFL allocation dead lock Wengang Wang @ 2023-04-05 15:26 ` Wengang Wang 2023-04-11 2:06 ` Dave Chinner 1 sibling, 0 replies; 9+ messages in thread From: Wengang Wang @ 2023-04-05 15:26 UTC (permalink / raw) To: linux-xfs@vger.kernel.org Hi, Anyone please look at this patch? thanks, wengang > On Mar 30, 2023, at 1:46 PM, Wengang Wang <wen.gang.wang@oracle.com> wrote: > > There is deadlock with calltrace on process 10133: > > PID 10133 not sceduled for 4403385ms (was on CPU[10]) > #0 context_switch() kernel/sched/core.c:3881 > #1 __schedule() kernel/sched/core.c:5111 > #2 schedule() kernel/sched/core.c:5186 > #3 xfs_extent_busy_flush() fs/xfs/xfs_extent_busy.c:598 > #4 xfs_alloc_ag_vextent_size() fs/xfs/libxfs/xfs_alloc.c:1641 > #5 xfs_alloc_ag_vextent() fs/xfs/libxfs/xfs_alloc.c:828 > #6 xfs_alloc_fix_freelist() fs/xfs/libxfs/xfs_alloc.c:2362 > #7 xfs_free_extent_fix_freelist() fs/xfs/libxfs/xfs_alloc.c:3029 > #8 __xfs_free_extent() fs/xfs/libxfs/xfs_alloc.c:3067 > #9 xfs_trans_free_extent() fs/xfs/xfs_extfree_item.c:370 > #10 xfs_efi_recover() fs/xfs/xfs_extfree_item.c:626 > #11 xlog_recover_process_efi() fs/xfs/xfs_log_recover.c:4605 > #12 xlog_recover_process_intents() fs/xfs/xfs_log_recover.c:4893 > #13 xlog_recover_finish() fs/xfs/xfs_log_recover.c:5824 > #14 xfs_log_mount_finish() fs/xfs/xfs_log.c:764 > #15 xfs_mountfs() fs/xfs/xfs_mount.c:978 > #16 xfs_fs_fill_super() fs/xfs/xfs_super.c:1908 > #17 mount_bdev() fs/super.c:1417 > #18 xfs_fs_mount() fs/xfs/xfs_super.c:1985 > #19 legacy_get_tree() fs/fs_context.c:647 > #20 vfs_get_tree() fs/super.c:1547 > #21 do_new_mount() fs/namespace.c:2843 > #22 do_mount() fs/namespace.c:3163 > #23 ksys_mount() fs/namespace.c:3372 > #24 __do_sys_mount() fs/namespace.c:3386 > #25 __se_sys_mount() fs/namespace.c:3383 > #26 __x64_sys_mount() fs/namespace.c:3383 > #27 do_syscall_64() arch/x86/entry/common.c:296 > #28 entry_SYSCALL_64() arch/x86/entry/entry_64.S:180 > > It's waiting xfs_perag.pagb_gen to increase (busy extent clearing happen). > From the vmcore, it's waiting on AG 1. And the ONLY busy extent for AG 1 is > with the transaction (in xfs_trans.t_busy) for process 10133. That busy extent > is created in a previous EFI with the same transaction. Process 10133 is > waiting, it has no change to commit that that transaction. So busy extent > clearing can't happen and pagb_gen remain unchanged. So dead lock formed. > > commit 06058bc40534530e617e5623775c53bb24f032cb disallowed using busy extents > for any path that calls xfs_extent_busy_trim(). That looks over-killing. > For AGFL block allocation, it just use the first extent that satisfies, it won't > try another extent for choose a "better" one. So it's safe to reuse busy extent > for AGFL. > > To fix above dead lock, this patch allows reusing busy extent for AGFL. > > Signed-off-by: Wengang Wang <wen.gang.wang@oracle.com> > --- > fs/xfs/xfs_extent_busy.c | 15 +++++++++++++++ > 1 file changed, 15 insertions(+) > > diff --git a/fs/xfs/xfs_extent_busy.c b/fs/xfs/xfs_extent_busy.c > index ef17c1f6db32..f857a5759506 100644 > --- a/fs/xfs/xfs_extent_busy.c > +++ b/fs/xfs/xfs_extent_busy.c > @@ -344,6 +344,7 @@ xfs_extent_busy_trim( > ASSERT(*len > 0); > > spin_lock(&args->pag->pagb_lock); > +restart: > fbno = *bno; > flen = *len; > rbp = args->pag->pagb_tree.rb_node; > @@ -362,6 +363,20 @@ xfs_extent_busy_trim( > continue; > } > > + /* > + * AGFL reserving (metadata) is just using the first- > + * fit extent, there is no optimization that tries further > + * extents. So it's safe to reuse the busy extent and safe > + * to update the busy extent. > + * Reuse for AGFL even busy extent being discarded. > + */ > + if (args->resv == XFS_AG_RESV_AGFL) { > + if (!xfs_extent_busy_update_extent(args->mp, args->pag, > + busyp, fbno, flen, false)) > + goto restart; > + continue; > + } > + > if (bbno <= fbno) { > /* start overlap */ > > -- > 2.21.0 (Apple Git-122.2) > ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [PATCH] xfs: fix AGFL allocation dead lock 2023-03-30 20:46 [PATCH] xfs: fix AGFL allocation dead lock Wengang Wang 2023-04-05 15:26 ` Wengang Wang @ 2023-04-11 2:06 ` Dave Chinner 2023-04-12 8:23 ` Chandan Babu R 1 sibling, 1 reply; 9+ messages in thread From: Dave Chinner @ 2023-04-11 2:06 UTC (permalink / raw) To: Wengang Wang; +Cc: linux-xfs On Thu, Mar 30, 2023 at 01:46:10PM -0700, Wengang Wang wrote: > There is deadlock with calltrace on process 10133: > > PID 10133 not sceduled for 4403385ms (was on CPU[10]) > #0 context_switch() kernel/sched/core.c:3881 > #1 __schedule() kernel/sched/core.c:5111 > #2 schedule() kernel/sched/core.c:5186 > #3 xfs_extent_busy_flush() fs/xfs/xfs_extent_busy.c:598 > #4 xfs_alloc_ag_vextent_size() fs/xfs/libxfs/xfs_alloc.c:1641 > #5 xfs_alloc_ag_vextent() fs/xfs/libxfs/xfs_alloc.c:828 > #6 xfs_alloc_fix_freelist() fs/xfs/libxfs/xfs_alloc.c:2362 > #7 xfs_free_extent_fix_freelist() fs/xfs/libxfs/xfs_alloc.c:3029 > #8 __xfs_free_extent() fs/xfs/libxfs/xfs_alloc.c:3067 > #9 xfs_trans_free_extent() fs/xfs/xfs_extfree_item.c:370 > #10 xfs_efi_recover() fs/xfs/xfs_extfree_item.c:626 > #11 xlog_recover_process_efi() fs/xfs/xfs_log_recover.c:4605 > #12 xlog_recover_process_intents() fs/xfs/xfs_log_recover.c:4893 > #13 xlog_recover_finish() fs/xfs/xfs_log_recover.c:5824 > #14 xfs_log_mount_finish() fs/xfs/xfs_log.c:764 > #15 xfs_mountfs() fs/xfs/xfs_mount.c:978 > #16 xfs_fs_fill_super() fs/xfs/xfs_super.c:1908 > #17 mount_bdev() fs/super.c:1417 > #18 xfs_fs_mount() fs/xfs/xfs_super.c:1985 > #19 legacy_get_tree() fs/fs_context.c:647 > #20 vfs_get_tree() fs/super.c:1547 > #21 do_new_mount() fs/namespace.c:2843 > #22 do_mount() fs/namespace.c:3163 > #23 ksys_mount() fs/namespace.c:3372 > #24 __do_sys_mount() fs/namespace.c:3386 > #25 __se_sys_mount() fs/namespace.c:3383 > #26 __x64_sys_mount() fs/namespace.c:3383 > #27 do_syscall_64() arch/x86/entry/common.c:296 > #28 entry_SYSCALL_64() arch/x86/entry/entry_64.S:180 > > It's waiting xfs_perag.pagb_gen to increase (busy extent clearing happen). > From the vmcore, it's waiting on AG 1. And the ONLY busy extent for AG 1 is > with the transaction (in xfs_trans.t_busy) for process 10133. That busy extent > is created in a previous EFI with the same transaction. Process 10133 is > waiting, it has no change to commit that that transaction. So busy extent > clearing can't happen and pagb_gen remain unchanged. So dead lock formed. We've talked about this "busy extent in transaction" issue before: https://lore.kernel.org/linux-xfs/20210428065152.77280-1-chandanrlinux@gmail.com/ and we were closing in on a practical solution before it went silent. I'm not sure if there's a different fix we can apply here - maybe free one extent per transaction instead of all the extents in an EFI in one transaction and relog the EFD at the end of each extent free transaction roll? > commit 06058bc40534530e617e5623775c53bb24f032cb disallowed using busy extents > for any path that calls xfs_extent_busy_trim(). That looks over-killing. > For AGFL block allocation, it just use the first extent that satisfies, it won't > try another extent for choose a "better" one. So it's safe to reuse busy extent > for AGFL. AGFL block allocation is not "for immediate use". The blocks get placed on the AGFL for -later- use, and not necessarily even within the current transaction. Hence a freelist block is still considered free space, not as used space. The difference is that we assume AGFL blocks can always be used immediately and they aren't constrained by being busy or have pending discards. Also, we have to keep in mind that we can allocate data blocks from the AGFL in low space situations. Hence it is not safe to place busy or discard-pending blocks on the AGFL, as this can result in them being allocated for user data and overwritten before the checkpoint that marked them busy has been committed to the journal.... As such, I don't think it is be safe to ignore busy extent state just because we are filling the AGFL from the current free space tree. Cheers, Dave. -- Dave Chinner david@fromorbit.com ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [PATCH] xfs: fix AGFL allocation dead lock 2023-04-11 2:06 ` Dave Chinner @ 2023-04-12 8:23 ` Chandan Babu R 2023-04-12 23:59 ` Dave Chinner 0 siblings, 1 reply; 9+ messages in thread From: Chandan Babu R @ 2023-04-12 8:23 UTC (permalink / raw) To: Dave Chinner; +Cc: Wengang Wang, linux-xfs On Tue, Apr 11, 2023 at 12:06:24 PM +1000, Dave Chinner wrote: > On Thu, Mar 30, 2023 at 01:46:10PM -0700, Wengang Wang wrote: >> There is deadlock with calltrace on process 10133: >> >> PID 10133 not sceduled for 4403385ms (was on CPU[10]) >> #0 context_switch() kernel/sched/core.c:3881 >> #1 __schedule() kernel/sched/core.c:5111 >> #2 schedule() kernel/sched/core.c:5186 >> #3 xfs_extent_busy_flush() fs/xfs/xfs_extent_busy.c:598 >> #4 xfs_alloc_ag_vextent_size() fs/xfs/libxfs/xfs_alloc.c:1641 >> #5 xfs_alloc_ag_vextent() fs/xfs/libxfs/xfs_alloc.c:828 >> #6 xfs_alloc_fix_freelist() fs/xfs/libxfs/xfs_alloc.c:2362 >> #7 xfs_free_extent_fix_freelist() fs/xfs/libxfs/xfs_alloc.c:3029 >> #8 __xfs_free_extent() fs/xfs/libxfs/xfs_alloc.c:3067 >> #9 xfs_trans_free_extent() fs/xfs/xfs_extfree_item.c:370 >> #10 xfs_efi_recover() fs/xfs/xfs_extfree_item.c:626 >> #11 xlog_recover_process_efi() fs/xfs/xfs_log_recover.c:4605 >> #12 xlog_recover_process_intents() fs/xfs/xfs_log_recover.c:4893 >> #13 xlog_recover_finish() fs/xfs/xfs_log_recover.c:5824 >> #14 xfs_log_mount_finish() fs/xfs/xfs_log.c:764 >> #15 xfs_mountfs() fs/xfs/xfs_mount.c:978 >> #16 xfs_fs_fill_super() fs/xfs/xfs_super.c:1908 >> #17 mount_bdev() fs/super.c:1417 >> #18 xfs_fs_mount() fs/xfs/xfs_super.c:1985 >> #19 legacy_get_tree() fs/fs_context.c:647 >> #20 vfs_get_tree() fs/super.c:1547 >> #21 do_new_mount() fs/namespace.c:2843 >> #22 do_mount() fs/namespace.c:3163 >> #23 ksys_mount() fs/namespace.c:3372 >> #24 __do_sys_mount() fs/namespace.c:3386 >> #25 __se_sys_mount() fs/namespace.c:3383 >> #26 __x64_sys_mount() fs/namespace.c:3383 >> #27 do_syscall_64() arch/x86/entry/common.c:296 >> #28 entry_SYSCALL_64() arch/x86/entry/entry_64.S:180 >> >> It's waiting xfs_perag.pagb_gen to increase (busy extent clearing happen). >> From the vmcore, it's waiting on AG 1. And the ONLY busy extent for AG 1 is >> with the transaction (in xfs_trans.t_busy) for process 10133. That busy extent >> is created in a previous EFI with the same transaction. Process 10133 is >> waiting, it has no change to commit that that transaction. So busy extent >> clearing can't happen and pagb_gen remain unchanged. So dead lock formed. > > We've talked about this "busy extent in transaction" issue before: > > https://lore.kernel.org/linux-xfs/20210428065152.77280-1-chandanrlinux@gmail.com/ > > and we were closing in on a practical solution before it went silent. > > I'm not sure if there's a different fix we can apply here - maybe > free one extent per transaction instead of all the extents in an EFI > in one transaction and relog the EFD at the end of each extent free > transaction roll? > Consider the case of executing a truncate operation which involves freeing two file extents on a filesystem which has refcount feature enabled. xfs_refcount_decrease_extent() will be invoked twice and hence XFS_DEFER_OPS_TYPE_REFCOUNT will have two "struct xfs_refcount_intent" associated with it. Processing each of the "struct xfs_refcount_intent" can cause two refcount btree blocks to be freed: - A high level transacation will invoke xfs_refcountbt_free_block() twice. - The first invocation adds an extent entry to the transaction's busy extent list. The second invocation can find the previously freed busy extent and hence wait indefinitely for the busy extent to be flushed. Also, processing a single "struct xfs_refcount_intent" can require the leaf block and its immediate parent block to be freed. The leaf block is added to the transaction's busy list. Freeing the parent block can result in the task waiting for the busy extent (present in the high level transaction) to be flushed. Hence, IMHO this approach is most likely not a feasible solution. >> commit 06058bc40534530e617e5623775c53bb24f032cb disallowed using busy extents >> for any path that calls xfs_extent_busy_trim(). That looks over-killing. >> For AGFL block allocation, it just use the first extent that satisfies, it won't >> try another extent for choose a "better" one. So it's safe to reuse busy extent >> for AGFL. > > AGFL block allocation is not "for immediate use". The blocks get > placed on the AGFL for -later- use, and not necessarily even within > the current transaction. Hence a freelist block is still considered > free space, not as used space. The difference is that we assume AGFL > blocks can always be used immediately and they aren't constrained by > being busy or have pending discards. > > Also, we have to keep in mind that we can allocate data blocks from > the AGFL in low space situations. Hence it is not safe to place busy > or discard-pending blocks on the AGFL, as this can result in them > being allocated for user data and overwritten before the checkpoint > that marked them busy has been committed to the journal.... > > As such, I don't think it is be safe to ignore busy extent state > just because we are filling the AGFL from the current free space > tree. > > Cheers, > > Dave. -- chandan ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [PATCH] xfs: fix AGFL allocation dead lock 2023-04-12 8:23 ` Chandan Babu R @ 2023-04-12 23:59 ` Dave Chinner 2023-04-13 10:16 ` Chandan Babu R 0 siblings, 1 reply; 9+ messages in thread From: Dave Chinner @ 2023-04-12 23:59 UTC (permalink / raw) To: Chandan Babu R; +Cc: Wengang Wang, linux-xfs On Wed, Apr 12, 2023 at 01:53:59PM +0530, Chandan Babu R wrote: > On Tue, Apr 11, 2023 at 12:06:24 PM +1000, Dave Chinner wrote: > > On Thu, Mar 30, 2023 at 01:46:10PM -0700, Wengang Wang wrote: > >> There is deadlock with calltrace on process 10133: > >> > >> PID 10133 not sceduled for 4403385ms (was on CPU[10]) > >> #0 context_switch() kernel/sched/core.c:3881 > >> #1 __schedule() kernel/sched/core.c:5111 > >> #2 schedule() kernel/sched/core.c:5186 > >> #3 xfs_extent_busy_flush() fs/xfs/xfs_extent_busy.c:598 > >> #4 xfs_alloc_ag_vextent_size() fs/xfs/libxfs/xfs_alloc.c:1641 > >> #5 xfs_alloc_ag_vextent() fs/xfs/libxfs/xfs_alloc.c:828 > >> #6 xfs_alloc_fix_freelist() fs/xfs/libxfs/xfs_alloc.c:2362 > >> #7 xfs_free_extent_fix_freelist() fs/xfs/libxfs/xfs_alloc.c:3029 > >> #8 __xfs_free_extent() fs/xfs/libxfs/xfs_alloc.c:3067 > >> #9 xfs_trans_free_extent() fs/xfs/xfs_extfree_item.c:370 > >> #10 xfs_efi_recover() fs/xfs/xfs_extfree_item.c:626 > >> #11 xlog_recover_process_efi() fs/xfs/xfs_log_recover.c:4605 > >> #12 xlog_recover_process_intents() fs/xfs/xfs_log_recover.c:4893 > >> #13 xlog_recover_finish() fs/xfs/xfs_log_recover.c:5824 > >> #14 xfs_log_mount_finish() fs/xfs/xfs_log.c:764 > >> #15 xfs_mountfs() fs/xfs/xfs_mount.c:978 > >> #16 xfs_fs_fill_super() fs/xfs/xfs_super.c:1908 > >> #17 mount_bdev() fs/super.c:1417 > >> #18 xfs_fs_mount() fs/xfs/xfs_super.c:1985 > >> #19 legacy_get_tree() fs/fs_context.c:647 > >> #20 vfs_get_tree() fs/super.c:1547 > >> #21 do_new_mount() fs/namespace.c:2843 > >> #22 do_mount() fs/namespace.c:3163 > >> #23 ksys_mount() fs/namespace.c:3372 > >> #24 __do_sys_mount() fs/namespace.c:3386 > >> #25 __se_sys_mount() fs/namespace.c:3383 > >> #26 __x64_sys_mount() fs/namespace.c:3383 > >> #27 do_syscall_64() arch/x86/entry/common.c:296 > >> #28 entry_SYSCALL_64() arch/x86/entry/entry_64.S:180 > >> > >> It's waiting xfs_perag.pagb_gen to increase (busy extent clearing happen). > >> From the vmcore, it's waiting on AG 1. And the ONLY busy extent for AG 1 is > >> with the transaction (in xfs_trans.t_busy) for process 10133. That busy extent > >> is created in a previous EFI with the same transaction. Process 10133 is > >> waiting, it has no change to commit that that transaction. So busy extent > >> clearing can't happen and pagb_gen remain unchanged. So dead lock formed. > > > > We've talked about this "busy extent in transaction" issue before: > > > > https://lore.kernel.org/linux-xfs/20210428065152.77280-1-chandanrlinux@gmail.com/ > > > > and we were closing in on a practical solution before it went silent. > > > > I'm not sure if there's a different fix we can apply here - maybe > > free one extent per transaction instead of all the extents in an EFI > > in one transaction and relog the EFD at the end of each extent free > > transaction roll? > > > > Consider the case of executing a truncate operation which involves freeing two > file extents on a filesystem which has refcount feature enabled. > > xfs_refcount_decrease_extent() will be invoked twice and hence > XFS_DEFER_OPS_TYPE_REFCOUNT will have two "struct xfs_refcount_intent" > associated with it. Yes, that's exactly the same issue as processing multiple extents in a single EFI intent. The same solution applies. The design the intent/intent done infrastructure allows intents to be logged repeatedly to indicate ongoing partial completion of the original intent. The runtime deferred work completion does this (see xfs_defer_finish_one() and handling the -EAGAIN return value). Indeed, we have a requeue mechanism in xfs_cui_item_recover() - look at what 'requeue_only' does. It stops processing the extents in the intent and instead relogs them as deferred operations to be processed in a later transaction context. IOWs, we can avoid processing multiple extent frees in a single transaction with just a small logic change to the recovery code to ensure it always requeues after the first extent is processed. Looking at recovery of intents: BUIs only ever have one extent in them, so no change needed there. CUIs can be requeued already, just need to update the loop. RUIs will process all extents in a single transaction, so it needs to be updated to relog and defer after the first extent is freed. EFIs will process all extents in a single transaction, so it needs to be updated to relog and defer after the first extent is freed. ATTRI already do the right thing w.r.t. deferring updates to new transaction contexts. No change needed. IOWs, the number of extents in a given intent is largely irrelevant as we can process them one extent at a time in recovery if we actually need to just by relogging new intents as deferred work. > Processing each of the "struct xfs_refcount_intent" can cause two refcount > btree blocks to be freed: > - A high level transacation will invoke xfs_refcountbt_free_block() twice. > - The first invocation adds an extent entry to the transaction's busy extent > list. The second invocation can find the previously freed busy extent and > hence wait indefinitely for the busy extent to be flushed. > > Also, processing a single "struct xfs_refcount_intent" can require the leaf > block and its immediate parent block to be freed. The leaf block is added to > the transaction's busy list. Freeing the parent block can result in the task > waiting for the busy extent (present in the high level transaction) to be > flushed. Yes, it probably can, but this is a different problem - this is an internal btree update issue, not a "free multiple user extents in a single transaction that may only have a reservation large enough for a single user extent modification". So, lets think about why the allocator might try to reuse a busy extent on the next extent internal btree free operation in the transaction. The only way that I can see that happening is if the AGFL needs refilling, and the only way the allocator should get stuck in this way is if there are no other free extents in the AG. It then follows that if there are no other free extents in the AG, then we don't need to refill the AGFL, because freeing an extent in an empty AG will never cause the free space btrees to grow. In which case, we should not ever need to allocate from an extent that was previously freed in this specific transaction. We should also have XFS_ALLOC_FLAG_FREEING set, and this allows the AGFL refill to abort without error if there are no free blocks available because it's not necessary in this case. If we can't find a non-busy extent after flushing on an AGFL fill for a XFS_ALLOC_FLAG_FREEING operation, we should just abort the freelist refill and allow the extent free operation to continue. Hence a second free operation in a single transaction in the same AG (i.e. pretty much any multi-level btree merge operation) should always succeed at ENOSPC regardless of how many busy extents there are in the current transaction - we should never need to refill the AGFL for these operations when the only free space in the AG is space that has been freed by the transaction doing the freeing.... -Dave. > >> commit 06058bc40534530e617e5623775c53bb24f032cb disallowed using busy extents > >> for any path that calls xfs_extent_busy_trim(). That looks over-killing. > >> For AGFL block allocation, it just use the first extent that satisfies, it won't > >> try another extent for choose a "better" one. So it's safe to reuse busy extent > >> for AGFL. > > > > AGFL block allocation is not "for immediate use". The blocks get > > placed on the AGFL for -later- use, and not necessarily even within > > the current transaction. Hence a freelist block is still considered > > free space, not as used space. The difference is that we assume AGFL > > blocks can always be used immediately and they aren't constrained by > > being busy or have pending discards. > > > > Also, we have to keep in mind that we can allocate data blocks from > > the AGFL in low space situations. Hence it is not safe to place busy > > or discard-pending blocks on the AGFL, as this can result in them > > being allocated for user data and overwritten before the checkpoint > > that marked them busy has been committed to the journal.... > > > > As such, I don't think it is be safe to ignore busy extent state > > just because we are filling the AGFL from the current free space > > tree. > > > > Cheers, > > > > Dave. > > -- > chandan > -- Dave Chinner david@fromorbit.com ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [PATCH] xfs: fix AGFL allocation dead lock 2023-04-12 23:59 ` Dave Chinner @ 2023-04-13 10:16 ` Chandan Babu R 2023-04-13 21:38 ` Dave Chinner 0 siblings, 1 reply; 9+ messages in thread From: Chandan Babu R @ 2023-04-13 10:16 UTC (permalink / raw) To: Dave Chinner; +Cc: Wengang Wang, linux-xfs On Thu, Apr 13, 2023 at 09:59:15 AM +1000, Dave Chinner wrote: > On Wed, Apr 12, 2023 at 01:53:59PM +0530, Chandan Babu R wrote: >> On Tue, Apr 11, 2023 at 12:06:24 PM +1000, Dave Chinner wrote: >> > On Thu, Mar 30, 2023 at 01:46:10PM -0700, Wengang Wang wrote: >> >> There is deadlock with calltrace on process 10133: >> >> >> >> PID 10133 not sceduled for 4403385ms (was on CPU[10]) >> >> #0 context_switch() kernel/sched/core.c:3881 >> >> #1 __schedule() kernel/sched/core.c:5111 >> >> #2 schedule() kernel/sched/core.c:5186 >> >> #3 xfs_extent_busy_flush() fs/xfs/xfs_extent_busy.c:598 >> >> #4 xfs_alloc_ag_vextent_size() fs/xfs/libxfs/xfs_alloc.c:1641 >> >> #5 xfs_alloc_ag_vextent() fs/xfs/libxfs/xfs_alloc.c:828 >> >> #6 xfs_alloc_fix_freelist() fs/xfs/libxfs/xfs_alloc.c:2362 >> >> #7 xfs_free_extent_fix_freelist() fs/xfs/libxfs/xfs_alloc.c:3029 >> >> #8 __xfs_free_extent() fs/xfs/libxfs/xfs_alloc.c:3067 >> >> #9 xfs_trans_free_extent() fs/xfs/xfs_extfree_item.c:370 >> >> #10 xfs_efi_recover() fs/xfs/xfs_extfree_item.c:626 >> >> #11 xlog_recover_process_efi() fs/xfs/xfs_log_recover.c:4605 >> >> #12 xlog_recover_process_intents() fs/xfs/xfs_log_recover.c:4893 >> >> #13 xlog_recover_finish() fs/xfs/xfs_log_recover.c:5824 >> >> #14 xfs_log_mount_finish() fs/xfs/xfs_log.c:764 >> >> #15 xfs_mountfs() fs/xfs/xfs_mount.c:978 >> >> #16 xfs_fs_fill_super() fs/xfs/xfs_super.c:1908 >> >> #17 mount_bdev() fs/super.c:1417 >> >> #18 xfs_fs_mount() fs/xfs/xfs_super.c:1985 >> >> #19 legacy_get_tree() fs/fs_context.c:647 >> >> #20 vfs_get_tree() fs/super.c:1547 >> >> #21 do_new_mount() fs/namespace.c:2843 >> >> #22 do_mount() fs/namespace.c:3163 >> >> #23 ksys_mount() fs/namespace.c:3372 >> >> #24 __do_sys_mount() fs/namespace.c:3386 >> >> #25 __se_sys_mount() fs/namespace.c:3383 >> >> #26 __x64_sys_mount() fs/namespace.c:3383 >> >> #27 do_syscall_64() arch/x86/entry/common.c:296 >> >> #28 entry_SYSCALL_64() arch/x86/entry/entry_64.S:180 >> >> >> >> It's waiting xfs_perag.pagb_gen to increase (busy extent clearing happen). >> >> From the vmcore, it's waiting on AG 1. And the ONLY busy extent for AG 1 is >> >> with the transaction (in xfs_trans.t_busy) for process 10133. That busy extent >> >> is created in a previous EFI with the same transaction. Process 10133 is >> >> waiting, it has no change to commit that that transaction. So busy extent >> >> clearing can't happen and pagb_gen remain unchanged. So dead lock formed. >> > >> > We've talked about this "busy extent in transaction" issue before: >> > >> > https://lore.kernel.org/linux-xfs/20210428065152.77280-1-chandanrlinux@gmail.com/ >> > >> > and we were closing in on a practical solution before it went silent. >> > >> > I'm not sure if there's a different fix we can apply here - maybe >> > free one extent per transaction instead of all the extents in an EFI >> > in one transaction and relog the EFD at the end of each extent free >> > transaction roll? >> > >> >> Consider the case of executing a truncate operation which involves freeing two >> file extents on a filesystem which has refcount feature enabled. >> >> xfs_refcount_decrease_extent() will be invoked twice and hence >> XFS_DEFER_OPS_TYPE_REFCOUNT will have two "struct xfs_refcount_intent" >> associated with it. > > Yes, that's exactly the same issue as processing multiple extents > in a single EFI intent. > > The same solution applies. > > The design the intent/intent done infrastructure allows intents to > be logged repeatedly to indicate ongoing partial completion of the > original intent. The runtime deferred work completion does this (see > xfs_defer_finish_one() and handling the -EAGAIN return value). > > Indeed, we have a requeue mechanism in xfs_cui_item_recover() - > look at what 'requeue_only' does. It stops processing the extents in > the intent and instead relogs them as deferred operations to be > processed in a later transaction context. IOWs, we can avoid > processing multiple extent frees in a single transaction with just a > small logic change to the recovery code to ensure it always requeues > after the first extent is processed. > > Looking at recovery of intents: > > BUIs only ever have one extent in them, so no change needed there. > CUIs can be requeued already, just need to update the loop. > > RUIs will process all extents in a single transaction, so it needs > to be updated to relog and defer after the first extent is freed. > > EFIs will process all extents in a single transaction, so it needs > to be updated to relog and defer after the first extent is freed. > > ATTRI already do the right thing w.r.t. deferring updates to new > transaction contexts. No change needed. > > IOWs, the number of extents in a given intent is largely irrelevant > as we can process them one extent at a time in recovery if we > actually need to just by relogging new intents as deferred work. > >> Processing each of the "struct xfs_refcount_intent" can cause two refcount >> btree blocks to be freed: >> - A high level transacation will invoke xfs_refcountbt_free_block() twice. >> - The first invocation adds an extent entry to the transaction's busy extent >> list. The second invocation can find the previously freed busy extent and >> hence wait indefinitely for the busy extent to be flushed. >> >> Also, processing a single "struct xfs_refcount_intent" can require the leaf >> block and its immediate parent block to be freed. The leaf block is added to >> the transaction's busy list. Freeing the parent block can result in the task >> waiting for the busy extent (present in the high level transaction) to be >> flushed. > > Yes, it probably can, but this is a different problem - this is an > internal btree update issue, not a "free multiple user extents in a > single transaction that may only have a reservation large enough > for a single user extent modification". > > So, lets think about why the allocator might try to reuse a busy > extent on the next extent internal btree free operation in the > transaction. The only way that I can see that happening is if the > AGFL needs refilling, and the only way the allocator should get > stuck in this way is if there are no other free extents in the AG. If the first extent that was freed by the transaction (and hence also marked busy) happens to be the first among several other non-busy free extents found during AGFL allocation, the task will get blocked waiting for the busy extent flush to complete. However, this can be solved if xfs_alloc_ag_vextent_size() is modified to traverse the rest of the free space btree to find other non-busy extents. Busy extents can be flushed only as a last resort when non-busy extents cannot be found. > > It then follows that if there are no other free extents in the AG, > then we don't need to refill the AGFL, because freeing an extent in > an empty AG will never cause the free space btrees to grow. In which > case, we should not ever need to allocate from an extent that was > previously freed in this specific transaction. > > We should also have XFS_ALLOC_FLAG_FREEING set, and this allows the > AGFL refill to abort without error if there are no free blocks > available because it's not necessary in this case. If we can't find > a non-busy extent after flushing on an AGFL fill for a > XFS_ALLOC_FLAG_FREEING operation, we should just abort the freelist > refill and allow the extent free operation to continue. > I tried in vain to figure out a correct way to perform non-blocking busy extent flush. If it involves using a timeout mechanism, then I don't know as to what constitues a good timeout value. Please let me know if you have any suggestions to this end. > Hence a second free operation in a single transaction in the same AG > (i.e. pretty much any multi-level btree merge operation) should > always succeed at ENOSPC regardless of how many busy extents there > are in the current transaction - we should never need to refill the > AGFL for these operations when the only free space in the AG is > space that has been freed by the transaction doing the freeing.... -- chandan ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [PATCH] xfs: fix AGFL allocation dead lock 2023-04-13 10:16 ` Chandan Babu R @ 2023-04-13 21:38 ` Dave Chinner 2023-04-14 8:44 ` Chandan Babu R 0 siblings, 1 reply; 9+ messages in thread From: Dave Chinner @ 2023-04-13 21:38 UTC (permalink / raw) To: Chandan Babu R; +Cc: Wengang Wang, linux-xfs On Thu, Apr 13, 2023 at 03:46:38PM +0530, Chandan Babu R wrote: > On Thu, Apr 13, 2023 at 09:59:15 AM +1000, Dave Chinner wrote: > > On Wed, Apr 12, 2023 at 01:53:59PM +0530, Chandan Babu R wrote: > >> Processing each of the "struct xfs_refcount_intent" can cause two refcount > >> btree blocks to be freed: > >> - A high level transacation will invoke xfs_refcountbt_free_block() twice. > >> - The first invocation adds an extent entry to the transaction's busy extent > >> list. The second invocation can find the previously freed busy extent and > >> hence wait indefinitely for the busy extent to be flushed. > >> > >> Also, processing a single "struct xfs_refcount_intent" can require the leaf > >> block and its immediate parent block to be freed. The leaf block is added to > >> the transaction's busy list. Freeing the parent block can result in the task > >> waiting for the busy extent (present in the high level transaction) to be > >> flushed. > > > > Yes, it probably can, but this is a different problem - this is an > > internal btree update issue, not a "free multiple user extents in a > > single transaction that may only have a reservation large enough > > for a single user extent modification". > > > > So, lets think about why the allocator might try to reuse a busy > > extent on the next extent internal btree free operation in the > > transaction. The only way that I can see that happening is if the > > AGFL needs refilling, and the only way the allocator should get > > stuck in this way is if there are no other free extents in the AG. > > If the first extent that was freed by the transaction (and hence also marked > busy) happens to be the first among several other non-busy free extents found > during AGFL allocation, the task will get blocked waiting for the busy extent > flush to complete. However, this can be solved if xfs_alloc_ag_vextent_size() > is modified to traverse the rest of the free space btree to find other > non-busy extents. Busy extents can be flushed only as a last resort when > non-busy extents cannot be found. Yes, exactly my point: Don't block on busy extents if there are other free space candidates available to can be allocated without blocking. > > > > > It then follows that if there are no other free extents in the AG, > > then we don't need to refill the AGFL, because freeing an extent in > > an empty AG will never cause the free space btrees to grow. In which > > case, we should not ever need to allocate from an extent that was > > previously freed in this specific transaction. > > > > We should also have XFS_ALLOC_FLAG_FREEING set, and this allows the > > AGFL refill to abort without error if there are no free blocks > > available because it's not necessary in this case. If we can't find > > a non-busy extent after flushing on an AGFL fill for a > > XFS_ALLOC_FLAG_FREEING operation, we should just abort the freelist > > refill and allow the extent free operation to continue. > > I tried in vain to figure out a correct way to perform non-blocking busy > extent flush. If it involves using a timeout mechanism, then I don't know as > to what constitues a good timeout value. Please let me know if you have any > suggestions to this end. Why would a non-blocking busy extent flush sleep? By definition, "non blocking" means "does not context switch away from the current task". IOWs, a non-blocking busy extent flush is effectively just log force operation. We may need to sample the generation number to determine if progress has been made next time we get back to the "all extents are busy so flush the busy extents" logic again, but otherwise I think all we need do here is elide the "wait for generation number to change" logic. Then, if we've retried the allocation after a log force, and we still haven't made progress and we are doing an AGFL allocation in XFS_ALLOC_FLAG_FREEING context, then we can simply abort the allocation attempt at this point. i.e. if we don't have extents we can allocate immediately, don't bother waiting for busy extents to be resolved... Cheers, Dave. -- Dave Chinner david@fromorbit.com ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [PATCH] xfs: fix AGFL allocation dead lock 2023-04-13 21:38 ` Dave Chinner @ 2023-04-14 8:44 ` Chandan Babu R 2023-04-14 23:15 ` Wengang Wang 0 siblings, 1 reply; 9+ messages in thread From: Chandan Babu R @ 2023-04-14 8:44 UTC (permalink / raw) To: Dave Chinner; +Cc: Wengang Wang, linux-xfs On Fri, Apr 14, 2023 at 07:38:57 AM +1000, Dave Chinner wrote: > On Thu, Apr 13, 2023 at 03:46:38PM +0530, Chandan Babu R wrote: >> On Thu, Apr 13, 2023 at 09:59:15 AM +1000, Dave Chinner wrote: >> > On Wed, Apr 12, 2023 at 01:53:59PM +0530, Chandan Babu R wrote: >> >> Processing each of the "struct xfs_refcount_intent" can cause two >> >> refcount >> >> btree blocks to be freed: >> >> - A high level transacation will invoke >> >> xfs_refcountbt_free_block() twice. >> >> - The first invocation adds an extent entry to the transaction's >> >> busy extent >> >> list. The second invocation can find the previously freed busy >> >> extent and >> >> hence wait indefinitely for the busy extent to be flushed. >> >> >> >> Also, processing a single "struct xfs_refcount_intent" can >> >> require the leaf >> >> block and its immediate parent block to be freed. The leaf block >> >> is added to >> >> the transaction's busy list. Freeing the parent block can result >> >> in the task >> >> waiting for the busy extent (present in the high level transaction) to be >> >> flushed. >> > >> > Yes, it probably can, but this is a different problem - this is an >> > internal btree update issue, not a "free multiple user extents in a >> > single transaction that may only have a reservation large enough >> > for a single user extent modification". >> > >> > So, lets think about why the allocator might try to reuse a busy >> > extent on the next extent internal btree free operation in the >> > transaction. The only way that I can see that happening is if the >> > AGFL needs refilling, and the only way the allocator should get >> > stuck in this way is if there are no other free extents in the AG. >> >> If the first extent that was freed by the transaction (and hence also marked >> busy) happens to be the first among several other non-busy free >> extents found >> during AGFL allocation, the task will get blocked waiting for the >> busy extent >> flush to complete. However, this can be solved if >> xfs_alloc_ag_vextent_size() >> is modified to traverse the rest of the free space btree to find other >> non-busy extents. Busy extents can be flushed only as a last resort when >> non-busy extents cannot be found. > > Yes, exactly my point: Don't block on busy extents if there are > other free space candidates available to can be allocated without > blocking. > >> >> > >> > It then follows that if there are no other free extents in the AG, >> > then we don't need to refill the AGFL, because freeing an extent in >> > an empty AG will never cause the free space btrees to grow. In which >> > case, we should not ever need to allocate from an extent that was >> > previously freed in this specific transaction. >> > >> > We should also have XFS_ALLOC_FLAG_FREEING set, and this allows the >> > AGFL refill to abort without error if there are no free blocks >> > available because it's not necessary in this case. If we can't find >> > a non-busy extent after flushing on an AGFL fill for a >> > XFS_ALLOC_FLAG_FREEING operation, we should just abort the freelist >> > refill and allow the extent free operation to continue. >> >> I tried in vain to figure out a correct way to perform non-blocking busy >> extent flush. If it involves using a timeout mechanism, then I don't know as >> to what constitues a good timeout value. Please let me know if you have any >> suggestions to this end. > > Why would a non-blocking busy extent flush sleep? By definition, > "non blocking" means "does not context switch away from the current > task". > > IOWs, a non-blocking busy extent flush is effectively just log force > operation. We may need to sample the generation number to determine > if progress has been made next time we get back to the "all extents > are busy so flush the busy extents" logic again, but otherwise I > think all we need do here is elide the "wait for generation number > to change" logic. > > Then, if we've retried the allocation after a log force, and we > still haven't made progress and we are doing an AGFL allocation in > XFS_ALLOC_FLAG_FREEING context, then we can simply abort the > allocation attempt at this point. i.e. if we don't have extents we > can allocate immediately, don't bother waiting for busy extents to > be resolved... The corresponding pseudocode for allocating blocks to AGFL during an extent free operation would be the following, 1. Search all extents in the CNTBT for non-busy extents. 2. Return if extent is found; Otherwise, 3. Issue a log force. 4. Retry non-busy free extent allocation. 5. Return back regardless of whether non-busy extents were found or not. I think I may have found one scenario where the above pseudocode would break. Assume that an AG has very few blocks of free space left and that these blocks fit within the root node of CNTBT. Assume that we have a large file which is backed up by contiguous extents belonging to the AG and that these extents have a refcount of 1. fpunch operation is issued on alternating blocks of this large file and the punched out extents are added to, 1. Per-ag RB tree that tracks busy extents 2. CNTBT causing the CNTBT to grow. Step 2 can cause consumption of blocks in AGFL and we can end up with a CNTBT having two fully filled leaf blocks and one root node. At this instance, a fpunch operation on a file extent having a refcount of 2 can cause the leaf node of the refcount btree to be freed. The free operation now tries to fill the AGFL. However, it only finds busy extents and issues a log force. The operation can fail to find any non-busy free extents because the busy extents probably have not yet been discarded. The free extent operation fails to fill up the AGFL. Later, adding the free extent (refcount btree's leaf node) to the CNTBT can require the CNTBT to be expanded (E.g. Adding a new leaf node). However, this fails since AGFL doesn't have any free blocks left. -- chandan ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [PATCH] xfs: fix AGFL allocation dead lock 2023-04-14 8:44 ` Chandan Babu R @ 2023-04-14 23:15 ` Wengang Wang 0 siblings, 0 replies; 9+ messages in thread From: Wengang Wang @ 2023-04-14 23:15 UTC (permalink / raw) To: Chandan Babu; +Cc: Dave Chinner, linux-xfs@vger.kernel.org Maybe duplicated reply (I was told that email included HTMP part and rejected). I just sent a two-patch set making one extent per EFI to solve this problem, can you please see if that’s good? thanks, wengang ^ permalink raw reply [flat|nested] 9+ messages in thread
end of thread, other threads:[~2023-04-14 23:16 UTC | newest] Thread overview: 9+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2023-03-30 20:46 [PATCH] xfs: fix AGFL allocation dead lock Wengang Wang 2023-04-05 15:26 ` Wengang Wang 2023-04-11 2:06 ` Dave Chinner 2023-04-12 8:23 ` Chandan Babu R 2023-04-12 23:59 ` Dave Chinner 2023-04-13 10:16 ` Chandan Babu R 2023-04-13 21:38 ` Dave Chinner 2023-04-14 8:44 ` Chandan Babu R 2023-04-14 23:15 ` Wengang Wang
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox