* [PATCH 0/2] xfs: one extent per EFI @ 2023-04-14 22:58 Wengang Wang 2023-04-14 22:58 ` [PATCH 1/2] xfs: IO time " Wengang Wang 2023-04-14 22:58 ` [PATCH 2/2] xfs: log recovery stage split EFIs with multiple extents Wengang Wang 0 siblings, 2 replies; 16+ messages in thread From: Wengang Wang @ 2023-04-14 22:58 UTC (permalink / raw) To: linux-xfs; +Cc: wen.gang.wang We are hitting the deadlock described in patch 1. This patchset doesn't want to disturb the existing block allocation routine, that would make the allocation routime even complex. Instead, this patch avoids doing AGFL block allocation holding busy extents in current memory transaction. Patch 1 fixes the IO path and Patch 2 takes care of log recovery. Wengang Wang (2): xfs: IO time one extent per EFI xfs: log recovery stage split EFIs with multiple extents fs/xfs/xfs_extfree_item.c | 104 ++++++++++++++++++++++++++++++++++---- fs/xfs/xfs_extfree_item.h | 9 +++- 2 files changed, 101 insertions(+), 12 deletions(-) -- 2.21.0 (Apple Git-122.2) ^ permalink raw reply [flat|nested] 16+ messages in thread
* [PATCH 1/2] xfs: IO time one extent per EFI 2023-04-14 22:58 [PATCH 0/2] xfs: one extent per EFI Wengang Wang @ 2023-04-14 22:58 ` Wengang Wang 2023-04-19 23:55 ` Dave Chinner 2023-04-14 22:58 ` [PATCH 2/2] xfs: log recovery stage split EFIs with multiple extents Wengang Wang 1 sibling, 1 reply; 16+ messages in thread From: Wengang Wang @ 2023-04-14 22:58 UTC (permalink / raw) To: linux-xfs; +Cc: wen.gang.wang At IO time, make sure an EFI contains only one extent. Transaction rolling in xfs_defer_finish() would commit the busy blocks for previous EFIs. By that we avoid holding busy extents (for previously extents in the same EFI) in current transaction when allocating blocks for AGFL where we could be otherwise stuck waiting the busy extents held by current transaction to be flushed (thus a deadlock). The log changes 1) before change: 358 rbbn 13 rec_lsn: 1,12 Oper 5: tid: ee327fd2 len: 48 flags: None 359 EFI nextents:2 id:ffff9fef708ba940 360 EFI id=ffff9fef708ba940 (0x21, 7) 361 EFI id=ffff9fef708ba940 (0x18, 8) 362 ----------------------------------------------------------------- 363 rbbn 13 rec_lsn: 1,12 Oper 6: tid: ee327fd2 len: 48 flags: None 364 EFD nextents:2 id:ffff9fef708ba940 365 EFD id=ffff9fef708ba940 (0x21, 7) 366 EFD id=ffff9fef708ba940 (0x18, 8) 2) after change: 830 rbbn 31 rec_lsn: 1,30 Oper 5: tid: 319f015f len: 32 flags: None 831 EFI nextents:1 id:ffff9fef708b9b80 832 EFI id=ffff9fef708b9b80 (0x21, 7) 833 ----------------------------------------------------------------- 834 rbbn 31 rec_lsn: 1,30 Oper 6: tid: 319f015f len: 32 flags: None 835 EFI nextents:1 id:ffff9fef708b9d38 836 EFI id=ffff9fef708b9d38 (0x18, 8) 837 ----------------------------------------------------------------- 838 rbbn 31 rec_lsn: 1,30 Oper 7: tid: 319f015f len: 32 flags: None 839 EFD nextents:1 id:ffff9fef708b9b80 840 EFD id=ffff9fef708b9b80 (0x21, 7) 841 ----------------------------------------------------------------- 842 rbbn 31 rec_lsn: 1,30 Oper 8: tid: 319f015f len: 32 flags: None 843 EFD nextents:1 id:ffff9fef708b9d38 844 EFD id=ffff9fef708b9d38 (0x18, 8) Signed-off-by: Wengang Wang <wen.gang.wang@oracle.com> --- fs/xfs/xfs_extfree_item.h | 9 ++++++++- 1 file changed, 8 insertions(+), 1 deletion(-) diff --git a/fs/xfs/xfs_extfree_item.h b/fs/xfs/xfs_extfree_item.h index da6a5afa607c..ae84d77eaf30 100644 --- a/fs/xfs/xfs_extfree_item.h +++ b/fs/xfs/xfs_extfree_item.h @@ -13,8 +13,15 @@ struct kmem_cache; /* * Max number of extents in fast allocation path. + * + * At IO time, make sure an EFI contains only one extent. Transaction rolling + * in xfs_defer_finish() would commit the busy blocks for previous EFIs. By + * that we avoid holding busy extents (for previously extents in the same EFI) + * in current transaction when allocating blocks for AGFL where we could be + * otherwise stuck waiting the busy extents held by current transaction to be + * flushed (thus a deadlock). */ -#define XFS_EFI_MAX_FAST_EXTENTS 16 +#define XFS_EFI_MAX_FAST_EXTENTS 1 /* * This is the "extent free intention" log item. It is used to log the fact -- 2.21.0 (Apple Git-122.2) ^ permalink raw reply related [flat|nested] 16+ messages in thread
* Re: [PATCH 1/2] xfs: IO time one extent per EFI 2023-04-14 22:58 ` [PATCH 1/2] xfs: IO time " Wengang Wang @ 2023-04-19 23:55 ` Dave Chinner 2023-04-20 17:31 ` Wengang Wang 0 siblings, 1 reply; 16+ messages in thread From: Dave Chinner @ 2023-04-19 23:55 UTC (permalink / raw) To: Wengang Wang; +Cc: linux-xfs On Fri, Apr 14, 2023 at 03:58:35PM -0700, Wengang Wang wrote: > At IO time, make sure an EFI contains only one extent. Transaction rolling in > xfs_defer_finish() would commit the busy blocks for previous EFIs. By that we > avoid holding busy extents (for previously extents in the same EFI) in current > transaction when allocating blocks for AGFL where we could be otherwise stuck > waiting the busy extents held by current transaction to be flushed (thus a > deadlock). > > The log changes > 1) before change: > > 358 rbbn 13 rec_lsn: 1,12 Oper 5: tid: ee327fd2 len: 48 flags: None > 359 EFI nextents:2 id:ffff9fef708ba940 > 360 EFI id=ffff9fef708ba940 (0x21, 7) > 361 EFI id=ffff9fef708ba940 (0x18, 8) > 362 ----------------------------------------------------------------- > 363 rbbn 13 rec_lsn: 1,12 Oper 6: tid: ee327fd2 len: 48 flags: None > 364 EFD nextents:2 id:ffff9fef708ba940 > 365 EFD id=ffff9fef708ba940 (0x21, 7) > 366 EFD id=ffff9fef708ba940 (0x18, 8) > > 2) after change: > > 830 rbbn 31 rec_lsn: 1,30 Oper 5: tid: 319f015f len: 32 flags: None > 831 EFI nextents:1 id:ffff9fef708b9b80 > 832 EFI id=ffff9fef708b9b80 (0x21, 7) > 833 ----------------------------------------------------------------- > 834 rbbn 31 rec_lsn: 1,30 Oper 6: tid: 319f015f len: 32 flags: None > 835 EFI nextents:1 id:ffff9fef708b9d38 > 836 EFI id=ffff9fef708b9d38 (0x18, 8) > 837 ----------------------------------------------------------------- > 838 rbbn 31 rec_lsn: 1,30 Oper 7: tid: 319f015f len: 32 flags: None > 839 EFD nextents:1 id:ffff9fef708b9b80 > 840 EFD id=ffff9fef708b9b80 (0x21, 7) > 841 ----------------------------------------------------------------- > 842 rbbn 31 rec_lsn: 1,30 Oper 8: tid: 319f015f len: 32 flags: None > 843 EFD nextents:1 id:ffff9fef708b9d38 > 844 EFD id=ffff9fef708b9d38 (0x18, 8) > > Signed-off-by: Wengang Wang <wen.gang.wang@oracle.com> > --- > fs/xfs/xfs_extfree_item.h | 9 ++++++++- > 1 file changed, 8 insertions(+), 1 deletion(-) > > diff --git a/fs/xfs/xfs_extfree_item.h b/fs/xfs/xfs_extfree_item.h > index da6a5afa607c..ae84d77eaf30 100644 > --- a/fs/xfs/xfs_extfree_item.h > +++ b/fs/xfs/xfs_extfree_item.h > @@ -13,8 +13,15 @@ struct kmem_cache; > > /* > * Max number of extents in fast allocation path. > + * > + * At IO time, make sure an EFI contains only one extent. Transaction rolling > + * in xfs_defer_finish() would commit the busy blocks for previous EFIs. By > + * that we avoid holding busy extents (for previously extents in the same EFI) > + * in current transaction when allocating blocks for AGFL where we could be > + * otherwise stuck waiting the busy extents held by current transaction to be > + * flushed (thus a deadlock). > */ > -#define XFS_EFI_MAX_FAST_EXTENTS 16 > +#define XFS_EFI_MAX_FAST_EXTENTS 1 IIRC, this doesn't have anything to do with the number of extents an EFI can hold. All it does is control how the memory for the EFI allocated. Oh, at some point it got overloaded code to define the max items in a defer ops work item. Ok, I now see why you changed this, but I don't think this is right way to solve the problem. We can handle processing multiple extents per EFI just fine, we just need to update the EFD and roll the transaction on each extent we process, yes? In hindsight, the use of XFS_EFI_MAX_FAST_EXTENTS to limit intent size is pretty awful. I also see the same pattern copied in every other intent. Darrick, if the deferops code has been limiting the number of extents in a work item to this value, when will we ever see an intent with more extents that .max_items in it? And if that is the case, then why wouldn't we consider an intent with more extents than we support in log recovery a corruption event? Cheers, Dave. -- Dave Chinner david@fromorbit.com ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH 1/2] xfs: IO time one extent per EFI 2023-04-19 23:55 ` Dave Chinner @ 2023-04-20 17:31 ` Wengang Wang 2023-04-20 23:22 ` Dave Chinner 0 siblings, 1 reply; 16+ messages in thread From: Wengang Wang @ 2023-04-20 17:31 UTC (permalink / raw) To: Dave Chinner; +Cc: linux-xfs@vger.kernel.org > On Apr 19, 2023, at 4:55 PM, Dave Chinner <david@fromorbit.com> wrote: > > On Fri, Apr 14, 2023 at 03:58:35PM -0700, Wengang Wang wrote: >> At IO time, make sure an EFI contains only one extent. Transaction rolling in >> xfs_defer_finish() would commit the busy blocks for previous EFIs. By that we >> avoid holding busy extents (for previously extents in the same EFI) in current >> transaction when allocating blocks for AGFL where we could be otherwise stuck >> waiting the busy extents held by current transaction to be flushed (thus a >> deadlock). >> >> The log changes >> 1) before change: >> >> 358 rbbn 13 rec_lsn: 1,12 Oper 5: tid: ee327fd2 len: 48 flags: None >> 359 EFI nextents:2 id:ffff9fef708ba940 >> 360 EFI id=ffff9fef708ba940 (0x21, 7) >> 361 EFI id=ffff9fef708ba940 (0x18, 8) >> 362 ----------------------------------------------------------------- >> 363 rbbn 13 rec_lsn: 1,12 Oper 6: tid: ee327fd2 len: 48 flags: None >> 364 EFD nextents:2 id:ffff9fef708ba940 >> 365 EFD id=ffff9fef708ba940 (0x21, 7) >> 366 EFD id=ffff9fef708ba940 (0x18, 8) >> >> 2) after change: >> >> 830 rbbn 31 rec_lsn: 1,30 Oper 5: tid: 319f015f len: 32 flags: None >> 831 EFI nextents:1 id:ffff9fef708b9b80 >> 832 EFI id=ffff9fef708b9b80 (0x21, 7) >> 833 ----------------------------------------------------------------- >> 834 rbbn 31 rec_lsn: 1,30 Oper 6: tid: 319f015f len: 32 flags: None >> 835 EFI nextents:1 id:ffff9fef708b9d38 >> 836 EFI id=ffff9fef708b9d38 (0x18, 8) >> 837 ----------------------------------------------------------------- >> 838 rbbn 31 rec_lsn: 1,30 Oper 7: tid: 319f015f len: 32 flags: None >> 839 EFD nextents:1 id:ffff9fef708b9b80 >> 840 EFD id=ffff9fef708b9b80 (0x21, 7) >> 841 ----------------------------------------------------------------- >> 842 rbbn 31 rec_lsn: 1,30 Oper 8: tid: 319f015f len: 32 flags: None >> 843 EFD nextents:1 id:ffff9fef708b9d38 >> 844 EFD id=ffff9fef708b9d38 (0x18, 8) >> >> Signed-off-by: Wengang Wang <wen.gang.wang@oracle.com> >> --- >> fs/xfs/xfs_extfree_item.h | 9 ++++++++- >> 1 file changed, 8 insertions(+), 1 deletion(-) >> >> diff --git a/fs/xfs/xfs_extfree_item.h b/fs/xfs/xfs_extfree_item.h >> index da6a5afa607c..ae84d77eaf30 100644 >> --- a/fs/xfs/xfs_extfree_item.h >> +++ b/fs/xfs/xfs_extfree_item.h >> @@ -13,8 +13,15 @@ struct kmem_cache; >> >> /* >> * Max number of extents in fast allocation path. >> + * >> + * At IO time, make sure an EFI contains only one extent. Transaction rolling >> + * in xfs_defer_finish() would commit the busy blocks for previous EFIs. By >> + * that we avoid holding busy extents (for previously extents in the same EFI) >> + * in current transaction when allocating blocks for AGFL where we could be >> + * otherwise stuck waiting the busy extents held by current transaction to be >> + * flushed (thus a deadlock). >> */ >> -#define XFS_EFI_MAX_FAST_EXTENTS 16 >> +#define XFS_EFI_MAX_FAST_EXTENTS 1 > > IIRC, this doesn't have anything to do with the number of extents an > EFI can hold. All it does is control how the memory for the EFI > allocated. > Yes, it ensures that one EFI contains at most one extent. And because each deferred intent goes with one transaction roll, it would solve the AGFL allocation deadlock (because no busy extents held by the process when it is doing the AGFL allocation). And yes, this would requires more log space if the EFI/EFD pair doesn’t appear in same checkpoint. > Oh, at some point it got overloaded code to define the max items in > a defer ops work item. Ok, I now see why you changed this, but I > don't think this is right way to solve the problem. We can handle > processing multiple extents per EFI just fine, we just need to > update the EFD and roll the transaction on each extent we process, > yes? > I am not quite sure what does “update the EFD” mean. My original concern is that (without your updated EFD), the extents in original EFI can be partially done before a crush. And during the recovery, the already done extents would also be replayed and hit error (because the in-place metadata could be flushed since the transaction is rolled.). Now consider your “update the EFD”, you meant the following? EFI: ID: THISISID1 extent1 extent2 free extent extent1 EFD: ID: THISISID1 extent1 free extent extent2 another EFD: ID: THISISID1 (same ID as above) extent2 Do we currently support that? ( I am thinking NO). thanks, wengang > In hindsight, the use of XFS_EFI_MAX_FAST_EXTENTS to limit intent > size is pretty awful. I also see the same pattern copied in every > other intent. > > Darrick, if the deferops code has been limiting the number of > extents in a work item to this value, when will we ever see an > intent with more extents that .max_items in it? And if that is the > case, then why wouldn't we consider an intent with more extents than > we support in log recovery a corruption event? > > Cheers, > > Dave. > -- > Dave Chinner > david@fromorbit.com ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH 1/2] xfs: IO time one extent per EFI 2023-04-20 17:31 ` Wengang Wang @ 2023-04-20 23:22 ` Dave Chinner 2023-04-21 0:24 ` Wengang Wang 0 siblings, 1 reply; 16+ messages in thread From: Dave Chinner @ 2023-04-20 23:22 UTC (permalink / raw) To: Wengang Wang; +Cc: linux-xfs@vger.kernel.org On Thu, Apr 20, 2023 at 05:31:14PM +0000, Wengang Wang wrote: > > > > On Apr 19, 2023, at 4:55 PM, Dave Chinner <david@fromorbit.com> wrote: > > > > On Fri, Apr 14, 2023 at 03:58:35PM -0700, Wengang Wang wrote: > >> At IO time, make sure an EFI contains only one extent. Transaction rolling in > >> xfs_defer_finish() would commit the busy blocks for previous EFIs. By that we > >> avoid holding busy extents (for previously extents in the same EFI) in current > >> transaction when allocating blocks for AGFL where we could be otherwise stuck > >> waiting the busy extents held by current transaction to be flushed (thus a > >> deadlock). > >> > >> The log changes > >> 1) before change: > >> > >> 358 rbbn 13 rec_lsn: 1,12 Oper 5: tid: ee327fd2 len: 48 flags: None > >> 359 EFI nextents:2 id:ffff9fef708ba940 > >> 360 EFI id=ffff9fef708ba940 (0x21, 7) > >> 361 EFI id=ffff9fef708ba940 (0x18, 8) > >> 362 ----------------------------------------------------------------- > >> 363 rbbn 13 rec_lsn: 1,12 Oper 6: tid: ee327fd2 len: 48 flags: None > >> 364 EFD nextents:2 id:ffff9fef708ba940 > >> 365 EFD id=ffff9fef708ba940 (0x21, 7) > >> 366 EFD id=ffff9fef708ba940 (0x18, 8) > >> > >> 2) after change: > >> > >> 830 rbbn 31 rec_lsn: 1,30 Oper 5: tid: 319f015f len: 32 flags: None > >> 831 EFI nextents:1 id:ffff9fef708b9b80 > >> 832 EFI id=ffff9fef708b9b80 (0x21, 7) > >> 833 ----------------------------------------------------------------- > >> 834 rbbn 31 rec_lsn: 1,30 Oper 6: tid: 319f015f len: 32 flags: None > >> 835 EFI nextents:1 id:ffff9fef708b9d38 > >> 836 EFI id=ffff9fef708b9d38 (0x18, 8) > >> 837 ----------------------------------------------------------------- > >> 838 rbbn 31 rec_lsn: 1,30 Oper 7: tid: 319f015f len: 32 flags: None > >> 839 EFD nextents:1 id:ffff9fef708b9b80 > >> 840 EFD id=ffff9fef708b9b80 (0x21, 7) > >> 841 ----------------------------------------------------------------- > >> 842 rbbn 31 rec_lsn: 1,30 Oper 8: tid: 319f015f len: 32 flags: None > >> 843 EFD nextents:1 id:ffff9fef708b9d38 > >> 844 EFD id=ffff9fef708b9d38 (0x18, 8) > >> > >> Signed-off-by: Wengang Wang <wen.gang.wang@oracle.com> > >> --- > >> fs/xfs/xfs_extfree_item.h | 9 ++++++++- > >> 1 file changed, 8 insertions(+), 1 deletion(-) > >> > >> diff --git a/fs/xfs/xfs_extfree_item.h b/fs/xfs/xfs_extfree_item.h > >> index da6a5afa607c..ae84d77eaf30 100644 > >> --- a/fs/xfs/xfs_extfree_item.h > >> +++ b/fs/xfs/xfs_extfree_item.h > >> @@ -13,8 +13,15 @@ struct kmem_cache; > >> > >> /* > >> * Max number of extents in fast allocation path. > >> + * > >> + * At IO time, make sure an EFI contains only one extent. Transaction rolling > >> + * in xfs_defer_finish() would commit the busy blocks for previous EFIs. By > >> + * that we avoid holding busy extents (for previously extents in the same EFI) > >> + * in current transaction when allocating blocks for AGFL where we could be > >> + * otherwise stuck waiting the busy extents held by current transaction to be > >> + * flushed (thus a deadlock). > >> */ > >> -#define XFS_EFI_MAX_FAST_EXTENTS 16 > >> +#define XFS_EFI_MAX_FAST_EXTENTS 1 > > > > IIRC, this doesn't have anything to do with the number of extents an > > EFI can hold. All it does is control how the memory for the EFI > > allocated. > > Yes, it ensures that one EFI contains at most one extent. And because each > deferred intent goes with one transaction roll, it would solve the AGFL allocation > deadlock (because no busy extents held by the process when it is doing the > AGFL allocation). > > > Oh, at some point it got overloaded code to define the max items in > > a defer ops work item. Ok, I now see why you changed this, but I > > don't think this is right way to solve the problem. We can handle > > processing multiple extents per EFI just fine, we just need to > > update the EFD and roll the transaction on each extent we process, > > yes? > > > > I am not quite sure what does “update the EFD” mean. Historical terminology, see below. > My original concern is that (without your updated EFD), the extents in original EFI can be partially done before a crush. And during the recovery, the already done extents would also be replayed and hit error (because the in-place metadata could be flushed since the transaction is rolled.). > > Now consider your “update the EFD”, you meant the following? > > EFI: ID: THISISID1 extent1 extent2 > free extent extent1 > EFD: ID: THISISID1 extent1 > free extent extent2 > another EFD: ID: THISISID1 (same ID as above) extent2 Yes, that's pretty much how multi-extent EFIs used to work, except the second and subsequent EFDs recorded all the extents that had been freed. That way recovery could simply find the EFD with the highest LSN in the log to determine what part of the EFI had not been replayed. We don't do that anymore for partially processed multi-extent intents anymore. Instead, we use deferred ops to chain updates. i.e. we log a complete intent done items alongside a new intent containing the remaining work to be done in the same transaction. This cancels the original intent and atomically replaces it with a new intent containing the remaining work to be done. So when I say "update the EFD" I'm using historic terminology for processing and recovering multi-extent intents. In modern terms, what I mean is "update the deferred work intent chain to reflect the work remaining to be done". Cheers, Dave. -- Dave Chinner david@fromorbit.com ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH 1/2] xfs: IO time one extent per EFI 2023-04-20 23:22 ` Dave Chinner @ 2023-04-21 0:24 ` Wengang Wang 2023-04-21 9:34 ` Dave Chinner 0 siblings, 1 reply; 16+ messages in thread From: Wengang Wang @ 2023-04-21 0:24 UTC (permalink / raw) To: Dave Chinner; +Cc: linux-xfs@vger.kernel.org > On Apr 20, 2023, at 4:22 PM, Dave Chinner <david@fromorbit.com> wrote: > > On Thu, Apr 20, 2023 at 05:31:14PM +0000, Wengang Wang wrote: >> >> >>> On Apr 19, 2023, at 4:55 PM, Dave Chinner <david@fromorbit.com> wrote: >>> >>> On Fri, Apr 14, 2023 at 03:58:35PM -0700, Wengang Wang wrote: >>>> At IO time, make sure an EFI contains only one extent. Transaction rolling in >>>> xfs_defer_finish() would commit the busy blocks for previous EFIs. By that we >>>> avoid holding busy extents (for previously extents in the same EFI) in current >>>> transaction when allocating blocks for AGFL where we could be otherwise stuck >>>> waiting the busy extents held by current transaction to be flushed (thus a >>>> deadlock). >>>> >>>> The log changes >>>> 1) before change: >>>> >>>> 358 rbbn 13 rec_lsn: 1,12 Oper 5: tid: ee327fd2 len: 48 flags: None >>>> 359 EFI nextents:2 id:ffff9fef708ba940 >>>> 360 EFI id=ffff9fef708ba940 (0x21, 7) >>>> 361 EFI id=ffff9fef708ba940 (0x18, 8) >>>> 362 ----------------------------------------------------------------- >>>> 363 rbbn 13 rec_lsn: 1,12 Oper 6: tid: ee327fd2 len: 48 flags: None >>>> 364 EFD nextents:2 id:ffff9fef708ba940 >>>> 365 EFD id=ffff9fef708ba940 (0x21, 7) >>>> 366 EFD id=ffff9fef708ba940 (0x18, 8) >>>> >>>> 2) after change: >>>> >>>> 830 rbbn 31 rec_lsn: 1,30 Oper 5: tid: 319f015f len: 32 flags: None >>>> 831 EFI nextents:1 id:ffff9fef708b9b80 >>>> 832 EFI id=ffff9fef708b9b80 (0x21, 7) >>>> 833 ----------------------------------------------------------------- >>>> 834 rbbn 31 rec_lsn: 1,30 Oper 6: tid: 319f015f len: 32 flags: None >>>> 835 EFI nextents:1 id:ffff9fef708b9d38 >>>> 836 EFI id=ffff9fef708b9d38 (0x18, 8) >>>> 837 ----------------------------------------------------------------- >>>> 838 rbbn 31 rec_lsn: 1,30 Oper 7: tid: 319f015f len: 32 flags: None >>>> 839 EFD nextents:1 id:ffff9fef708b9b80 >>>> 840 EFD id=ffff9fef708b9b80 (0x21, 7) >>>> 841 ----------------------------------------------------------------- >>>> 842 rbbn 31 rec_lsn: 1,30 Oper 8: tid: 319f015f len: 32 flags: None >>>> 843 EFD nextents:1 id:ffff9fef708b9d38 >>>> 844 EFD id=ffff9fef708b9d38 (0x18, 8) >>>> >>>> Signed-off-by: Wengang Wang <wen.gang.wang@oracle.com> >>>> --- >>>> fs/xfs/xfs_extfree_item.h | 9 ++++++++- >>>> 1 file changed, 8 insertions(+), 1 deletion(-) >>>> >>>> diff --git a/fs/xfs/xfs_extfree_item.h b/fs/xfs/xfs_extfree_item.h >>>> index da6a5afa607c..ae84d77eaf30 100644 >>>> --- a/fs/xfs/xfs_extfree_item.h >>>> +++ b/fs/xfs/xfs_extfree_item.h >>>> @@ -13,8 +13,15 @@ struct kmem_cache; >>>> >>>> /* >>>> * Max number of extents in fast allocation path. >>>> + * >>>> + * At IO time, make sure an EFI contains only one extent. Transaction rolling >>>> + * in xfs_defer_finish() would commit the busy blocks for previous EFIs. By >>>> + * that we avoid holding busy extents (for previously extents in the same EFI) >>>> + * in current transaction when allocating blocks for AGFL where we could be >>>> + * otherwise stuck waiting the busy extents held by current transaction to be >>>> + * flushed (thus a deadlock). >>>> */ >>>> -#define XFS_EFI_MAX_FAST_EXTENTS 16 >>>> +#define XFS_EFI_MAX_FAST_EXTENTS 1 >>> >>> IIRC, this doesn't have anything to do with the number of extents an >>> EFI can hold. All it does is control how the memory for the EFI >>> allocated. >> >> Yes, it ensures that one EFI contains at most one extent. And because each >> deferred intent goes with one transaction roll, it would solve the AGFL allocation >> deadlock (because no busy extents held by the process when it is doing the >> AGFL allocation). >> >>> Oh, at some point it got overloaded code to define the max items in >>> a defer ops work item. Ok, I now see why you changed this, but I >>> don't think this is right way to solve the problem. We can handle >>> processing multiple extents per EFI just fine, we just need to >>> update the EFD and roll the transaction on each extent we process, >>> yes? >>> >> >> I am not quite sure what does “update the EFD” mean. > > Historical terminology, see below. > >> My original concern is that (without your updated EFD), the extents in original EFI can be partially done before a crush. And during the recovery, the already done extents would also be replayed and hit error (because the in-place metadata could be flushed since the transaction is rolled.). >> >> Now consider your “update the EFD”, you meant the following? >> >> EFI: ID: THISISID1 extent1 extent2 >> free extent extent1 >> EFD: ID: THISISID1 extent1 >> free extent extent2 >> another EFD: ID: THISISID1 (same ID as above) extent2 > > Yes, that's pretty much how multi-extent EFIs used to work, except > the second and subsequent EFDs recorded all the extents that had > been freed. That way recovery could simply find the EFD with the > highest LSN in the log to determine what part of the EFI had not > been replayed. Good to know it. > > We don't do that anymore for partially processed multi-extent > intents anymore. Instead, we use deferred ops to chain updates. i.e. > we log a complete intent done items alongside a new intent > containing the remaining work to be done in the same transaction. > This cancels the original intent and atomically replaces it with a > new intent containing the remaining work to be done. > > So when I say "update the EFD" I'm using historic terminology for > processing and recovering multi-extent intents. In modern terms, > what I mean is "update the deferred work intent chain to reflect the > work remaining to be done". OK. so let’s see the difference between your implementation from mine. Say, there are three extents to free. My patch would result in: EFI 1 with extent1 free extent1 EFD 1 with extent1 transaction roll EFI 2 with extent2 free extent2 EFD 2 with extent2 transaction roll EFI 3 with extent3 free extent3 EFD3 with extent3 transaction commit The EFI/EFD log item pairs should not be written to log as they appear in same checkpoint. Your idea yields this: EFI 1 with extent1 extent2 extent3 free extent1 EFI 2 with extent2 extent3 EFD 1 with extent1 extent2 extent3 transaction commit create transaction free extent2 EFI 3 with extent3 EFD 2 with extent extent2 extent3 transaction commit create transaction free extent3 EFD 3 with extent3 transaction commit. Your implementation also includes three EFI/EFD pairs, that’s the same as mine. So actually it’s still one extent per EFI per transaction. Though you are not changing XFS_EFI_MAX_FAST_EXTENTS. And your implementation may use more log space than mine in case the EFI (and EFD pair) can’t be cancelled. :D The only difference if that you use transaction commit and I am using transaction roll which is not safe as you said. Is my understanding correct? One question is that if only one EFI is safe per transaction, how we ensure that there is only one EFI per transaction in case there are more than 16 (current XFS_EFI_MAX_FAST_EXTENTS) extents to free in current code? thanks, wengang ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH 1/2] xfs: IO time one extent per EFI 2023-04-21 0:24 ` Wengang Wang @ 2023-04-21 9:34 ` Dave Chinner 2023-04-21 18:23 ` Wengang Wang 0 siblings, 1 reply; 16+ messages in thread From: Dave Chinner @ 2023-04-21 9:34 UTC (permalink / raw) To: Wengang Wang; +Cc: linux-xfs@vger.kernel.org On Fri, Apr 21, 2023 at 12:24:49AM +0000, Wengang Wang wrote: > > On Apr 20, 2023, at 4:22 PM, Dave Chinner <david@fromorbit.com> wrote: > > We don't do that anymore for partially processed multi-extent > > intents anymore. Instead, we use deferred ops to chain updates. i.e. > > we log a complete intent done items alongside a new intent > > containing the remaining work to be done in the same transaction. > > This cancels the original intent and atomically replaces it with a > > new intent containing the remaining work to be done. > > > > So when I say "update the EFD" I'm using historic terminology for > > processing and recovering multi-extent intents. In modern terms, > > what I mean is "update the deferred work intent chain to reflect the > > work remaining to be done". > > OK. so let’s see the difference between your implementation from mine. > Say, there are three extents to free. > > My patch would result in: > > EFI 1 with extent1 > free extent1 > EFD 1 with extent1 > transaction roll > EFI 2 with extent2 > free extent2 > EFD 2 with extent2 > transaction roll > EFI 3 with extent3 > free extent3 > EFD3 with extent3 > transaction commit No, it wouldn't. This isn't how the deferred work processes work items on the transaction. A work item with multiple extents on it would result in this: xfs_defer_finish(tp) # tp contains three xefi work items xfs_defer_finish_noroll xfs_defer_create_intents() list_for_each_defer_pending xfs_defer_create_intent(dfp) ops->create_intent(tp, &dfp->dfp_work, dfp->dfp_count, sort); xfs_extent_free_create_intent() <create EFI> list_for_each_xefi xfs_extent_free_log_item(xefi) <adds extent to current EFI> xfs_defer_trans_roll() <save> xfs_trans_roll() xfs_trans_dup() xfs_trans_commit() <restore> At this point we have this committed to the CIL EFI 1 with extent1 EFI 2 with extent2 EFI 3 with extent3 And xfs_defer_finish_noroll() continues with <grabs first work item> xfs_defer_finish_one(dfp) ->create_done(EFI 1) xfs_extent_free_create_done <create EFD> list_for_each_xefi ops->finish_item(tp, dfp->dfp_done, li, &state); xfs_extent_free_finish_item() xfs_trans_free_extent __xfs_free_extent <adds extent to EFD> And once the processing of the single work item is done we loop back to the start of the xfs_defer_finish_noroll() loop. We don't have any new intents, so xfs_defer_create_intents() returns false, but we completed a dfp work item, so we run: xfs_defer_trans_roll() <save> xfs_trans_roll() xfs_trans_dup() xfs_trans_commit() <restore> At this point we have this committed to the CIL EFI 1 with extent1 EFI 2 with extent2 EFI 3 with extent3 <AGF, AGFL, free space btree block mods> EFD 1 with extent1 Then we run xfs_defer_finish_one() on EFI 2, commit, then run xfs_defer_finish_one() on EFI 3. At this point, we have in the log: EFI 1 with extent1 EFI 2 with extent2 EFI 3 with extent3 <AGF, AGFL, free space btree block mods> EFD 1 with extent1 <AGF, AGFL, free space btree block mods> EFD 2 with extent2 But we have not committed the final extent free or EFD 3 - that's in the last transaction context we pass back to the _xfs_trans_commit() context for it to finalise and close off the rolling transaction chain. At this point, we finally have this in the CIL: EFI 1 with extent1 EFI 2 with extent2 EFI 3 with extent3 <AGF, AGFL, free space btree block mods> EFD 1 with extent1 <AGF, AGFL, free space btree block mods> EFD 2 with extent2 <AGF, AGFL, free space btree block mods> EFD 3 with extent3 > The EFI/EFD log item pairs should not be written to log as they appear in same checkpoint. I *always* ignore CIL intent whiteouts when thinking about transaction chains and intents. That is purely a journal efficiency optimisation, not something that is necessary for correct operation. > Your idea yields this: > > EFI 1 with extent1 extent2 extent3 > free extent1 > EFI 2 with extent2 extent3 > EFD 1 with extent1 extent2 extent3 > transaction commit > create transaction > free extent2 > EFI 3 with extent3 > EFD 2 with extent extent2 extent3 > transaction commit > create transaction > free extent3 > EFD 3 with extent3 > transaction commit. The EFI/EFD contents are correct, but the rest of it is not - I am not suggesting open coding transaction rolling like that. Everything I am suggesting works through the same defer ops mechanism as you are describing. So if we start with the initial journal contents looks like this: EFI 1 with extent1 extent2 extent3. Then we run xfs_defer_finish_one() on that work, xfs_defer_finish_one(dfp) ->create_done(EFI 1) xfs_extent_free_create_done <create EFD> list_for_each_xefi ops->finish_item(tp, dfp->dfp_done, li, &state); xfs_extent_free_finish_item() xfs_trans_free_extent __xfs_free_extent <adds extent to EFD> But now we have 3 xefis on the work to process. So on success of the first call to xfs_trans_free_extent(), we want it to return -EAGAIN to trigger the existing relogging code to create the new EFI. How this works is describe in the section "Requesting a Fresh Transaction while Finishing Deferred Work" in fs/xfs/libxfs/xfs_defer.c, no point me duplicating that here. The result is that the deferred work infrastructure does the work of updaing the done intent and creating the new intents for the work remaining. Hence after the next transaction roll, we have in the CIL EFI 1 with extent1 extent2 extent3. <AGF, AGFL, free space btree block mods> EFD 1 with extent1 extent2 extent3. EFI 2 with extent2 extent3. And so the loop goes until there is no more work remaining. > Your implementation also includes three EFI/EFD pairs, that’s the same as mine. > So actually it’s still one extent per EFI per transaction. Though you are not changing > XFS_EFI_MAX_FAST_EXTENTS. The difference is that what I'm suggesting is that the extent free code can decide when it needs a transaction to be rolled. It isn't forced to always run a single free per transaction, it can decide that it can free multiple extents per transaction if there is no risk of deadlocks (e.g. extents are in different AGs). Forcing everything to be limited to a transaction per extent free even when there is no risk of deadlocks freeing multiple extents in a single transaction is unnecessary. Indeed, if the second extent is in a different AG, we don't risk busy extents causing us issues, so we could do: EFI 1 with extent1 extent2 extent3. <AGF 1, AGFL 1, free space btree block mods> <AGF 2, AGFL 2, free space btree block mods> EFD 1 with extent1 extent2 extent3. EFI 2 with extent3. ..... The difference is that limiting the number of xefi items per deferred work item means we can only process a single extent per work item regardless of the current context. Using the existing defered work "on demand transaction rolling" mechanism I'm suggesting we use doesn't put any artificial constrains on how we log and process the intents. This allows us to aggregate multiple extent frees into a single transaction when there is no risk associated with doing so and we have sufficient transaction reservations remaining to guarantee we can free the extent. It's far more efficient, and the infrastructure we have in place already supports this sort of functionality.... When we go back to the original problem of the AGFL allocation needing to roll the transaction to get busy extents processed, then we could have it return -EAGAIN all the way back up to the defer-ops level and we simply then roll the transaction and retry the same extent free operation again. i.e. where extent freeing needs to block waiting on stuff like busy extents, we could now have these commit the current transaction, push the current work item to the back of the current context's queue and come back to it later. At this point, I think the "single extent per transaction" constraint that is needed to avoid the busy extent deadlock goes away, and with it the need for limiting EFI processing to a single extent goes away too.... > And your implementation may use more log space than mine in case the EFI > (and EFD pair) can’t be cancelled. :D True, but it's really not a concern. Don't conflate "intent recovery intent amplification can cause log space deadlocks" with "intent size is a problem". :) > The only difference if that you use transaction commit and I am using transaction roll > which is not safe as you said. > > Is my understanding correct? I think I understand where we are misunderstanding each other :) Perhaps it is now obvious to you as well from the analysis above. If so, ignore the rest of this :) What does "transaction roll" actually mean? XFS has a concept of "rolling transactions". These are made up of a series of individual transaction contexts that are linked together by passing a single log reservation ticket between transaction contexts. xfs_trans_roll() effectively does: ntp = xfs_trans_dup(tp) .... xfs_trans_commit(tp) return ntp; i.e. it duplicates the current transaction handle, takes the unused block reservation from it, grabs the log reservation ticket from it, moves the defered ops from the old to the new transaction context, then commits the old transaction context and returns the new one. tl;dr: a rolling transaction is a series of linked independent transactions that share a common set of block and log reservations. To make a rolling transaction chain an atomic operation on a specific object (e.g. an inode) that object has to remain locked and be logged in every transaction in the chain of rolling transactions. This keeps it moving forward in the log and prevents it from pinning the tail of the log and causing log space deadlocks. Fundamentally, though, each individual transaction in a rolling transaction is an independent atomic change set. So when you say "roll the transaction", what you are actually saying is "commit the current transaction and start a new transaction using the reservations the current transaction already holds." When I say "transaction commit" I am only refering to the process that closes off the current transaction. If this is in the middle of a rolling transaction, then what I am talking about is xfs_trans_roll() calling xfs_trans_commit() after it has duplicated the current transaction context. Transaction contexts running defered operations, intent chains, etc are *always* rolling transactions, and each time we roll the transaction we commit it. IOWS, when I say "transaction commit" and you say "transaction roll" we are talking about exactly the same thing - transaction commit is the operation that roll performs to finish off the current change set... Ideally, nobody should be calling xfs_trans_roll() directly anymore. All rolling transactions should be implemented using deferred ops nowdays - xfs_trans_roll() is the historic method of running rolling transactions. e.g. see the recent rework of the attribute code to use deferred ops rather than explicit calls to xfs_trans_roll() so we can use intents for running attr operations. These days the transaction model is based around chains of deferred operations. We attach deferred work to the transaction, and then when we call xfs_trans_commit() it goes off into the deferred work infrastructure that creates intents and manages the transaction context for processing the intents itself. This is the primary reason we are trying to move away from high level code using transaction rolling - we can't easily change the way we process operations when the high level code controls transaction contexts. Using deferred intent chains gives us fine grained control over transaction context in the deferred ops processing code - we can roll to a new transaction whenever we need to.... > One question is that if only one EFI is safe per transaction, how > we ensure that there is only one EFI per transaction in case there > are more than 16 (current XFS_EFI_MAX_FAST_EXTENTS) extents to > free in current code? That will handled exactly the same way it does with your change - it will simply split up the work items so there are multiple intents logged. Cheers, Dave. -- Dave Chinner david@fromorbit.com ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH 1/2] xfs: IO time one extent per EFI 2023-04-21 9:34 ` Dave Chinner @ 2023-04-21 18:23 ` Wengang Wang 2023-04-22 3:22 ` Wengang Wang 0 siblings, 1 reply; 16+ messages in thread From: Wengang Wang @ 2023-04-21 18:23 UTC (permalink / raw) To: Dave Chinner; +Cc: linux-xfs@vger.kernel.org > On Apr 21, 2023, at 2:34 AM, Dave Chinner <david@fromorbit.com> wrote: > > On Fri, Apr 21, 2023 at 12:24:49AM +0000, Wengang Wang wrote: >>> On Apr 20, 2023, at 4:22 PM, Dave Chinner <david@fromorbit.com> wrote: >>> We don't do that anymore for partially processed multi-extent >>> intents anymore. Instead, we use deferred ops to chain updates. i.e. >>> we log a complete intent done items alongside a new intent >>> containing the remaining work to be done in the same transaction. >>> This cancels the original intent and atomically replaces it with a >>> new intent containing the remaining work to be done. >>> >>> So when I say "update the EFD" I'm using historic terminology for >>> processing and recovering multi-extent intents. In modern terms, >>> what I mean is "update the deferred work intent chain to reflect the >>> work remaining to be done". >> >> OK. so let’s see the difference between your implementation from mine. >> Say, there are three extents to free. >> >> My patch would result in: >> >> EFI 1 with extent1 >> free extent1 >> EFD 1 with extent1 >> transaction roll >> EFI 2 with extent2 >> free extent2 >> EFD 2 with extent2 >> transaction roll >> EFI 3 with extent3 >> free extent3 >> EFD3 with extent3 >> transaction commit > > No, it wouldn't. This isn't how the deferred work processes work > items on the transaction. A work item with multiple extents on it > would result in this: > > xfs_defer_finish(tp) # tp contains three xefi work items > xfs_defer_finish_noroll > xfs_defer_create_intents() > list_for_each_defer_pending > xfs_defer_create_intent(dfp) > ops->create_intent(tp, &dfp->dfp_work, dfp->dfp_count, sort); > xfs_extent_free_create_intent() > <create EFI> > list_for_each_xefi > xfs_extent_free_log_item(xefi) > <adds extent to current EFI> > > xfs_defer_trans_roll() > <save> > xfs_trans_roll() > xfs_trans_dup() > xfs_trans_commit() > <restore> > > At this point we have this committed to the CIL > > EFI 1 with extent1 > EFI 2 with extent2 > EFI 3 with extent3 > > And xfs_defer_finish_noroll() continues with > > <grabs first work item> > xfs_defer_finish_one(dfp) > ->create_done(EFI 1) > xfs_extent_free_create_done > <create EFD> > list_for_each_xefi > ops->finish_item(tp, dfp->dfp_done, li, &state); > xfs_extent_free_finish_item() > xfs_trans_free_extent > __xfs_free_extent > <adds extent to EFD> > > And once the processing of the single work item is done we loop > back to the start of the xfs_defer_finish_noroll() loop. We don't > have any new intents, so xfs_defer_create_intents() returns false, > but we completed a dfp work item, so we run: > > xfs_defer_trans_roll() > <save> > xfs_trans_roll() > xfs_trans_dup() > xfs_trans_commit() > <restore> > > At this point we have this committed to the CIL > > EFI 1 with extent1 > EFI 2 with extent2 > EFI 3 with extent3 > <AGF, AGFL, free space btree block mods> > EFD 1 with extent1 > > Then we run xfs_defer_finish_one() on EFI 2, commit, then run > xfs_defer_finish_one() on EFI 3. At this point, we have in the log: > > EFI 1 with extent1 > EFI 2 with extent2 > EFI 3 with extent3 > <AGF, AGFL, free space btree block mods> > EFD 1 with extent1 > <AGF, AGFL, free space btree block mods> > EFD 2 with extent2 > > But we have not committed the final extent free or EFD 3 - that's in > the last transaction context we pass back to the _xfs_trans_commit() > context for it to finalise and close off the rolling transaction > chain. At this point, we finally have this in the CIL: > > EFI 1 with extent1 > EFI 2 with extent2 > EFI 3 with extent3 > <AGF, AGFL, free space btree block mods> > EFD 1 with extent1 > <AGF, AGFL, free space btree block mods> > EFD 2 with extent2 > <AGF, AGFL, free space btree block mods> > EFD 3 with extent3 Yes, correct. thanks for so much details! I mis-read xfs_defer_create_intents() thinking it only create intent for the first in tp->t_dfops. > >> The EFI/EFD log item pairs should not be written to log as they appear in same checkpoint. > > I *always* ignore CIL intent whiteouts when thinking about > transaction chains and intents. That is purely a journal efficiency > optimisation, not something that is necessary for correct operation. OK. > >> Your idea yields this: >> >> EFI 1 with extent1 extent2 extent3 >> free extent1 >> EFI 2 with extent2 extent3 >> EFD 1 with extent1 extent2 extent3 >> transaction commit >> create transaction >> free extent2 >> EFI 3 with extent3 >> EFD 2 with extent extent2 extent3 >> transaction commit >> create transaction >> free extent3 >> EFD 3 with extent3 >> transaction commit. > > The EFI/EFD contents are correct, but the rest of it is not - I am > not suggesting open coding transaction rolling like that. Everything > I am suggesting works through the same defer ops mechanism as you > are describing. So if we start with the initial journal contents > looks like this: > > EFI 1 with extent1 extent2 extent3. > > Then we run xfs_defer_finish_one() on that work, > > xfs_defer_finish_one(dfp) > ->create_done(EFI 1) > xfs_extent_free_create_done > <create EFD> > list_for_each_xefi > ops->finish_item(tp, dfp->dfp_done, li, &state); > xfs_extent_free_finish_item() > xfs_trans_free_extent > __xfs_free_extent > <adds extent to EFD> > > But now we have 3 xefis on the work to process. So on success of > the first call to xfs_trans_free_extent(), we want it to return > -EAGAIN to trigger the existing relogging code to create the new > EFI. How this works is describe in the section "Requesting a > Fresh Transaction while Finishing Deferred Work" in > fs/xfs/libxfs/xfs_defer.c, no point me duplicating that here. > > The result is that the deferred work infrastructure does the work > of updaing the done intent and creating the new intents for the work > remaining. Hence after the next transaction roll, we have in the CIL > > EFI 1 with extent1 extent2 extent3. > <AGF, AGFL, free space btree block mods> > EFD 1 with extent1 extent2 extent3. > EFI 2 with extent2 extent3. > Taking transaction rolls into account (also adding up to EFD3), above would be: EFI 1 with extent1 extent2 extent3. transaction roll <AGF, AGFL, free space btree block mods> for extent 1 EFD 1 with extent1 extent2 extent3. EFI 2 with extent2 extent3. transaction roll free extent 2 EFD 2 with extent2 extent3 EFI 3 with extent3 transaction roll free extent 3 EFD 3 with extent3 Because EFD N is always be with EFI N+1 in the SAME transaction, so EFI N+1 doesn’t have to be placed before EFD N. I got it. > And so the loop goes until there is no more work remaining. > >> Your implementation also includes three EFI/EFD pairs, that’s the same as mine. >> So actually it’s still one extent per EFI per transaction. Though you are not changing >> XFS_EFI_MAX_FAST_EXTENTS. > > The difference is that what I'm suggesting is that the extent free > code can decide when it needs a transaction to be rolled. It isn't > forced to always run a single free per transaction, it can decide > that it can free multiple extents per transaction if there is no > risk of deadlocks (e.g. extents are in different AGs). Forcing > everything to be limited to a transaction per extent free even when > there is no risk of deadlocks freeing multiple extents in a single > transaction is unnecessary. > > Indeed, if the second extent is in a different AG, we don't risk > busy extents causing us issues, so we could do: > > EFI 1 with extent1 extent2 extent3. > <AGF 1, AGFL 1, free space btree block mods> > <AGF 2, AGFL 2, free space btree block mods> > EFD 1 with extent1 extent2 extent3. > EFI 2 with extent3. > ..... > My thumb is up. > The difference is that limiting the number of xefi items per > deferred work item means we can only process a single extent per > work item regardless of the current context. > > Using the existing defered work "on demand transaction rolling" > mechanism I'm suggesting we use doesn't put any artificial > constrains on how we log and process the intents. This allows us to > aggregate multiple extent frees into a single transaction when there > is no risk associated with doing so and we have sufficient > transaction reservations remaining to guarantee we can free the > extent. It's far more efficient, and the infrastructure we have in > place already supports this sort of functionality.... > > When we go back to the original problem of the AGFL allocation > needing to roll the transaction to get busy extents processed, then > we could have it return -EAGAIN all the way back up to the defer-ops > level and we simply then roll the transaction and retry the same > extent free operation again. i.e. where extent freeing needs to > block waiting on stuff like busy extents, we could now have these > commit the current transaction, push the current work item to the > back of the current context's queue and come back to it later. > > At this point, I think the "single extent per transaction" > constraint that is needed to avoid the busy extent deadlock goes > away, and with it the need for limiting EFI processing to a single > extent goes away too.... I am pretty clear now. > >> And your implementation may use more log space than mine in case the EFI >> (and EFD pair) can’t be cancelled. :D > > True, but it's really not a concern. Don't conflate "intent > recovery intent amplification can cause log space deadlocks" with > "intent size is a problem". :) > Got it. >> The only difference if that you use transaction commit and I am using transaction roll >> which is not safe as you said. >> >> Is my understanding correct? > > I think I understand where we are misunderstanding each other :) > Perhaps it is now obvious to you as well from the analysis above. > If so, ignore the rest of this :) > > What does "transaction roll" actually mean? > > XFS has a concept of "rolling transactions". These are made up of a > series of individual transaction contexts that are linked together > by passing a single log reservation ticket between transaction > contexts. > > xfs_trans_roll() effectively does: > > ntp = xfs_trans_dup(tp) > .... > xfs_trans_commit(tp) > > return ntp; > > i.e. it duplicates the current transaction handle, takes the unused > block reservation from it, grabs the log reservation ticket from it, > moves the defered ops from the old to the new transaction context, > then commits the old transaction context and returns the new one. > > tl;dr: a rolling transaction is a series of linked independent > transactions that share a common set of block and log reservations. > > To make a rolling transaction chain an atomic operation on a > specific object (e.g. an inode) that object has to remain locked and > be logged in every transaction in the chain of rolling transactions. > This keeps it moving forward in the log and prevents it from pinning > the tail of the log and causing log space deadlocks. > > Fundamentally, though, each individual transaction in a rolling > transaction is an independent atomic change set. So when you say > "roll the transaction", what you are actually saying is "commit the > current transaction and start a new transaction using the > reservations the current transaction already holds." > > When I say "transaction commit" I am only refering to the process > that closes off the current transaction. If this is in the middle of > a rolling transaction, then what I am talking about is > xfs_trans_roll() calling xfs_trans_commit() after it has duplicated > the current transaction context. > > Transaction contexts running defered operations, intent chains, etc > are *always* rolling transactions, and each time we roll the > transaction we commit it. > > IOWS, when I say "transaction commit" and you say "transaction roll" > we are talking about exactly the same thing - transaction commit is > the operation that roll performs to finish off the current change > set... > > Ideally, nobody should be calling xfs_trans_roll() directly anymore. > All rolling transactions should be implemented using deferred ops > nowdays - xfs_trans_roll() is the historic method of running rolling > transactions. e.g. see the recent rework of the attribute code to > use deferred ops rather than explicit calls to xfs_trans_roll() so > we can use intents for running attr operations. > > These days the transaction model is based around chains of deferred > operations. We attach deferred work to the transaction, and then > when we call xfs_trans_commit() it goes off into the deferred work > infrastructure that creates intents and manages the transaction > context for processing the intents itself. > > This is the primary reason we are trying to move away from high > level code using transaction rolling - we can't easily change the > way we process operations when the high level code controls > transaction contexts. Using deferred intent chains gives us fine > grained control over transaction context in the deferred ops > processing code - we can roll to a new transaction whenever we need > to.... > Above is really helpful. I when I mention transaction roll, I meant 1) a new transaction is created, but its self does’t reserve any resource. Instead, it inherits all the unused resources from the old transaction. 2) commit the old transaction. 3) don’t un-reserve unused resources from old transaction, the new transaction will inherits them. 4) use the new transaction for further log items. As I understand, the key is that the new transaction its self doesn’t reserve new resources. And when I read your “transaction commit” I understood it as this: 1). commit old transaction 2) un-reserve unused resources from old transaction 3) create new transaction with all resources reserved. Thus in my understand your “transaction commit” would have no risk of lack of log space to put the extra EFI/EFDs. But reading above, I’d think you meant “transaction roll” actually. >> One question is that if only one EFI is safe per transaction, how >> we ensure that there is only one EFI per transaction in case there >> are more than 16 (current XFS_EFI_MAX_FAST_EXTENTS) extents to >> free in current code? > > That will handled exactly the same way it does with your change - it > will simply split up the work items so there are multiple intents > logged. I’d like to make it more clear. You were saying that during log recover stage, there may be no enough log space for the extra EFI/EFD (splitted from original multiple extents EFI). But here (non-recovery stage), seems you have no concern logging more EFI/EFDs. So why they are different? thanks, wengang ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH 1/2] xfs: IO time one extent per EFI 2023-04-21 18:23 ` Wengang Wang @ 2023-04-22 3:22 ` Wengang Wang 2023-04-24 15:53 ` Wengang Wang 0 siblings, 1 reply; 16+ messages in thread From: Wengang Wang @ 2023-04-22 3:22 UTC (permalink / raw) To: Dave Chinner; +Cc: linux-xfs@vger.kernel.org > On Apr 21, 2023, at 11:23 AM, Wengang Wang <wen.gang.wang@oracle.com> wrote: > > > >> On Apr 21, 2023, at 2:34 AM, Dave Chinner <david@fromorbit.com> wrote: >> >> On Fri, Apr 21, 2023 at 12:24:49AM +0000, Wengang Wang wrote: >>>> On Apr 20, 2023, at 4:22 PM, Dave Chinner <david@fromorbit.com> wrote: >>>> We don't do that anymore for partially processed multi-extent >>>> intents anymore. Instead, we use deferred ops to chain updates. i.e. >>>> we log a complete intent done items alongside a new intent >>>> containing the remaining work to be done in the same transaction. >>>> This cancels the original intent and atomically replaces it with a >>>> new intent containing the remaining work to be done. >>>> >>>> So when I say "update the EFD" I'm using historic terminology for >>>> processing and recovering multi-extent intents. In modern terms, >>>> what I mean is "update the deferred work intent chain to reflect the >>>> work remaining to be done". >>> >>> OK. so let’s see the difference between your implementation from mine. >>> Say, there are three extents to free. >>> >>> My patch would result in: >>> >>> EFI 1 with extent1 >>> free extent1 >>> EFD 1 with extent1 >>> transaction roll >>> EFI 2 with extent2 >>> free extent2 >>> EFD 2 with extent2 >>> transaction roll >>> EFI 3 with extent3 >>> free extent3 >>> EFD3 with extent3 >>> transaction commit >> >> No, it wouldn't. This isn't how the deferred work processes work >> items on the transaction. A work item with multiple extents on it >> would result in this: >> >> xfs_defer_finish(tp) # tp contains three xefi work items >> xfs_defer_finish_noroll >> xfs_defer_create_intents() >> list_for_each_defer_pending >> xfs_defer_create_intent(dfp) >> ops->create_intent(tp, &dfp->dfp_work, dfp->dfp_count, sort); >> xfs_extent_free_create_intent() >> <create EFI> >> list_for_each_xefi >> xfs_extent_free_log_item(xefi) >> <adds extent to current EFI> >> >> xfs_defer_trans_roll() >> <save> >> xfs_trans_roll() >> xfs_trans_dup() >> xfs_trans_commit() >> <restore> >> >> At this point we have this committed to the CIL >> >> EFI 1 with extent1 >> EFI 2 with extent2 >> EFI 3 with extent3 >> >> And xfs_defer_finish_noroll() continues with >> >> <grabs first work item> >> xfs_defer_finish_one(dfp) >> ->create_done(EFI 1) >> xfs_extent_free_create_done >> <create EFD> >> list_for_each_xefi >> ops->finish_item(tp, dfp->dfp_done, li, &state); >> xfs_extent_free_finish_item() >> xfs_trans_free_extent >> __xfs_free_extent >> <adds extent to EFD> >> >> And once the processing of the single work item is done we loop >> back to the start of the xfs_defer_finish_noroll() loop. We don't >> have any new intents, so xfs_defer_create_intents() returns false, >> but we completed a dfp work item, so we run: >> >> xfs_defer_trans_roll() >> <save> >> xfs_trans_roll() >> xfs_trans_dup() >> xfs_trans_commit() >> <restore> >> >> At this point we have this committed to the CIL >> >> EFI 1 with extent1 >> EFI 2 with extent2 >> EFI 3 with extent3 >> <AGF, AGFL, free space btree block mods> >> EFD 1 with extent1 >> >> Then we run xfs_defer_finish_one() on EFI 2, commit, then run >> xfs_defer_finish_one() on EFI 3. At this point, we have in the log: >> >> EFI 1 with extent1 >> EFI 2 with extent2 >> EFI 3 with extent3 >> <AGF, AGFL, free space btree block mods> >> EFD 1 with extent1 >> <AGF, AGFL, free space btree block mods> >> EFD 2 with extent2 >> >> But we have not committed the final extent free or EFD 3 - that's in >> the last transaction context we pass back to the _xfs_trans_commit() >> context for it to finalise and close off the rolling transaction >> chain. At this point, we finally have this in the CIL: >> >> EFI 1 with extent1 >> EFI 2 with extent2 >> EFI 3 with extent3 >> <AGF, AGFL, free space btree block mods> >> EFD 1 with extent1 >> <AGF, AGFL, free space btree block mods> >> EFD 2 with extent2 >> <AGF, AGFL, free space btree block mods> >> EFD 3 with extent3 > > > Yes, correct. thanks for so much details! I mis-read xfs_defer_create_intents() > thinking it only create intent for the first in tp->t_dfops. > >> >>> The EFI/EFD log item pairs should not be written to log as they appear in same checkpoint. >> >> I *always* ignore CIL intent whiteouts when thinking about >> transaction chains and intents. That is purely a journal efficiency >> optimisation, not something that is necessary for correct operation. > > OK. > >> >>> Your idea yields this: >>> >>> EFI 1 with extent1 extent2 extent3 >>> free extent1 >>> EFI 2 with extent2 extent3 >>> EFD 1 with extent1 extent2 extent3 >>> transaction commit >>> create transaction >>> free extent2 >>> EFI 3 with extent3 >>> EFD 2 with extent extent2 extent3 >>> transaction commit >>> create transaction >>> free extent3 >>> EFD 3 with extent3 >>> transaction commit. >> >> The EFI/EFD contents are correct, but the rest of it is not - I am >> not suggesting open coding transaction rolling like that. Everything >> I am suggesting works through the same defer ops mechanism as you >> are describing. So if we start with the initial journal contents >> looks like this: >> >> EFI 1 with extent1 extent2 extent3. >> >> Then we run xfs_defer_finish_one() on that work, >> >> xfs_defer_finish_one(dfp) >> ->create_done(EFI 1) >> xfs_extent_free_create_done >> <create EFD> >> list_for_each_xefi >> ops->finish_item(tp, dfp->dfp_done, li, &state); >> xfs_extent_free_finish_item() >> xfs_trans_free_extent >> __xfs_free_extent >> <adds extent to EFD> >> >> But now we have 3 xefis on the work to process. So on success of >> the first call to xfs_trans_free_extent(), we want it to return >> -EAGAIN to trigger the existing relogging code to create the new >> EFI. How this works is describe in the section "Requesting a >> Fresh Transaction while Finishing Deferred Work" in >> fs/xfs/libxfs/xfs_defer.c, no point me duplicating that here. >> >> The result is that the deferred work infrastructure does the work >> of updaing the done intent and creating the new intents for the work >> remaining. Hence after the next transaction roll, we have in the CIL >> >> EFI 1 with extent1 extent2 extent3. >> <AGF, AGFL, free space btree block mods> >> EFD 1 with extent1 extent2 extent3. >> EFI 2 with extent2 extent3. >> > > Taking transaction rolls into account (also adding up to EFD3), above would be: > > EFI 1 with extent1 extent2 extent3. > transaction roll > <AGF, AGFL, free space btree block mods> for extent 1 > EFD 1 with extent1 extent2 extent3. > EFI 2 with extent2 extent3. > transaction roll > free extent 2 > EFD 2 with extent2 extent3 > EFI 3 with extent3 > transaction roll > free extent 3 > EFD 3 with extent3 > > Because EFD N is always be with EFI N+1 in the SAME transaction, so EFI N+1 doesn’t have to be placed before EFD N. > I got it. > >> And so the loop goes until there is no more work remaining. >> >>> Your implementation also includes three EFI/EFD pairs, that’s the same as mine. >>> So actually it’s still one extent per EFI per transaction. Though you are not changing >>> XFS_EFI_MAX_FAST_EXTENTS. >> >> The difference is that what I'm suggesting is that the extent free >> code can decide when it needs a transaction to be rolled. It isn't >> forced to always run a single free per transaction, it can decide >> that it can free multiple extents per transaction if there is no >> risk of deadlocks (e.g. extents are in different AGs). Forcing >> everything to be limited to a transaction per extent free even when >> there is no risk of deadlocks freeing multiple extents in a single >> transaction is unnecessary. >> >> Indeed, if the second extent is in a different AG, we don't risk >> busy extents causing us issues, so we could do: >> >> EFI 1 with extent1 extent2 extent3. >> <AGF 1, AGFL 1, free space btree block mods> >> <AGF 2, AGFL 2, free space btree block mods> >> EFD 1 with extent1 extent2 extent3. >> EFI 2 with extent3. >> ..... >> > > My thumb is up. Well, I am wondering if there is ABBA deadlock instead of self deadlock then. Say process 1 produced busy extent 1 on AG 0 and now blocking at AGFL allocation on AG 1. And at the same time, process 2 produced busy extent 2 on AG 1 and now blocking at AGFL allocation on AG 0. Anything prevents that from happening? Shall we introduce a per FS list, say named “pending_busy_AGs", where the AG number are stored there. For those AGs they have pending busy extents in in-memory transactions (not committed to CIL). We add the AG number to pending_busy_AGs at the start of xfs_trans_free_extent() if it’s not there, and continue to free the extent. Otherwise if the AG number is already there in pending_busy_AGs, we return -EAGAIN. The AG number is removed when the busy extents are committed to the xfs_cil_ctx. Well, wondering if the pending busy extents would remain in pending state long before committed to CIL and that become a bottle neck for freeing extents. thanks, wengang > >> The difference is that limiting the number of xefi items per >> deferred work item means we can only process a single extent per >> work item regardless of the current context. >> >> Using the existing defered work "on demand transaction rolling" >> mechanism I'm suggesting we use doesn't put any artificial >> constrains on how we log and process the intents. This allows us to >> aggregate multiple extent frees into a single transaction when there >> is no risk associated with doing so and we have sufficient >> transaction reservations remaining to guarantee we can free the >> extent. It's far more efficient, and the infrastructure we have in >> place already supports this sort of functionality.... >> >> When we go back to the original problem of the AGFL allocation >> needing to roll the transaction to get busy extents processed, then >> we could have it return -EAGAIN all the way back up to the defer-ops >> level and we simply then roll the transaction and retry the same >> extent free operation again. i.e. where extent freeing needs to >> block waiting on stuff like busy extents, we could now have these >> commit the current transaction, push the current work item to the >> back of the current context's queue and come back to it later. >> >> At this point, I think the "single extent per transaction" >> constraint that is needed to avoid the busy extent deadlock goes >> away, and with it the need for limiting EFI processing to a single >> extent goes away too.... > > I am pretty clear now. >> >>> And your implementation may use more log space than mine in case the EFI >>> (and EFD pair) can’t be cancelled. :D >> >> True, but it's really not a concern. Don't conflate "intent >> recovery intent amplification can cause log space deadlocks" with >> "intent size is a problem". :) >> > > Got it. >>> The only difference if that you use transaction commit and I am using transaction roll >>> which is not safe as you said. >>> >>> Is my understanding correct? >> >> I think I understand where we are misunderstanding each other :) >> Perhaps it is now obvious to you as well from the analysis above. >> If so, ignore the rest of this :) >> >> What does "transaction roll" actually mean? >> >> XFS has a concept of "rolling transactions". These are made up of a >> series of individual transaction contexts that are linked together >> by passing a single log reservation ticket between transaction >> contexts. >> >> xfs_trans_roll() effectively does: >> >> ntp = xfs_trans_dup(tp) >> .... >> xfs_trans_commit(tp) >> >> return ntp; >> >> i.e. it duplicates the current transaction handle, takes the unused >> block reservation from it, grabs the log reservation ticket from it, >> moves the defered ops from the old to the new transaction context, >> then commits the old transaction context and returns the new one. >> >> tl;dr: a rolling transaction is a series of linked independent >> transactions that share a common set of block and log reservations. >> >> To make a rolling transaction chain an atomic operation on a >> specific object (e.g. an inode) that object has to remain locked and >> be logged in every transaction in the chain of rolling transactions. >> This keeps it moving forward in the log and prevents it from pinning >> the tail of the log and causing log space deadlocks. >> >> Fundamentally, though, each individual transaction in a rolling >> transaction is an independent atomic change set. So when you say >> "roll the transaction", what you are actually saying is "commit the >> current transaction and start a new transaction using the >> reservations the current transaction already holds." >> >> When I say "transaction commit" I am only refering to the process >> that closes off the current transaction. If this is in the middle of >> a rolling transaction, then what I am talking about is >> xfs_trans_roll() calling xfs_trans_commit() after it has duplicated >> the current transaction context. >> >> Transaction contexts running defered operations, intent chains, etc >> are *always* rolling transactions, and each time we roll the >> transaction we commit it. >> >> IOWS, when I say "transaction commit" and you say "transaction roll" >> we are talking about exactly the same thing - transaction commit is >> the operation that roll performs to finish off the current change >> set... >> >> Ideally, nobody should be calling xfs_trans_roll() directly anymore. >> All rolling transactions should be implemented using deferred ops >> nowdays - xfs_trans_roll() is the historic method of running rolling >> transactions. e.g. see the recent rework of the attribute code to >> use deferred ops rather than explicit calls to xfs_trans_roll() so >> we can use intents for running attr operations. >> >> These days the transaction model is based around chains of deferred >> operations. We attach deferred work to the transaction, and then >> when we call xfs_trans_commit() it goes off into the deferred work >> infrastructure that creates intents and manages the transaction >> context for processing the intents itself. >> >> This is the primary reason we are trying to move away from high >> level code using transaction rolling - we can't easily change the >> way we process operations when the high level code controls >> transaction contexts. Using deferred intent chains gives us fine >> grained control over transaction context in the deferred ops >> processing code - we can roll to a new transaction whenever we need >> to.... >> > > Above is really helpful. > I when I mention transaction roll, I meant > 1) a new transaction is created, but its self does’t reserve any resource. > Instead, it inherits all the unused resources from the old transaction. > 2) commit the old transaction. > 3) don’t un-reserve unused resources from old transaction, the new transaction > will inherits them. > 4) use the new transaction for further log items. > > As I understand, the key is that the new transaction its self doesn’t reserve new resources. > > And when I read your “transaction commit” I understood it as this: > 1). commit old transaction > 2) un-reserve unused resources from old transaction > 3) create new transaction with all resources reserved. > > Thus in my understand your “transaction commit” would have no risk of lack of log space to put > the extra EFI/EFDs. But reading above, I’d think you meant “transaction roll” actually. > >>> One question is that if only one EFI is safe per transaction, how >>> we ensure that there is only one EFI per transaction in case there >>> are more than 16 (current XFS_EFI_MAX_FAST_EXTENTS) extents to >>> free in current code? >> >> That will handled exactly the same way it does with your change - it >> will simply split up the work items so there are multiple intents >> logged. > > I’d like to make it more clear. You were saying that during log recover stage, there may be no enough > log space for the extra EFI/EFD (splitted from original multiple extents EFI). But here (non-recovery stage), > seems you have no concern logging more EFI/EFDs. So why they are different? > > thanks, > wengang ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH 1/2] xfs: IO time one extent per EFI 2023-04-22 3:22 ` Wengang Wang @ 2023-04-24 15:53 ` Wengang Wang 2023-04-24 22:52 ` Wengang Wang 0 siblings, 1 reply; 16+ messages in thread From: Wengang Wang @ 2023-04-24 15:53 UTC (permalink / raw) To: Dave Chinner; +Cc: linux-xfs@vger.kernel.org > On Apr 21, 2023, at 8:22 PM, Wengang Wang <wen.gang.wang@oracle.com> wrote: > > > >> On Apr 21, 2023, at 11:23 AM, Wengang Wang <wen.gang.wang@oracle.com> wrote: >> >> >> >>> On Apr 21, 2023, at 2:34 AM, Dave Chinner <david@fromorbit.com> wrote: >>> >>> On Fri, Apr 21, 2023 at 12:24:49AM +0000, Wengang Wang wrote: >>>>> On Apr 20, 2023, at 4:22 PM, Dave Chinner <david@fromorbit.com> wrote: >>>>> We don't do that anymore for partially processed multi-extent >>>>> intents anymore. Instead, we use deferred ops to chain updates. i.e. >>>>> we log a complete intent done items alongside a new intent >>>>> containing the remaining work to be done in the same transaction. >>>>> This cancels the original intent and atomically replaces it with a >>>>> new intent containing the remaining work to be done. >>>>> >>>>> So when I say "update the EFD" I'm using historic terminology for >>>>> processing and recovering multi-extent intents. In modern terms, >>>>> what I mean is "update the deferred work intent chain to reflect the >>>>> work remaining to be done". >>>> >>>> OK. so let’s see the difference between your implementation from mine. >>>> Say, there are three extents to free. >>>> >>>> My patch would result in: >>>> >>>> EFI 1 with extent1 >>>> free extent1 >>>> EFD 1 with extent1 >>>> transaction roll >>>> EFI 2 with extent2 >>>> free extent2 >>>> EFD 2 with extent2 >>>> transaction roll >>>> EFI 3 with extent3 >>>> free extent3 >>>> EFD3 with extent3 >>>> transaction commit >>> >>> No, it wouldn't. This isn't how the deferred work processes work >>> items on the transaction. A work item with multiple extents on it >>> would result in this: >>> >>> xfs_defer_finish(tp) # tp contains three xefi work items >>> xfs_defer_finish_noroll >>> xfs_defer_create_intents() >>> list_for_each_defer_pending >>> xfs_defer_create_intent(dfp) >>> ops->create_intent(tp, &dfp->dfp_work, dfp->dfp_count, sort); >>> xfs_extent_free_create_intent() >>> <create EFI> >>> list_for_each_xefi >>> xfs_extent_free_log_item(xefi) >>> <adds extent to current EFI> >>> >>> xfs_defer_trans_roll() >>> <save> >>> xfs_trans_roll() >>> xfs_trans_dup() >>> xfs_trans_commit() >>> <restore> >>> >>> At this point we have this committed to the CIL >>> >>> EFI 1 with extent1 >>> EFI 2 with extent2 >>> EFI 3 with extent3 >>> >>> And xfs_defer_finish_noroll() continues with >>> >>> <grabs first work item> >>> xfs_defer_finish_one(dfp) >>> ->create_done(EFI 1) >>> xfs_extent_free_create_done >>> <create EFD> >>> list_for_each_xefi >>> ops->finish_item(tp, dfp->dfp_done, li, &state); >>> xfs_extent_free_finish_item() >>> xfs_trans_free_extent >>> __xfs_free_extent >>> <adds extent to EFD> >>> >>> And once the processing of the single work item is done we loop >>> back to the start of the xfs_defer_finish_noroll() loop. We don't >>> have any new intents, so xfs_defer_create_intents() returns false, >>> but we completed a dfp work item, so we run: >>> >>> xfs_defer_trans_roll() >>> <save> >>> xfs_trans_roll() >>> xfs_trans_dup() >>> xfs_trans_commit() >>> <restore> >>> >>> At this point we have this committed to the CIL >>> >>> EFI 1 with extent1 >>> EFI 2 with extent2 >>> EFI 3 with extent3 >>> <AGF, AGFL, free space btree block mods> >>> EFD 1 with extent1 >>> >>> Then we run xfs_defer_finish_one() on EFI 2, commit, then run >>> xfs_defer_finish_one() on EFI 3. At this point, we have in the log: >>> >>> EFI 1 with extent1 >>> EFI 2 with extent2 >>> EFI 3 with extent3 >>> <AGF, AGFL, free space btree block mods> >>> EFD 1 with extent1 >>> <AGF, AGFL, free space btree block mods> >>> EFD 2 with extent2 >>> >>> But we have not committed the final extent free or EFD 3 - that's in >>> the last transaction context we pass back to the _xfs_trans_commit() >>> context for it to finalise and close off the rolling transaction >>> chain. At this point, we finally have this in the CIL: >>> >>> EFI 1 with extent1 >>> EFI 2 with extent2 >>> EFI 3 with extent3 >>> <AGF, AGFL, free space btree block mods> >>> EFD 1 with extent1 >>> <AGF, AGFL, free space btree block mods> >>> EFD 2 with extent2 >>> <AGF, AGFL, free space btree block mods> >>> EFD 3 with extent3 >> >> >> Yes, correct. thanks for so much details! I mis-read xfs_defer_create_intents() >> thinking it only create intent for the first in tp->t_dfops. >> >>> >>>> The EFI/EFD log item pairs should not be written to log as they appear in same checkpoint. >>> >>> I *always* ignore CIL intent whiteouts when thinking about >>> transaction chains and intents. That is purely a journal efficiency >>> optimisation, not something that is necessary for correct operation. >> >> OK. >> >>> >>>> Your idea yields this: >>>> >>>> EFI 1 with extent1 extent2 extent3 >>>> free extent1 >>>> EFI 2 with extent2 extent3 >>>> EFD 1 with extent1 extent2 extent3 >>>> transaction commit >>>> create transaction >>>> free extent2 >>>> EFI 3 with extent3 >>>> EFD 2 with extent extent2 extent3 >>>> transaction commit >>>> create transaction >>>> free extent3 >>>> EFD 3 with extent3 >>>> transaction commit. >>> >>> The EFI/EFD contents are correct, but the rest of it is not - I am >>> not suggesting open coding transaction rolling like that. Everything >>> I am suggesting works through the same defer ops mechanism as you >>> are describing. So if we start with the initial journal contents >>> looks like this: >>> >>> EFI 1 with extent1 extent2 extent3. >>> >>> Then we run xfs_defer_finish_one() on that work, >>> >>> xfs_defer_finish_one(dfp) >>> ->create_done(EFI 1) >>> xfs_extent_free_create_done >>> <create EFD> >>> list_for_each_xefi >>> ops->finish_item(tp, dfp->dfp_done, li, &state); >>> xfs_extent_free_finish_item() >>> xfs_trans_free_extent >>> __xfs_free_extent >>> <adds extent to EFD> >>> >>> But now we have 3 xefis on the work to process. So on success of >>> the first call to xfs_trans_free_extent(), we want it to return >>> -EAGAIN to trigger the existing relogging code to create the new >>> EFI. How this works is describe in the section "Requesting a >>> Fresh Transaction while Finishing Deferred Work" in >>> fs/xfs/libxfs/xfs_defer.c, no point me duplicating that here. >>> >>> The result is that the deferred work infrastructure does the work >>> of updaing the done intent and creating the new intents for the work >>> remaining. Hence after the next transaction roll, we have in the CIL >>> >>> EFI 1 with extent1 extent2 extent3. >>> <AGF, AGFL, free space btree block mods> >>> EFD 1 with extent1 extent2 extent3. >>> EFI 2 with extent2 extent3. >>> >> >> Taking transaction rolls into account (also adding up to EFD3), above would be: >> >> EFI 1 with extent1 extent2 extent3. >> transaction roll >> <AGF, AGFL, free space btree block mods> for extent 1 >> EFD 1 with extent1 extent2 extent3. >> EFI 2 with extent2 extent3. >> transaction roll >> free extent 2 >> EFD 2 with extent2 extent3 >> EFI 3 with extent3 >> transaction roll >> free extent 3 >> EFD 3 with extent3 >> >> Because EFD N is always be with EFI N+1 in the SAME transaction, so EFI N+1 doesn’t have to be placed before EFD N. >> I got it. >> >>> And so the loop goes until there is no more work remaining. >>> >>>> Your implementation also includes three EFI/EFD pairs, that’s the same as mine. >>>> So actually it’s still one extent per EFI per transaction. Though you are not changing >>>> XFS_EFI_MAX_FAST_EXTENTS. >>> >>> The difference is that what I'm suggesting is that the extent free >>> code can decide when it needs a transaction to be rolled. It isn't >>> forced to always run a single free per transaction, it can decide >>> that it can free multiple extents per transaction if there is no >>> risk of deadlocks (e.g. extents are in different AGs). Forcing >>> everything to be limited to a transaction per extent free even when >>> there is no risk of deadlocks freeing multiple extents in a single >>> transaction is unnecessary. >>> >>> Indeed, if the second extent is in a different AG, we don't risk >>> busy extents causing us issues, so we could do: >>> >>> EFI 1 with extent1 extent2 extent3. >>> <AGF 1, AGFL 1, free space btree block mods> >>> <AGF 2, AGFL 2, free space btree block mods> >>> EFD 1 with extent1 extent2 extent3. >>> EFI 2 with extent3. >>> ..... >>> >> >> My thumb is up. > > Well, I am wondering if there is ABBA deadlock instead of self deadlock then. > Say process 1 produced busy extent 1 on AG 0 and now blocking at AGFL > allocation on AG 1. And at the same time, process 2 produced busy extent 2 on AG 1 > and now blocking at AGFL allocation on AG 0. Anything prevents that from happening? > > Shall we introduce a per FS list, say named “pending_busy_AGs", where the AG number > are stored there. For those AGs > they have pending busy extents in in-memory transactions (not committed to CIL). > We add the AG number to pending_busy_AGs at the start of xfs_trans_free_extent() > if it’s not there, and continue to free the extent. Otherwise if the AG number is already > there in pending_busy_AGs, we return -EAGAIN. The AG number is removed when > the busy extents are committed to the xfs_cil_ctx. > Well, wondering if the pending busy extents would remain in pending state long before > committed to CIL and that become a bottle neck for freeing extents. Thinking more, we could add per AG counter indicating current pending busy extents on the AG. For the 2nd+ extent in the EFI, return -EAGAIN on seeing the counter set (>0). As the first extent is always OK to go, there shouldn’t be a bottle neck. thanks, wengang > > thanks, > wengang > >> >>> The difference is that limiting the number of xefi items per >>> deferred work item means we can only process a single extent per >>> work item regardless of the current context. >>> >>> Using the existing defered work "on demand transaction rolling" >>> mechanism I'm suggesting we use doesn't put any artificial >>> constrains on how we log and process the intents. This allows us to >>> aggregate multiple extent frees into a single transaction when there >>> is no risk associated with doing so and we have sufficient >>> transaction reservations remaining to guarantee we can free the >>> extent. It's far more efficient, and the infrastructure we have in >>> place already supports this sort of functionality.... >>> >>> When we go back to the original problem of the AGFL allocation >>> needing to roll the transaction to get busy extents processed, then >>> we could have it return -EAGAIN all the way back up to the defer-ops >>> level and we simply then roll the transaction and retry the same >>> extent free operation again. i.e. where extent freeing needs to >>> block waiting on stuff like busy extents, we could now have these >>> commit the current transaction, push the current work item to the >>> back of the current context's queue and come back to it later. >>> >>> At this point, I think the "single extent per transaction" >>> constraint that is needed to avoid the busy extent deadlock goes >>> away, and with it the need for limiting EFI processing to a single >>> extent goes away too.... >> >> I am pretty clear now. >>> >>>> And your implementation may use more log space than mine in case the EFI >>>> (and EFD pair) can’t be cancelled. :D >>> >>> True, but it's really not a concern. Don't conflate "intent >>> recovery intent amplification can cause log space deadlocks" with >>> "intent size is a problem". :) >>> >> >> Got it. >>>> The only difference if that you use transaction commit and I am using transaction roll >>>> which is not safe as you said. >>>> >>>> Is my understanding correct? >>> >>> I think I understand where we are misunderstanding each other :) >>> Perhaps it is now obvious to you as well from the analysis above. >>> If so, ignore the rest of this :) >>> >>> What does "transaction roll" actually mean? >>> >>> XFS has a concept of "rolling transactions". These are made up of a >>> series of individual transaction contexts that are linked together >>> by passing a single log reservation ticket between transaction >>> contexts. >>> >>> xfs_trans_roll() effectively does: >>> >>> ntp = xfs_trans_dup(tp) >>> .... >>> xfs_trans_commit(tp) >>> >>> return ntp; >>> >>> i.e. it duplicates the current transaction handle, takes the unused >>> block reservation from it, grabs the log reservation ticket from it, >>> moves the defered ops from the old to the new transaction context, >>> then commits the old transaction context and returns the new one. >>> >>> tl;dr: a rolling transaction is a series of linked independent >>> transactions that share a common set of block and log reservations. >>> >>> To make a rolling transaction chain an atomic operation on a >>> specific object (e.g. an inode) that object has to remain locked and >>> be logged in every transaction in the chain of rolling transactions. >>> This keeps it moving forward in the log and prevents it from pinning >>> the tail of the log and causing log space deadlocks. >>> >>> Fundamentally, though, each individual transaction in a rolling >>> transaction is an independent atomic change set. So when you say >>> "roll the transaction", what you are actually saying is "commit the >>> current transaction and start a new transaction using the >>> reservations the current transaction already holds." >>> >>> When I say "transaction commit" I am only refering to the process >>> that closes off the current transaction. If this is in the middle of >>> a rolling transaction, then what I am talking about is >>> xfs_trans_roll() calling xfs_trans_commit() after it has duplicated >>> the current transaction context. >>> >>> Transaction contexts running defered operations, intent chains, etc >>> are *always* rolling transactions, and each time we roll the >>> transaction we commit it. >>> >>> IOWS, when I say "transaction commit" and you say "transaction roll" >>> we are talking about exactly the same thing - transaction commit is >>> the operation that roll performs to finish off the current change >>> set... >>> >>> Ideally, nobody should be calling xfs_trans_roll() directly anymore. >>> All rolling transactions should be implemented using deferred ops >>> nowdays - xfs_trans_roll() is the historic method of running rolling >>> transactions. e.g. see the recent rework of the attribute code to >>> use deferred ops rather than explicit calls to xfs_trans_roll() so >>> we can use intents for running attr operations. >>> >>> These days the transaction model is based around chains of deferred >>> operations. We attach deferred work to the transaction, and then >>> when we call xfs_trans_commit() it goes off into the deferred work >>> infrastructure that creates intents and manages the transaction >>> context for processing the intents itself. >>> >>> This is the primary reason we are trying to move away from high >>> level code using transaction rolling - we can't easily change the >>> way we process operations when the high level code controls >>> transaction contexts. Using deferred intent chains gives us fine >>> grained control over transaction context in the deferred ops >>> processing code - we can roll to a new transaction whenever we need >>> to.... >>> >> >> Above is really helpful. >> I when I mention transaction roll, I meant >> 1) a new transaction is created, but its self does’t reserve any resource. >> Instead, it inherits all the unused resources from the old transaction. >> 2) commit the old transaction. >> 3) don’t un-reserve unused resources from old transaction, the new transaction >> will inherits them. >> 4) use the new transaction for further log items. >> >> As I understand, the key is that the new transaction its self doesn’t reserve new resources. >> >> And when I read your “transaction commit” I understood it as this: >> 1). commit old transaction >> 2) un-reserve unused resources from old transaction >> 3) create new transaction with all resources reserved. >> >> Thus in my understand your “transaction commit” would have no risk of lack of log space to put >> the extra EFI/EFDs. But reading above, I’d think you meant “transaction roll” actually. >> >>>> One question is that if only one EFI is safe per transaction, how >>>> we ensure that there is only one EFI per transaction in case there >>>> are more than 16 (current XFS_EFI_MAX_FAST_EXTENTS) extents to >>>> free in current code? >>> >>> That will handled exactly the same way it does with your change - it >>> will simply split up the work items so there are multiple intents >>> logged. >> >> I’d like to make it more clear. You were saying that during log recover stage, there may be no enough >> log space for the extra EFI/EFD (splitted from original multiple extents EFI). But here (non-recovery stage), >> seems you have no concern logging more EFI/EFDs. So why they are different? >> >> thanks, >> wengang ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH 1/2] xfs: IO time one extent per EFI 2023-04-24 15:53 ` Wengang Wang @ 2023-04-24 22:52 ` Wengang Wang 0 siblings, 0 replies; 16+ messages in thread From: Wengang Wang @ 2023-04-24 22:52 UTC (permalink / raw) To: Dave Chinner; +Cc: linux-xfs@vger.kernel.org > On Apr 24, 2023, at 8:53 AM, Wengang Wang <wen.gang.wang@oracle.com> wrote: > > > >> On Apr 21, 2023, at 8:22 PM, Wengang Wang <wen.gang.wang@oracle.com> wrote: >> >> >> >>> On Apr 21, 2023, at 11:23 AM, Wengang Wang <wen.gang.wang@oracle.com> wrote: >>> >>> >>> >>>> On Apr 21, 2023, at 2:34 AM, Dave Chinner <david@fromorbit.com> wrote: >>>> >>>> On Fri, Apr 21, 2023 at 12:24:49AM +0000, Wengang Wang wrote: >>>>>> On Apr 20, 2023, at 4:22 PM, Dave Chinner <david@fromorbit.com> wrote: >>>>>> We don't do that anymore for partially processed multi-extent >>>>>> intents anymore. Instead, we use deferred ops to chain updates. i.e. >>>>>> we log a complete intent done items alongside a new intent >>>>>> containing the remaining work to be done in the same transaction. >>>>>> This cancels the original intent and atomically replaces it with a >>>>>> new intent containing the remaining work to be done. >>>>>> >>>>>> So when I say "update the EFD" I'm using historic terminology for >>>>>> processing and recovering multi-extent intents. In modern terms, >>>>>> what I mean is "update the deferred work intent chain to reflect the >>>>>> work remaining to be done". >>>>> >>>>> OK. so let’s see the difference between your implementation from mine. >>>>> Say, there are three extents to free. >>>>> >>>>> My patch would result in: >>>>> >>>>> EFI 1 with extent1 >>>>> free extent1 >>>>> EFD 1 with extent1 >>>>> transaction roll >>>>> EFI 2 with extent2 >>>>> free extent2 >>>>> EFD 2 with extent2 >>>>> transaction roll >>>>> EFI 3 with extent3 >>>>> free extent3 >>>>> EFD3 with extent3 >>>>> transaction commit >>>> >>>> No, it wouldn't. This isn't how the deferred work processes work >>>> items on the transaction. A work item with multiple extents on it >>>> would result in this: >>>> >>>> xfs_defer_finish(tp) # tp contains three xefi work items >>>> xfs_defer_finish_noroll >>>> xfs_defer_create_intents() >>>> list_for_each_defer_pending >>>> xfs_defer_create_intent(dfp) >>>> ops->create_intent(tp, &dfp->dfp_work, dfp->dfp_count, sort); >>>> xfs_extent_free_create_intent() >>>> <create EFI> >>>> list_for_each_xefi >>>> xfs_extent_free_log_item(xefi) >>>> <adds extent to current EFI> >>>> >>>> xfs_defer_trans_roll() >>>> <save> >>>> xfs_trans_roll() >>>> xfs_trans_dup() >>>> xfs_trans_commit() >>>> <restore> >>>> >>>> At this point we have this committed to the CIL >>>> >>>> EFI 1 with extent1 >>>> EFI 2 with extent2 >>>> EFI 3 with extent3 >>>> >>>> And xfs_defer_finish_noroll() continues with >>>> >>>> <grabs first work item> >>>> xfs_defer_finish_one(dfp) >>>> ->create_done(EFI 1) >>>> xfs_extent_free_create_done >>>> <create EFD> >>>> list_for_each_xefi >>>> ops->finish_item(tp, dfp->dfp_done, li, &state); >>>> xfs_extent_free_finish_item() >>>> xfs_trans_free_extent >>>> __xfs_free_extent >>>> <adds extent to EFD> >>>> >>>> And once the processing of the single work item is done we loop >>>> back to the start of the xfs_defer_finish_noroll() loop. We don't >>>> have any new intents, so xfs_defer_create_intents() returns false, >>>> but we completed a dfp work item, so we run: >>>> >>>> xfs_defer_trans_roll() >>>> <save> >>>> xfs_trans_roll() >>>> xfs_trans_dup() >>>> xfs_trans_commit() >>>> <restore> >>>> >>>> At this point we have this committed to the CIL >>>> >>>> EFI 1 with extent1 >>>> EFI 2 with extent2 >>>> EFI 3 with extent3 >>>> <AGF, AGFL, free space btree block mods> >>>> EFD 1 with extent1 >>>> >>>> Then we run xfs_defer_finish_one() on EFI 2, commit, then run >>>> xfs_defer_finish_one() on EFI 3. At this point, we have in the log: >>>> >>>> EFI 1 with extent1 >>>> EFI 2 with extent2 >>>> EFI 3 with extent3 >>>> <AGF, AGFL, free space btree block mods> >>>> EFD 1 with extent1 >>>> <AGF, AGFL, free space btree block mods> >>>> EFD 2 with extent2 >>>> >>>> But we have not committed the final extent free or EFD 3 - that's in >>>> the last transaction context we pass back to the _xfs_trans_commit() >>>> context for it to finalise and close off the rolling transaction >>>> chain. At this point, we finally have this in the CIL: >>>> >>>> EFI 1 with extent1 >>>> EFI 2 with extent2 >>>> EFI 3 with extent3 >>>> <AGF, AGFL, free space btree block mods> >>>> EFD 1 with extent1 >>>> <AGF, AGFL, free space btree block mods> >>>> EFD 2 with extent2 >>>> <AGF, AGFL, free space btree block mods> >>>> EFD 3 with extent3 >>> >>> >>> Yes, correct. thanks for so much details! I mis-read xfs_defer_create_intents() >>> thinking it only create intent for the first in tp->t_dfops. >>> >>>> >>>>> The EFI/EFD log item pairs should not be written to log as they appear in same checkpoint. >>>> >>>> I *always* ignore CIL intent whiteouts when thinking about >>>> transaction chains and intents. That is purely a journal efficiency >>>> optimisation, not something that is necessary for correct operation. >>> >>> OK. >>> >>>> >>>>> Your idea yields this: >>>>> >>>>> EFI 1 with extent1 extent2 extent3 >>>>> free extent1 >>>>> EFI 2 with extent2 extent3 >>>>> EFD 1 with extent1 extent2 extent3 >>>>> transaction commit >>>>> create transaction >>>>> free extent2 >>>>> EFI 3 with extent3 >>>>> EFD 2 with extent extent2 extent3 >>>>> transaction commit >>>>> create transaction >>>>> free extent3 >>>>> EFD 3 with extent3 >>>>> transaction commit. >>>> >>>> The EFI/EFD contents are correct, but the rest of it is not - I am >>>> not suggesting open coding transaction rolling like that. Everything >>>> I am suggesting works through the same defer ops mechanism as you >>>> are describing. So if we start with the initial journal contents >>>> looks like this: >>>> >>>> EFI 1 with extent1 extent2 extent3. >>>> >>>> Then we run xfs_defer_finish_one() on that work, >>>> >>>> xfs_defer_finish_one(dfp) >>>> ->create_done(EFI 1) >>>> xfs_extent_free_create_done >>>> <create EFD> >>>> list_for_each_xefi >>>> ops->finish_item(tp, dfp->dfp_done, li, &state); >>>> xfs_extent_free_finish_item() >>>> xfs_trans_free_extent >>>> __xfs_free_extent >>>> <adds extent to EFD> >>>> >>>> But now we have 3 xefis on the work to process. So on success of >>>> the first call to xfs_trans_free_extent(), we want it to return >>>> -EAGAIN to trigger the existing relogging code to create the new >>>> EFI. How this works is describe in the section "Requesting a >>>> Fresh Transaction while Finishing Deferred Work" in >>>> fs/xfs/libxfs/xfs_defer.c, no point me duplicating that here. >>>> >>>> The result is that the deferred work infrastructure does the work >>>> of updaing the done intent and creating the new intents for the work >>>> remaining. Hence after the next transaction roll, we have in the CIL >>>> >>>> EFI 1 with extent1 extent2 extent3. >>>> <AGF, AGFL, free space btree block mods> >>>> EFD 1 with extent1 extent2 extent3. >>>> EFI 2 with extent2 extent3. >>>> >>> >>> Taking transaction rolls into account (also adding up to EFD3), above would be: >>> >>> EFI 1 with extent1 extent2 extent3. >>> transaction roll >>> <AGF, AGFL, free space btree block mods> for extent 1 >>> EFD 1 with extent1 extent2 extent3. >>> EFI 2 with extent2 extent3. >>> transaction roll >>> free extent 2 >>> EFD 2 with extent2 extent3 >>> EFI 3 with extent3 >>> transaction roll >>> free extent 3 >>> EFD 3 with extent3 >>> >>> Because EFD N is always be with EFI N+1 in the SAME transaction, so EFI N+1 doesn’t have to be placed before EFD N. >>> I got it. >>> >>>> And so the loop goes until there is no more work remaining. >>>> >>>>> Your implementation also includes three EFI/EFD pairs, that’s the same as mine. >>>>> So actually it’s still one extent per EFI per transaction. Though you are not changing >>>>> XFS_EFI_MAX_FAST_EXTENTS. >>>> >>>> The difference is that what I'm suggesting is that the extent free >>>> code can decide when it needs a transaction to be rolled. It isn't >>>> forced to always run a single free per transaction, it can decide >>>> that it can free multiple extents per transaction if there is no >>>> risk of deadlocks (e.g. extents are in different AGs). Forcing >>>> everything to be limited to a transaction per extent free even when >>>> there is no risk of deadlocks freeing multiple extents in a single >>>> transaction is unnecessary. >>>> >>>> Indeed, if the second extent is in a different AG, we don't risk >>>> busy extents causing us issues, so we could do: >>>> >>>> EFI 1 with extent1 extent2 extent3. >>>> <AGF 1, AGFL 1, free space btree block mods> >>>> <AGF 2, AGFL 2, free space btree block mods> >>>> EFD 1 with extent1 extent2 extent3. >>>> EFI 2 with extent3. >>>> ..... >>>> >>> >>> My thumb is up. >> >> Well, I am wondering if there is ABBA deadlock instead of self deadlock then. >> Say process 1 produced busy extent 1 on AG 0 and now blocking at AGFL >> allocation on AG 1. And at the same time, process 2 produced busy extent 2 on AG 1 >> and now blocking at AGFL allocation on AG 0. Anything prevents that from happening? >> >> Shall we introduce a per FS list, say named “pending_busy_AGs", where the AG number >> are stored there. For those AGs >> they have pending busy extents in in-memory transactions (not committed to CIL). >> We add the AG number to pending_busy_AGs at the start of xfs_trans_free_extent() >> if it’s not there, and continue to free the extent. Otherwise if the AG number is already >> there in pending_busy_AGs, we return -EAGAIN. The AG number is removed when >> the busy extents are committed to the xfs_cil_ctx. >> Well, wondering if the pending busy extents would remain in pending state long before >> committed to CIL and that become a bottle neck for freeing extents. > > Thinking more, we could add per AG counter indicating current pending busy extents on the AG. > For the 2nd+ extent in the EFI, return -EAGAIN on seeing the counter set (>0). As the first extent > is always OK to go, there shouldn’t be a bottle neck. I just sent the first attempt (this way), pls review. thanks, wengang ^ permalink raw reply [flat|nested] 16+ messages in thread
* [PATCH 2/2] xfs: log recovery stage split EFIs with multiple extents 2023-04-14 22:58 [PATCH 0/2] xfs: one extent per EFI Wengang Wang 2023-04-14 22:58 ` [PATCH 1/2] xfs: IO time " Wengang Wang @ 2023-04-14 22:58 ` Wengang Wang 2023-04-20 0:30 ` Dave Chinner 1 sibling, 1 reply; 16+ messages in thread From: Wengang Wang @ 2023-04-14 22:58 UTC (permalink / raw) To: linux-xfs; +Cc: wen.gang.wang At log recovery stage, we need to split EFIs with multiple extents. For each orginal multiple-extent EFI, split it into new EFIs each including one extent from the original EFI. By that we avoid deadlock when allocating blocks for AGFL waiting for the held busy extents by current transaction to be flushed. For the original EFI, the process is 1. Create and log new EFIs each covering one extent from the original EFI. 2. Don't free extent with the original EFI. 3. Log EFD for the original EFI. Make sure we log the new EFIs and original EFD in this order: new EFI 1 new EFI 2 ... new EFI N original EFD The original extents are freed with the new EFIs. The example log items: rbbn 41572 rec_lsn: 1638833,41568 Oper 18: tid: d746ea5d len: 48 flags: None EFI nextents:2 id:ffff8b10b5a13c28 --> orginal EFI EFI id=ffff8b10b5a13c28 (0x5de4c42, 256) EFI id=ffff8b10b5a13c28 (0x5de4942, 256) rbbn 39041 rec_lsn: 1638834,39040 Oper 2: tid: 4e651c99 len: 32 flags: None EFI nextents:1 id:ffff9fef39f4c528 --> new EFI 1 EFI id=ffff9fef39f4c528 (0x5de4c42, 256) ----------------------------------------------------------------------------- rbbn 39041 rec_lsn: 1638834,39040 Oper 3: tid: 4e651c99 len: 32 flags: None EFI nextents:1 id:ffff9fef39f4f548 --> new EFI 2 EFI id=ffff9fef39f4f548 (0x5de4942, 256) ----------------------------------------------------------------------------- rbbn 39041 rec_lsn: 1638834,39040 Oper 4: tid: 4e651c99 len: 48 flags: None EFD nextents:2 id:ffff8b10b5a13c28 --> EFD to original EFI EFD id=ffff8b10b5a13c28 (0x5de4c42, 256) EFD id=ffff8b10b5a13c28 (0x5de4942, 256) ----------------------------------------------------------------------------- rbbn 39041 rec_lsn: 1638834,39040 Oper 5: tid: 4e651c99 len: 32 flags: None EFD nextents:1 id:ffff9fef39f4c528 --> EFD to new EFI 1 EFD id=ffff9fef39f4c528 (0x5de4c42, 256) ...... rbbn 39057 rec_lsn: 1638834,39056 Oper 2: tid: e3264681 len: 32 flags: None EFD nextents:1 id:ffff9fef39f4f548 --> EFD to new EFI 2 EFD id=ffff9fef39f4f548 (0x5de4942, 256) Signed-off-by: Wengang Wang <wen.gang.wang@oracle.com> --- fs/xfs/xfs_extfree_item.c | 104 ++++++++++++++++++++++++++++++++++---- 1 file changed, 93 insertions(+), 11 deletions(-) diff --git a/fs/xfs/xfs_extfree_item.c b/fs/xfs/xfs_extfree_item.c index 011b50469301..b00b44234397 100644 --- a/fs/xfs/xfs_extfree_item.c +++ b/fs/xfs/xfs_extfree_item.c @@ -595,7 +595,11 @@ xfs_efi_item_recover( struct list_head *capture_list) { struct xfs_efi_log_item *efip = EFI_ITEM(lip); + int nr_ext = efip->efi_format.efi_nextents; struct xfs_mount *mp = lip->li_log->l_mp; + struct xfs_efi_log_item **new_efis, *new_efip; + struct xfs_efd_log_item *new_efdp; + struct xfs_extent_free_item fake; struct xfs_efd_log_item *efdp; struct xfs_trans *tp; int i; @@ -606,7 +610,7 @@ xfs_efi_item_recover( * EFI. If any are bad, then assume that all are bad and * just toss the EFI. */ - for (i = 0; i < efip->efi_format.efi_nextents; i++) { + for (i = 0; i < nr_ext; i++) { if (!xfs_efi_validate_ext(mp, &efip->efi_format.efi_extents[i])) { XFS_CORRUPTION_ERROR(__func__, XFS_ERRLEVEL_LOW, mp, @@ -619,28 +623,106 @@ xfs_efi_item_recover( error = xfs_trans_alloc(mp, &M_RES(mp)->tr_itruncate, 0, 0, 0, &tp); if (error) return error; - efdp = xfs_trans_get_efd(tp, efip, efip->efi_format.efi_nextents); - for (i = 0; i < efip->efi_format.efi_nextents; i++) { - struct xfs_extent_free_item fake = { - .xefi_owner = XFS_RMAP_OWN_UNKNOWN, - }; + memset(&fake, 0, sizeof(fake)); + fake.xefi_owner = XFS_RMAP_OWN_UNKNOWN; + + if (nr_ext <= 1) { + efdp = xfs_trans_get_efd(tp, efip, + efip->efi_format.efi_nextents); + + for (i = 0; i < efip->efi_format.efi_nextents; i++) { + struct xfs_extent *extp; + + extp = &efip->efi_format.efi_extents[i]; + + fake.xefi_startblock = extp->ext_start; + fake.xefi_blockcount = extp->ext_len; + + error = xfs_trans_free_extent(tp, efdp, &fake); + if (error == -EFSCORRUPTED) + XFS_CORRUPTION_ERROR(__func__, XFS_ERRLEVEL_LOW, mp, + extp, sizeof(*extp)); + if (error) + goto abort_error; + + } + + return xfs_defer_ops_capture_and_commit(tp, capture_list); + } + + /* + * Log recovery stage, we need to split a EFI into new EFIs if the + * original EFI includes more than one extents. Check the change of + * XFS_EFI_MAX_FAST_EXTENTS for the reason. + * For the original EFI, the process is + * 1. Create and log new EFIs each covering one extent from the + * original EFI. + * 2. Don't free extent with the original EFI. + * 3. Log EFD for the original EFI. + * Make sure we log the new EFIs and original EFD in this order: + * new EFI 1 + * new EFI 2 + * ... + * new EFI N + * original EFD + * The original extents are freed with the new EFIs. + */ + new_efis = kmem_zalloc(sizeof(*new_efis) * nr_ext, 0); + if (!new_efis) { + error = -ENOMEM; + goto abort_error; + } + for (i = 0; i < nr_ext; i++) { struct xfs_extent *extp; + new_efip = xfs_efi_init(mp, 1); extp = &efip->efi_format.efi_extents[i]; fake.xefi_startblock = extp->ext_start; fake.xefi_blockcount = extp->ext_len; + xfs_trans_add_item(tp, &new_efip->efi_item); + xfs_extent_free_log_item(tp, new_efip, &fake); + new_efis[i] = new_efip; + } + + /* + * The new EFIs are in transaction now, add original EFD with + * full extents. + */ + efdp = xfs_trans_get_efd(tp, efip, efip->efi_format.efi_nextents); + set_bit(XFS_LI_DIRTY, &efdp->efd_item.li_flags); + efdp->efd_next_extent = nr_ext; + for (i = 0; i < nr_ext; i++) + efdp->efd_format.efd_extents[i] = + efip->efi_format.efi_extents[i]; - error = xfs_trans_free_extent(tp, efdp, &fake); + /* + * Now process the new EFIs. + * Current transaction is a new one, there are no defered + * works attached. It's safe to use the following first + * xfs_trans_roll() to commit it. + */ + for (i = 0; i < nr_ext; i++) { + struct xfs_extent *extp; + + new_efip = new_efis[i]; + new_efdp = xfs_trans_get_efd(tp, new_efip, 1); + extp = &new_efip->efi_format.efi_extents[0]; + fake.xefi_startblock = extp->ext_start; + fake.xefi_blockcount = extp->ext_len; + error = xfs_trans_free_extent(tp, new_efdp, &fake); if (error == -EFSCORRUPTED) XFS_CORRUPTION_ERROR(__func__, XFS_ERRLEVEL_LOW, mp, - extp, sizeof(*extp)); - if (error) + extp, sizeof(*extp)); + if (!error) + error = xfs_trans_roll(&tp); + if (error) { + kmem_free(new_efis); goto abort_error; - + } } - + kmem_free(new_efis); return xfs_defer_ops_capture_and_commit(tp, capture_list); abort_error: -- 2.21.0 (Apple Git-122.2) ^ permalink raw reply related [flat|nested] 16+ messages in thread
* Re: [PATCH 2/2] xfs: log recovery stage split EFIs with multiple extents 2023-04-14 22:58 ` [PATCH 2/2] xfs: log recovery stage split EFIs with multiple extents Wengang Wang @ 2023-04-20 0:30 ` Dave Chinner 2023-04-20 17:10 ` Wengang Wang 0 siblings, 1 reply; 16+ messages in thread From: Dave Chinner @ 2023-04-20 0:30 UTC (permalink / raw) To: Wengang Wang; +Cc: linux-xfs On Fri, Apr 14, 2023 at 03:58:36PM -0700, Wengang Wang wrote: > At log recovery stage, we need to split EFIs with multiple extents. For each > orginal multiple-extent EFI, split it into new EFIs each including one extent > from the original EFI. By that we avoid deadlock when allocating blocks for > AGFL waiting for the held busy extents by current transaction to be flushed. > > For the original EFI, the process is > 1. Create and log new EFIs each covering one extent from the > original EFI. > 2. Don't free extent with the original EFI. > 3. Log EFD for the original EFI. > Make sure we log the new EFIs and original EFD in this order: > new EFI 1 > new EFI 2 > ... > new EFI N > original EFD > The original extents are freed with the new EFIs. We may not have the log space available during recovery to explode a single EFI out into many EFIs like this. The EFI only had enough space reserved for processing a single EFI, and exploding a single EFI out like this requires an individual log reservation for each new EFI. Hence this de-multiplexing process risks running out of log space and deadlocking before we've been able to process anything. Hence the only option we really have here is to replicate how CUIs are handled. We must process the first extent with a whole EFD and a new EFI containing the remaining unprocessed extents as defered operations. i.e. 1. free the first extent in the original EFI 2. log an EFD for the original EFI 3. Add all the remaining extents in the original EFI to an xefi chain 4. Call xfs_defer_ops_capture_and_commit() to create a new EFI from the xefi chain and commit the current transaction. xfs_defer_ops_capture_and_commit() will then add a work item to the defered list which will come back to the new EFI and process it through the normal runtime deferred ops intent processing path. The first patch changed that path to only create intents with a single extent, so the continued defer ops would then do the right thing with that change in place. However, I think that we also need the runtime code to process a single extent per intent per commit in the same manner as above. i.e. we process the first extent in the intent, then relog all the remaining unprocessed extents as a single new intent. Note that this is similar to how we already relog intents to roll them forward in the journal. The only difference for single extent processing is that an intent relog duplicates the entire extent list in the EFD and the new EFI, whilst what we want is the new EFI to contain all the extents except the one we just processed... Cheers, Dave. -- Dave Chinner david@fromorbit.com ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH 2/2] xfs: log recovery stage split EFIs with multiple extents 2023-04-20 0:30 ` Dave Chinner @ 2023-04-20 17:10 ` Wengang Wang 2023-04-20 22:54 ` Dave Chinner 0 siblings, 1 reply; 16+ messages in thread From: Wengang Wang @ 2023-04-20 17:10 UTC (permalink / raw) To: Dave Chinner; +Cc: linux-xfs@vger.kernel.org > On Apr 19, 2023, at 5:30 PM, Dave Chinner <david@fromorbit.com> wrote: > > On Fri, Apr 14, 2023 at 03:58:36PM -0700, Wengang Wang wrote: >> At log recovery stage, we need to split EFIs with multiple extents. For each >> orginal multiple-extent EFI, split it into new EFIs each including one extent >> from the original EFI. By that we avoid deadlock when allocating blocks for >> AGFL waiting for the held busy extents by current transaction to be flushed. >> >> For the original EFI, the process is >> 1. Create and log new EFIs each covering one extent from the >> original EFI. >> 2. Don't free extent with the original EFI. >> 3. Log EFD for the original EFI. >> Make sure we log the new EFIs and original EFD in this order: >> new EFI 1 >> new EFI 2 >> ... >> new EFI N >> original EFD >> The original extents are freed with the new EFIs. > > We may not have the log space available during recovery to explode a > single EFI out into many EFIs like this. The EFI only had enough > space reserved for processing a single EFI, and exploding a single > EFI out like this requires an individual log reservation for each > new EFI. Hence this de-multiplexing process risks running out of log > space and deadlocking before we've been able to process anything. > Oh, yes, got it. > Hence the only option we really have here is to replicate how CUIs > are handled. We must process the first extent with a whole EFD and > a new EFI containing the remaining unprocessed extents as defered > operations. i.e. > > 1. free the first extent in the original EFI > 2. log an EFD for the original EFI > 3. Add all the remaining extents in the original EFI to an xefi chain > 4. Call xfs_defer_ops_capture_and_commit() to create a new EFI from > the xefi chain and commit the current transaction. > > xfs_defer_ops_capture_and_commit() will then add a work item to the > defered list which will come back to the new EFI and process it > through the normal runtime deferred ops intent processing path. > So you meant this? Orig EFI with extent1 extent2 extent3 free first extent1 Full EFD to orig EFI transaction roll, xfs_defer_ops_capture_and_commit() to take care of extent2 and extent3 If so, I don’t think it’s safe. Consider that case that kernel panic happened after the transaction roll, during next log replay, the original EFI has the matching EFD, so this EFI is ignored, but actually extent2 and extent3 are not freed. If you didn’t mean above, but instead this: Orig EFI with extent1 extent2 extent3 free first extent1 New EFI extent2 extent3 Full EFD to orig EFI transaction roll, xfs_defer_ops_capture_and_commit() to take care of extent2 and extent3 The problem will comeback to the log space issue, are we ensured we have the space for the new EFI? > The first patch changed that path to only create intents with a > single extent, so the continued defer ops would then do the right > thing with that change in place. However, I think that we also need > the runtime code to process a single extent per intent per commit in > the same manner as above. i.e. we process the first extent in the > intent, then relog all the remaining unprocessed extents as a single > new intent. > > Note that this is similar to how we already relog intents to roll > them forward in the journal. The only difference for single extent > processing is that an intent relog duplicates the entire extent list > in the EFD and the new EFI, whilst what we want is the new EFI to > contain all the extents except the one we just processed... > The problem to me is that where we place the new EFI, it can’t be after the EFD. I explained why above. thanks, wengang > Cheers, > > Dave. > -- > Dave Chinner > david@fromorbit.com ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH 2/2] xfs: log recovery stage split EFIs with multiple extents 2023-04-20 17:10 ` Wengang Wang @ 2023-04-20 22:54 ` Dave Chinner 2023-04-21 0:32 ` Wengang Wang 0 siblings, 1 reply; 16+ messages in thread From: Dave Chinner @ 2023-04-20 22:54 UTC (permalink / raw) To: Wengang Wang; +Cc: linux-xfs@vger.kernel.org On Thu, Apr 20, 2023 at 05:10:42PM +0000, Wengang Wang wrote: > > > > On Apr 19, 2023, at 5:30 PM, Dave Chinner <david@fromorbit.com> wrote: > > > > On Fri, Apr 14, 2023 at 03:58:36PM -0700, Wengang Wang wrote: > >> At log recovery stage, we need to split EFIs with multiple extents. For each > >> orginal multiple-extent EFI, split it into new EFIs each including one extent > >> from the original EFI. By that we avoid deadlock when allocating blocks for > >> AGFL waiting for the held busy extents by current transaction to be flushed. > >> > >> For the original EFI, the process is > >> 1. Create and log new EFIs each covering one extent from the > >> original EFI. > >> 2. Don't free extent with the original EFI. > >> 3. Log EFD for the original EFI. > >> Make sure we log the new EFIs and original EFD in this order: > >> new EFI 1 > >> new EFI 2 > >> ... > >> new EFI N > >> original EFD > >> The original extents are freed with the new EFIs. > > > > We may not have the log space available during recovery to explode a > > single EFI out into many EFIs like this. The EFI only had enough > > space reserved for processing a single EFI, and exploding a single > > EFI out like this requires an individual log reservation for each > > new EFI. Hence this de-multiplexing process risks running out of log > > space and deadlocking before we've been able to process anything. > > > > Oh, yes, got it. > > > Hence the only option we really have here is to replicate how CUIs > > are handled. We must process the first extent with a whole EFD and > > a new EFI containing the remaining unprocessed extents as defered > > operations. i.e. > > > > 1. free the first extent in the original EFI > > 2. log an EFD for the original EFI > > 3. Add all the remaining extents in the original EFI to an xefi chain > > 4. Call xfs_defer_ops_capture_and_commit() to create a new EFI from > > the xefi chain and commit the current transaction. > > > > xfs_defer_ops_capture_and_commit() will then add a work item to the > > defered list which will come back to the new EFI and process it > > through the normal runtime deferred ops intent processing path. > > > > So you meant this? > > Orig EFI with extent1 extent2 extent3 > free first extent1 > Full EFD to orig EFI > transaction roll, > xfs_defer_ops_capture_and_commit() to take care of extent2 and extent3 No. We do not need a transaction roll there if we rebuild a new xefi list with the remaining extents from the original efi. At that point, we call: <create transaction> <free first extent> <create new xefi list> <create and log EFD> xfs_defer_ops_capture_and_commit() xfs_defer_ops_capture() xfs_defer_create_intents() for each tp->t_dfops ->create intent xfs_extent_free_create_intent() create new EFI walk each xefi and add it to the new intent <captures remaining defered work> xfs_trans_commit() i.e. xfs_defer_ops_capture_and_commit() can builds new EFI for us from the xefi list as part of defering the work that remains to be done. Once it has done that, it queues the remaining work and commits the transaction. Hence all we need to do in recovery of the first extent is free it, create the xefi list and log the full EFD. xfs_defer_ops_capture_and_commit() does the rest. > If so, I don’t think it’s safe. > Consider that case that kernel panic happened after the transaction roll, > during next log replay, the original EFI has the matching EFD, so this EFI > is ignored, but actually extent2 and extent3 are not freed. > > If you didn’t mean above, but instead this: > > Orig EFI with extent1 extent2 extent3 > free first extent1 > New EFI extent2 extent3 > Full EFD to orig EFI > transaction roll, > xfs_defer_ops_capture_and_commit() to take care of extent2 and extent3 > > The problem will comeback to the log space issue, are we ensured we have > the space for the new EFI? Yes, because logging the full EFD cancels the original EFI and so it no longer consumes log space. hence we can log a new EFI using the space the original EFI consumed. > > The first patch changed that path to only create intents with a > > single extent, so the continued defer ops would then do the right > > thing with that change in place. However, I think that we also need > > the runtime code to process a single extent per intent per commit in > > the same manner as above. i.e. we process the first extent in the > > intent, then relog all the remaining unprocessed extents as a single > > new intent. > > > > Note that this is similar to how we already relog intents to roll > > them forward in the journal. The only difference for single extent > > processing is that an intent relog duplicates the entire extent list > > in the EFD and the new EFI, whilst what we want is the new EFI to > > contain all the extents except the one we just processed... > > > > The problem to me is that where we place the new EFI, it can’t be after the EFD. > I explained why above. Yes it can. The only thing that matters is that the EFD and new EFI are committed in the same transaction. Remember: transactions are atomic change sets - either all the changes in the transaction are replayed on recovery, or none of them are. -Dave. -- Dave Chinner david@fromorbit.com ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH 2/2] xfs: log recovery stage split EFIs with multiple extents 2023-04-20 22:54 ` Dave Chinner @ 2023-04-21 0:32 ` Wengang Wang 0 siblings, 0 replies; 16+ messages in thread From: Wengang Wang @ 2023-04-21 0:32 UTC (permalink / raw) To: Dave Chinner; +Cc: linux-xfs@vger.kernel.org > On Apr 20, 2023, at 3:54 PM, Dave Chinner <david@fromorbit.com> wrote: > > On Thu, Apr 20, 2023 at 05:10:42PM +0000, Wengang Wang wrote: >> >> >>> On Apr 19, 2023, at 5:30 PM, Dave Chinner <david@fromorbit.com> wrote: >>> >>> On Fri, Apr 14, 2023 at 03:58:36PM -0700, Wengang Wang wrote: >>>> At log recovery stage, we need to split EFIs with multiple extents. For each >>>> orginal multiple-extent EFI, split it into new EFIs each including one extent >>>> from the original EFI. By that we avoid deadlock when allocating blocks for >>>> AGFL waiting for the held busy extents by current transaction to be flushed. >>>> >>>> For the original EFI, the process is >>>> 1. Create and log new EFIs each covering one extent from the >>>> original EFI. >>>> 2. Don't free extent with the original EFI. >>>> 3. Log EFD for the original EFI. >>>> Make sure we log the new EFIs and original EFD in this order: >>>> new EFI 1 >>>> new EFI 2 >>>> ... >>>> new EFI N >>>> original EFD >>>> The original extents are freed with the new EFIs. >>> >>> We may not have the log space available during recovery to explode a >>> single EFI out into many EFIs like this. The EFI only had enough >>> space reserved for processing a single EFI, and exploding a single >>> EFI out like this requires an individual log reservation for each >>> new EFI. Hence this de-multiplexing process risks running out of log >>> space and deadlocking before we've been able to process anything. >>> >> >> Oh, yes, got it. >> >>> Hence the only option we really have here is to replicate how CUIs >>> are handled. We must process the first extent with a whole EFD and >>> a new EFI containing the remaining unprocessed extents as defered >>> operations. i.e. >>> >>> 1. free the first extent in the original EFI >>> 2. log an EFD for the original EFI >>> 3. Add all the remaining extents in the original EFI to an xefi chain >>> 4. Call xfs_defer_ops_capture_and_commit() to create a new EFI from >>> the xefi chain and commit the current transaction. >>> >>> xfs_defer_ops_capture_and_commit() will then add a work item to the >>> defered list which will come back to the new EFI and process it >>> through the normal runtime deferred ops intent processing path. >>> >> >> So you meant this? >> >> Orig EFI with extent1 extent2 extent3 >> free first extent1 >> Full EFD to orig EFI >> transaction roll, >> xfs_defer_ops_capture_and_commit() to take care of extent2 and extent3 > > No. We do not need a transaction roll there if we rebuild a new > xefi list with the remaining extents from the original efi. At that > point, we call: > > <create transaction> > <free first extent> > <create new xefi list> > <create and log EFD> > xfs_defer_ops_capture_and_commit() > xfs_defer_ops_capture() > xfs_defer_create_intents() > for each tp->t_dfops > ->create intent > xfs_extent_free_create_intent() > create new EFI > walk each xefi and add it to the new intent > <captures remaining defered work> > xfs_trans_commit() > > i.e. xfs_defer_ops_capture_and_commit() can builds new EFI for us > from the xefi list as part of defering the work that remains to be > done. Once it has done that, it queues the remaining work and > commits the transaction. Hence all we need to do in recovery of the > first extent is free it, create the xefi list and log the full EFD. > xfs_defer_ops_capture_and_commit() does the rest. Yes, xfs_defer_ops_capture_and_commit will commit the transaction so we don’t need a roll. > >> If so, I don’t think it’s safe. >> Consider that case that kernel panic happened after the transaction roll, >> during next log replay, the original EFI has the matching EFD, so this EFI >> is ignored, but actually extent2 and extent3 are not freed. >> >> If you didn’t mean above, but instead this: >> >> Orig EFI with extent1 extent2 extent3 >> free first extent1 >> New EFI extent2 extent3 >> Full EFD to orig EFI >> transaction roll, >> xfs_defer_ops_capture_and_commit() to take care of extent2 and extent3 >> >> The problem will comeback to the log space issue, are we ensured we have >> the space for the new EFI? > > Yes, because logging the full EFD cancels the original EFI and so it > no longer consumes log space. hence we can log a new EFI using the > space the original EFI consumed. > Make sense. >>> The first patch changed that path to only create intents with a >>> single extent, so the continued defer ops would then do the right >>> thing with that change in place. However, I think that we also need >>> the runtime code to process a single extent per intent per commit in >>> the same manner as above. i.e. we process the first extent in the >>> intent, then relog all the remaining unprocessed extents as a single >>> new intent. >>> >>> Note that this is similar to how we already relog intents to roll >>> them forward in the journal. The only difference for single extent >>> processing is that an intent relog duplicates the entire extent list >>> in the EFD and the new EFI, whilst what we want is the new EFI to >>> contain all the extents except the one we just processed... >>> >> >> The problem to me is that where we place the new EFI, it can’t be after the EFD. >> I explained why above. > > Yes it can. The only thing that matters is that the EFD and new EFI > are committed in the same transaction. Remember: transactions are > atomic change sets - either all the changes in the transaction are > replayed on recovery, or none of them are. Yes, will try this way. thanks, wengang ^ permalink raw reply [flat|nested] 16+ messages in thread
end of thread, other threads:[~2023-04-24 22:52 UTC | newest] Thread overview: 16+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2023-04-14 22:58 [PATCH 0/2] xfs: one extent per EFI Wengang Wang 2023-04-14 22:58 ` [PATCH 1/2] xfs: IO time " Wengang Wang 2023-04-19 23:55 ` Dave Chinner 2023-04-20 17:31 ` Wengang Wang 2023-04-20 23:22 ` Dave Chinner 2023-04-21 0:24 ` Wengang Wang 2023-04-21 9:34 ` Dave Chinner 2023-04-21 18:23 ` Wengang Wang 2023-04-22 3:22 ` Wengang Wang 2023-04-24 15:53 ` Wengang Wang 2023-04-24 22:52 ` Wengang Wang 2023-04-14 22:58 ` [PATCH 2/2] xfs: log recovery stage split EFIs with multiple extents Wengang Wang 2023-04-20 0:30 ` Dave Chinner 2023-04-20 17:10 ` Wengang Wang 2023-04-20 22:54 ` Dave Chinner 2023-04-21 0:32 ` Wengang Wang
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox