* [PATCH 0/3] [RFC] xfs: intent recovery reservation changes
@ 2020-09-09 8:19 Dave Chinner
2020-09-09 8:19 ` [PATCH 1/3] xfs: EFI recovery needs it's own transaction reservation Dave Chinner
` (2 more replies)
0 siblings, 3 replies; 12+ messages in thread
From: Dave Chinner @ 2020-09-09 8:19 UTC (permalink / raw)
To: linux-xfs
Hi folks,
A recent bug report of a log recovery hang led to an analysis of a
metadump image that indicates that an intent can pin the tail of the
log and leave less space in the log than is require to begin
recovery of that intent. In this case, it was an EFI that pinned the
log tail, and the transaction reservation required to replay the EFI
was larger than the remaining free space in the log.
Long story short: the EFI recovery log space reservation is larger
than the log space reservation required to complete the operation at
runtime.
This appears to be a general problem with intents - the reservation
to replay the intent is the reservation of the entire permanent
transaction reservation that logged the intent, not the amount of
log space that is required to _execute_ the intent and record the
intent done item.
This patch series does not attempt to solve this generic problem,
just address the specific EFI reservation for a single EFI that pins
the tail of the log. If deferops and rolling transactions are
required, then recovery of multiple EFIs that all pin the tail of
the log at the same LSN becomes very complex, very fast. I have no clear idea
how to deal with that problem yet, but for the single case where the
extent free (and subsequent defer ops for refcnt and rmap mods) all
complete within the initial EFI reservation, the solution is
relatively straight forward.
That is, EFI recovery is an itruncate reservation minus the unit
reservation for the initial transaction that logs the EFI. At
runtime, this log space will be reserved by both the reserve and
write grant heads, hence we are guaranteed to have at least that
amount of space free in the log when we start recoverying the EFI.
For a single EFI pinning the tail of the log, once we free the
extent and log the EFD, the log tail is unpinned and the rest of
defered ops can be run without issue.
If there are multiple log tail pinning intents, then we cannot allow
the defered ops in an intent chain to roll once they run out of
initial recovery grant space. If they have to reserve more log space
rather than just regrant from the initial reservation, then they are
stealing log space from other intents that may pin the tail of the
log and need that space to guarantee that they can be replayed to
unpin the tail of the log. i.e. it gets complex real fast.
Hence this patchset does not attempt to solve the generic problem,
just the simple one-off EFI case. THe first patch is a back-portable
"minimal fix" for the problem, the second patch creates a formal
reservation construct for freeing a single extent to the AG free
list and converts the EFI recovery reservation to use it, and the
third patch factors all the existing reservation calculations to use
the free extent reservation instead of open coding it everywhere.
This last patch fixes some inconsistencies, too, in that some
extent alloc/free reservations fail to take into account AGFL
modifications. I also noted some concerns about the number of
extent alloc/free operations that certain transactions reserve space
for - if they are too large, then we should be able to reduce the
size of some of the large transactions significantly.
But doing any of that requires much, much more investigation than I
can do in a couple of hours, so if anyone is looking for something
they can sink a bit of time into, this would be a good thing to look
at.....
Anyway, this has run through fstests with reflink+rmap enabled
without regressions, so the patchset is not an obvious disaster.
Comments, thoughts, ideas all welcome...
Cheers,
Dave.
Dave Chinner (3):
xfs: EFI recovery needs it's own transaction reservation
xfs: add a free space extent change reservation
xfs: factor free space tree transaciton reservations
fs/xfs/libxfs/xfs_trans_resv.c | 143 ++++++++++++++++-----------------
fs/xfs/libxfs/xfs_trans_resv.h | 2 +
fs/xfs/xfs_extfree_item.c | 2 +-
3 files changed, 74 insertions(+), 73 deletions(-)
--
2.28.0
^ permalink raw reply [flat|nested] 12+ messages in thread* [PATCH 1/3] xfs: EFI recovery needs it's own transaction reservation 2020-09-09 8:19 [PATCH 0/3] [RFC] xfs: intent recovery reservation changes Dave Chinner @ 2020-09-09 8:19 ` Dave Chinner 2020-09-09 13:31 ` Brian Foster 2020-09-09 8:19 ` [PATCH 2/3] xfs: add a free space extent change reservation Dave Chinner 2020-09-09 8:19 ` [PATCH 3/3] xfs: factor free space tree transaciton reservations Dave Chinner 2 siblings, 1 reply; 12+ messages in thread From: Dave Chinner @ 2020-09-09 8:19 UTC (permalink / raw) To: linux-xfs From: Dave Chinner <dchinner@redhat.com> Recovering an EFI currently uses a itruncate reservation, which is designed for a rolling transaction that modifies the BMBT and logs the EFI in one commit, then frees the space and logs the EFD in the second commit. Recovering the EFI only requires the second transaction in this pair, and hence has a smaller log space requirement than a truncate operation. Hence when the extent free is being processed at runtime, the log reservation that is held by the filesystem is only enough to complete the extent free, not the entire truncate operation. Hence if the EFI pins the tail of the log and the log fills up while the extent is being freed, the amount of reserved free space in the log is not enough to start another entire truncate operation. Hence if we crash at this point, log recovery will deadlock with the EFI pinning the tail of the log and the log not having enough free space to reserve an itruncate transaction. As such, EFI recovery needs it's own log space reservation separate to the itruncate reservation. We only need what is required free the extent, and this matches the space we have reserved at runtime for this operation and hence should prevent the recovery deadlock from occurring. This patch adds the new reservation in a way that minimises the change such that it should be back-portable to older kernels easily. Follow up patches will factor and rework the reservations to be more correct and more tightly defined. Note: this would appear to be a generic problem with intent recovery; we use the entire operation reservation for recovery, not the reservation that was held at runtime after the intent was logged. I suspect all intents are going to require their own reservation as a result. Signed-off-by: Dave Chinner <dchinner@redhat.com> --- fs/xfs/libxfs/xfs_trans_resv.c | 10 ++++++++++ fs/xfs/libxfs/xfs_trans_resv.h | 2 ++ fs/xfs/xfs_extfree_item.c | 2 +- 3 files changed, 13 insertions(+), 1 deletion(-) diff --git a/fs/xfs/libxfs/xfs_trans_resv.c b/fs/xfs/libxfs/xfs_trans_resv.c index d1a0848cb52e..da2ec052ac0a 100644 --- a/fs/xfs/libxfs/xfs_trans_resv.c +++ b/fs/xfs/libxfs/xfs_trans_resv.c @@ -916,6 +916,16 @@ xfs_trans_resv_calc( resp->tr_qm_dqalloc.tr_logcount = XFS_WRITE_LOG_COUNT; resp->tr_qm_dqalloc.tr_logflags |= XFS_TRANS_PERM_LOG_RES; + /* + * Log recovery reservations for intent replay + * + * EFI recovery is itruncate minus the initial transaction that logs + * logs the EFI. + */ + resp->tr_efi.tr_logres = resp->tr_itruncate.tr_logres; + resp->tr_efi.tr_logcount = resp->tr_itruncate.tr_logcount - 1; + resp->tr_efi.tr_logflags |= XFS_TRANS_PERM_LOG_RES; + /* * The following transactions are logged in logical format with * a default log count. diff --git a/fs/xfs/libxfs/xfs_trans_resv.h b/fs/xfs/libxfs/xfs_trans_resv.h index 7241ab28cf84..13173b3eaac9 100644 --- a/fs/xfs/libxfs/xfs_trans_resv.h +++ b/fs/xfs/libxfs/xfs_trans_resv.h @@ -50,6 +50,8 @@ struct xfs_trans_resv { struct xfs_trans_res tr_qm_equotaoff;/* end of turn quota off */ struct xfs_trans_res tr_sb; /* modify superblock */ struct xfs_trans_res tr_fsyncts; /* update timestamps on fsync */ + struct xfs_trans_res tr_efi; /* EFI log item recovery */ + }; /* shorthand way of accessing reservation structure */ diff --git a/fs/xfs/xfs_extfree_item.c b/fs/xfs/xfs_extfree_item.c index 6cb8cd11072a..1ea9ab4cd44e 100644 --- a/fs/xfs/xfs_extfree_item.c +++ b/fs/xfs/xfs_extfree_item.c @@ -618,7 +618,7 @@ xfs_efi_item_recover( } } - error = xfs_trans_alloc(mp, &M_RES(mp)->tr_itruncate, 0, 0, 0, &tp); + error = xfs_trans_alloc(mp, &M_RES(mp)->tr_efi, 0, 0, 0, &tp); if (error) return error; efdp = xfs_trans_get_efd(tp, efip, efip->efi_format.efi_nextents); -- 2.28.0 ^ permalink raw reply related [flat|nested] 12+ messages in thread
* Re: [PATCH 1/3] xfs: EFI recovery needs it's own transaction reservation 2020-09-09 8:19 ` [PATCH 1/3] xfs: EFI recovery needs it's own transaction reservation Dave Chinner @ 2020-09-09 13:31 ` Brian Foster 2020-09-09 21:44 ` Dave Chinner 0 siblings, 1 reply; 12+ messages in thread From: Brian Foster @ 2020-09-09 13:31 UTC (permalink / raw) To: Dave Chinner; +Cc: linux-xfs On Wed, Sep 09, 2020 at 06:19:10PM +1000, Dave Chinner wrote: > From: Dave Chinner <dchinner@redhat.com> > > Recovering an EFI currently uses a itruncate reservation, which is > designed for a rolling transaction that modifies the BMBT and > logs the EFI in one commit, then frees the space and logs the EFD in > the second commit. > > Recovering the EFI only requires the second transaction in this > pair, and hence has a smaller log space requirement than a truncate > operation. Hence when the extent free is being processed at runtime, > the log reservation that is held by the filesystem is only enough to > complete the extent free, not the entire truncate operation. > > Hence if the EFI pins the tail of the log and the log fills up while > the extent is being freed, the amount of reserved free space in the > log is not enough to start another entire truncate operation. Hence > if we crash at this point, log recovery will deadlock with the EFI > pinning the tail of the log and the log not having enough free space > to reserve an itruncate transaction. > > As such, EFI recovery needs it's own log space reservation separate > to the itruncate reservation. We only need what is required free the > extent, and this matches the space we have reserved at runtime for > this operation and hence should prevent the recovery deadlock from > occurring. > > This patch adds the new reservation in a way that minimises the > change such that it should be back-portable to older kernels easily. > Follow up patches will factor and rework the reservations to be more > correct and more tightly defined. > > Note: this would appear to be a generic problem with intent > recovery; we use the entire operation reservation for recovery, > not the reservation that was held at runtime after the intent was > logged. I suspect all intents are going to require their own > reservation as a result. > It might be worth explicitly pointing out that support for EFI/EFD intents goes farther back than the various intents associated with newer features, hence the value of a targeted fix. Otherwise the problem description makes sense and approach seems reasonable to me. Thanks for the writeup. Some questions... > Signed-off-by: Dave Chinner <dchinner@redhat.com> > --- > fs/xfs/libxfs/xfs_trans_resv.c | 10 ++++++++++ > fs/xfs/libxfs/xfs_trans_resv.h | 2 ++ > fs/xfs/xfs_extfree_item.c | 2 +- > 3 files changed, 13 insertions(+), 1 deletion(-) > > diff --git a/fs/xfs/libxfs/xfs_trans_resv.c b/fs/xfs/libxfs/xfs_trans_resv.c > index d1a0848cb52e..da2ec052ac0a 100644 > --- a/fs/xfs/libxfs/xfs_trans_resv.c > +++ b/fs/xfs/libxfs/xfs_trans_resv.c > @@ -916,6 +916,16 @@ xfs_trans_resv_calc( > resp->tr_qm_dqalloc.tr_logcount = XFS_WRITE_LOG_COUNT; > resp->tr_qm_dqalloc.tr_logflags |= XFS_TRANS_PERM_LOG_RES; > > + /* > + * Log recovery reservations for intent replay > + * > + * EFI recovery is itruncate minus the initial transaction that logs > + * logs the EFI. > + */ > + resp->tr_efi.tr_logres = resp->tr_itruncate.tr_logres; > + resp->tr_efi.tr_logcount = resp->tr_itruncate.tr_logcount - 1; tr_itruncate.tr_logcount looks like it's either 2 or 8 depending on whether reflink is enabled. On one hand this seems conservative enough, but do we know exactly what those extra unit counts are accounted for in the reflink case? It looks like extents are only freed when the last reference is dropped (otherwise we log a refcount intent), which makes me wonder whether we really need 7 log count units if recovery encounters an EFI. Also, while looking through the code I noticed that truncate does the following: ... error = xfs_defer_finish(&tp); if (error) goto out; error = xfs_trans_roll_inode(&tp, ip); if (error) goto out; ... ... which looks like it rolls the transaction an extra time per-extent. I don't think that contributes to this problem vs just being a runtime inefficiency, so maybe I'll fling a patch up for that separately. > + resp->tr_efi.tr_logflags |= XFS_TRANS_PERM_LOG_RES; > + > /* > * The following transactions are logged in logical format with > * a default log count. > diff --git a/fs/xfs/libxfs/xfs_trans_resv.h b/fs/xfs/libxfs/xfs_trans_resv.h > index 7241ab28cf84..13173b3eaac9 100644 > --- a/fs/xfs/libxfs/xfs_trans_resv.h > +++ b/fs/xfs/libxfs/xfs_trans_resv.h > @@ -50,6 +50,8 @@ struct xfs_trans_resv { > struct xfs_trans_res tr_qm_equotaoff;/* end of turn quota off */ > struct xfs_trans_res tr_sb; /* modify superblock */ > struct xfs_trans_res tr_fsyncts; /* update timestamps on fsync */ > + struct xfs_trans_res tr_efi; /* EFI log item recovery */ > + Extra whitespace line. Brian > }; > > /* shorthand way of accessing reservation structure */ > diff --git a/fs/xfs/xfs_extfree_item.c b/fs/xfs/xfs_extfree_item.c > index 6cb8cd11072a..1ea9ab4cd44e 100644 > --- a/fs/xfs/xfs_extfree_item.c > +++ b/fs/xfs/xfs_extfree_item.c > @@ -618,7 +618,7 @@ xfs_efi_item_recover( > } > } > > - error = xfs_trans_alloc(mp, &M_RES(mp)->tr_itruncate, 0, 0, 0, &tp); > + error = xfs_trans_alloc(mp, &M_RES(mp)->tr_efi, 0, 0, 0, &tp); > if (error) > return error; > efdp = xfs_trans_get_efd(tp, efip, efip->efi_format.efi_nextents); > -- > 2.28.0 > ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [PATCH 1/3] xfs: EFI recovery needs it's own transaction reservation 2020-09-09 13:31 ` Brian Foster @ 2020-09-09 21:44 ` Dave Chinner 2020-09-10 13:18 ` Brian Foster 0 siblings, 1 reply; 12+ messages in thread From: Dave Chinner @ 2020-09-09 21:44 UTC (permalink / raw) To: Brian Foster; +Cc: linux-xfs On Wed, Sep 09, 2020 at 09:31:11AM -0400, Brian Foster wrote: > On Wed, Sep 09, 2020 at 06:19:10PM +1000, Dave Chinner wrote: > > From: Dave Chinner <dchinner@redhat.com> > > > > Recovering an EFI currently uses a itruncate reservation, which is > > designed for a rolling transaction that modifies the BMBT and > > logs the EFI in one commit, then frees the space and logs the EFD in > > the second commit. > > > > Recovering the EFI only requires the second transaction in this > > pair, and hence has a smaller log space requirement than a truncate > > operation. Hence when the extent free is being processed at runtime, > > the log reservation that is held by the filesystem is only enough to > > complete the extent free, not the entire truncate operation. > > > > Hence if the EFI pins the tail of the log and the log fills up while > > the extent is being freed, the amount of reserved free space in the > > log is not enough to start another entire truncate operation. Hence > > if we crash at this point, log recovery will deadlock with the EFI > > pinning the tail of the log and the log not having enough free space > > to reserve an itruncate transaction. > > > > As such, EFI recovery needs it's own log space reservation separate > > to the itruncate reservation. We only need what is required free the > > extent, and this matches the space we have reserved at runtime for > > this operation and hence should prevent the recovery deadlock from > > occurring. > > > > This patch adds the new reservation in a way that minimises the > > change such that it should be back-portable to older kernels easily. > > Follow up patches will factor and rework the reservations to be more > > correct and more tightly defined. > > > > Note: this would appear to be a generic problem with intent > > recovery; we use the entire operation reservation for recovery, > > not the reservation that was held at runtime after the intent was > > logged. I suspect all intents are going to require their own > > reservation as a result. > > > > It might be worth explicitly pointing out that support for EFI/EFD > intents goes farther back than the various intents associated with newer > features, hence the value of a targeted fix. Ok. > > @@ -916,6 +916,16 @@ xfs_trans_resv_calc( > > resp->tr_qm_dqalloc.tr_logcount = XFS_WRITE_LOG_COUNT; > > resp->tr_qm_dqalloc.tr_logflags |= XFS_TRANS_PERM_LOG_RES; > > > > + /* > > + * Log recovery reservations for intent replay > > + * > > + * EFI recovery is itruncate minus the initial transaction that logs > > + * logs the EFI. > > + */ > > + resp->tr_efi.tr_logres = resp->tr_itruncate.tr_logres; > > + resp->tr_efi.tr_logcount = resp->tr_itruncate.tr_logcount - 1; > > tr_itruncate.tr_logcount looks like it's either 2 or 8 depending on > whether reflink is enabled. On one hand this seems conservative enough, > but do we know exactly what those extra unit counts are accounted for in > the reflink case? Right, in the reflink case we may have to roll the transaction many more times to do the refcount btree and reverse map btree modifications. Those are done under separate intents and so we have to roll and commit the defered ops more times on a reflink/rmap based filesystem. Hence the logcount is higher so that the initial reservation can roll more times before regrant during a roll has to go and physically reserve more write space in the log to continue rolling the transaction. > It looks like extents are only freed when the last > reference is dropped (otherwise we log a refcount intent), which makes > me wonder whether we really need 7 log count units if recovery > encounters an EFI. I don't know if the numbers are correct, and it really is out of scope for this patch to audit/fix that. I really think we need to map this whole thing out in a diagram at this point because I now suspect that the allocfree log count calculation is not correct, either... > Also, while looking through the code I noticed that truncate does the > following: > > ... > error = xfs_defer_finish(&tp); > if (error) > goto out; > > error = xfs_trans_roll_inode(&tp, ip); > if (error) > goto out; > ... > > ... which looks like it rolls the transaction an extra time per-extent. > I don't think that contributes to this problem vs just being a runtime > inefficiency, so maybe I'll fling a patch up for that separately. Yeah, I'm not sure if this is correct/needed or not. > > * The following transactions are logged in logical format with > > * a default log count. > > diff --git a/fs/xfs/libxfs/xfs_trans_resv.h b/fs/xfs/libxfs/xfs_trans_resv.h > > index 7241ab28cf84..13173b3eaac9 100644 > > --- a/fs/xfs/libxfs/xfs_trans_resv.h > > +++ b/fs/xfs/libxfs/xfs_trans_resv.h > > @@ -50,6 +50,8 @@ struct xfs_trans_resv { > > struct xfs_trans_res tr_qm_equotaoff;/* end of turn quota off */ > > struct xfs_trans_res tr_sb; /* modify superblock */ > > struct xfs_trans_res tr_fsyncts; /* update timestamps on fsync */ > > + struct xfs_trans_res tr_efi; /* EFI log item recovery */ > > + > > Extra whitespace line. Will fix. Thanks! -Dave. -- Dave Chinner david@fromorbit.com ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [PATCH 1/3] xfs: EFI recovery needs it's own transaction reservation 2020-09-09 21:44 ` Dave Chinner @ 2020-09-10 13:18 ` Brian Foster 2020-09-10 21:29 ` Dave Chinner 0 siblings, 1 reply; 12+ messages in thread From: Brian Foster @ 2020-09-10 13:18 UTC (permalink / raw) To: Dave Chinner; +Cc: linux-xfs On Thu, Sep 10, 2020 at 07:44:55AM +1000, Dave Chinner wrote: > On Wed, Sep 09, 2020 at 09:31:11AM -0400, Brian Foster wrote: > > On Wed, Sep 09, 2020 at 06:19:10PM +1000, Dave Chinner wrote: > > > From: Dave Chinner <dchinner@redhat.com> > > > > > > Recovering an EFI currently uses a itruncate reservation, which is > > > designed for a rolling transaction that modifies the BMBT and > > > logs the EFI in one commit, then frees the space and logs the EFD in > > > the second commit. > > > > > > Recovering the EFI only requires the second transaction in this > > > pair, and hence has a smaller log space requirement than a truncate > > > operation. Hence when the extent free is being processed at runtime, > > > the log reservation that is held by the filesystem is only enough to > > > complete the extent free, not the entire truncate operation. > > > > > > Hence if the EFI pins the tail of the log and the log fills up while > > > the extent is being freed, the amount of reserved free space in the > > > log is not enough to start another entire truncate operation. Hence > > > if we crash at this point, log recovery will deadlock with the EFI > > > pinning the tail of the log and the log not having enough free space > > > to reserve an itruncate transaction. > > > > > > As such, EFI recovery needs it's own log space reservation separate > > > to the itruncate reservation. We only need what is required free the > > > extent, and this matches the space we have reserved at runtime for > > > this operation and hence should prevent the recovery deadlock from > > > occurring. > > > > > > This patch adds the new reservation in a way that minimises the > > > change such that it should be back-portable to older kernels easily. > > > Follow up patches will factor and rework the reservations to be more > > > correct and more tightly defined. > > > > > > Note: this would appear to be a generic problem with intent > > > recovery; we use the entire operation reservation for recovery, > > > not the reservation that was held at runtime after the intent was > > > logged. I suspect all intents are going to require their own > > > reservation as a result. > > > > > > > It might be worth explicitly pointing out that support for EFI/EFD > > intents goes farther back than the various intents associated with newer > > features, hence the value of a targeted fix. > > Ok. > > > > @@ -916,6 +916,16 @@ xfs_trans_resv_calc( > > > resp->tr_qm_dqalloc.tr_logcount = XFS_WRITE_LOG_COUNT; > > > resp->tr_qm_dqalloc.tr_logflags |= XFS_TRANS_PERM_LOG_RES; > > > > > > + /* > > > + * Log recovery reservations for intent replay > > > + * > > > + * EFI recovery is itruncate minus the initial transaction that logs > > > + * logs the EFI. > > > + */ > > > + resp->tr_efi.tr_logres = resp->tr_itruncate.tr_logres; > > > + resp->tr_efi.tr_logcount = resp->tr_itruncate.tr_logcount - 1; > > > > tr_itruncate.tr_logcount looks like it's either 2 or 8 depending on > > whether reflink is enabled. On one hand this seems conservative enough, > > but do we know exactly what those extra unit counts are accounted for in > > the reflink case? > > Right, in the reflink case we may have to roll the transaction many more > times to do the refcount btree and reverse map btree modifications. > Those are done under separate intents and so we have to roll and > commit the defered ops more times on a reflink/rmap based > filesystem. Hence the logcount is higher so that the initial > reservation can roll more times before regrant during a roll has to > go and physically reserve more write space in the log to continue > rolling the transaction. > > > It looks like extents are only freed when the last > > reference is dropped (otherwise we log a refcount intent), which makes > > me wonder whether we really need 7 log count units if recovery > > encounters an EFI. > > I don't know if the numbers are correct, and it really is out of > scope for this patch to audit/fix that. I really think we need to > map this whole thing out in a diagram at this point because I now > suspect that the allocfree log count calculation is not correct, > either... > I agree up to the point where it relates to this specific EFI recovery issue. reflink is enabled by default, which means the default EFI recovery reservation is going to have 7 logcount units. Is that actually enough of a reduction to prevent this same recovery problem on newer fs'? I'm wondering if the tr_efi logcount should just be set to 1, for example, at least for the short term fix. Brian > > Also, while looking through the code I noticed that truncate does the > > following: > > > > ... > > error = xfs_defer_finish(&tp); > > if (error) > > goto out; > > > > error = xfs_trans_roll_inode(&tp, ip); > > if (error) > > goto out; > > ... > > > > ... which looks like it rolls the transaction an extra time per-extent. > > I don't think that contributes to this problem vs just being a runtime > > inefficiency, so maybe I'll fling a patch up for that separately. > > Yeah, I'm not sure if this is correct/needed or not. > > > > * The following transactions are logged in logical format with > > > * a default log count. > > > diff --git a/fs/xfs/libxfs/xfs_trans_resv.h b/fs/xfs/libxfs/xfs_trans_resv.h > > > index 7241ab28cf84..13173b3eaac9 100644 > > > --- a/fs/xfs/libxfs/xfs_trans_resv.h > > > +++ b/fs/xfs/libxfs/xfs_trans_resv.h > > > @@ -50,6 +50,8 @@ struct xfs_trans_resv { > > > struct xfs_trans_res tr_qm_equotaoff;/* end of turn quota off */ > > > struct xfs_trans_res tr_sb; /* modify superblock */ > > > struct xfs_trans_res tr_fsyncts; /* update timestamps on fsync */ > > > + struct xfs_trans_res tr_efi; /* EFI log item recovery */ > > > + > > > > Extra whitespace line. > > Will fix. > > Thanks! > > -Dave. > -- > Dave Chinner > david@fromorbit.com > ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [PATCH 1/3] xfs: EFI recovery needs it's own transaction reservation 2020-09-10 13:18 ` Brian Foster @ 2020-09-10 21:29 ` Dave Chinner 2020-09-11 12:37 ` Brian Foster 0 siblings, 1 reply; 12+ messages in thread From: Dave Chinner @ 2020-09-10 21:29 UTC (permalink / raw) To: Brian Foster; +Cc: linux-xfs On Thu, Sep 10, 2020 at 09:18:10AM -0400, Brian Foster wrote: > On Thu, Sep 10, 2020 at 07:44:55AM +1000, Dave Chinner wrote: > > On Wed, Sep 09, 2020 at 09:31:11AM -0400, Brian Foster wrote: > > > It looks like extents are only freed when the last > > > reference is dropped (otherwise we log a refcount intent), which makes > > > me wonder whether we really need 7 log count units if recovery > > > encounters an EFI. > > > > I don't know if the numbers are correct, and it really is out of > > scope for this patch to audit/fix that. I really think we need to > > map this whole thing out in a diagram at this point because I now > > suspect that the allocfree log count calculation is not correct, > > either... > > I agree up to the point where it relates to this specific EFI recovery > issue. reflink is enabled by default, which means the default EFI > recovery reservation is going to have 7 logcount units. Is that actually > enough of a reduction to prevent this same recovery problem on newer > fs'? I'm wondering if the tr_efi logcount should just be set to 1, for > example, at least for the short term fix. I spent yesterday mapping out the whole "free extent" chain to understand exactly what was necessary, and in discussing this with Darrick on #xfs I came to the conclusion that we cannot -ever- have a logcount of more than 1 for an intent recovery of any sort. That's because the transaction may have rolled enough times to exhaust the initial grant of unit * logcount, so the only reservation that the runtime kernel has when it crashs is a single transaction unit reservation (which gets re-reserved on every roll). Hence recovery cannot assume that intent that was being processed has more than a single unit reservation of log space available to be used, and hence all intent recovery reservations must start with a log count of 1. THere are other restrictions we need to deal with, too. multiple intents may pin the tail of the log, so we can't just process a single intent chain at a time as that will result in using all the log space for a single intent chain and reservation deadlocking on one of the intents we haven't yet started. Hence the first thing we have to do in recovering intents is take an active reservation for -all- intents in the log to match the reservation state at runtime. Only then can we guarantee that intents that pin the tail of the log will have the reservation space needed to be able to unpin the log tail. Further, because we now may have consumed all the available log reservation space to guarantee forwards progress, the first commit of the first intent may block trying to regrant space if another intent also pins the tail of the log. This means we cannot replay intents in a serial manner as we currently do. Each intent chain and it's transaction context needs to run in it's own process context so it can block waiting for other intents to make progress, just like happens at runtime when the system crashed. IOWs, intent replay needs to be done with a task context per intent chain (probably via a workqueue). AFAICT, this is the only way we can make intent recovery deadlock free - we have to recreate the complete log reservation state in memory before we start recovery of a single intent, we can only assume an intent had a single unit reservation of log space assigned to it, and intent chain execution needs to run concurrently so commits can block waiting for log space to become available as other intents commit and unpin the tail of the log... Cheers, Dave. -- Dave Chinner david@fromorbit.com ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [PATCH 1/3] xfs: EFI recovery needs it's own transaction reservation 2020-09-10 21:29 ` Dave Chinner @ 2020-09-11 12:37 ` Brian Foster 0 siblings, 0 replies; 12+ messages in thread From: Brian Foster @ 2020-09-11 12:37 UTC (permalink / raw) To: Dave Chinner; +Cc: linux-xfs On Fri, Sep 11, 2020 at 07:29:27AM +1000, Dave Chinner wrote: > On Thu, Sep 10, 2020 at 09:18:10AM -0400, Brian Foster wrote: > > On Thu, Sep 10, 2020 at 07:44:55AM +1000, Dave Chinner wrote: > > > On Wed, Sep 09, 2020 at 09:31:11AM -0400, Brian Foster wrote: > > > > It looks like extents are only freed when the last > > > > reference is dropped (otherwise we log a refcount intent), which makes > > > > me wonder whether we really need 7 log count units if recovery > > > > encounters an EFI. > > > > > > I don't know if the numbers are correct, and it really is out of > > > scope for this patch to audit/fix that. I really think we need to > > > map this whole thing out in a diagram at this point because I now > > > suspect that the allocfree log count calculation is not correct, > > > either... > > > > I agree up to the point where it relates to this specific EFI recovery > > issue. reflink is enabled by default, which means the default EFI > > recovery reservation is going to have 7 logcount units. Is that actually > > enough of a reduction to prevent this same recovery problem on newer > > fs'? I'm wondering if the tr_efi logcount should just be set to 1, for > > example, at least for the short term fix. > > I spent yesterday mapping out the whole "free extent" chain to > understand exactly what was necessary, and in discussing this with > Darrick on #xfs I came to the conclusion that we cannot -ever- have > a logcount of more than 1 for an intent recovery of any sort. > > That's because the transaction may have rolled enough times to > exhaust the initial grant of unit * logcount, so the only > reservation that the runtime kernel has when it crashs is a single > transaction unit reservation (which gets re-reserved on every roll). > > Hence recovery cannot assume that intent that was being processed has more > than a single unit reservation of log space available to be used, > and hence all intent recovery reservations must start with a log > count of 1. > Makes sense. > THere are other restrictions we need to deal with, too. multiple > intents may pin the tail of the log, so we can't just process a > single intent chain at a time as that will result in using all the > log space for a single intent chain and reservation deadlocking on > one of the intents we haven't yet started. > > Hence the first thing we have to do in recovering intents is take an > active reservation for -all- intents in the log to match the > reservation state at runtime. Only then can we guarantee that > intents that pin the tail of the log will have the reservation space > needed to be able to unpin the log tail. > Which brings up the ordering question when multiple intents pin the tail... > Further, because we now may have consumed all the available log > reservation space to guarantee forwards progress, the first commit > of the first intent may block trying to regrant space if another > intent also pins the tail of the log. This means we cannot replay > intents in a serial manner as we currently do. Each intent chain and it's > transaction context needs to run in it's own process context so it > can block waiting for other intents to make progress, just like > happens at runtime when the system crashed. IOWs, intent replay > needs to be done with a task context per intent chain (probably via > a workqueue). > Indeed. > AFAICT, this is the only way we can make intent recovery deadlock > free - we have to recreate the complete log reservation state in > memory before we start recovery of a single intent, we can only > assume an intent had a single unit reservation of log space assigned > to it, and intent chain execution needs to run concurrently so > commits can block waiting for log space to become available as other > intents commit and unpin the tail of the log... > Ok, so the primary recovery task would either need to acquire and distribute the initial reservation or otherwise implement some kind of execution barrier for the intent recovery wq tasks to ensure that every currently log resident intent acquires a transaction before any other is allowed to roll. That doesn't seem _too_ limiting a restriction given that it technically should allow many intents (that don't result in intent chains) to execute to completion before waiting on others at all, though it does sound like a complete rework of the current post-recovery, single threaded intent recovery approach. I'm kind of wondering if we had some mechanism to freeze/unfreeze all transaction rolls, if we could then rework recovery to issue intent recovery wq tasks during pass2 as log records are written back to the fs. For example, with transaction rolls frozen, pass2 of recovery completes a particular log record and on I/O completion of all associated buffers, kicks off workqueue tasks for each unfinished intent in that particular record. The workqueue tasks will either complete and exit or unconditionally block on the next transaction roll (regrant). The main recovery task carries on and repeats for all subsequent records in the log, unfreezes regrants and waits for all outstanding tasks/transactions to (hopefully) complete. Hm? Brian > Cheers, > > Dave. > -- > Dave Chinner > david@fromorbit.com > ^ permalink raw reply [flat|nested] 12+ messages in thread
* [PATCH 2/3] xfs: add a free space extent change reservation 2020-09-09 8:19 [PATCH 0/3] [RFC] xfs: intent recovery reservation changes Dave Chinner 2020-09-09 8:19 ` [PATCH 1/3] xfs: EFI recovery needs it's own transaction reservation Dave Chinner @ 2020-09-09 8:19 ` Dave Chinner 2020-09-09 13:35 ` Brian Foster 2020-09-09 8:19 ` [PATCH 3/3] xfs: factor free space tree transaciton reservations Dave Chinner 2 siblings, 1 reply; 12+ messages in thread From: Dave Chinner @ 2020-09-09 8:19 UTC (permalink / raw) To: linux-xfs From: Dave Chinner <dchinner@redhat.com> Lots of the transaction reservation code reserves space for an extent allocation. It is inconsistently implemented, and many of them get it wrong. Introduce a new function to calculate the log space reservation for adding or removing an extent from the free space btrees. This function reserves space for logging the AGF, the AGFL and the free space btrees, avoiding the need to account for them seperately in every reservation that manipulates free space. Convert the EFI recovery reservation to use this transaction reservation as EFI recovery only needs to manipulate the free space index. Signed-off-by: Dave Chinner <dchinner@redhat.com> --- fs/xfs/libxfs/xfs_trans_resv.c | 19 ++++++++++++++++++- 1 file changed, 18 insertions(+), 1 deletion(-) diff --git a/fs/xfs/libxfs/xfs_trans_resv.c b/fs/xfs/libxfs/xfs_trans_resv.c index da2ec052ac0a..621ddb277dfa 100644 --- a/fs/xfs/libxfs/xfs_trans_resv.c +++ b/fs/xfs/libxfs/xfs_trans_resv.c @@ -79,6 +79,23 @@ xfs_allocfree_log_count( return blocks; } +/* + * Log reservation required to add or remove a single extent to the free space + * btrees. This requires modifying: + * + * the agf header: 1 sector + * the agfl header: 1 sector + * the allocation btrees: 2 trees * (max depth - 1) * block size + */ +uint +xfs_allocfree_extent_res( + struct xfs_mount *mp) +{ + return xfs_calc_buf_res(2, mp->m_sb.sb_sectsize) + + xfs_calc_buf_res(xfs_allocfree_log_count(mp, 1), + XFS_FSB_TO_B(mp, 1)); +} + /* * Logging inodes is really tricksy. They are logged in memory format, * which means that what we write into the log doesn't directly translate into @@ -922,7 +939,7 @@ xfs_trans_resv_calc( * EFI recovery is itruncate minus the initial transaction that logs * logs the EFI. */ - resp->tr_efi.tr_logres = resp->tr_itruncate.tr_logres; + resp->tr_efi.tr_logres = xfs_allocfree_extent_res(mp); resp->tr_efi.tr_logcount = resp->tr_itruncate.tr_logcount - 1; resp->tr_efi.tr_logflags |= XFS_TRANS_PERM_LOG_RES; -- 2.28.0 ^ permalink raw reply related [flat|nested] 12+ messages in thread
* Re: [PATCH 2/3] xfs: add a free space extent change reservation 2020-09-09 8:19 ` [PATCH 2/3] xfs: add a free space extent change reservation Dave Chinner @ 2020-09-09 13:35 ` Brian Foster 2020-09-09 22:51 ` Dave Chinner 0 siblings, 1 reply; 12+ messages in thread From: Brian Foster @ 2020-09-09 13:35 UTC (permalink / raw) To: Dave Chinner; +Cc: linux-xfs On Wed, Sep 09, 2020 at 06:19:11PM +1000, Dave Chinner wrote: > From: Dave Chinner <dchinner@redhat.com> > > Lots of the transaction reservation code reserves space for an > extent allocation. It is inconsistently implemented, and many of > them get it wrong. Introduce a new function to calculate the log > space reservation for adding or removing an extent from the free > space btrees. > > This function reserves space for logging the AGF, the AGFL and the > free space btrees, avoiding the need to account for them seperately > in every reservation that manipulates free space. > > Convert the EFI recovery reservation to use this transaction > reservation as EFI recovery only needs to manipulate the free space > index. > > Signed-off-by: Dave Chinner <dchinner@redhat.com> > --- > fs/xfs/libxfs/xfs_trans_resv.c | 19 ++++++++++++++++++- > 1 file changed, 18 insertions(+), 1 deletion(-) > > diff --git a/fs/xfs/libxfs/xfs_trans_resv.c b/fs/xfs/libxfs/xfs_trans_resv.c > index da2ec052ac0a..621ddb277dfa 100644 > --- a/fs/xfs/libxfs/xfs_trans_resv.c > +++ b/fs/xfs/libxfs/xfs_trans_resv.c > @@ -79,6 +79,23 @@ xfs_allocfree_log_count( > return blocks; > } > > +/* > + * Log reservation required to add or remove a single extent to the free space > + * btrees. This requires modifying: > + * > + * the agf header: 1 sector > + * the agfl header: 1 sector > + * the allocation btrees: 2 trees * (max depth - 1) * block size Nit, but the xfs_allocfree_log_count() helper this uses clearly indicates reservation for up to four trees. It might be worth referring to that here just to minimize spreading details all over the place that are likely to become stale or inconsistent over time. > + */ > +uint > +xfs_allocfree_extent_res( > + struct xfs_mount *mp) > +{ > + return xfs_calc_buf_res(2, mp->m_sb.sb_sectsize) + > + xfs_calc_buf_res(xfs_allocfree_log_count(mp, 1), Why calculate for a single extent when an EFI can refer to multiple extents? I thought the max was 2, but the extent free portion of the itruncate calculation actually uses an op count of 4. The reason for that is not immediately clear to me. It actually accounts 4 agf/agfl blocks as well, so perhaps there's a wrong assumption somewhere. FWIW, the truncate code allows 2 unmaps per transaction and the xfs_extent_free_defer_type struct limits the dfp to 16. I suspect the latter is not relevant for the current code. Either way, multiple extents are factored into the current freeing reservation and the extent freeing code at runtime (dfops) and during recovery both appear to iterate on an extent count (potentially > 1) per transaction. The itruncate comment, for reference (I also just noticed that the subsequent patch modifies this comment, so you're presumably aware of this mess): /* * ... * And the bmap_finish transaction can free the blocks and bmap blocks (t2): * the agf for each of the ags: 4 * sector size * the agfl for each of the ags: 4 * sector size * the super block to reflect the freed blocks: sector size * worst case split in allocation btrees per extent assuming 4 extents: * 4 exts * 2 trees * (2 * max depth - 1) * block size * ... */ Brian > + XFS_FSB_TO_B(mp, 1)); > +} > + > /* > * Logging inodes is really tricksy. They are logged in memory format, > * which means that what we write into the log doesn't directly translate into > @@ -922,7 +939,7 @@ xfs_trans_resv_calc( > * EFI recovery is itruncate minus the initial transaction that logs > * logs the EFI. > */ > - resp->tr_efi.tr_logres = resp->tr_itruncate.tr_logres; > + resp->tr_efi.tr_logres = xfs_allocfree_extent_res(mp); > resp->tr_efi.tr_logcount = resp->tr_itruncate.tr_logcount - 1; > resp->tr_efi.tr_logflags |= XFS_TRANS_PERM_LOG_RES; > > -- > 2.28.0 > ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [PATCH 2/3] xfs: add a free space extent change reservation 2020-09-09 13:35 ` Brian Foster @ 2020-09-09 22:51 ` Dave Chinner 2020-09-10 13:19 ` Brian Foster 0 siblings, 1 reply; 12+ messages in thread From: Dave Chinner @ 2020-09-09 22:51 UTC (permalink / raw) To: Brian Foster; +Cc: linux-xfs On Wed, Sep 09, 2020 at 09:35:25AM -0400, Brian Foster wrote: > On Wed, Sep 09, 2020 at 06:19:11PM +1000, Dave Chinner wrote: > > From: Dave Chinner <dchinner@redhat.com> > > > > Lots of the transaction reservation code reserves space for an > > extent allocation. It is inconsistently implemented, and many of > > them get it wrong. Introduce a new function to calculate the log > > space reservation for adding or removing an extent from the free > > space btrees. > > > > This function reserves space for logging the AGF, the AGFL and the > > free space btrees, avoiding the need to account for them seperately > > in every reservation that manipulates free space. > > > > Convert the EFI recovery reservation to use this transaction > > reservation as EFI recovery only needs to manipulate the free space > > index. > > > > Signed-off-by: Dave Chinner <dchinner@redhat.com> > > --- > > fs/xfs/libxfs/xfs_trans_resv.c | 19 ++++++++++++++++++- > > 1 file changed, 18 insertions(+), 1 deletion(-) > > > > diff --git a/fs/xfs/libxfs/xfs_trans_resv.c b/fs/xfs/libxfs/xfs_trans_resv.c > > index da2ec052ac0a..621ddb277dfa 100644 > > --- a/fs/xfs/libxfs/xfs_trans_resv.c > > +++ b/fs/xfs/libxfs/xfs_trans_resv.c > > @@ -79,6 +79,23 @@ xfs_allocfree_log_count( > > return blocks; > > } > > > > +/* > > + * Log reservation required to add or remove a single extent to the free space > > + * btrees. This requires modifying: > > + * > > + * the agf header: 1 sector > > + * the agfl header: 1 sector > > + * the allocation btrees: 2 trees * (max depth - 1) * block size > > Nit, but the xfs_allocfree_log_count() helper this uses clearly > indicates reservation for up to four trees. It might be worth referring > to that here just to minimize spreading details all over the place that > are likely to become stale or inconsistent over time. Yup, but now I think of it, xfs_allocfree_log_count() doesn't seem right for freeing, and I need to check how allocation works again because stuff gets deferred. i.e. on freeing, AFAICT we don't modify the freespace trees, the reflink tree and the rmap trees all in the same transaction. We do a cycle that looks like this: log new intent, commit, execute the intent, log the intent done, log new intent, commit, execute the intent, log the intent done, log new intent, commit, .... And so I'm not sure that we are modifying the reflink, rmap and free space trees all in the same transaction and commit. > > > + */ > > +uint > > +xfs_allocfree_extent_res( > > + struct xfs_mount *mp) > > +{ > > + return xfs_calc_buf_res(2, mp->m_sb.sb_sectsize) + > > + xfs_calc_buf_res(xfs_allocfree_log_count(mp, 1), > > Why calculate for a single extent when an EFI can refer to multiple > extents? This isn't anything to do with an EFI at this point. It's just the reservation needed to remove a single extent from a single AG in isolation. I wanted to isolate this reservation completely from the rest of the transaction reservations and how it ends up being used. This deadlock surprised me, and so reflection and insight has lead me to think that we actually need our reservations to reflect the specific operations that will be performed by the transaction rather than an aggregation of things that get modified. Then each reservation can contain the reservations for each operation it performance (e.g. free an extent) That's kinda what this "intent needs it's own reservation" > I thought the max was 2, but the extent free portion of the > itruncate calculation actually uses an op count of 4. The reason for > that is not immediately clear to me. THe reason is that itruncate used to allow 4 extents to be freed in a single transaction. i.e. XFS_ITRUNC_MAX_EXTENTS used to be defined to 4, it is now 2. However, if you want to relate this to EFIs, I think this reservation is completely bogus. If all the extents freed a single BMBT extent free are packed into a single EFI (i.e. all the BMBT blocks freed and the data extent) then we'll overrun the reservation extent count of 4. The max extents being freed in a single BMBT extent free operations is a full bmbt merge + the data extent. i.e. XFS_BM_MAXLEVELS + 1 extents in a single EFI. Except that we're allowed to pack two data extent frees into a single EFI, and that means it is, worst case, a full BMBT merge + a BMBT merge up to level below root + 2 data extents, or: = (XFS_BM_MAXLEVELS + 1) + ((XFS_BM_MAXLEVELS - 1 - 1) + 1) = XFS_BM_MAXLEVELS * 2 extents And then we try to free all those extents in a single itruncate unit reservation. We probably don't hit this often because the maximum BMBT level is bound by filesystem size, and it's rare to have a >= 4-level BMBT tree needed to make this problematic. We do not have a reservation large enough to do that as a single transaction, and we don't want to have a reservation that large, either. This is not a limitation of the EFI/EFD construct - that can effectively be unbound. The limitation is that we don't roll and relog the EFD between each extent in the EFI that we free. The itruncate reservation has a bound limit for freeing extents, the EFI reservation does not define that. So, really, I think our extent free reservations and some of the intent execution behaviour needs a serious audit and rework, and we need to decouple the extent allocation and freeing reservations as extent allocation is now a fundamentally different operation to freeing an extent. i.e. Allocation does BMBT, refcount, rmap and freespace tree mods all in one transaction, while freeing does multiple individual transactional modifications linked by chained intents. > Either way, multiple extents are factored into the current freeing > reservation and the extent freeing code at runtime (dfops) and during > recovery both appear to iterate on an extent count (potentially > 1) per > transaction. The itruncate comment, for reference (I also just noticed > that the subsequent patch modifies this comment, so you're presumably > aware of this mess): > > /* > * ... > * And the bmap_finish transaction can free the blocks and bmap blocks (t2): > * the agf for each of the ags: 4 * sector size > * the agfl for each of the ags: 4 * sector size > * the super block to reflect the freed blocks: sector size > * worst case split in allocation btrees per extent assuming 4 extents: > * 4 exts * 2 trees * (2 * max depth - 1) * block size > * ... > */ Oh, yeah, I'm well aware of it - ISTR commenting on it past discussions about the transaction reservation sizes being much larger than necessary.... Cheers, Dave. -- Dave Chinner david@fromorbit.com ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [PATCH 2/3] xfs: add a free space extent change reservation 2020-09-09 22:51 ` Dave Chinner @ 2020-09-10 13:19 ` Brian Foster 0 siblings, 0 replies; 12+ messages in thread From: Brian Foster @ 2020-09-10 13:19 UTC (permalink / raw) To: Dave Chinner; +Cc: linux-xfs On Thu, Sep 10, 2020 at 08:51:27AM +1000, Dave Chinner wrote: > On Wed, Sep 09, 2020 at 09:35:25AM -0400, Brian Foster wrote: > > On Wed, Sep 09, 2020 at 06:19:11PM +1000, Dave Chinner wrote: > > > From: Dave Chinner <dchinner@redhat.com> > > > > > > Lots of the transaction reservation code reserves space for an > > > extent allocation. It is inconsistently implemented, and many of > > > them get it wrong. Introduce a new function to calculate the log > > > space reservation for adding or removing an extent from the free > > > space btrees. > > > > > > This function reserves space for logging the AGF, the AGFL and the > > > free space btrees, avoiding the need to account for them seperately > > > in every reservation that manipulates free space. > > > > > > Convert the EFI recovery reservation to use this transaction > > > reservation as EFI recovery only needs to manipulate the free space > > > index. > > > > > > Signed-off-by: Dave Chinner <dchinner@redhat.com> > > > --- > > > fs/xfs/libxfs/xfs_trans_resv.c | 19 ++++++++++++++++++- > > > 1 file changed, 18 insertions(+), 1 deletion(-) > > > > > > diff --git a/fs/xfs/libxfs/xfs_trans_resv.c b/fs/xfs/libxfs/xfs_trans_resv.c > > > index da2ec052ac0a..621ddb277dfa 100644 > > > --- a/fs/xfs/libxfs/xfs_trans_resv.c > > > +++ b/fs/xfs/libxfs/xfs_trans_resv.c > > > @@ -79,6 +79,23 @@ xfs_allocfree_log_count( > > > return blocks; > > > } > > > > > > +/* > > > + * Log reservation required to add or remove a single extent to the free space > > > + * btrees. This requires modifying: > > > + * > > > + * the agf header: 1 sector > > > + * the agfl header: 1 sector > > > + * the allocation btrees: 2 trees * (max depth - 1) * block size > > > > Nit, but the xfs_allocfree_log_count() helper this uses clearly > > indicates reservation for up to four trees. It might be worth referring > > to that here just to minimize spreading details all over the place that > > are likely to become stale or inconsistent over time. > > Yup, but now I think of it, xfs_allocfree_log_count() doesn't seem > right for freeing, and I need to check how allocation works again > because stuff gets deferred. > > i.e. on freeing, AFAICT we don't modify the freespace trees, the reflink > tree and the rmap trees all in the same transaction. We do a cycle > that looks like this: > > log new intent, commit, execute the intent, log the intent done, > log new intent, commit, execute the intent, log the intent done, > log new intent, commit, .... > > And so I'm not sure that we are modifying the reflink, rmap and free > space trees all in the same transaction and commit. > Indeed. > > > > > + */ > > > +uint > > > +xfs_allocfree_extent_res( > > > + struct xfs_mount *mp) > > > +{ > > > + return xfs_calc_buf_res(2, mp->m_sb.sb_sectsize) + > > > + xfs_calc_buf_res(xfs_allocfree_log_count(mp, 1), > > > > Why calculate for a single extent when an EFI can refer to multiple > > extents? > > This isn't anything to do with an EFI at this point. It's just the > reservation needed to remove a single extent from a single AG in > isolation. I wanted to isolate this reservation completely from the > rest of the transaction reservations and how it ends up being used. > Sure, but this is used as the reservation size for tr_efi. The current itruncate reservation is clearly more than is required for EFI recovery, but my point is that the current itruncate code means recovery must handle EFIs with at least a couple extents (perhaps more based on your analysis below). Recovery currently processes efi_nextents all in a single transaction. The current recovery reservation seems to accommodate current runtime behavior, but does the new reservation based on this helper satisfy EFIs with >1 extents? I haven't tested it, but it seems potentially deficient based on inspection... > This deadlock surprised me, and so reflection and insight has lead > me to think that we actually need our reservations to reflect the > specific operations that will be performed by the transaction rather > than an aggregation of things that get modified. > > Then each reservation can contain the reservations for each > operation it performance (e.g. free an extent) > > That's kinda what this "intent needs it's own reservation" > Ack. > > I thought the max was 2, but the extent free portion of the > > itruncate calculation actually uses an op count of 4. The reason for > > that is not immediately clear to me. > > THe reason is that itruncate used to allow 4 extents to be freed in > a single transaction. i.e. XFS_ITRUNC_MAX_EXTENTS used to be defined > to 4, it is now 2. > Ok, well that's a simple enough explanation.. :P > However, if you want to relate this to EFIs, I think this > reservation is completely bogus. If all the extents freed a single > BMBT extent free are packed into a single EFI (i.e. all the BMBT > blocks freed and the data extent) then we'll overrun the reservation > extent count of 4. The max extents being freed in a single > BMBT extent free operations is a full bmbt merge + the data extent. > i.e. XFS_BM_MAXLEVELS + 1 extents in a single EFI. Except that > we're allowed to pack two data extent frees into a single EFI, and > that means it is, worst case, a full BMBT merge + a BMBT merge up to > level below root + 2 data extents, or: > All I want to do wrt to this patch is make sure the recovery reservation is large enough to process the EFIs it might see in the log based on the current code. > = (XFS_BM_MAXLEVELS + 1) + ((XFS_BM_MAXLEVELS - 1 - 1) + 1) > = XFS_BM_MAXLEVELS * 2 extents > > And then we try to free all those extents in a single itruncate unit > reservation. We probably don't hit this often because the maximum > BMBT level is bound by filesystem size, and it's rare to have a >= > 4-level BMBT tree needed to make this problematic. We do not have a > reservation large enough to do that as a single transaction, and we > don't want to have a reservation that large, either. > Agreed. > This is not a limitation of the EFI/EFD construct - that can > effectively be unbound. The limitation is that we don't roll and > relog the EFD between each extent in the EFI that we free. The > itruncate reservation has a bound limit for freeing extents, the > EFI reservation does not define that. > Sure. > So, really, I think our extent free reservations and some of the > intent execution behaviour needs a serious audit and rework, and we > need to decouple the extent allocation and freeing reservations as > extent allocation is now a fundamentally different operation to > freeing an extent. i.e. Allocation does BMBT, refcount, rmap and > freespace tree mods all in one transaction, while freeing does > multiple individual transactional modifications linked by chained > intents. > I don't think an allocation transaction is so overloaded as compared to freeing. I thought reflink was more of a separate remap operation vs being tied to block allocation, not to say that a remap might depend on block allocs or not, but both remaps and refcount updates look like individual intents (along with rmapbt updates associated with block mappings/allocations). Regardless, I agree that the current transaction (and associated reservation) situation is a mess, particularly now with more use of intents, complex chains thereof, and certain transactions (tr_itruncate/tr_write) being overloaded to the point where they're huge or simply too difficult to reason about.. Brian > > Either way, multiple extents are factored into the current freeing > > reservation and the extent freeing code at runtime (dfops) and during > > recovery both appear to iterate on an extent count (potentially > 1) per > > transaction. The itruncate comment, for reference (I also just noticed > > that the subsequent patch modifies this comment, so you're presumably > > aware of this mess): > > > > /* > > * ... > > * And the bmap_finish transaction can free the blocks and bmap blocks (t2): > > * the agf for each of the ags: 4 * sector size > > * the agfl for each of the ags: 4 * sector size > > * the super block to reflect the freed blocks: sector size > > * worst case split in allocation btrees per extent assuming 4 extents: > > * 4 exts * 2 trees * (2 * max depth - 1) * block size > > * ... > > */ > > Oh, yeah, I'm well aware of it - ISTR commenting on it past > discussions about the transaction reservation sizes being much > larger than necessary.... > > Cheers, > > Dave. > -- > Dave Chinner > david@fromorbit.com > ^ permalink raw reply [flat|nested] 12+ messages in thread
* [PATCH 3/3] xfs: factor free space tree transaciton reservations 2020-09-09 8:19 [PATCH 0/3] [RFC] xfs: intent recovery reservation changes Dave Chinner 2020-09-09 8:19 ` [PATCH 1/3] xfs: EFI recovery needs it's own transaction reservation Dave Chinner 2020-09-09 8:19 ` [PATCH 2/3] xfs: add a free space extent change reservation Dave Chinner @ 2020-09-09 8:19 ` Dave Chinner 2 siblings, 0 replies; 12+ messages in thread From: Dave Chinner @ 2020-09-09 8:19 UTC (permalink / raw) To: linux-xfs From: Dave Chinner <dchinner@redhat.com> Convert all the open coded free space tree modification reservations to use the new xfs_allocfree_extent_res() function. Signed-off-by: Dave Chinner <dchinner@redhat.com> --- fs/xfs/libxfs/xfs_trans_resv.c | 116 +++++++++++++-------------------- 1 file changed, 44 insertions(+), 72 deletions(-) diff --git a/fs/xfs/libxfs/xfs_trans_resv.c b/fs/xfs/libxfs/xfs_trans_resv.c index 621ddb277dfa..cb3cddb84d75 100644 --- a/fs/xfs/libxfs/xfs_trans_resv.c +++ b/fs/xfs/libxfs/xfs_trans_resv.c @@ -143,18 +143,16 @@ xfs_calc_inode_res( * reservation: * * the inode btree: max depth * blocksize - * the allocation btrees: 2 trees * (max depth - 1) * block size + * one extent for the AG free space trees * - * The caller must account for SB and AG header modifications, etc. */ STATIC uint xfs_calc_inobt_res( struct xfs_mount *mp) { return xfs_calc_buf_res(M_IGEO(mp)->inobt_maxlevels, - XFS_FSB_TO_B(mp, 1)) + - xfs_calc_buf_res(xfs_allocfree_log_count(mp, 1), - XFS_FSB_TO_B(mp, 1)); + XFS_FSB_TO_B(mp, 1)) + + xfs_allocfree_extent_res(mp); } /* @@ -200,8 +198,7 @@ xfs_calc_inode_chunk_res( { uint res, size = 0; - res = xfs_calc_buf_res(xfs_allocfree_log_count(mp, 1), - XFS_FSB_TO_B(mp, 1)); + res = xfs_allocfree_extent_res(mp); if (alloc) { /* icreate tx uses ordered buffers */ if (xfs_sb_version_has_v3inode(&mp->m_sb)) @@ -256,22 +253,19 @@ xfs_rtalloc_log_count( * extents. This gives (t1): * the inode getting the new extents: inode size * the inode's bmap btree: max depth * block size - * the agfs of the ags from which the extents are allocated: 2 * sector * the superblock free block counter: sector size - * the allocation btrees: 2 exts * 2 trees * (2 * max depth - 1) * block size + * two extents for the AG free space trees * Or, if we're writing to a realtime file (t2): * the inode getting the new extents: inode size * the inode's bmap btree: max depth * block size - * the agfs of the ags from which the extents are allocated: 2 * sector + * one extent for the AG free space trees * the superblock free block counter: sector size * the realtime bitmap: ((MAXEXTLEN / rtextsize) / NBBY) bytes * the realtime summary: 1 block - * the allocation btrees: 2 trees * (2 * max depth - 1) * block size * And the bmap_finish transaction can free bmap blocks in a join (t3): - * the agfs of the ags containing the blocks: 2 * sector size * the agfls of the ags containing the blocks: 2 * sector size * the super block free block counter: sector size - * the allocation btrees: 2 exts * 2 trees * (2 * max depth - 1) * block size + * two extents for the AG free space trees */ STATIC uint xfs_calc_write_reservation( @@ -282,8 +276,8 @@ xfs_calc_write_reservation( t1 = xfs_calc_inode_res(mp, 1) + xfs_calc_buf_res(XFS_BM_MAXLEVELS(mp, XFS_DATA_FORK), blksz) + - xfs_calc_buf_res(3, mp->m_sb.sb_sectsize) + - xfs_calc_buf_res(xfs_allocfree_log_count(mp, 2), blksz); + xfs_calc_buf_res(1, mp->m_sb.sb_sectsize) + + xfs_allocfree_extent_res(mp) * 2; if (xfs_sb_version_hasrealtime(&mp->m_sb)) { t2 = xfs_calc_inode_res(mp, 1) + @@ -291,13 +285,13 @@ xfs_calc_write_reservation( blksz) + xfs_calc_buf_res(3, mp->m_sb.sb_sectsize) + xfs_calc_buf_res(xfs_rtalloc_log_count(mp, 1), blksz) + - xfs_calc_buf_res(xfs_allocfree_log_count(mp, 1), blksz); + xfs_allocfree_extent_res(mp); } else { t2 = 0; } - t3 = xfs_calc_buf_res(5, mp->m_sb.sb_sectsize) + - xfs_calc_buf_res(xfs_allocfree_log_count(mp, 2), blksz); + t3 = xfs_calc_buf_res(1, mp->m_sb.sb_sectsize) + + xfs_allocfree_extent_res(mp) * 2; return XFS_DQUOT_LOGRES(mp) + max3(t1, t2, t3); } @@ -307,19 +301,13 @@ xfs_calc_write_reservation( * the inode being truncated: inode size * the inode's bmap btree: (max depth + 1) * block size * And the bmap_finish transaction can free the blocks and bmap blocks (t2): - * the agf for each of the ags: 4 * sector size - * the agfl for each of the ags: 4 * sector size * the super block to reflect the freed blocks: sector size - * worst case split in allocation btrees per extent assuming 4 extents: - * 4 exts * 2 trees * (2 * max depth - 1) * block size + * four (XXX: one?) extents for the AG free space trees * Or, if it's a realtime file (t3): - * the agf for each of the ags: 2 * sector size - * the agfl for each of the ags: 2 * sector size * the super block to reflect the freed blocks: sector size * the realtime bitmap: 2 exts * ((MAXEXTLEN / rtextsize) / NBBY) bytes * the realtime summary: 2 exts * 1 block - * worst case split in allocation btrees per extent assuming 2 extents: - * 2 exts * 2 trees * (2 * max depth - 1) * block size + * two extents for the AG free space trees */ STATIC uint xfs_calc_itruncate_reservation( @@ -331,13 +319,13 @@ xfs_calc_itruncate_reservation( t1 = xfs_calc_inode_res(mp, 1) + xfs_calc_buf_res(XFS_BM_MAXLEVELS(mp, XFS_DATA_FORK) + 1, blksz); - t2 = xfs_calc_buf_res(9, mp->m_sb.sb_sectsize) + - xfs_calc_buf_res(xfs_allocfree_log_count(mp, 4), blksz); + t2 = xfs_calc_buf_res(1, mp->m_sb.sb_sectsize) + + xfs_allocfree_extent_res(mp) * 4; if (xfs_sb_version_hasrealtime(&mp->m_sb)) { - t3 = xfs_calc_buf_res(5, mp->m_sb.sb_sectsize) + + t3 = xfs_calc_buf_res(1, mp->m_sb.sb_sectsize) + xfs_calc_buf_res(xfs_rtalloc_log_count(mp, 2), blksz) + - xfs_calc_buf_res(xfs_allocfree_log_count(mp, 2), blksz); + xfs_allocfree_extent_res(mp) * 2; } else { t3 = 0; } @@ -352,10 +340,8 @@ xfs_calc_itruncate_reservation( * the two directory bmap btrees: 2 * max depth * block size * And the bmap_finish transaction can free dir and bmap blocks (two sets * of bmap blocks) giving: - * the agf for the ags in which the blocks live: 3 * sector size - * the agfl for the ags in which the blocks live: 3 * sector size * the superblock for the free block count: sector size - * the allocation btrees: 3 exts * 2 trees * (2 * max depth - 1) * block size + * three extents for the AG free space trees */ STATIC uint xfs_calc_rename_reservation( @@ -365,9 +351,8 @@ xfs_calc_rename_reservation( max((xfs_calc_inode_res(mp, 4) + xfs_calc_buf_res(2 * XFS_DIROP_LOG_COUNT(mp), XFS_FSB_TO_B(mp, 1))), - (xfs_calc_buf_res(7, mp->m_sb.sb_sectsize) + - xfs_calc_buf_res(xfs_allocfree_log_count(mp, 3), - XFS_FSB_TO_B(mp, 1)))); + (xfs_calc_buf_res(1, mp->m_sb.sb_sectsize) + + xfs_allocfree_extent_res(mp) * 3)); } /* @@ -381,20 +366,19 @@ xfs_calc_iunlink_remove_reservation( struct xfs_mount *mp) { return xfs_calc_buf_res(1, mp->m_sb.sb_sectsize) + - 2 * M_IGEO(mp)->inode_cluster_size; + xfs_calc_buf_res(2, M_IGEO(mp)->inode_cluster_size); } /* * For creating a link to an inode: + * the inode is removed from the iunlink list (O_TMPFILE) * the parent directory inode: inode size * the linked inode: inode size * the directory btree could split: (max depth + v2) * dir block size * the directory bmap btree could join or split: (max depth + v2) * blocksize * And the bmap_finish transaction can free some bmap blocks giving: - * the agf for the ag in which the blocks live: sector size - * the agfl for the ag in which the blocks live: sector size * the superblock for the free block count: sector size - * the allocation btrees: 2 trees * (2 * max depth - 1) * block size + * one extent for the AG free space trees */ STATIC uint xfs_calc_link_reservation( @@ -405,9 +389,8 @@ xfs_calc_link_reservation( max((xfs_calc_inode_res(mp, 2) + xfs_calc_buf_res(XFS_DIROP_LOG_COUNT(mp), XFS_FSB_TO_B(mp, 1))), - (xfs_calc_buf_res(3, mp->m_sb.sb_sectsize) + - xfs_calc_buf_res(xfs_allocfree_log_count(mp, 1), - XFS_FSB_TO_B(mp, 1)))); + (xfs_calc_buf_res(1, mp->m_sb.sb_sectsize) + + xfs_allocfree_extent_res(mp))); } /* @@ -419,20 +402,19 @@ STATIC uint xfs_calc_iunlink_add_reservation(xfs_mount_t *mp) { return xfs_calc_buf_res(1, mp->m_sb.sb_sectsize) + - M_IGEO(mp)->inode_cluster_size; + xfs_calc_buf_res(1, M_IGEO(mp)->inode_cluster_size); } /* * For removing a directory entry we can modify: + * the inode is added to the agi unlinked list * the parent directory inode: inode size * the removed inode: inode size * the directory btree could join: (max depth + v2) * dir block size * the directory bmap btree could join or split: (max depth + v2) * blocksize * And the bmap_finish transaction can free the dir and bmap blocks giving: - * the agf for the ag in which the blocks live: 2 * sector size - * the agfl for the ag in which the blocks live: 2 * sector size * the superblock for the free block count: sector size - * the allocation btrees: 2 exts * 2 trees * (2 * max depth - 1) * block size + * two extents for the AG free space trees */ STATIC uint xfs_calc_remove_reservation( @@ -443,9 +425,8 @@ xfs_calc_remove_reservation( max((xfs_calc_inode_res(mp, 1) + xfs_calc_buf_res(XFS_DIROP_LOG_COUNT(mp), XFS_FSB_TO_B(mp, 1))), - (xfs_calc_buf_res(4, mp->m_sb.sb_sectsize) + - xfs_calc_buf_res(xfs_allocfree_log_count(mp, 2), - XFS_FSB_TO_B(mp, 1)))); + (xfs_calc_buf_res(1, mp->m_sb.sb_sectsize) + + xfs_allocfree_extent_res(mp) * 2)); } /* @@ -582,15 +563,14 @@ xfs_calc_ichange_reservation( * Growing the data section of the filesystem. * superblock * agi and agf - * allocation btrees + * growing the last AG requires freeing one extent */ STATIC uint xfs_calc_growdata_reservation( struct xfs_mount *mp) { - return xfs_calc_buf_res(3, mp->m_sb.sb_sectsize) + - xfs_calc_buf_res(xfs_allocfree_log_count(mp, 1), - XFS_FSB_TO_B(mp, 1)); + return xfs_calc_buf_res(1, mp->m_sb.sb_sectsize) + + xfs_allocfree_extent_res(mp); } /* @@ -598,10 +578,9 @@ xfs_calc_growdata_reservation( * In the first set of transactions (ALLOC) we allocate space to the * bitmap or summary files. * superblock: sector size - * agf of the ag from which the extent is allocated: sector size * bmap btree for bitmap/summary inode: max depth * blocksize * bitmap/summary inode: inode size - * allocation btrees for 1 block alloc: 2 * (2 * maxdepth - 1) * blocksize + * one extent for the AG free space trees */ STATIC uint xfs_calc_growrtalloc_reservation( @@ -611,8 +590,7 @@ xfs_calc_growrtalloc_reservation( xfs_calc_buf_res(XFS_BM_MAXLEVELS(mp, XFS_DATA_FORK), XFS_FSB_TO_B(mp, 1)) + xfs_calc_inode_res(mp, 1) + - xfs_calc_buf_res(xfs_allocfree_log_count(mp, 1), - XFS_FSB_TO_B(mp, 1)); + xfs_allocfree_extent_res(mp); } /* @@ -675,7 +653,7 @@ xfs_calc_writeid_reservation( * agf block and superblock (for block allocation) * the new block (directory sized) * bmap blocks for the new directory block - * allocation btrees + * one extent for the AG free space trees */ STATIC uint xfs_calc_addafork_reservation( @@ -687,8 +665,7 @@ xfs_calc_addafork_reservation( xfs_calc_buf_res(1, mp->m_dir_geo->blksize) + xfs_calc_buf_res(XFS_DAENTER_BMAP1B(mp, XFS_DATA_FORK) + 1, XFS_FSB_TO_B(mp, 1)) + - xfs_calc_buf_res(xfs_allocfree_log_count(mp, 1), - XFS_FSB_TO_B(mp, 1)); + xfs_allocfree_extent_res(mp); } /* @@ -696,11 +673,8 @@ xfs_calc_addafork_reservation( * the inode being truncated: inode size * the inode's bmap btree: max depth * block size * And the bmap_finish transaction can free the blocks and bmap blocks: - * the agf for each of the ags: 4 * sector size - * the agfl for each of the ags: 4 * sector size * the super block to reflect the freed blocks: sector size - * worst case split in allocation btrees per extent assuming 4 extents: - * 4 exts * 2 trees * (2 * max depth - 1) * block size + * four (XXX: really? should be one?) extents for the AG free space trees */ STATIC uint xfs_calc_attrinval_reservation( @@ -709,9 +683,8 @@ xfs_calc_attrinval_reservation( return max((xfs_calc_inode_res(mp, 1) + xfs_calc_buf_res(XFS_BM_MAXLEVELS(mp, XFS_ATTR_FORK), XFS_FSB_TO_B(mp, 1))), - (xfs_calc_buf_res(9, mp->m_sb.sb_sectsize) + - xfs_calc_buf_res(xfs_allocfree_log_count(mp, 4), - XFS_FSB_TO_B(mp, 1)))); + (xfs_calc_buf_res(1, mp->m_sb.sb_sectsize) + + xfs_allocfree_extent_res(mp) * 4)); } /* @@ -763,7 +736,7 @@ xfs_calc_attrsetrt_reservation( * the agf for the ag in which the blocks live: 2 * sector size * the agfl for the ag in which the blocks live: 2 * sector size * the superblock for the free block count: sector size - * the allocation btrees: 2 exts * 2 trees * (2 * max depth - 1) * block size + * two extents for the AG free space trees */ STATIC uint xfs_calc_attrrm_reservation( @@ -776,9 +749,8 @@ xfs_calc_attrrm_reservation( (uint)XFS_FSB_TO_B(mp, XFS_BM_MAXLEVELS(mp, XFS_ATTR_FORK)) + xfs_calc_buf_res(XFS_BM_MAXLEVELS(mp, XFS_DATA_FORK), 0)), - (xfs_calc_buf_res(5, mp->m_sb.sb_sectsize) + - xfs_calc_buf_res(xfs_allocfree_log_count(mp, 2), - XFS_FSB_TO_B(mp, 1)))); + (xfs_calc_buf_res(1, mp->m_sb.sb_sectsize) + + xfs_allocfree_extent_res(mp) * 2)); } /* -- 2.28.0 ^ permalink raw reply related [flat|nested] 12+ messages in thread
end of thread, other threads:[~2020-09-11 17:42 UTC | newest] Thread overview: 12+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2020-09-09 8:19 [PATCH 0/3] [RFC] xfs: intent recovery reservation changes Dave Chinner 2020-09-09 8:19 ` [PATCH 1/3] xfs: EFI recovery needs it's own transaction reservation Dave Chinner 2020-09-09 13:31 ` Brian Foster 2020-09-09 21:44 ` Dave Chinner 2020-09-10 13:18 ` Brian Foster 2020-09-10 21:29 ` Dave Chinner 2020-09-11 12:37 ` Brian Foster 2020-09-09 8:19 ` [PATCH 2/3] xfs: add a free space extent change reservation Dave Chinner 2020-09-09 13:35 ` Brian Foster 2020-09-09 22:51 ` Dave Chinner 2020-09-10 13:19 ` Brian Foster 2020-09-09 8:19 ` [PATCH 3/3] xfs: factor free space tree transaciton reservations Dave Chinner
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox