From: "Darrick J. Wong" <djwong@kernel.org>
To: Dave Chinner <david@fromorbit.com>
Cc: Wengang Wang <wen.gang.wang@oracle.com>,
"linux-xfs@vger.kernel.org" <linux-xfs@vger.kernel.org>,
Srikanth C S <srikanth.c.s@oracle.com>
Subject: Re: Question: reserve log space at IO time for recover
Date: Sun, 27 Aug 2023 09:04:30 -0700 [thread overview]
Message-ID: <20230827160430.GA28186@frogsfrogsfrogs> (raw)
In-Reply-To: <ZOlzdNT/DWb+fmPq@dread.disaster.area>
On Sat, Aug 26, 2023 at 01:37:24PM +1000, Dave Chinner wrote:
> On Thu, Aug 24, 2023 at 03:01:54PM -0700, Darrick J. Wong wrote:
> > On Thu, Aug 24, 2023 at 05:28:23PM +1000, Dave Chinner wrote:
> > > On Wed, Aug 23, 2023 at 09:52:34PM -0700, Darrick J. Wong wrote:
> > > > On Fri, Aug 18, 2023 at 03:25:46AM +0000, Wengang Wang wrote:
> > > >
> > > > Since xfs_efi_item_recover is only performing one step of what could be
> > > > a chain of deferred updates, it never rolls the transaction that it
> > > > creates. It therefore only requires the amount of grant space that
> > > > you'd get with tr_logcount == 1. It is therefore a bit silly that we
> > > > ask for more than that, and in bad cases like this, hang log recovery
> > > > needlessly.
> > >
> > > But this doesn't fix the whatever problem lead to the recovery not
> > > having the same full tr_itruncate reservation available as was held
> > > by the transaction that logged the EFI and was running the extent
> > > free at the time the system crashed. There should -always- be enough
> > > transaction reservation space in the journal to reserve space for an
> > > intent replay if the intent recovery reservation uses the same
> > > reservation type as runtime.
> > >
> > > Hence I think this is still just a band-aid over whatever went wrong
> > > at runtime that lead to the log not having enough space for a
> > > reservation that was clearly held at runtime and hadn't yet used.
> >
> > Maybe I'm not remembering accurately how permanent log reservations
> > work. Let's continue picking on tr_itruncate from Wengang's example.
> > IIRC, he said that tr_itruncate on the running system was defined
> > roughly like so:
> >
> > tr_itruncate = {
> > .tr_logres = 180K
> > .tr_logcount = 2,
> > .tr_logflags = XFS_TRANS_PERM_LOG_RES,
> > }
> >
> > At runtime, when we want to start a truncation update, we do this:
> >
> > xfs_trans_alloc(mp, &M_RES(mp)->tr_itruncate, ...);
> >
> > Call sequence: xfs_trans_alloc -> xfs_trans_reserve -> xfs_log_reserve
> > -> xlog_ticket_alloc. Ticket allocation assigns tic->t_unit_res =
> > (tr_logres + overhead) and tic->t_cnt = tr_logcount.
> >
> > (Let's pretend for the sake of argument that the overhead is 5K.)
> >
> > Now xfs_log_reserve calls xlog_grant_head_check. That does:
> >
> > *need_bytes = xlog_ticket_reservation(log, head, tic);
> >
> > For the reserve head, the ticket reservation computation is
> > (tic->t_unit_res * tic->t_cnt), which in this case is (185K * 2) ==
> > 370K, right? So we make sure there's at least 370K free in the reserve
> > head, then add that to the reserve and write heads.
> >
> > Now that we've allocated the transaction, delete the bmap mapping,
> > log an EFI to free the space, and roll the transaction as part of
> > finishing the deferops chain. Rolling creates a new xfs_trans which
> > shares its ticket with the old transaction. Next, xfs_trans_roll calls
> > __xfs_trans_commit with regrant == true, which calls xlog_cil_commit
> > with the same regrant parameter.
> >
> > xlog_cil_commit calls xfs_log_ticket_regrant, which decrements t_cnt and
> > subtracts t_curr_res from the reservation and write heads.
> >
> > If the filesystem is fresh and the first transaction only used (say)
> > 20K, then t_curr_res will be 165K, and we give that much reservation
> > back to the reservation head. Or if the file is really fragmented and
> > the first transaction actually uses 170K, then t_curr_res will be 15K,
> > and that's what we give back to the reservation.
> >
> > Having done that, we're now headed into the second transaction with an
> > EFI and 185K of reservation, correct?
>
> Ah, right, I overlooked that the long running truncates only regrant a
> single reservation unit at a time when they roll, and the runtime
> instances I seen of this are with long running truncate operations
> (i.e. inactivation after unlink for multi-thousand extent inodes).
>
> So, yes, all types of intent recovery must only use a single unit
> reservation, not just EFIs, because there is no guarantee that there
> is a full unit * count reservation available in the journal whenever
> that intent was first logged. Indeed, at best it will be 'count - 1'
> that is avaialble, becuase the transaction that logged the intent
> will have used a full unit of the original reservation....
<nod> I think something like this will fix the problem, right?
diff --git a/fs/xfs/libxfs/xfs_log_recover.h b/fs/xfs/libxfs/xfs_log_recover.h
index 2420865f3007..a5100a11faf9 100644
--- a/fs/xfs/libxfs/xfs_log_recover.h
+++ b/fs/xfs/libxfs/xfs_log_recover.h
@@ -131,4 +131,26 @@ void xlog_check_buf_cancel_table(struct xlog *log);
#define xlog_check_buf_cancel_table(log) do { } while (0)
#endif
+/*
+ * Transform a regular reservation into one suitable for recovery of a log
+ * intent item.
+ *
+ * Intent recovery only runs a single step of the transaction chain and defers
+ * the rest to a separate transaction. Therefore, we reduce logcount to 1 here
+ * to avoid livelocks if the log grant space is nearly exhausted due to the
+ * recovered intent pinning the tail. Keep the same logflags to avoid tripping
+ * asserts elsewhere. Struct copies abound below.
+ */
+static inline struct xfs_trans_res
+xlog_recover_resv(const struct xfs_trans_res *r)
+{
+ struct xfs_trans_res ret = {
+ .tr_logres = r->tr_logres,
+ .tr_logcount = 1,
+ .tr_logflags = r->tr_logflags,
+ };
+
+ return ret;
+}
+
#endif /* __XFS_LOG_RECOVER_H__ */
diff --git a/fs/xfs/xfs_attr_item.c b/fs/xfs/xfs_attr_item.c
index 2788a6f2edcd..36fe2abb16e6 100644
--- a/fs/xfs/xfs_attr_item.c
+++ b/fs/xfs/xfs_attr_item.c
@@ -547,7 +547,7 @@ xfs_attri_item_recover(
struct xfs_inode *ip;
struct xfs_da_args *args;
struct xfs_trans *tp;
- struct xfs_trans_res tres;
+ struct xfs_trans_res resv;
struct xfs_attri_log_format *attrp;
struct xfs_attri_log_nameval *nv = attrip->attri_nameval;
int error;
@@ -618,8 +618,9 @@ xfs_attri_item_recover(
goto out;
}
- xfs_init_attr_trans(args, &tres, &total);
- error = xfs_trans_alloc(mp, &tres, total, 0, XFS_TRANS_RESERVE, &tp);
+ xfs_init_attr_trans(args, &resv, &total);
+ resv = xlog_recover_resv(&resv);
+ error = xfs_trans_alloc(mp, &resv, total, 0, XFS_TRANS_RESERVE, &tp);
if (error)
goto out;
diff --git a/fs/xfs/xfs_bmap_item.c b/fs/xfs/xfs_bmap_item.c
index 7551c3ec4ea5..e736a0844c89 100644
--- a/fs/xfs/xfs_bmap_item.c
+++ b/fs/xfs/xfs_bmap_item.c
@@ -490,6 +490,7 @@ xfs_bui_item_recover(
struct list_head *capture_list)
{
struct xfs_bmap_intent fake = { };
+ struct xfs_trans_res resv;
struct xfs_bui_log_item *buip = BUI_ITEM(lip);
struct xfs_trans *tp;
struct xfs_inode *ip = NULL;
@@ -515,7 +516,8 @@ xfs_bui_item_recover(
return error;
/* Allocate transaction and do the work. */
- error = xfs_trans_alloc(mp, &M_RES(mp)->tr_itruncate,
+ resv = xlog_recover_resv(&M_RES(mp)->tr_itruncate);
+ error = xfs_trans_alloc(mp, &resv,
XFS_EXTENTADD_SPACE_RES(mp, XFS_DATA_FORK), 0, 0, &tp);
if (error)
goto err_rele;
diff --git a/fs/xfs/xfs_extfree_item.c b/fs/xfs/xfs_extfree_item.c
index f1a5ecf099aa..3fa8789820ad 100644
--- a/fs/xfs/xfs_extfree_item.c
+++ b/fs/xfs/xfs_extfree_item.c
@@ -660,6 +660,7 @@ xfs_efi_item_recover(
struct xfs_log_item *lip,
struct list_head *capture_list)
{
+ struct xfs_trans_res resv;
struct xfs_efi_log_item *efip = EFI_ITEM(lip);
struct xfs_mount *mp = lip->li_log->l_mp;
struct xfs_efd_log_item *efdp;
@@ -683,7 +684,8 @@ xfs_efi_item_recover(
}
}
- error = xfs_trans_alloc(mp, &M_RES(mp)->tr_itruncate, 0, 0, 0, &tp);
+ resv = xlog_recover_resv(&M_RES(mp)->tr_itruncate);
+ error = xfs_trans_alloc(mp, &resv, 0, 0, 0, &tp);
if (error)
return error;
efdp = xfs_trans_get_efd(tp, efip, efip->efi_format.efi_nextents);
diff --git a/fs/xfs/xfs_refcount_item.c b/fs/xfs/xfs_refcount_item.c
index edd8587658d5..2d4444d61e98 100644
--- a/fs/xfs/xfs_refcount_item.c
+++ b/fs/xfs/xfs_refcount_item.c
@@ -477,6 +477,7 @@ xfs_cui_item_recover(
struct xfs_log_item *lip,
struct list_head *capture_list)
{
+ struct xfs_trans_res resv;
struct xfs_cui_log_item *cuip = CUI_ITEM(lip);
struct xfs_cud_log_item *cudp;
struct xfs_trans *tp;
@@ -514,8 +515,9 @@ xfs_cui_item_recover(
* doesn't fit. We need to reserve enough blocks to handle a
* full btree split on either end of the refcount range.
*/
- error = xfs_trans_alloc(mp, &M_RES(mp)->tr_itruncate,
- mp->m_refc_maxlevels * 2, 0, XFS_TRANS_RESERVE, &tp);
+ resv = xlog_recover_resv(&M_RES(mp)->tr_itruncate);
+ error = xfs_trans_alloc(mp, &resv, mp->m_refc_maxlevels * 2, 0,
+ XFS_TRANS_RESERVE, &tp);
if (error)
return error;
diff --git a/fs/xfs/xfs_rmap_item.c b/fs/xfs/xfs_rmap_item.c
index 520c7ebdfed8..0e0e747028da 100644
--- a/fs/xfs/xfs_rmap_item.c
+++ b/fs/xfs/xfs_rmap_item.c
@@ -507,6 +507,7 @@ xfs_rui_item_recover(
struct xfs_log_item *lip,
struct list_head *capture_list)
{
+ struct xfs_trans_res resv;
struct xfs_rui_log_item *ruip = RUI_ITEM(lip);
struct xfs_rud_log_item *rudp;
struct xfs_trans *tp;
@@ -530,8 +531,9 @@ xfs_rui_item_recover(
}
}
- error = xfs_trans_alloc(mp, &M_RES(mp)->tr_itruncate,
- mp->m_rmap_maxlevels, 0, XFS_TRANS_RESERVE, &tp);
+ resv = xlog_recover_resv(&M_RES(mp)->tr_itruncate);
+ error = xfs_trans_alloc(mp, &resv, mp->m_rmap_maxlevels, 0,
+ XFS_TRANS_RESERVE, &tp);
if (error)
return error;
rudp = xfs_trans_get_rud(tp, ruip);
> > IOWs, I'm operating on an assumption that we have two problems to solve:
> > the runtime acconting bug that you've been chasing, and Wengang's where
> > log recovery asks for more log space than what had been in the log
> > ticket at the time of the crash.
>
> OK, yes, it does seem there are two problems here...
>
> > > I suspect that the cause of this recovery issue is the we require an
> > > overcommit of the log space accounting before we throttle incoming
> > > transaction reservations (i.e. the new reservation has to overrun
> > > before we throttle). I think that the reservation accounting overrun
> > > detection can race to the first item being placed on the wait list,
> >
> > Yeah, I was wondering that myself when I was looking at the logic
> > between list_empty_careful and the second xlog_grant_head_wait and
> > wondering if that "careful" construction actually worked.
>
> Well, it's not the careful check that is the problem; the race is
> much simpler than that. We just have to have two concurrent
> reservations that will both fit in the remaining log space
> individually but not together. i.e. this is the overcommit race
> window:
>
> P1 P2
> --------------------------- --------------------------
> xfs_log_reserve(10000 bytes)
> xlog_grant_head_check()
> xlog_space_left()
> <sees 12000 bytes>
> OK, doesn't wait
> xfs_log_reserve(10000 bytes)
> xlog_grant_head_check()
> xlog_space_left()
> <sees 12000 bytes>
> OK, doesn't wait
> xlog_grant_add_space(10k)
> xlog_grant_add_space(10k)
>
> And now we've overcommitted the log by 8000 bytes....
>
> The current byte based grant head patch set does not address this;
> the attempt I made to fix it didn't work properly, so I split the
> rework of the throttling out of the byte based accounting patchset
> (which seems to work correctly without reworking
> xlog_grant_head_check()) and now I'm trying to address this race
> condition as a separate patchset...
Oh. Heh. Yeah.
--D
> -Dave.
> --
> Dave Chinner
> david@fromorbit.com
next prev parent reply other threads:[~2023-08-27 16:05 UTC|newest]
Thread overview: 22+ messages / expand[flat|nested] mbox.gz Atom feed top
2023-07-18 22:57 Question: reserve log space at IO time for recover Wengang Wang
2023-07-19 0:11 ` Dave Chinner
2023-07-19 1:44 ` Darrick J. Wong
2023-07-19 6:25 ` Dave Chinner
2023-07-21 19:36 ` Wengang Wang
2023-07-24 0:57 ` Dave Chinner
2023-07-24 18:03 ` Wengang Wang
2023-07-26 4:08 ` Dave Chinner
2023-07-26 15:23 ` Darrick J. Wong
2023-07-27 1:05 ` Dave Chinner
2023-07-28 17:56 ` Wengang Wang
2023-08-18 3:25 ` Wengang Wang
2023-08-21 22:06 ` Wengang Wang
2023-08-24 5:05 ` Darrick J. Wong
2023-08-24 22:55 ` Wengang Wang
2023-08-24 4:52 ` Darrick J. Wong
2023-08-24 7:28 ` Dave Chinner
2023-08-24 22:01 ` Darrick J. Wong
2023-08-26 3:37 ` Dave Chinner
2023-08-27 16:04 ` Darrick J. Wong [this message]
2023-08-24 23:53 ` Wengang Wang
2023-07-19 1:46 ` Wengang Wang
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20230827160430.GA28186@frogsfrogsfrogs \
--to=djwong@kernel.org \
--cc=david@fromorbit.com \
--cc=linux-xfs@vger.kernel.org \
--cc=srikanth.c.s@oracle.com \
--cc=wen.gang.wang@oracle.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.