From: "Darrick J. Wong" <darrick.wong@oracle.com>
To: Dave Chinner <david@fromorbit.com>
Cc: linux-xfs@vger.kernel.org, bfoster@redhat.com
Subject: Re: [PATCH 3/3] xfs: periodically relog deferred intent items
Date: Sun, 27 Sep 2020 16:30:25 -0700 [thread overview]
Message-ID: <20200927233025.GA49547@magnolia> (raw)
In-Reply-To: <20200927230823.GA14422@dread.disaster.area>
On Mon, Sep 28, 2020 at 09:08:23AM +1000, Dave Chinner wrote:
> On Tue, Sep 22, 2020 at 10:33:19PM -0700, Darrick J. Wong wrote:
> > From: Darrick J. Wong <darrick.wong@oracle.com>
> >
> > There's a subtle design flaw in the deferred log item code that can lead
> > to pinning the log tail. Taking up the defer ops chain examples from
> > the previous commit, we can get trapped in sequences like this:
> >
> > Caller hands us a transaction t0 with D0-D3 attached. The defer ops
> > chain will look like the following if the transaction rolls succeed:
> >
> > t1: D0(t0), D1(t0), D2(t0), D3(t0)
> > t2: d4(t1), d5(t1), D1(t0), D2(t0), D3(t0)
> > t3: d5(t1), D1(t0), D2(t0), D3(t0)
> > ...
> > t9: d9(t7), D3(t0)
> > t10: D3(t0)
> > t11: d10(t10), d11(t10)
> > t12: d11(t10)
> >
> > In transaction 9, we finish d9 and try to roll to t10 while holding onto
> > an intent item for D3 that we logged in t0.
> >
> > The previous commit changed the order in which we place new defer ops in
> > the defer ops processing chain to reduce the maximum chain length. Now
> > make xfs_defer_finish_noroll capable of relogging the entire chain
> > periodically so that we can always move the log tail forward. Most
> > chains will never get relogged, except for operations that generate very
> > long chains (large extents containing many blocks with different sharing
> > levels) or are on filesystems with small logs and a lot of ongoing
> > metadata updates.
> >
> > Callers are now required to ensure that the transaction reservation is
> > large enough to handle logging done items and new intent items for the
> > maximum possible chain length. Most callers are careful to keep the
> > chain lengths low, so the overhead should be minimal.
> >
> > The decision to relog an intent item is made based on whether or not the
> > intent was added to the current checkpoint. If so, the checkpoint is
> > still open and there's no point in relogging. Otherwise, the old
> > checkpoint is closed and we relog the intent to add it to the current
> > one.
> >
> > Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
> > ---
> > fs/xfs/libxfs/xfs_defer.c | 52 ++++++++++++++++++++++++++++++++++++++++++++
> > fs/xfs/xfs_bmap_item.c | 27 +++++++++++++++++++++++
> > fs/xfs/xfs_extfree_item.c | 29 +++++++++++++++++++++++++
> > fs/xfs/xfs_refcount_item.c | 27 +++++++++++++++++++++++
> > fs/xfs/xfs_rmap_item.c | 27 +++++++++++++++++++++++
> > fs/xfs/xfs_trace.h | 1 +
> > fs/xfs/xfs_trans.h | 10 ++++++++
> > 7 files changed, 173 insertions(+)
> >
> >
> > diff --git a/fs/xfs/libxfs/xfs_defer.c b/fs/xfs/libxfs/xfs_defer.c
> > index 84a70edd0da1..c601cc2af254 100644
> > --- a/fs/xfs/libxfs/xfs_defer.c
> > +++ b/fs/xfs/libxfs/xfs_defer.c
> > @@ -17,6 +17,7 @@
> > #include "xfs_inode_item.h"
> > #include "xfs_trace.h"
> > #include "xfs_icache.h"
> > +#include "xfs_log.h"
> >
> > /*
> > * Deferred Operations in XFS
> > @@ -361,6 +362,52 @@ xfs_defer_cancel_list(
> > }
> > }
> >
> > +/*
> > + * Prevent a log intent item from pinning the tail of the log by logging a
> > + * done item to release the intent item; and then log a new intent item.
> > + * The caller should provide a fresh transaction and roll it after we're done.
> > + */
> > +static int
> > +xfs_defer_relog(
> > + struct xfs_trans **tpp,
> > + struct list_head *dfops)
> > +{
> > + struct xfs_defer_pending *dfp;
> > + xfs_lsn_t threshold_lsn;
> > +
> > + ASSERT((*tpp)->t_flags & XFS_TRANS_PERM_LOG_RES);
> > +
> > + /*
> > + * Figure out where we need the tail to be in order to maintain the
> > + * minimum required free space in the log.
> > + */
> > + threshold_lsn = xlog_grant_push_threshold((*tpp)->t_mountp->m_log, 0);
> > + if (threshold_lsn == NULLCOMMITLSN)
> > + return 0;
>
> This smells of premature optimisation.
>
> When we are in a tail-pushing scenario (i.e. any sort of
> sustained metadata workload) this will always return true, and so we
> will relog every intent that isn't in the current checkpoint every
> time this is called. Under light load, we don't care if we add a
> little bit of relogging overhead as the CIL slowly flushes/pushes -
> it will have neglible impact on performance because there is little
> load on the journal.
>
> However, when we are under heavy load the code will now be reading
> the grant head and log position accounting variables during every
> commit, hence greatly increasing the number and temporal
> distribution of accesses to the hotest cachelines in the log. We
> currently never access these cache lines during commit unless the
> unit reservation has run out and we have to regrant physical log
> space for the transaction to continue (i.e. we are into slow path
> commit code). IOWs, this is like causing far more than double the
> number of accesses to the grant head, the log tail, the
> last_sync_lsn, etc, all of which is unnecessary exactly when we care
> about minimising contention on the log space accounting variables...
>
> Given that it is a redundant check under heavy load journal load
> when access to the log grant/head/tail are already contended,
> I think we should just be checking the "in current checkpoint" logic
> and not making it conditional on the log being near full.
<nod> FWIW I broke this patch up again into the first part that
only does relogging if the checkpoints don't match, and a second part
that does the LSN push target check to see if I could observe any
difference.
Across a ~4h fstests run I noticed that there was about ~20% fewer
relogs, but OTOH the total runtime didn't change noticeably. I kind of
wondered if the increased cacheline contention would at least slow down
the frontend a bit to give the log a chance to push things out, but
haven't had time to dig any further than "ran fstests, recorded runtimes
and grep | wc -l'd the ftrace log".
Anyway, I was about to resend with all these patches rebased against
something resembling the 5.10 branch, so expect to see this broken out a
bit.
--D
> Cheers,
>
> Dave.
> --
> Dave Chinner
> david@fromorbit.com
next prev parent reply other threads:[~2020-09-27 23:30 UTC|newest]
Thread overview: 16+ messages / expand[flat|nested] mbox.gz Atom feed top
2020-09-23 5:32 [PATCH v3 0/3] xfs: fix some log stalling problems in defer ops Darrick J. Wong
2020-09-23 5:33 ` [PATCH 1/3] xfs: change the order in which child and parent defer ops are finished Darrick J. Wong
2020-09-23 5:33 ` [PATCH 2/3] xfs: expose the log push threshold Darrick J. Wong
2020-09-25 11:15 ` Brian Foster
2020-09-25 18:59 ` Darrick J. Wong
2020-09-23 5:33 ` [PATCH 3/3] xfs: periodically relog deferred intent items Darrick J. Wong
2020-09-25 11:15 ` Brian Foster
2020-09-25 19:06 ` Darrick J. Wong
2020-09-27 23:08 ` Dave Chinner
2020-09-27 23:30 ` Darrick J. Wong [this message]
2020-09-28 1:00 ` Dave Chinner
2020-09-28 15:16 ` Brian Foster
2020-09-28 23:09 ` Dave Chinner
2020-09-29 12:27 ` Brian Foster
2020-09-29 17:01 ` Darrick J. Wong
2020-09-29 18:45 ` Brian Foster
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20200927233025.GA49547@magnolia \
--to=darrick.wong@oracle.com \
--cc=bfoster@redhat.com \
--cc=david@fromorbit.com \
--cc=linux-xfs@vger.kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox