From: "Darrick J. Wong" <djwong@kernel.org>
To: Dave Chinner <david@fromorbit.com>
Cc: linux-xfs@vger.kernel.org
Subject: Re: [PATCH 8/8] xfs: Fix CIL throttle hang when CIL space used going backwards
Date: Wed, 24 Feb 2021 13:18:10 -0800 [thread overview]
Message-ID: <20210224211810.GX7272@magnolia> (raw)
In-Reply-To: <20210223033442.3267258-9-david@fromorbit.com>
On Tue, Feb 23, 2021 at 02:34:42PM +1100, Dave Chinner wrote:
> From: Dave Chinner <dchinner@redhat.com>
>
> A hang with tasks stuck on the CIL hard throttle was reported and
> largely diagnosed by Donald Buczek, who discovered that it was a
> result of the CIL context space usage decrementing in committed
> transactions once the hard throttle limit had been hit and processes
> were already blocked. This resulted in the CIL push not waking up
> those waiters because the CIL context was no longer over the hard
> throttle limit.
>
> The surprising aspect of this was the CIL space usage going
> backwards regularly enough to trigger this situation. Assumptions
> had been made in design that the relogging process would only
> increase the size of the objects in the CIL, and so that space would
> only increase.
>
> This change and commit message fixes the issue and documents the
> result of an audit of the triggers that can cause the CIL space to
> go backwards, how large the backwards steps tend to be, the
> frequency in which they occur, and what the impact on the CIL
> accounting code is.
>
> Even though the CIL ctx->space_used can go backwards, it will only
> do so if the log item is already logged to the CIL and contains a
> space reservation for it's entire logged state. This is tracked by
> the shadow buffer state on the log item. If the item is not
> previously logged in the CIL it has no shadow buffer nor log vector,
> and hence the entire size of the logged item copied to the log
> vector is accounted to the CIL space usage. i.e. it will always go
> up in this case.
>
> If the item has a log vector (i.e. already in the CIL) and the size
> decreases, then the existing log vector will be overwritten and the
> space usage will go down. This is the only condition where the space
> usage reduces, and it can only occur when an item is already tracked
> in the CIL. Hence we are safe from CIL space usage underruns as a
> result of log items decreasing in size when they are relogged.
>
> Typically this reduction in CIL usage occurs from metadta blocks
"metadata"...
> being free, such as when a btree block merge
> occurs or a directory enter/xattr entry is removed and the da-tree
> is reduced in size. This generally results in a reduction in size of
> around a single block in the CIL, but also tends to increase the
> number of log vectors because the parent and sibling nodes in the
> tree needs to be updated when a btree block is removed. If a
> multi-level merge occurs, then we see reduction in size of 2+
> blocks, but again the log vector count goes up.
>
> The other vector is inode fork size changes, which only log the
> current size of the fork and ignore the previously logged size when
> the fork is relogged. Hence if we are removing items from the inode
> fork (dir/xattr removal in shortform, extent record removal in
> extent form, etc) the relogged size of the inode for can decrease.
>
> No other log items can decrease in size either because they are a
> fixed size (e.g. dquots) or they cannot be relogged (e.g. relogging
> an intent actually creates a new intent log item and doesn't relog
> the old item at all.) Hence the only two vectors for CIL context
> size reduction are relogging inode forks and marking buffers active
> in the CIL as stale.
>
> Long story short: the majority of the code does the right thing and
> handles the reduction in log item size correctly, and only the CIL
> hard throttle implementation is problematic and needs fixing. This
> patch makes that fix, as well as adds comments in the log item code
> that result in items shrinking in size when they are relogged as a
> clear reminder that this can and does happen frequently.
>
> The throttle fix is based upon the change Donald proposed, though it
> goes further to ensure that once the throttle is activated, it
> captures all tasks until the CIL push issues a wakeup, regardless of
> whether the CIL space used has gone back under the throttle
> threshold.
>
> This ensures that we prevent tasks reducing the CIL slightly under
> the throttle threshold and then making more changes that push it
> well over the throttle limit. This is acheived by checking if the
> throttle wait queue is already active as a condition of throttling.
> Hence once we start throttling, we continue to apply the throttle
> until the CIL context push wakes everything on the wait queue.
>
> We can use waitqueue_active() for the waitqueue manipulations and
> checks as they are all done under the ctx->xc_push_lock. Hence the
> waitqueue has external serialisation and we can safely peek inside
> the wait queue without holding the internal waitqueue locks.
>
> Many thanks to Donald for his diagnostic and analysis work to
> isolate the cause of this hang.
>
> Reported-by: Donald Buczek <buczek@molgen.mpg.de>
Does this whole series fix the Donald's problem?
> Signed-off-by: Dave Chinner <dchinner@redhat.com>
> Reviewed-by: Brian Foster <bfoster@redhat.com>
> Reviewed-by: Chandan Babu R <chandanrlinux@gmail.com>
Looks ok to me,
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
--D
> ---
> fs/xfs/xfs_buf_item.c | 37 ++++++++++++++++++-------------------
> fs/xfs/xfs_inode_item.c | 14 ++++++++++++++
> fs/xfs/xfs_log_cil.c | 22 +++++++++++++++++-----
> 3 files changed, 49 insertions(+), 24 deletions(-)
>
> diff --git a/fs/xfs/xfs_buf_item.c b/fs/xfs/xfs_buf_item.c
> index dc0be2a639cc..17960b1ce5ef 100644
> --- a/fs/xfs/xfs_buf_item.c
> +++ b/fs/xfs/xfs_buf_item.c
> @@ -56,14 +56,12 @@ xfs_buf_log_format_size(
> }
>
> /*
> - * This returns the number of log iovecs needed to log the
> - * given buf log item.
> + * Return the number of log iovecs and space needed to log the given buf log
> + * item segment.
> *
> - * It calculates this as 1 iovec for the buf log format structure
> - * and 1 for each stretch of non-contiguous chunks to be logged.
> - * Contiguous chunks are logged in a single iovec.
> - *
> - * If the XFS_BLI_STALE flag has been set, then log nothing.
> + * It calculates this as 1 iovec for the buf log format structure and 1 for each
> + * stretch of non-contiguous chunks to be logged. Contiguous chunks are logged
> + * in a single iovec.
> */
> STATIC void
> xfs_buf_item_size_segment(
> @@ -119,11 +117,8 @@ xfs_buf_item_size_segment(
> }
>
> /*
> - * This returns the number of log iovecs needed to log the given buf log item.
> - *
> - * It calculates this as 1 iovec for the buf log format structure and 1 for each
> - * stretch of non-contiguous chunks to be logged. Contiguous chunks are logged
> - * in a single iovec.
> + * Return the number of log iovecs and space needed to log the given buf log
> + * item.
> *
> * Discontiguous buffers need a format structure per region that is being
> * logged. This makes the changes in the buffer appear to log recovery as though
> @@ -133,7 +128,11 @@ xfs_buf_item_size_segment(
> * what ends up on disk.
> *
> * If the XFS_BLI_STALE flag has been set, then log nothing but the buf log
> - * format structures.
> + * format structures. If the item has previously been logged and has dirty
> + * regions, we do not relog them in stale buffers. This has the effect of
> + * reducing the size of the relogged item by the amount of dirty data tracked
> + * by the log item. This can result in the committing transaction reducing the
> + * amount of space being consumed by the CIL.
> */
> STATIC void
> xfs_buf_item_size(
> @@ -147,9 +146,9 @@ xfs_buf_item_size(
> ASSERT(atomic_read(&bip->bli_refcount) > 0);
> if (bip->bli_flags & XFS_BLI_STALE) {
> /*
> - * The buffer is stale, so all we need to log
> - * is the buf log format structure with the
> - * cancel flag in it.
> + * The buffer is stale, so all we need to log is the buf log
> + * format structure with the cancel flag in it as we are never
> + * going to replay the changes tracked in the log item.
> */
> trace_xfs_buf_item_size_stale(bip);
> ASSERT(bip->__bli_format.blf_flags & XFS_BLF_CANCEL);
> @@ -164,9 +163,9 @@ xfs_buf_item_size(
>
> if (bip->bli_flags & XFS_BLI_ORDERED) {
> /*
> - * The buffer has been logged just to order it.
> - * It is not being included in the transaction
> - * commit, so no vectors are used at all.
> + * The buffer has been logged just to order it. It is not being
> + * included in the transaction commit, so no vectors are used at
> + * all.
> */
> trace_xfs_buf_item_size_ordered(bip);
> *nvecs = XFS_LOG_VEC_ORDERED;
> diff --git a/fs/xfs/xfs_inode_item.c b/fs/xfs/xfs_inode_item.c
> index 17e20a6d8b4e..6ff91e5bf3cd 100644
> --- a/fs/xfs/xfs_inode_item.c
> +++ b/fs/xfs/xfs_inode_item.c
> @@ -28,6 +28,20 @@ static inline struct xfs_inode_log_item *INODE_ITEM(struct xfs_log_item *lip)
> return container_of(lip, struct xfs_inode_log_item, ili_item);
> }
>
> +/*
> + * The logged size of an inode fork is always the current size of the inode
> + * fork. This means that when an inode fork is relogged, the size of the logged
> + * region is determined by the current state, not the combination of the
> + * previously logged state + the current state. This is different relogging
> + * behaviour to most other log items which will retain the size of the
> + * previously logged changes when smaller regions are relogged.
> + *
> + * Hence operations that remove data from the inode fork (e.g. shortform
> + * dir/attr remove, extent form extent removal, etc), the size of the relogged
> + * inode gets -smaller- rather than stays the same size as the previously logged
> + * size and this can result in the committing transaction reducing the amount of
> + * space being consumed by the CIL.
> + */
> STATIC void
> xfs_inode_item_data_fork_size(
> struct xfs_inode_log_item *iip,
> diff --git a/fs/xfs/xfs_log_cil.c b/fs/xfs/xfs_log_cil.c
> index 370da7c2bfc8..0a00c3c9610c 100644
> --- a/fs/xfs/xfs_log_cil.c
> +++ b/fs/xfs/xfs_log_cil.c
> @@ -669,9 +669,14 @@ xlog_cil_push_work(
> ASSERT(push_seq <= ctx->sequence);
>
> /*
> - * Wake up any background push waiters now this context is being pushed.
> + * As we are about to switch to a new, empty CIL context, we no longer
> + * need to throttle tasks on CIL space overruns. Wake any waiters that
> + * the hard push throttle may have caught so they can start committing
> + * to the new context. The ctx->xc_push_lock provides the serialisation
> + * necessary for safely using the lockless waitqueue_active() check in
> + * this context.
> */
> - if (ctx->space_used >= XLOG_CIL_BLOCKING_SPACE_LIMIT(log))
> + if (waitqueue_active(&cil->xc_push_wait))
> wake_up_all(&cil->xc_push_wait);
>
> /*
> @@ -941,7 +946,7 @@ xlog_cil_push_background(
> ASSERT(!list_empty(&cil->xc_cil));
>
> /*
> - * don't do a background push if we haven't used up all the
> + * Don't do a background push if we haven't used up all the
> * space available yet.
> */
> if (cil->xc_ctx->space_used < XLOG_CIL_SPACE_LIMIT(log)) {
> @@ -965,9 +970,16 @@ xlog_cil_push_background(
>
> /*
> * If we are well over the space limit, throttle the work that is being
> - * done until the push work on this context has begun.
> + * done until the push work on this context has begun. Enforce the hard
> + * throttle on all transaction commits once it has been activated, even
> + * if the committing transactions have resulted in the space usage
> + * dipping back down under the hard limit.
> + *
> + * The ctx->xc_push_lock provides the serialisation necessary for safely
> + * using the lockless waitqueue_active() check in this context.
> */
> - if (cil->xc_ctx->space_used >= XLOG_CIL_BLOCKING_SPACE_LIMIT(log)) {
> + if (cil->xc_ctx->space_used >= XLOG_CIL_BLOCKING_SPACE_LIMIT(log) ||
> + waitqueue_active(&cil->xc_push_wait)) {
> trace_xfs_log_cil_wait(log, cil->xc_ctx->ticket);
> ASSERT(cil->xc_ctx->space_used < log->l_logsize);
> xlog_wait(&cil->xc_push_wait, &cil->xc_push_lock);
> --
> 2.28.0
>
next prev parent reply other threads:[~2021-02-24 21:18 UTC|newest]
Thread overview: 59+ messages / expand[flat|nested] mbox.gz Atom feed top
2021-02-23 3:34 [PATCH v2] xfs: various log stuff Dave Chinner
2021-02-23 3:34 ` [PATCH 1/8] xfs: log stripe roundoff is a property of the log Dave Chinner
2021-02-23 10:29 ` Chandan Babu R
2021-02-24 20:14 ` Darrick J. Wong
2021-02-25 8:32 ` Christoph Hellwig
2021-03-01 15:13 ` Brian Foster
2021-02-23 3:34 ` [PATCH 2/8] xfs: separate CIL commit record IO Dave Chinner
2021-02-23 12:12 ` Chandan Babu R
2021-02-24 20:34 ` Darrick J. Wong
2021-02-24 21:44 ` Dave Chinner
2021-02-24 23:06 ` Darrick J. Wong
2021-02-25 8:34 ` Christoph Hellwig
2021-02-25 20:47 ` Dave Chinner
2021-03-01 9:09 ` Christoph Hellwig
2021-03-03 0:11 ` Dave Chinner
2021-02-26 2:48 ` Darrick J. Wong
2021-02-28 16:36 ` Brian Foster
2021-02-28 23:46 ` Dave Chinner
2021-03-01 15:33 ` Brian Foster
2021-03-01 15:19 ` Brian Foster
2021-03-03 0:41 ` Dave Chinner
2021-03-03 15:22 ` Brian Foster
2021-03-04 22:57 ` Dave Chinner
2021-03-05 0:44 ` Dave Chinner
2021-02-23 3:34 ` [PATCH 3/8] xfs: move and rename xfs_blkdev_issue_flush Dave Chinner
2021-02-23 12:57 ` Chandan Babu R
2021-02-24 20:45 ` Darrick J. Wong
2021-02-24 22:01 ` Dave Chinner
2021-02-25 8:36 ` Christoph Hellwig
2021-02-23 3:34 ` [PATCH 4/8] xfs: async blkdev cache flush Dave Chinner
2021-02-23 5:29 ` Chaitanya Kulkarni
2021-02-23 14:02 ` Chandan Babu R
2021-02-24 20:51 ` Darrick J. Wong
2021-02-23 3:34 ` [PATCH 5/8] xfs: CIL checkpoint flushes caches unconditionally Dave Chinner
2021-02-24 7:16 ` Chandan Babu R
2021-02-24 20:57 ` Darrick J. Wong
2021-02-25 8:42 ` Christoph Hellwig
2021-02-25 21:07 ` Dave Chinner
2021-02-23 3:34 ` [PATCH 6/8] xfs: remove need_start_rec parameter from xlog_write() Dave Chinner
2021-02-24 7:17 ` Chandan Babu R
2021-02-24 20:59 ` Darrick J. Wong
2021-02-25 8:49 ` Christoph Hellwig
2021-02-25 20:55 ` Dave Chinner
2021-02-23 3:34 ` [PATCH 7/8] xfs: journal IO cache flush reductions Dave Chinner
2021-02-23 8:05 ` [PATCH 7/8 v2] " Dave Chinner
2021-02-24 12:27 ` Chandan Babu R
2021-02-24 20:32 ` Dave Chinner
2021-02-24 21:13 ` Darrick J. Wong
2021-02-24 22:03 ` Dave Chinner
2021-02-25 4:09 ` Chandan Babu R
2021-02-25 7:13 ` Chandan Babu R
2021-03-01 5:44 ` Dave Chinner
2021-03-01 5:56 ` Dave Chinner
2021-02-25 8:58 ` Christoph Hellwig
2021-02-25 21:06 ` Dave Chinner
2021-03-01 19:29 ` Brian Foster
2021-02-23 3:34 ` [PATCH 8/8] xfs: Fix CIL throttle hang when CIL space used going backwards Dave Chinner
2021-02-24 21:18 ` Darrick J. Wong [this message]
2021-02-24 22:05 ` Dave Chinner
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20210224211810.GX7272@magnolia \
--to=djwong@kernel.org \
--cc=david@fromorbit.com \
--cc=linux-xfs@vger.kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).