From: Eric Sandeen <sandeen@sandeen.net>
To: Dave Chinner <david@fromorbit.com>
Cc: xfs@oss.sgi.com
Subject: Re: [PATCH 03/19] xfs: prevent deadlock trying to cover an active log
Date: Thu, 17 Oct 2013 10:54:29 -0500 [thread overview]
Message-ID: <52600835.9010802@sandeen.net> (raw)
In-Reply-To: <1381789085-21923-4-git-send-email-david@fromorbit.com>
On 10/14/13 5:17 PM, Dave Chinner wrote:
> From: Dave Chinner <dchinner@redhat.com>
>
> Recent analysis of a deadlocked XFS filesystem from a kernel
> crash dump indicated that the filesystem was stuck waiting for log
> space. The short story of the hang on the RHEL6 kernel is this:
>
> - the tail of the log is pinned by an inode
> - the inode has been pushed by the xfsaild
> - the inode has been flushed to it's backing buffer and is
> currently flush locked and hence waiting for backing
> buffer IO to complete and remove it from the AIL
> - the backing buffer is marked for write - it is on the
> delayed write queue
> - the inode buffer has been modified directly and logged
> recently due to unlinked inode list modification
> - the backing buffer is pinned in memory as it is in the
> active CIL context.
> - the xfsbufd won't start buffer writeback because it is
> pinned
> - xfssyncd won't force the log because it sees the log as
> needing to be covered and hence wants to issue a dummy
> transaction to move the log covering state machine along.
>
> Hence there is no trigger to force the CIL to the log and hence
> unpin the inode buffer and therefore complete the inode IO, remove
> it from the AIL and hence move the tail of the log along, allowing
> transactions to start again.
>
> Mainline kernels also have the same deadlock, though the signature
> is slightly different - the inode buffer never reaches the delayed
> write lists because xfs_buf_item_push() sees that it is pinned and
> hence never adds it to the delayed write list that the xfsaild
> flushes.
>
> There are two possible solutions here. The first is to simply force
> the log before trying to cover the log and so ensure that the CIL is
> emptied before we try to reserve space for the dummy transaction in
> the xfs_log_worker(). While this might work most of the time, it is
> still racy and is no guarantee that we don't get stuck in
> xfs_trans_reserve waiting for log space to come free. Hence it's not
> the best way to solve the problem.
>
> The second solution is to modify xfs_log_need_covered() to be aware
> of the CIL. We only should be attempting to cover the log if there
> is no current activity in the log - covering the log is the process
> of ensuring that the head and tail in the log on disk are identical
> (i.e. the log is clean and at idle). Hence, by definition, if there
> are items in the CIL then the log is not at idle and so we don't
> need to attempt to cover it.
>
> When we don't need to cover the log because it is active or idle, we
> issue a log force from xfs_log_worker() - if the log is idle, then
> this does nothing. However, if the log is active due to there being
> items in the CIL, it will force the items in the CIL to the log and
> unpin them.
>
> In the case of the above deadlock scenario, instead of
> xfs_log_worker() getting stuck in xfs_trans_reserve() attempting to
> cover the log, it will instead force the log, thereby unpinning the
> inode buffer, allowing IO to be issued and complete and hence
> removing the inode that was pinning the tail of the log from the
> AIL. At that point, everything will start moving along again. i.e.
> the xfs_log_worker turns back into a watchdog that can alleviate
> deadlocks based around pinned items that prevent the tail of the log
> from being moved...
>
> Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Eric Sandeen <sandeen@redhat.com>
> ---
> fs/xfs/xfs_log.c | 48 +++++++++++++++++++++++++++++-------------------
> fs/xfs/xfs_log_cil.c | 14 ++++++++++++++
> fs/xfs/xfs_log_priv.h | 10 ++++------
> 3 files changed, 47 insertions(+), 25 deletions(-)
>
> diff --git a/fs/xfs/xfs_log.c b/fs/xfs/xfs_log.c
> index a2dea108..613ed94 100644
> --- a/fs/xfs/xfs_log.c
> +++ b/fs/xfs/xfs_log.c
> @@ -1000,27 +1000,34 @@ xfs_log_space_wake(
> }
>
> /*
> - * Determine if we have a transaction that has gone to disk
> - * that needs to be covered. To begin the transition to the idle state
> - * firstly the log needs to be idle (no AIL and nothing in the iclogs).
> - * If we are then in a state where covering is needed, the caller is informed
> - * that dummy transactions are required to move the log into the idle state.
> + * Determine if we have a transaction that has gone to disk that needs to be
> + * covered. To begin the transition to the idle state firstly the log needs to
> + * be idle. That means the CIL, the AIL and the iclogs needs to be empty before
> + * we start attempting to cover the log.
> *
> - * Because this is called as part of the sync process, we should also indicate
> - * that dummy transactions should be issued in anything but the covered or
> - * idle states. This ensures that the log tail is accurately reflected in
> - * the log at the end of the sync, hence if a crash occurrs avoids replay
> - * of transactions where the metadata is already on disk.
> + * Only if we are then in a state where covering is needed, the caller is
> + * informed that dummy transactions are required to move the log into the idle
> + * state.
> + *
> + * If there are any items in the AIl or CIL, then we do not want to attempt to
> + * cover the log as we may be in a situation where there isn't log space
> + * available to run a dummy transaction and this can lead to deadlocks when the
> + * tail of the log is pinned by an item that is modified in the CIL. Hence
> + * there's no point in running a dummy transaction at this point because we
> + * can't start trying to idle the log until both the CIL and AIL are empty.
> */
> int
> xfs_log_need_covered(xfs_mount_t *mp)
> {
> - int needed = 0;
> struct xlog *log = mp->m_log;
> + int needed = 0;
>
> if (!xfs_fs_writable(mp))
> return 0;
>
> + if (!xlog_cil_empty(log))
> + return 0;
> +
> spin_lock(&log->l_icloglock);
> switch (log->l_covered_state) {
> case XLOG_STATE_COVER_DONE:
> @@ -1029,14 +1036,17 @@ xfs_log_need_covered(xfs_mount_t *mp)
> break;
> case XLOG_STATE_COVER_NEED:
> case XLOG_STATE_COVER_NEED2:
> - if (!xfs_ail_min_lsn(log->l_ailp) &&
> - xlog_iclogs_empty(log)) {
> - if (log->l_covered_state == XLOG_STATE_COVER_NEED)
> - log->l_covered_state = XLOG_STATE_COVER_DONE;
> - else
> - log->l_covered_state = XLOG_STATE_COVER_DONE2;
> - }
> - /* FALLTHRU */
> + if (xfs_ail_min_lsn(log->l_ailp))
> + break;
> + if (!xlog_iclogs_empty(log))
> + break;
> +
> + needed = 1;
> + if (log->l_covered_state == XLOG_STATE_COVER_NEED)
> + log->l_covered_state = XLOG_STATE_COVER_DONE;
> + else
> + log->l_covered_state = XLOG_STATE_COVER_DONE2;
> + break;
> default:
> needed = 1;
> break;
> diff --git a/fs/xfs/xfs_log_cil.c b/fs/xfs/xfs_log_cil.c
> index cfe9797..da8524e77 100644
> --- a/fs/xfs/xfs_log_cil.c
> +++ b/fs/xfs/xfs_log_cil.c
> @@ -711,6 +711,20 @@ xlog_cil_push_foreground(
> xlog_cil_push(log);
> }
>
> +bool
> +xlog_cil_empty(
> + struct xlog *log)
> +{
> + struct xfs_cil *cil = log->l_cilp;
> + bool empty = false;
> +
> + spin_lock(&cil->xc_push_lock);
> + if (list_empty(&cil->xc_cil))
> + empty = true;
> + spin_unlock(&cil->xc_push_lock);
> + return empty;
> +}
> +
> /*
> * Commit a transaction with the given vector to the Committed Item List.
> *
> diff --git a/fs/xfs/xfs_log_priv.h b/fs/xfs/xfs_log_priv.h
> index 136654b..f80cff2 100644
> --- a/fs/xfs/xfs_log_priv.h
> +++ b/fs/xfs/xfs_log_priv.h
> @@ -514,12 +514,10 @@ xlog_assign_grant_head(atomic64_t *head, int cycle, int space)
> /*
> * Committed Item List interfaces
> */
> -int
> -xlog_cil_init(struct xlog *log);
> -void
> -xlog_cil_init_post_recovery(struct xlog *log);
> -void
> -xlog_cil_destroy(struct xlog *log);
> +int xlog_cil_init(struct xlog *log);
> +void xlog_cil_init_post_recovery(struct xlog *log);
> +void xlog_cil_destroy(struct xlog *log);
> +bool xlog_cil_empty(struct xlog *log);
>
> /*
> * CIL force routines
>
_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs
next prev parent reply other threads:[~2013-10-17 15:54 UTC|newest]
Thread overview: 54+ messages / expand[flat|nested] mbox.gz Atom feed top
2013-10-14 22:17 [PATCH 00/19 V2] xfs: patches for 3.13 Dave Chinner
2013-10-14 22:17 ` [PATCH 01/19] xfs: xfs_remove deadlocks due to inverted AGF vs AGI lock ordering Dave Chinner
2013-10-14 22:17 ` [PATCH 02/19] xfs: open code inc_inode_iversion when logging an inode Dave Chinner
2013-10-14 22:17 ` [PATCH 03/19] xfs: prevent deadlock trying to cover an active log Dave Chinner
2013-10-17 15:54 ` Eric Sandeen [this message]
2013-10-17 18:51 ` Ben Myers
2013-10-14 22:17 ` [PATCH 04/19] xfs: create a shared header file for format-related information Dave Chinner
2013-10-22 23:12 ` Ben Myers
2013-10-22 23:36 ` [PATCH 04/19, V2] " Dave Chinner
2013-10-14 22:17 ` [PATCH 05/19] xfs: unify directory/attribute format definitions Dave Chinner
2013-10-21 23:11 ` Ben Myers
2013-10-21 23:33 ` Dave Chinner
2013-10-22 14:57 ` Christoph Hellwig
2013-10-14 22:17 ` [PATCH 06/19] xfs: split dquot buffer operations out Dave Chinner
2013-10-14 22:17 ` [PATCH 07/19] xfs: remove unused transaction callback variables Dave Chinner
2013-10-22 14:25 ` Ben Myers
2013-10-14 22:17 ` [PATCH 08/19] xfs: decouple log and transaction headers Dave Chinner
2013-10-22 17:16 ` Ben Myers
2013-10-22 23:50 ` [PATCH 08/19, V2] " Dave Chinner
2013-10-14 22:17 ` [PATCH 09/19] xfs: decouple inode and bmap btree header files Dave Chinner
2013-10-22 18:29 ` Ben Myers
2013-10-22 23:40 ` [PATCH 09/19, V2] " Dave Chinner
2013-10-22 23:48 ` Dave Chinner
2013-10-22 23:51 ` [PATCH 09/19, V3] " Dave Chinner
2013-10-14 22:17 ` [PATCH 10/19] xfs: split xfs_rtalloc.c for userspace sanity Dave Chinner
2013-10-14 22:17 ` [PATCH 11/19] xfs: abstract the differences in dir2/dir3 via an ops vector Dave Chinner
2013-10-14 22:17 ` [PATCH 12/19] xfs: vectorise remaining shortform dir2 ops Dave Chinner
2013-10-14 22:17 ` [PATCH 13/19] xfs: vectorise directory data operations Dave Chinner
2013-10-24 18:39 ` Ben Myers
2013-10-24 21:31 ` Dave Chinner
2013-10-24 21:41 ` Ben Myers
2013-10-24 22:08 ` Dave Chinner
2013-10-24 22:28 ` Ben Myers
2013-10-27 23:01 ` Dave Chinner
2013-10-14 22:18 ` [PATCH 14/19] xfs: vectorise directory data operations part 2 Dave Chinner
2013-10-14 22:18 ` [PATCH 15/19] xfs: vectorise directory leaf operations Dave Chinner
2013-10-20 23:25 ` [PATCH 15/19, V2] " Dave Chinner
2013-10-14 22:18 ` [PATCH 16/19] xfs: vectorise DA btree operations Dave Chinner
2013-10-25 17:17 ` Ben Myers
2013-10-14 22:18 ` [PATCH 17/19] xfs: vectorise encoding/decoding directory headers Dave Chinner
2013-10-25 18:46 ` Ben Myers
2013-10-25 19:20 ` Ben Myers
2013-10-14 22:18 ` [PATCH 18/19] xfs: vectorise directory leaf operations Dave Chinner
2013-10-25 20:18 ` Ben Myers
2013-10-25 21:51 ` Ben Myers
2013-10-14 22:18 ` [PATCH 19/19] xfs: convert directory vector functions to constants Dave Chinner
2013-10-16 21:16 ` Ben Myers
2013-10-16 22:23 ` Dave Chinner
2013-10-16 22:52 ` Ben Myers
2013-10-16 23:06 ` Ben Myers
2013-10-17 0:17 ` Dave Chinner
2013-10-23 22:23 ` Ben Myers
2013-10-25 21:47 ` Ben Myers
2013-10-23 22:23 ` [PATCH 00/19 V2] xfs: patches for 3.13 Ben Myers
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=52600835.9010802@sandeen.net \
--to=sandeen@sandeen.net \
--cc=david@fromorbit.com \
--cc=xfs@oss.sgi.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.