From: "Darrick J. Wong" <djwong@kernel.org>
To: Dave Chinner <david@fromorbit.com>
Cc: linux-xfs@vger.kernel.org
Subject: Re: [PATCH 04/11] xfs: fix ordering violation between cache flushes and tail updates
Date: Tue, 27 Jul 2021 11:50:25 -0700 [thread overview]
Message-ID: <20210727185025.GF559212@magnolia> (raw)
In-Reply-To: <20210727071012.3358033-5-david@fromorbit.com>
On Tue, Jul 27, 2021 at 05:10:05PM +1000, Dave Chinner wrote:
> From: Dave Chinner <dchinner@redhat.com>
>
> There is a race between the new CIL async data device metadata IO
> completion cache flush and the log tail in the iclog the flush
> covers being updated. This can be seen by repeating generic/482 in a
> loop and eventually log recovery fails with a failures such as this:
Looks good, can't remember why I didn't do this last night:
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
--D
> XFS (dm-3): Starting recovery (logdev: internal)
> XFS (dm-3): bad inode magic/vsn daddr 228352 #0 (magic=0)
> XFS (dm-3): Metadata corruption detected at xfs_inode_buf_verify+0x180/0x190, xfs_inode block 0x37c00 xfs_inode_buf_verify
> XFS (dm-3): Unmount and run xfs_repair
> XFS (dm-3): First 128 bytes of corrupted metadata buffer:
> 00000000: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
> 00000010: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
> 00000020: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
> 00000030: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
> 00000040: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
> 00000050: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
> 00000060: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
> 00000070: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
> XFS (dm-3): metadata I/O error in "xlog_recover_items_pass2+0x55/0xc0" at daddr 0x37c00 len 32 error 117
>
> Analysis of the logwrite replay shows that there were no writes to
> the data device between the FUA @ write 124 and the FUA at write @
> 125, but log recovery @ 125 failed. The difference was the one log
> write @ 125 moved the tail of the log forwards from (1,8) to (1,32)
> and so the inode create intent in (1,8) was not replayed and so the
> inode cluster was zero on disk when replay of the first inode item
> in (1,32) was attempted.
>
> What this meant was that the journal write that occurred at @ 125
> did not ensure that metadata completed before the iclog was written
> was correctly on stable storage. The tail of the log moved forward,
> so IO must have been completed between the two iclog writes. This
> means that there is a race condition between the unconditional async
> cache flush in the CIL push work and the tail LSN that is written to
> the iclog. This happens like so:
>
> CIL push work AIL push work
> ------------- -------------
> Add to committing list
> start async data dev cache flush
> .....
> <flush completes>
> <all writes to old tail lsn are stable>
> xlog_write
> .... push inode create buffer
> <start IO>
> .....
> xlog_write(commit record)
> .... <IO completes>
> log tail moves
> xlog_assign_tail_lsn()
> start_lsn == commit_lsn
> <no iclog preflush!>
> xlog_state_release_iclog
> __xlog_state_release_iclog()
> <writes *new* tail_lsn into iclog>
> xlog_sync()
> ....
> submit_bio()
> <tail in log moves forward without flushing written metadata>
>
> Essentially, this can only occur if the commit iclog is issued
> without a cache flush. If the iclog bio is submitted with
> REQ_PREFLUSH, then it will guarantee that all the completed IO is
> one stable storage before the iclog bio with the new tail LSN in it
> is written to the log.
>
> IOWs, the tail lsn that is written to the iclog needs to be sampled
> *before* we issue the cache flush that guarantees all IO up to that
> LSN has been completed.
>
> To fix this without giving up the performance advantage of the
> flush/FUA optimisations (e.g. g/482 runtime halves with 5.14-rc1
> compared to 5.13), we need to ensure that we always issue a cache
> flush if the tail LSN changes between the initial async flush and
> the commit record being written. THis requires sampling the tail_lsn
> before we start the flush, and then passing the sampled tail LSN to
> xlog_state_release_iclog() so it can determine if the the tail LSN
> has changed while writing the checkpoint. If the tail LSN has
> changed, then it needs to set the NEED_FLUSH flag on the iclog and
> we'll issue another cache flush before writing the iclog.
>
> Fixes: eef983ffeae7 ("xfs: journal IO cache flush reductions")
> Signed-off-by: Dave Chinner <dchinner@redhat.com>
> Reviewed-by: Christoph Hellwig <hch@lst.de>
> ---
> fs/xfs/xfs_log.c | 36 ++++++++++++++++++++++++++----------
> fs/xfs/xfs_log_cil.c | 13 +++++++++++--
> fs/xfs/xfs_log_priv.h | 3 ++-
> 3 files changed, 39 insertions(+), 13 deletions(-)
>
> diff --git a/fs/xfs/xfs_log.c b/fs/xfs/xfs_log.c
> index 82f5996d3889..e8c6c96d4f7c 100644
> --- a/fs/xfs/xfs_log.c
> +++ b/fs/xfs/xfs_log.c
> @@ -489,12 +489,17 @@ xfs_log_reserve(
>
> /*
> * Flush iclog to disk if this is the last reference to the given iclog and the
> - * it is in the WANT_SYNC state.
> + * it is in the WANT_SYNC state. If the caller passes in a non-zero
> + * @old_tail_lsn and the current log tail does not match, there may be metadata
> + * on disk that must be persisted before this iclog is written. To satisfy that
> + * requirement, set the XLOG_ICL_NEED_FLUSH flag as a condition for writing this
> + * iclog with the new log tail value.
> */
> int
> xlog_state_release_iclog(
> struct xlog *log,
> - struct xlog_in_core *iclog)
> + struct xlog_in_core *iclog,
> + xfs_lsn_t old_tail_lsn)
> {
> xfs_lsn_t tail_lsn;
> lockdep_assert_held(&log->l_icloglock);
> @@ -503,6 +508,19 @@ xlog_state_release_iclog(
> if (iclog->ic_state == XLOG_STATE_IOERROR)
> return -EIO;
>
> + /*
> + * Grabbing the current log tail needs to be atomic w.r.t. the writing
> + * of the tail LSN into the iclog so we guarantee that the log tail does
> + * not move between deciding if a cache flush is required and writing
> + * the LSN into the iclog below.
> + */
> + if (old_tail_lsn || iclog->ic_state == XLOG_STATE_WANT_SYNC) {
> + tail_lsn = xlog_assign_tail_lsn(log->l_mp);
> +
> + if (old_tail_lsn && tail_lsn != old_tail_lsn)
> + iclog->ic_flags |= XLOG_ICL_NEED_FLUSH;
> + }
> +
> if (!atomic_dec_and_test(&iclog->ic_refcnt))
> return 0;
>
> @@ -511,8 +529,6 @@ xlog_state_release_iclog(
> return 0;
> }
>
> - /* update tail before writing to iclog */
> - tail_lsn = xlog_assign_tail_lsn(log->l_mp);
> iclog->ic_state = XLOG_STATE_SYNCING;
> iclog->ic_header.h_tail_lsn = cpu_to_be64(tail_lsn);
> xlog_verify_tail_lsn(log, iclog, tail_lsn);
> @@ -858,7 +874,7 @@ xlog_unmount_write(
> * iclog containing the unmount record is written.
> */
> iclog->ic_flags |= (XLOG_ICL_NEED_FLUSH | XLOG_ICL_NEED_FUA);
> - error = xlog_state_release_iclog(log, iclog);
> + error = xlog_state_release_iclog(log, iclog, 0);
> xlog_wait_on_iclog(iclog);
>
> if (tic) {
> @@ -2302,7 +2318,7 @@ xlog_write_copy_finish(
> return 0;
>
> release_iclog:
> - error = xlog_state_release_iclog(log, iclog);
> + error = xlog_state_release_iclog(log, iclog, 0);
> spin_unlock(&log->l_icloglock);
> return error;
> }
> @@ -2521,7 +2537,7 @@ xlog_write(
> ASSERT(optype & XLOG_COMMIT_TRANS);
> *commit_iclog = iclog;
> } else {
> - error = xlog_state_release_iclog(log, iclog);
> + error = xlog_state_release_iclog(log, iclog, 0);
> }
> spin_unlock(&log->l_icloglock);
>
> @@ -2959,7 +2975,7 @@ xlog_state_get_iclog_space(
> * reference to the iclog.
> */
> if (!atomic_add_unless(&iclog->ic_refcnt, -1, 1))
> - error = xlog_state_release_iclog(log, iclog);
> + error = xlog_state_release_iclog(log, iclog, 0);
> spin_unlock(&log->l_icloglock);
> if (error)
> return error;
> @@ -3195,7 +3211,7 @@ xfs_log_force(
> atomic_inc(&iclog->ic_refcnt);
> lsn = be64_to_cpu(iclog->ic_header.h_lsn);
> xlog_state_switch_iclogs(log, iclog, 0);
> - if (xlog_state_release_iclog(log, iclog))
> + if (xlog_state_release_iclog(log, iclog, 0))
> goto out_error;
>
> if (be64_to_cpu(iclog->ic_header.h_lsn) != lsn)
> @@ -3275,7 +3291,7 @@ xlog_force_lsn(
> }
> atomic_inc(&iclog->ic_refcnt);
> xlog_state_switch_iclogs(log, iclog, 0);
> - if (xlog_state_release_iclog(log, iclog))
> + if (xlog_state_release_iclog(log, iclog, 0))
> goto out_error;
> if (log_flushed)
> *log_flushed = 1;
> diff --git a/fs/xfs/xfs_log_cil.c b/fs/xfs/xfs_log_cil.c
> index b128aaa9b870..4c44bc3786c0 100644
> --- a/fs/xfs/xfs_log_cil.c
> +++ b/fs/xfs/xfs_log_cil.c
> @@ -654,8 +654,9 @@ xlog_cil_push_work(
> struct xfs_trans_header thdr;
> struct xfs_log_iovec lhdr;
> struct xfs_log_vec lvhdr = { NULL };
> + xfs_lsn_t preflush_tail_lsn;
> xfs_lsn_t commit_lsn;
> - xfs_lsn_t push_seq;
> + xfs_csn_t push_seq;
> struct bio bio;
> DECLARE_COMPLETION_ONSTACK(bdev_flush);
>
> @@ -730,7 +731,15 @@ xlog_cil_push_work(
> * because we hold the flush lock exclusively. Hence we can now issue
> * a cache flush to ensure all the completed metadata in the journal we
> * are about to overwrite is on stable storage.
> + *
> + * Because we are issuing this cache flush before we've written the
> + * tail lsn to the iclog, we can have metadata IO completions move the
> + * tail forwards between the completion of this flush and the iclog
> + * being written. In this case, we need to re-issue the cache flush
> + * before the iclog write. To detect whether the log tail moves, sample
> + * the tail LSN *before* we issue the flush.
> */
> + preflush_tail_lsn = atomic64_read(&log->l_tail_lsn);
> xfs_flush_bdev_async(&bio, log->l_mp->m_ddev_targp->bt_bdev,
> &bdev_flush);
>
> @@ -941,7 +950,7 @@ xlog_cil_push_work(
> * storage.
> */
> commit_iclog->ic_flags |= XLOG_ICL_NEED_FUA;
> - xlog_state_release_iclog(log, commit_iclog);
> + xlog_state_release_iclog(log, commit_iclog, preflush_tail_lsn);
> spin_unlock(&log->l_icloglock);
> return;
>
> diff --git a/fs/xfs/xfs_log_priv.h b/fs/xfs/xfs_log_priv.h
> index 4c41bbfa33b0..7cbde0b4f990 100644
> --- a/fs/xfs/xfs_log_priv.h
> +++ b/fs/xfs/xfs_log_priv.h
> @@ -497,7 +497,8 @@ int xlog_commit_record(struct xlog *log, struct xlog_ticket *ticket,
> void xfs_log_ticket_ungrant(struct xlog *log, struct xlog_ticket *ticket);
> void xfs_log_ticket_regrant(struct xlog *log, struct xlog_ticket *ticket);
>
> -int xlog_state_release_iclog(struct xlog *log, struct xlog_in_core *iclog);
> +int xlog_state_release_iclog(struct xlog *log, struct xlog_in_core *iclog,
> + xfs_lsn_t log_tail_lsn);
>
> /*
> * When we crack an atomic LSN, we sample it first so that the value will not
> --
> 2.31.1
>
next prev parent reply other threads:[~2021-07-27 18:50 UTC|newest]
Thread overview: 20+ messages / expand[flat|nested] mbox.gz Atom feed top
2021-07-27 7:10 [PATCH 0/11 v3] xfs: fix log cache flush regressions and bugs Dave Chinner
2021-07-27 7:10 ` [PATCH 01/11] xfs: flush data dev on external log write Dave Chinner
2021-07-27 7:10 ` [PATCH 02/11] xfs: external logs need to flush data device Dave Chinner
2021-07-27 7:10 ` [PATCH 03/11] xfs: fold __xlog_state_release_iclog into xlog_state_release_iclog Dave Chinner
2021-07-27 7:10 ` [PATCH 04/11] xfs: fix ordering violation between cache flushes and tail updates Dave Chinner
2021-07-27 18:50 ` Darrick J. Wong [this message]
2021-07-27 7:10 ` [PATCH 05/11] xfs: factor out forced iclog flushes Dave Chinner
2021-07-27 7:10 ` [PATCH 06/11] xfs: log forces imply data device cache flushes Dave Chinner
2021-07-27 7:10 ` [PATCH 07/11] xfs: avoid unnecessary waits in xfs_log_force_lsn() Dave Chinner
2021-07-27 7:10 ` [PATCH 08/11] xfs: logging the on disk inode LSN can make it go backwards Dave Chinner
2021-07-27 19:00 ` Darrick J. Wong
2023-06-13 6:20 ` Chandan Babu R
2023-06-14 9:13 ` Dave Chinner
2021-07-27 7:10 ` [PATCH 09/11] xfs: Enforce attr3 buffer recovery order Dave Chinner
2021-07-27 7:10 ` [PATCH 10/11] xfs: need to see iclog flags in tracing Dave Chinner
2021-07-27 7:10 ` [PATCH 11/11] xfs: limit iclog tail updates Dave Chinner
2021-07-27 21:25 ` Darrick J. Wong
2021-07-27 22:13 ` Dave Chinner
2021-07-27 23:44 ` [PATCH 11/11 v2] " Dave Chinner
2021-07-29 16:28 ` Darrick J. Wong
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20210727185025.GF559212@magnolia \
--to=djwong@kernel.org \
--cc=david@fromorbit.com \
--cc=linux-xfs@vger.kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox