From: "Darrick J. Wong" <djwong@kernel.org>
To: Dave Chinner <david@fromorbit.com>
Cc: linux-xfs@vger.kernel.org
Subject: Re: [PATCH 11/11 v2] xfs: limit iclog tail updates
Date: Thu, 29 Jul 2021 09:28:47 -0700 [thread overview]
Message-ID: <20210729162847.GK3601443@magnolia> (raw)
In-Reply-To: <20210727234410.GY664593@dread.disaster.area>
On Wed, Jul 28, 2021 at 09:44:10AM +1000, Dave Chinner wrote:
> From: Dave Chinner <dchinner@redhat.com>
>
> From the department of "generic/482 keeps on giving", we bring you
> another tail update race condition:
>
> iclog:
> S1 C1
> +-----------------------+-----------------------+
> S2 EOIC
>
> Two checkpoints in a single iclog. One is complete, the other just
> contains the start record and overruns into a new iclog.
>
> Timeline:
>
> Before S1: Cache flush, log tail = X
> At S1: Metadata stable, write start record and checkpoint
> At C1: Write commit record, set NEED_FUA
> Single iclog checkpoint, so no need for NEED_FLUSH
> Log tail still = X, so no need for NEED_FLUSH
>
> After C1,
> Before S2: Cache flush, log tail = X
> At S2: Metadata stable, write start record and checkpoint
> After S2: Log tail moves to X+1
> At EOIC: End of iclog, more journal data to write
> Releases iclog
> Not a commit iclog, so no need for NEED_FLUSH
> Writes log tail X+1 into iclog.
>
> At this point, the iclog has tail X+1 and NEED_FUA set. There has
> been no cache flush for the metadata between X and X+1, and the
> iclog writes the new tail permanently to the log. THis is sufficient
> to violate on disk metadata/journal ordering.
>
> We have two options here. The first is to detect this case in some
> manner and ensure that the partial checkpoint write sets NEED_FLUSH
> when the iclog is already marked NEED_FUA and the log tail changes.
> This seems somewhat fragile and quite complex to get right, and it
> doesn't actually make it obvious what underlying problem it is
> actually addressing from reading the code.
>
> The second option seems much cleaner to me, because it is derived
> directly from the requirements of the C1 commit record in the iclog.
> That is, when we write this commit record to the iclog, we've
> guaranteed that the metadata/data ordering is correct for tail
> update purposes. Hence if we only write the log tail into the iclog
> for the *first* commit record rather than the log tail at the last
> release, we guarantee that the log tail does not move past where the
> the first commit record in the log expects it to be.
>
> IOWs, taking the first option means that replay of C1 becomes
> dependent on future operations doing the right thing, not just the
> C1 checkpoint itself doing the right thing. This makes log recovery
> almost impossible to reason about because now we have to take into
> account what might or might not have happened in the future when
> looking at checkpoints in the log rather than just having to
> reconstruct the past...
>
> Signed-off-by: Dave Chinner <dchinner@redhat.com>
Looks good now,
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
--D
> ---
> Version 2:
> - fix xlog_verify_tail_lsn() debug code to look at the iclog tail
> lsn rather than the current tail lsn that we might not have
> written into the iclog.
>
> fs/xfs/xfs_log.c | 50 +++++++++++++++++++++++++++++++++++++-------------
> 1 file changed, 37 insertions(+), 13 deletions(-)
>
> diff --git a/fs/xfs/xfs_log.c b/fs/xfs/xfs_log.c
> index 1c328efdca66..60ac5fd63f1e 100644
> --- a/fs/xfs/xfs_log.c
> +++ b/fs/xfs/xfs_log.c
> @@ -78,13 +78,12 @@ xlog_verify_iclog(
> STATIC void
> xlog_verify_tail_lsn(
> struct xlog *log,
> - struct xlog_in_core *iclog,
> - xfs_lsn_t tail_lsn);
> + struct xlog_in_core *iclog);
> #else
> #define xlog_verify_dest_ptr(a,b)
> #define xlog_verify_grant_tail(a)
> #define xlog_verify_iclog(a,b,c)
> -#define xlog_verify_tail_lsn(a,b,c)
> +#define xlog_verify_tail_lsn(a,b)
> #endif
>
> STATIC int
> @@ -489,12 +488,31 @@ xfs_log_reserve(
>
> /*
> * Flush iclog to disk if this is the last reference to the given iclog and the
> - * it is in the WANT_SYNC state. If the caller passes in a non-zero
> - * @old_tail_lsn and the current log tail does not match, there may be metadata
> - * on disk that must be persisted before this iclog is written. To satisfy that
> - * requirement, set the XLOG_ICL_NEED_FLUSH flag as a condition for writing this
> - * iclog with the new log tail value.
> + * it is in the WANT_SYNC state.
> + *
> + * If the caller passes in a non-zero @old_tail_lsn and the current log tail
> + * does not match, there may be metadata on disk that must be persisted before
> + * this iclog is written. To satisfy that requirement, set the
> + * XLOG_ICL_NEED_FLUSH flag as a condition for writing this iclog with the new
> + * log tail value.
> + *
> + * If XLOG_ICL_NEED_FUA is already set on the iclog, we need to ensure that the
> + * log tail is updated correctly. NEED_FUA indicates that the iclog will be
> + * written to stable storage, and implies that a commit record is contained
> + * within the iclog. We need to ensure that the log tail does not move beyond
> + * the tail that the first commit record in the iclog ordered against, otherwise
> + * correct recovery of that checkpoint becomes dependent on future operations
> + * performed on this iclog.
> + *
> + * Hence if NEED_FUA is set and the current iclog tail lsn is empty, write the
> + * current tail into iclog. Once the iclog tail is set, future operations must
> + * not modify it, otherwise they potentially violate ordering constraints for
> + * the checkpoint commit that wrote the initial tail lsn value. The tail lsn in
> + * the iclog will get zeroed on activation of the iclog after sync, so we
> + * always capture the tail lsn on the iclog on the first NEED_FUA release
> + * regardless of the number of active reference counts on this iclog.
> */
> +
> int
> xlog_state_release_iclog(
> struct xlog *log,
> @@ -519,6 +537,10 @@ xlog_state_release_iclog(
>
> if (old_tail_lsn && tail_lsn != old_tail_lsn)
> iclog->ic_flags |= XLOG_ICL_NEED_FLUSH;
> +
> + if ((iclog->ic_flags & XLOG_ICL_NEED_FUA) &&
> + !iclog->ic_header.h_tail_lsn)
> + iclog->ic_header.h_tail_lsn = cpu_to_be64(tail_lsn);
> }
>
> if (!atomic_dec_and_test(&iclog->ic_refcnt))
> @@ -530,8 +552,9 @@ xlog_state_release_iclog(
> }
>
> iclog->ic_state = XLOG_STATE_SYNCING;
> - iclog->ic_header.h_tail_lsn = cpu_to_be64(tail_lsn);
> - xlog_verify_tail_lsn(log, iclog, tail_lsn);
> + if (!iclog->ic_header.h_tail_lsn)
> + iclog->ic_header.h_tail_lsn = cpu_to_be64(tail_lsn);
> + xlog_verify_tail_lsn(log, iclog);
> trace_xlog_iclog_syncing(iclog, _RET_IP_);
>
> spin_unlock(&log->l_icloglock);
> @@ -2579,6 +2602,7 @@ xlog_state_activate_iclog(
> memset(iclog->ic_header.h_cycle_data, 0,
> sizeof(iclog->ic_header.h_cycle_data));
> iclog->ic_header.h_lsn = 0;
> + iclog->ic_header.h_tail_lsn = 0;
> }
>
> /*
> @@ -3614,10 +3638,10 @@ xlog_verify_grant_tail(
> STATIC void
> xlog_verify_tail_lsn(
> struct xlog *log,
> - struct xlog_in_core *iclog,
> - xfs_lsn_t tail_lsn)
> + struct xlog_in_core *iclog)
> {
> - int blocks;
> + xfs_lsn_t tail_lsn = be64_to_cpu(iclog->ic_header.h_tail_lsn);
> + int blocks;
>
> if (CYCLE_LSN(tail_lsn) == log->l_prev_cycle) {
> blocks =
prev parent reply other threads:[~2021-07-29 16:28 UTC|newest]
Thread overview: 20+ messages / expand[flat|nested] mbox.gz Atom feed top
2021-07-27 7:10 [PATCH 0/11 v3] xfs: fix log cache flush regressions and bugs Dave Chinner
2021-07-27 7:10 ` [PATCH 01/11] xfs: flush data dev on external log write Dave Chinner
2021-07-27 7:10 ` [PATCH 02/11] xfs: external logs need to flush data device Dave Chinner
2021-07-27 7:10 ` [PATCH 03/11] xfs: fold __xlog_state_release_iclog into xlog_state_release_iclog Dave Chinner
2021-07-27 7:10 ` [PATCH 04/11] xfs: fix ordering violation between cache flushes and tail updates Dave Chinner
2021-07-27 18:50 ` Darrick J. Wong
2021-07-27 7:10 ` [PATCH 05/11] xfs: factor out forced iclog flushes Dave Chinner
2021-07-27 7:10 ` [PATCH 06/11] xfs: log forces imply data device cache flushes Dave Chinner
2021-07-27 7:10 ` [PATCH 07/11] xfs: avoid unnecessary waits in xfs_log_force_lsn() Dave Chinner
2021-07-27 7:10 ` [PATCH 08/11] xfs: logging the on disk inode LSN can make it go backwards Dave Chinner
2021-07-27 19:00 ` Darrick J. Wong
2023-06-13 6:20 ` Chandan Babu R
2023-06-14 9:13 ` Dave Chinner
2021-07-27 7:10 ` [PATCH 09/11] xfs: Enforce attr3 buffer recovery order Dave Chinner
2021-07-27 7:10 ` [PATCH 10/11] xfs: need to see iclog flags in tracing Dave Chinner
2021-07-27 7:10 ` [PATCH 11/11] xfs: limit iclog tail updates Dave Chinner
2021-07-27 21:25 ` Darrick J. Wong
2021-07-27 22:13 ` Dave Chinner
2021-07-27 23:44 ` [PATCH 11/11 v2] " Dave Chinner
2021-07-29 16:28 ` Darrick J. Wong [this message]
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20210729162847.GK3601443@magnolia \
--to=djwong@kernel.org \
--cc=david@fromorbit.com \
--cc=linux-xfs@vger.kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox