public inbox for linux-xfs@vger.kernel.org
 help / color / mirror / Atom feed
From: "Darrick J. Wong" <djwong@kernel.org>
To: Dave Chinner <david@fromorbit.com>
Cc: linux-xfs@vger.kernel.org
Subject: Re: [PATCH 11/11 v2] xfs: limit iclog tail updates
Date: Thu, 29 Jul 2021 09:28:47 -0700	[thread overview]
Message-ID: <20210729162847.GK3601443@magnolia> (raw)
In-Reply-To: <20210727234410.GY664593@dread.disaster.area>

On Wed, Jul 28, 2021 at 09:44:10AM +1000, Dave Chinner wrote:
> From: Dave Chinner <dchinner@redhat.com>
> 
> From the department of "generic/482 keeps on giving", we bring you
> another tail update race condition:
> 
> iclog:
> 	S1			C1
> 	+-----------------------+-----------------------+
> 				 S2			EOIC
> 
> Two checkpoints in a single iclog. One is complete, the other just
> contains the start record and overruns into a new iclog.
> 
> Timeline:
> 
> Before S1:	Cache flush, log tail = X
> At S1:		Metadata stable, write start record and checkpoint
> At C1:		Write commit record, set NEED_FUA
> 		Single iclog checkpoint, so no need for NEED_FLUSH
> 		Log tail still = X, so no need for NEED_FLUSH
> 
> After C1,
> Before S2:	Cache flush, log tail = X
> At S2:		Metadata stable, write start record and checkpoint
> After S2:	Log tail moves to X+1
> At EOIC:	End of iclog, more journal data to write
> 		Releases iclog
> 		Not a commit iclog, so no need for NEED_FLUSH
> 		Writes log tail X+1 into iclog.
> 
> At this point, the iclog has tail X+1 and NEED_FUA set. There has
> been no cache flush for the metadata between X and X+1, and the
> iclog writes the new tail permanently to the log. THis is sufficient
> to violate on disk metadata/journal ordering.
> 
> We have two options here. The first is to detect this case in some
> manner and ensure that the partial checkpoint write sets NEED_FLUSH
> when the iclog is already marked NEED_FUA and the log tail changes.
> This seems somewhat fragile and quite complex to get right, and it
> doesn't actually make it obvious what underlying problem it is
> actually addressing from reading the code.
> 
> The second option seems much cleaner to me, because it is derived
> directly from the requirements of the C1 commit record in the iclog.
> That is, when we write this commit record to the iclog, we've
> guaranteed that the metadata/data ordering is correct for tail
> update purposes. Hence if we only write the log tail into the iclog
> for the *first* commit record rather than the log tail at the last
> release, we guarantee that the log tail does not move past where the
> the first commit record in the log expects it to be.
> 
> IOWs, taking the first option means that replay of C1 becomes
> dependent on future operations doing the right thing, not just the
> C1 checkpoint itself doing the right thing. This makes log recovery
> almost impossible to reason about because now we have to take into
> account what might or might not have happened in the future when
> looking at checkpoints in the log rather than just having to
> reconstruct the past...
> 
> Signed-off-by: Dave Chinner <dchinner@redhat.com>

Looks good now,
Reviewed-by: Darrick J. Wong <djwong@kernel.org>

--D

> ---
> Version 2:
> - fix xlog_verify_tail_lsn() debug code to look at the iclog tail
>   lsn rather than the current tail lsn that we might not have
>   written into the iclog.
> 
>  fs/xfs/xfs_log.c | 50 +++++++++++++++++++++++++++++++++++++-------------
>  1 file changed, 37 insertions(+), 13 deletions(-)
> 
> diff --git a/fs/xfs/xfs_log.c b/fs/xfs/xfs_log.c
> index 1c328efdca66..60ac5fd63f1e 100644
> --- a/fs/xfs/xfs_log.c
> +++ b/fs/xfs/xfs_log.c
> @@ -78,13 +78,12 @@ xlog_verify_iclog(
>  STATIC void
>  xlog_verify_tail_lsn(
>  	struct xlog		*log,
> -	struct xlog_in_core	*iclog,
> -	xfs_lsn_t		tail_lsn);
> +	struct xlog_in_core	*iclog);
>  #else
>  #define xlog_verify_dest_ptr(a,b)
>  #define xlog_verify_grant_tail(a)
>  #define xlog_verify_iclog(a,b,c)
> -#define xlog_verify_tail_lsn(a,b,c)
> +#define xlog_verify_tail_lsn(a,b)
>  #endif
>  
>  STATIC int
> @@ -489,12 +488,31 @@ xfs_log_reserve(
>  
>  /*
>   * Flush iclog to disk if this is the last reference to the given iclog and the
> - * it is in the WANT_SYNC state.  If the caller passes in a non-zero
> - * @old_tail_lsn and the current log tail does not match, there may be metadata
> - * on disk that must be persisted before this iclog is written.  To satisfy that
> - * requirement, set the XLOG_ICL_NEED_FLUSH flag as a condition for writing this
> - * iclog with the new log tail value.
> + * it is in the WANT_SYNC state.
> + *
> + * If the caller passes in a non-zero @old_tail_lsn and the current log tail
> + * does not match, there may be metadata on disk that must be persisted before
> + * this iclog is written.  To satisfy that requirement, set the
> + * XLOG_ICL_NEED_FLUSH flag as a condition for writing this iclog with the new
> + * log tail value.
> + *
> + * If XLOG_ICL_NEED_FUA is already set on the iclog, we need to ensure that the
> + * log tail is updated correctly. NEED_FUA indicates that the iclog will be
> + * written to stable storage, and implies that a commit record is contained
> + * within the iclog. We need to ensure that the log tail does not move beyond
> + * the tail that the first commit record in the iclog ordered against, otherwise
> + * correct recovery of that checkpoint becomes dependent on future operations
> + * performed on this iclog.
> + *
> + * Hence if NEED_FUA is set and the current iclog tail lsn is empty, write the
> + * current tail into iclog. Once the iclog tail is set, future operations must
> + * not modify it, otherwise they potentially violate ordering constraints for
> + * the checkpoint commit that wrote the initial tail lsn value. The tail lsn in
> + * the iclog will get zeroed on activation of the iclog after sync, so we
> + * always capture the tail lsn on the iclog on the first NEED_FUA release
> + * regardless of the number of active reference counts on this iclog.
>   */
> +
>  int
>  xlog_state_release_iclog(
>  	struct xlog		*log,
> @@ -519,6 +537,10 @@ xlog_state_release_iclog(
>  
>  		if (old_tail_lsn && tail_lsn != old_tail_lsn)
>  			iclog->ic_flags |= XLOG_ICL_NEED_FLUSH;
> +
> +		if ((iclog->ic_flags & XLOG_ICL_NEED_FUA) &&
> +		    !iclog->ic_header.h_tail_lsn)
> +			iclog->ic_header.h_tail_lsn = cpu_to_be64(tail_lsn);
>  	}
>  
>  	if (!atomic_dec_and_test(&iclog->ic_refcnt))
> @@ -530,8 +552,9 @@ xlog_state_release_iclog(
>  	}
>  
>  	iclog->ic_state = XLOG_STATE_SYNCING;
> -	iclog->ic_header.h_tail_lsn = cpu_to_be64(tail_lsn);
> -	xlog_verify_tail_lsn(log, iclog, tail_lsn);
> +	if (!iclog->ic_header.h_tail_lsn)
> +		iclog->ic_header.h_tail_lsn = cpu_to_be64(tail_lsn);
> +	xlog_verify_tail_lsn(log, iclog);
>  	trace_xlog_iclog_syncing(iclog, _RET_IP_);
>  
>  	spin_unlock(&log->l_icloglock);
> @@ -2579,6 +2602,7 @@ xlog_state_activate_iclog(
>  	memset(iclog->ic_header.h_cycle_data, 0,
>  		sizeof(iclog->ic_header.h_cycle_data));
>  	iclog->ic_header.h_lsn = 0;
> +	iclog->ic_header.h_tail_lsn = 0;
>  }
>  
>  /*
> @@ -3614,10 +3638,10 @@ xlog_verify_grant_tail(
>  STATIC void
>  xlog_verify_tail_lsn(
>  	struct xlog		*log,
> -	struct xlog_in_core	*iclog,
> -	xfs_lsn_t		tail_lsn)
> +	struct xlog_in_core	*iclog)
>  {
> -    int blocks;
> +	xfs_lsn_t	tail_lsn = be64_to_cpu(iclog->ic_header.h_tail_lsn);
> +	int		blocks;
>  
>      if (CYCLE_LSN(tail_lsn) == log->l_prev_cycle) {
>  	blocks =

      reply	other threads:[~2021-07-29 16:28 UTC|newest]

Thread overview: 20+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-07-27  7:10 [PATCH 0/11 v3] xfs: fix log cache flush regressions and bugs Dave Chinner
2021-07-27  7:10 ` [PATCH 01/11] xfs: flush data dev on external log write Dave Chinner
2021-07-27  7:10 ` [PATCH 02/11] xfs: external logs need to flush data device Dave Chinner
2021-07-27  7:10 ` [PATCH 03/11] xfs: fold __xlog_state_release_iclog into xlog_state_release_iclog Dave Chinner
2021-07-27  7:10 ` [PATCH 04/11] xfs: fix ordering violation between cache flushes and tail updates Dave Chinner
2021-07-27 18:50   ` Darrick J. Wong
2021-07-27  7:10 ` [PATCH 05/11] xfs: factor out forced iclog flushes Dave Chinner
2021-07-27  7:10 ` [PATCH 06/11] xfs: log forces imply data device cache flushes Dave Chinner
2021-07-27  7:10 ` [PATCH 07/11] xfs: avoid unnecessary waits in xfs_log_force_lsn() Dave Chinner
2021-07-27  7:10 ` [PATCH 08/11] xfs: logging the on disk inode LSN can make it go backwards Dave Chinner
2021-07-27 19:00   ` Darrick J. Wong
2023-06-13  6:20   ` Chandan Babu R
2023-06-14  9:13     ` Dave Chinner
2021-07-27  7:10 ` [PATCH 09/11] xfs: Enforce attr3 buffer recovery order Dave Chinner
2021-07-27  7:10 ` [PATCH 10/11] xfs: need to see iclog flags in tracing Dave Chinner
2021-07-27  7:10 ` [PATCH 11/11] xfs: limit iclog tail updates Dave Chinner
2021-07-27 21:25   ` Darrick J. Wong
2021-07-27 22:13     ` Dave Chinner
2021-07-27 23:44   ` [PATCH 11/11 v2] " Dave Chinner
2021-07-29 16:28     ` Darrick J. Wong [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20210729162847.GK3601443@magnolia \
    --to=djwong@kernel.org \
    --cc=david@fromorbit.com \
    --cc=linux-xfs@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox