public inbox for linux-xfs@vger.kernel.org
 help / color / mirror / Atom feed
From: "Darrick J. Wong" <djwong@kernel.org>
To: Dave Chinner <david@fromorbit.com>
Cc: linux-xfs@vger.kernel.org
Subject: Re: [PATCH 10/10] xfs: intent item whiteouts
Date: Tue, 3 May 2022 15:50:09 -0700	[thread overview]
Message-ID: <20220503225009.GE8265@magnolia> (raw)
In-Reply-To: <20220503221728.185449-11-david@fromorbit.com>

On Wed, May 04, 2022 at 08:17:28AM +1000, Dave Chinner wrote:
> From: Dave Chinner <dchinner@redhat.com>
> 
> When we log modifications based on intents, we add both intent
> and intent done items to the modification being made. These get
> written to the log to ensure that the operation is re-run if the
> intent done is not found in the log.
> 
> However, for operations that complete wholly within a single
> checkpoint, the change in the checkpoint is atomic and will never
> need replay. In this case, we don't need to actually write the
> intent and intent done items to the journal because log recovery
> will never need to manually restart this modification.
> 
> Log recovery currently handles intent/intent done matching by
> inserting the intent into the AIL, then removing it when a matching
> intent done item is found. Hence for all the intent-based operations
> that complete within a checkpoint, we spend all that time parsing
> the intent/intent done items just to cancel them and do nothing with
> them.
> 
> Hence it follows that the only time we actually need intents in the
> log is when the modification crosses checkpoint boundaries in the
> log and so may only be partially complete in the journal. Hence if
> we commit and intent done item to the CIL and the intent item is in
> the same checkpoint, we don't actually have to write them to the
> journal because log recovery will always cancel the intents.
> 
> We've never really worried about the overhead of logging intents
> unnecessarily like this because the intents we log are generally
> very much smaller than the change being made. e.g. freeing an extent
> involves modifying at lease two freespace btree blocks and the AGF,
> so the EFI/EFD overhead is only a small increase in space and
> processing time compared to the overall cost of freeing an extent.
> 
> However, delayed attributes change this cost equation dramatically,
> especially for inline attributes. In the case of adding an inline
> attribute, we only log the inode core and attribute fork at present.
> With delayed attributes, we now log the attr intent which includes
> the name and value, the inode core adn attr fork, and finally the
> attr intent done item. We increase the number of items we log from 1
> to 3, and the number of log vectors (regions) goes up from 3 to 7.
> Hence we tripple the number of objects that the CIL has to process,
> and more than double the number of log vectors that need to be
> written to the journal.
> 
> At scale, this means delayed attributes cause a non-pipelined CIL to
> become CPU bound processing all the extra items, resulting in a > 40%
> performance degradation on 16-way file+xattr create worklaods.
> Pipelining the CIL (as per 5.15) reduces the performance degradation
> to 20%, but now the limitation is the rate at which the log items
> can be written to the iclogs and iclogs be dispatched for IO and
> completed.
> 
> Even log IO completion is slowed down by these intents, because it
> now has to process 3x the number of items in the checkpoint.
> Processing completed intents is especially inefficient here, because
> we first insert the intent into the AIL, then remove it from the AIL
> when the intent done is processed. IOWs, we are also doing expensive
> operations in log IO completion we could completely avoid if we
> didn't log completed intent/intent done pairs.
> 
> Enter log item whiteouts.
> 
> When an intent done is committed, we can check to see if the
> associated intent is in the same checkpoint as we are currently
> committing the intent done to. If so, we can mark the intent log
> item with a whiteout and immediately free the intent done item
> rather than committing it to the CIL. We can basically skip the
> entire formatting and CIL insertion steps for the intent done item.
> 
> However, we cannot remove the intent item from the CIL at this point
> because the unlocked per-cpu CIL item lists do not permit removal
> without holding the CIL context lock exclusively. Transaction commit
> only holds the context lock shared, hence the best we can do is mark
> the intent item with a whiteout so that the CIL push can release it
> rather than writing it to the log.
> 
> This means we never write the intent to the log if the intent done
> has also been committed to the same checkpoint, but we'll always
> write the intent if the intent done has not been committed or has
> been committed to a different checkpoint. This will result in
> correct log recovery behaviour in all cases, without the overhead of
> logging unnecessary intents.
> 
> This intent whiteout concept is generic - we can apply it to all
> intent/intent done pairs that have a direct 1:1 relationship. The
> way deferred ops iterate and relog intents mean that all intents
> currently have a 1:1 relationship with their done intent, and hence
> we can apply this cancellation to all existing intent/intent done
> implementations.
> 
> For delayed attributes with a 16-way 64kB xattr create workload,
> whiteouts reduce the amount of journalled metadata from ~2.5GB/s
> down to ~600MB/s and improve the creation rate from 9000/s to
> 14000/s.
> 
> Signed-off-by: Dave Chinner <dchinner@redhat.com>
> Reviewed-by: Christoph Hellwig <hch@lst.de>
> ---
>  fs/xfs/xfs_log_cil.c | 78 ++++++++++++++++++++++++++++++++++++++++++--
>  fs/xfs/xfs_trace.h   |  3 ++
>  fs/xfs/xfs_trans.h   |  6 ++--
>  3 files changed, 82 insertions(+), 5 deletions(-)
> 
> diff --git a/fs/xfs/xfs_log_cil.c b/fs/xfs/xfs_log_cil.c
> index 0d8d092447ad..fecd2ea3e935 100644
> --- a/fs/xfs/xfs_log_cil.c
> +++ b/fs/xfs/xfs_log_cil.c
> @@ -476,7 +476,8 @@ xlog_cil_insert_format_items(
>  static void
>  xlog_cil_insert_items(
>  	struct xlog		*log,
> -	struct xfs_trans	*tp)
> +	struct xfs_trans	*tp,
> +	uint32_t		released_space)
>  {
>  	struct xfs_cil		*cil = log->l_cilp;
>  	struct xfs_cil_ctx	*ctx = cil->xc_ctx;
> @@ -525,7 +526,9 @@ xlog_cil_insert_items(
>  		ASSERT(tp->t_ticket->t_curr_res >= len);
>  	}
>  	tp->t_ticket->t_curr_res -= len;
> +	tp->t_ticket->t_curr_res += released_space;
>  	ctx->space_used += len;
> +	ctx->space_used -= released_space;
>  
>  	/*
>  	 * If we've overrun the reservation, dump the tx details before we move
> @@ -970,11 +973,16 @@ xlog_cil_build_trans_hdr(
>   * Pull all the log vectors off the items in the CIL, and remove the items from
>   * the CIL. We don't need the CIL lock here because it's only needed on the
>   * transaction commit side which is currently locked out by the flush lock.
> + *
> + * If a log item is marked with a whiteout, we do not need to write it to the
> + * journal and so we just move them to the whiteout list for the caller to
> + * dispose of appropriately.
>   */
>  static void
>  xlog_cil_build_lv_chain(
>  	struct xfs_cil		*cil,
>  	struct xfs_cil_ctx	*ctx,
> +	struct list_head	*whiteouts,
>  	uint32_t		*num_iovecs,
>  	uint32_t		*num_bytes)
>  {
> @@ -985,6 +993,13 @@ xlog_cil_build_lv_chain(
>  
>  		item = list_first_entry(&cil->xc_cil,
>  					struct xfs_log_item, li_cil);
> +
> +		if (test_bit(XFS_LI_WHITEOUT, &item->li_flags)) {
> +			list_move(&item->li_cil, whiteouts);
> +			trace_xfs_cil_whiteout_skip(item);
> +			continue;
> +		}
> +
>  		list_del_init(&item->li_cil);
>  		if (!ctx->lv_chain)
>  			ctx->lv_chain = item->li_lv;
> @@ -1000,6 +1015,19 @@ xlog_cil_build_lv_chain(
>  	}
>  }
>  
> +static void
> +xlog_cil_push_cleanup_whiteouts(

Pushing cleanup whiteouts?

Oh, clean up whiteouts as part of pushing CIL.

I almost want to ask for a comment here:

/* Remove log items from the CIL that have been elided from the checkpoint. */
static void
xlog_cil_push_cleanup_whiteouts(

But fmeh, aside from my own momentary confusion this isn't that big of a
deal.

Reviewed-by: Darrick J. Wong <djwong@kernel.org>

--D


> +	struct list_head	*whiteouts)
> +{
> +	while (!list_empty(whiteouts)) {
> +		struct xfs_log_item *item = list_first_entry(whiteouts,
> +						struct xfs_log_item, li_cil);
> +		list_del_init(&item->li_cil);
> +		trace_xfs_cil_whiteout_unpin(item);
> +		item->li_ops->iop_unpin(item, 1);
> +	}
> +}
> +
>  /*
>   * Push the Committed Item List to the log.
>   *
> @@ -1030,6 +1058,7 @@ xlog_cil_push_work(
>  	struct xfs_log_vec	lvhdr = { NULL };
>  	xfs_csn_t		push_seq;
>  	bool			push_commit_stable;
> +	LIST_HEAD		(whiteouts);
>  
>  	new_ctx = xlog_cil_ctx_alloc();
>  	new_ctx->ticket = xlog_cil_ticket_alloc(log);
> @@ -1098,7 +1127,7 @@ xlog_cil_push_work(
>  	list_add(&ctx->committing, &cil->xc_committing);
>  	spin_unlock(&cil->xc_push_lock);
>  
> -	xlog_cil_build_lv_chain(cil, ctx, &num_iovecs, &num_bytes);
> +	xlog_cil_build_lv_chain(cil, ctx, &whiteouts, &num_iovecs, &num_bytes);
>  
>  	/*
>  	 * Switch the contexts so we can drop the context lock and move out
> @@ -1201,6 +1230,7 @@ xlog_cil_push_work(
>  	/* Not safe to reference ctx now! */
>  
>  	spin_unlock(&log->l_icloglock);
> +	xlog_cil_push_cleanup_whiteouts(&whiteouts);
>  	return;
>  
>  out_skip:
> @@ -1212,6 +1242,7 @@ xlog_cil_push_work(
>  out_abort_free_ticket:
>  	xfs_log_ticket_ungrant(log, ctx->ticket);
>  	ASSERT(xlog_is_shutdown(log));
> +	xlog_cil_push_cleanup_whiteouts(&whiteouts);
>  	if (!ctx->commit_iclog) {
>  		xlog_cil_committed(ctx);
>  		return;
> @@ -1360,6 +1391,43 @@ xlog_cil_empty(
>  	return empty;
>  }
>  
> +/*
> + * If there are intent done items in this transaction and the related intent was
> + * committed in the current (same) CIL checkpoint, we don't need to write either
> + * the intent or intent done item to the journal as the change will be
> + * journalled atomically within this checkpoint. As we cannot remove items from
> + * the CIL here, mark the related intent with a whiteout so that the CIL push
> + * can remove it rather than writing it to the journal. Then remove the intent
> + * done item from the current transaction and release it so it doesn't get put
> + * into the CIL at all.
> + */
> +static uint32_t
> +xlog_cil_process_intents(
> +	struct xfs_cil		*cil,
> +	struct xfs_trans	*tp)
> +{
> +	struct xfs_log_item	*lip, *ilip, *next;
> +	uint32_t		len = 0;
> +
> +	list_for_each_entry_safe(lip, next, &tp->t_items, li_trans) {
> +		if (!(lip->li_ops->flags & XFS_ITEM_INTENT_DONE))
> +			continue;
> +
> +		ilip = lip->li_ops->iop_intent(lip);
> +		if (!ilip || !xlog_item_in_current_chkpt(cil, ilip))
> +			continue;
> +		set_bit(XFS_LI_WHITEOUT, &ilip->li_flags);
> +		trace_xfs_cil_whiteout_mark(ilip);
> +		len += ilip->li_lv->lv_bytes;
> +		kmem_free(ilip->li_lv);
> +		ilip->li_lv = NULL;
> +
> +		xfs_trans_del_item(lip);
> +		lip->li_ops->iop_release(lip);
> +	}
> +	return len;
> +}
> +
>  /*
>   * Commit a transaction with the given vector to the Committed Item List.
>   *
> @@ -1382,6 +1450,7 @@ xlog_cil_commit(
>  {
>  	struct xfs_cil		*cil = log->l_cilp;
>  	struct xfs_log_item	*lip, *next;
> +	uint32_t		released_space = 0;
>  
>  	/*
>  	 * Do all necessary memory allocation before we lock the CIL.
> @@ -1393,7 +1462,10 @@ xlog_cil_commit(
>  	/* lock out background commit */
>  	down_read(&cil->xc_ctx_lock);
>  
> -	xlog_cil_insert_items(log, tp);
> +	if (tp->t_flags & XFS_TRANS_HAS_INTENT_DONE)
> +		released_space = xlog_cil_process_intents(cil, tp);
> +
> +	xlog_cil_insert_items(log, tp, released_space);
>  
>  	if (regrant && !xlog_is_shutdown(log))
>  		xfs_log_ticket_regrant(log, tp->t_ticket);
> diff --git a/fs/xfs/xfs_trace.h b/fs/xfs/xfs_trace.h
> index e1197f9ad97e..75934e3c3f55 100644
> --- a/fs/xfs/xfs_trace.h
> +++ b/fs/xfs/xfs_trace.h
> @@ -1332,6 +1332,9 @@ DEFINE_LOG_ITEM_EVENT(xfs_ail_push);
>  DEFINE_LOG_ITEM_EVENT(xfs_ail_pinned);
>  DEFINE_LOG_ITEM_EVENT(xfs_ail_locked);
>  DEFINE_LOG_ITEM_EVENT(xfs_ail_flushing);
> +DEFINE_LOG_ITEM_EVENT(xfs_cil_whiteout_mark);
> +DEFINE_LOG_ITEM_EVENT(xfs_cil_whiteout_skip);
> +DEFINE_LOG_ITEM_EVENT(xfs_cil_whiteout_unpin);
>  
>  DECLARE_EVENT_CLASS(xfs_ail_class,
>  	TP_PROTO(struct xfs_log_item *lip, xfs_lsn_t old_lsn, xfs_lsn_t new_lsn),
> diff --git a/fs/xfs/xfs_trans.h b/fs/xfs/xfs_trans.h
> index d72a5995d33e..9561f193e7e1 100644
> --- a/fs/xfs/xfs_trans.h
> +++ b/fs/xfs/xfs_trans.h
> @@ -55,13 +55,15 @@ struct xfs_log_item {
>  #define	XFS_LI_IN_AIL	0
>  #define	XFS_LI_ABORTED	1
>  #define	XFS_LI_FAILED	2
> -#define	XFS_LI_DIRTY	3	/* log item dirty in transaction */
> +#define	XFS_LI_DIRTY	3
> +#define	XFS_LI_WHITEOUT	4
>  
>  #define XFS_LI_FLAGS \
>  	{ (1u << XFS_LI_IN_AIL),	"IN_AIL" }, \
>  	{ (1u << XFS_LI_ABORTED),	"ABORTED" }, \
>  	{ (1u << XFS_LI_FAILED),	"FAILED" }, \
> -	{ (1u << XFS_LI_DIRTY),		"DIRTY" }
> +	{ (1u << XFS_LI_DIRTY),		"DIRTY" }, \
> +	{ (1u << XFS_LI_WHITEOUT),	"WHITEOUT" }
>  
>  struct xfs_item_ops {
>  	unsigned flags;
> -- 
> 2.35.1
> 

  parent reply	other threads:[~2022-05-03 22:50 UTC|newest]

Thread overview: 23+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2022-05-03 22:17 [PATCH 00/10 v6] xfs: intent whiteouts Dave Chinner
2022-05-03 22:17 ` [PATCH 01/10] xfs: zero inode fork buffer at allocation Dave Chinner
2022-05-03 22:41   ` Darrick J. Wong
2022-05-03 22:42   ` Alli
2022-05-10 12:47   ` Christoph Hellwig
2022-05-03 22:17 ` [PATCH 02/10] xfs: fix potential log item leak Dave Chinner
2022-05-03 22:42   ` Alli
2022-05-03 22:44   ` Darrick J. Wong
2022-05-10 12:48   ` Christoph Hellwig
2022-05-03 22:17 ` [PATCH 03/10] xfs: hide log iovec alignment constraints Dave Chinner
2022-05-03 22:45   ` Darrick J. Wong
2022-05-03 23:07     ` Dave Chinner
2022-05-03 22:17 ` [PATCH 04/10] xfs: don't commit the first deferred transaction without intents Dave Chinner
2022-05-03 22:17 ` [PATCH 05/10] xfs: add log item flags to indicate intents Dave Chinner
2022-05-03 22:17 ` [PATCH 06/10] xfs: tag transactions that contain intent done items Dave Chinner
2022-05-03 22:17 ` [PATCH 07/10] xfs: factor and move some code in xfs_log_cil.c Dave Chinner
2022-05-03 22:17 ` [PATCH 08/10] xfs: add log item method to return related intents Dave Chinner
2022-05-03 22:17 ` [PATCH 09/10] xfs: whiteouts release intents that are not in the AIL Dave Chinner
2022-05-10 12:49   ` Christoph Hellwig
2022-05-03 22:17 ` [PATCH 10/10] xfs: intent item whiteouts Dave Chinner
2022-05-03 22:42   ` Alli
2022-05-03 22:50   ` Darrick J. Wong [this message]
2022-05-04  1:49     ` Dave Chinner

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20220503225009.GE8265@magnolia \
    --to=djwong@kernel.org \
    --cc=david@fromorbit.com \
    --cc=linux-xfs@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox