From: "Darrick J. Wong" <djwong@kernel.org>
To: Dave Chinner <david@fromorbit.com>
Cc: linux-xfs@vger.kernel.org
Subject: Re: [PATCH 10/10] xfs: intent item whiteouts
Date: Tue, 3 May 2022 15:50:09 -0700 [thread overview]
Message-ID: <20220503225009.GE8265@magnolia> (raw)
In-Reply-To: <20220503221728.185449-11-david@fromorbit.com>
On Wed, May 04, 2022 at 08:17:28AM +1000, Dave Chinner wrote:
> From: Dave Chinner <dchinner@redhat.com>
>
> When we log modifications based on intents, we add both intent
> and intent done items to the modification being made. These get
> written to the log to ensure that the operation is re-run if the
> intent done is not found in the log.
>
> However, for operations that complete wholly within a single
> checkpoint, the change in the checkpoint is atomic and will never
> need replay. In this case, we don't need to actually write the
> intent and intent done items to the journal because log recovery
> will never need to manually restart this modification.
>
> Log recovery currently handles intent/intent done matching by
> inserting the intent into the AIL, then removing it when a matching
> intent done item is found. Hence for all the intent-based operations
> that complete within a checkpoint, we spend all that time parsing
> the intent/intent done items just to cancel them and do nothing with
> them.
>
> Hence it follows that the only time we actually need intents in the
> log is when the modification crosses checkpoint boundaries in the
> log and so may only be partially complete in the journal. Hence if
> we commit and intent done item to the CIL and the intent item is in
> the same checkpoint, we don't actually have to write them to the
> journal because log recovery will always cancel the intents.
>
> We've never really worried about the overhead of logging intents
> unnecessarily like this because the intents we log are generally
> very much smaller than the change being made. e.g. freeing an extent
> involves modifying at lease two freespace btree blocks and the AGF,
> so the EFI/EFD overhead is only a small increase in space and
> processing time compared to the overall cost of freeing an extent.
>
> However, delayed attributes change this cost equation dramatically,
> especially for inline attributes. In the case of adding an inline
> attribute, we only log the inode core and attribute fork at present.
> With delayed attributes, we now log the attr intent which includes
> the name and value, the inode core adn attr fork, and finally the
> attr intent done item. We increase the number of items we log from 1
> to 3, and the number of log vectors (regions) goes up from 3 to 7.
> Hence we tripple the number of objects that the CIL has to process,
> and more than double the number of log vectors that need to be
> written to the journal.
>
> At scale, this means delayed attributes cause a non-pipelined CIL to
> become CPU bound processing all the extra items, resulting in a > 40%
> performance degradation on 16-way file+xattr create worklaods.
> Pipelining the CIL (as per 5.15) reduces the performance degradation
> to 20%, but now the limitation is the rate at which the log items
> can be written to the iclogs and iclogs be dispatched for IO and
> completed.
>
> Even log IO completion is slowed down by these intents, because it
> now has to process 3x the number of items in the checkpoint.
> Processing completed intents is especially inefficient here, because
> we first insert the intent into the AIL, then remove it from the AIL
> when the intent done is processed. IOWs, we are also doing expensive
> operations in log IO completion we could completely avoid if we
> didn't log completed intent/intent done pairs.
>
> Enter log item whiteouts.
>
> When an intent done is committed, we can check to see if the
> associated intent is in the same checkpoint as we are currently
> committing the intent done to. If so, we can mark the intent log
> item with a whiteout and immediately free the intent done item
> rather than committing it to the CIL. We can basically skip the
> entire formatting and CIL insertion steps for the intent done item.
>
> However, we cannot remove the intent item from the CIL at this point
> because the unlocked per-cpu CIL item lists do not permit removal
> without holding the CIL context lock exclusively. Transaction commit
> only holds the context lock shared, hence the best we can do is mark
> the intent item with a whiteout so that the CIL push can release it
> rather than writing it to the log.
>
> This means we never write the intent to the log if the intent done
> has also been committed to the same checkpoint, but we'll always
> write the intent if the intent done has not been committed or has
> been committed to a different checkpoint. This will result in
> correct log recovery behaviour in all cases, without the overhead of
> logging unnecessary intents.
>
> This intent whiteout concept is generic - we can apply it to all
> intent/intent done pairs that have a direct 1:1 relationship. The
> way deferred ops iterate and relog intents mean that all intents
> currently have a 1:1 relationship with their done intent, and hence
> we can apply this cancellation to all existing intent/intent done
> implementations.
>
> For delayed attributes with a 16-way 64kB xattr create workload,
> whiteouts reduce the amount of journalled metadata from ~2.5GB/s
> down to ~600MB/s and improve the creation rate from 9000/s to
> 14000/s.
>
> Signed-off-by: Dave Chinner <dchinner@redhat.com>
> Reviewed-by: Christoph Hellwig <hch@lst.de>
> ---
> fs/xfs/xfs_log_cil.c | 78 ++++++++++++++++++++++++++++++++++++++++++--
> fs/xfs/xfs_trace.h | 3 ++
> fs/xfs/xfs_trans.h | 6 ++--
> 3 files changed, 82 insertions(+), 5 deletions(-)
>
> diff --git a/fs/xfs/xfs_log_cil.c b/fs/xfs/xfs_log_cil.c
> index 0d8d092447ad..fecd2ea3e935 100644
> --- a/fs/xfs/xfs_log_cil.c
> +++ b/fs/xfs/xfs_log_cil.c
> @@ -476,7 +476,8 @@ xlog_cil_insert_format_items(
> static void
> xlog_cil_insert_items(
> struct xlog *log,
> - struct xfs_trans *tp)
> + struct xfs_trans *tp,
> + uint32_t released_space)
> {
> struct xfs_cil *cil = log->l_cilp;
> struct xfs_cil_ctx *ctx = cil->xc_ctx;
> @@ -525,7 +526,9 @@ xlog_cil_insert_items(
> ASSERT(tp->t_ticket->t_curr_res >= len);
> }
> tp->t_ticket->t_curr_res -= len;
> + tp->t_ticket->t_curr_res += released_space;
> ctx->space_used += len;
> + ctx->space_used -= released_space;
>
> /*
> * If we've overrun the reservation, dump the tx details before we move
> @@ -970,11 +973,16 @@ xlog_cil_build_trans_hdr(
> * Pull all the log vectors off the items in the CIL, and remove the items from
> * the CIL. We don't need the CIL lock here because it's only needed on the
> * transaction commit side which is currently locked out by the flush lock.
> + *
> + * If a log item is marked with a whiteout, we do not need to write it to the
> + * journal and so we just move them to the whiteout list for the caller to
> + * dispose of appropriately.
> */
> static void
> xlog_cil_build_lv_chain(
> struct xfs_cil *cil,
> struct xfs_cil_ctx *ctx,
> + struct list_head *whiteouts,
> uint32_t *num_iovecs,
> uint32_t *num_bytes)
> {
> @@ -985,6 +993,13 @@ xlog_cil_build_lv_chain(
>
> item = list_first_entry(&cil->xc_cil,
> struct xfs_log_item, li_cil);
> +
> + if (test_bit(XFS_LI_WHITEOUT, &item->li_flags)) {
> + list_move(&item->li_cil, whiteouts);
> + trace_xfs_cil_whiteout_skip(item);
> + continue;
> + }
> +
> list_del_init(&item->li_cil);
> if (!ctx->lv_chain)
> ctx->lv_chain = item->li_lv;
> @@ -1000,6 +1015,19 @@ xlog_cil_build_lv_chain(
> }
> }
>
> +static void
> +xlog_cil_push_cleanup_whiteouts(
Pushing cleanup whiteouts?
Oh, clean up whiteouts as part of pushing CIL.
I almost want to ask for a comment here:
/* Remove log items from the CIL that have been elided from the checkpoint. */
static void
xlog_cil_push_cleanup_whiteouts(
But fmeh, aside from my own momentary confusion this isn't that big of a
deal.
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
--D
> + struct list_head *whiteouts)
> +{
> + while (!list_empty(whiteouts)) {
> + struct xfs_log_item *item = list_first_entry(whiteouts,
> + struct xfs_log_item, li_cil);
> + list_del_init(&item->li_cil);
> + trace_xfs_cil_whiteout_unpin(item);
> + item->li_ops->iop_unpin(item, 1);
> + }
> +}
> +
> /*
> * Push the Committed Item List to the log.
> *
> @@ -1030,6 +1058,7 @@ xlog_cil_push_work(
> struct xfs_log_vec lvhdr = { NULL };
> xfs_csn_t push_seq;
> bool push_commit_stable;
> + LIST_HEAD (whiteouts);
>
> new_ctx = xlog_cil_ctx_alloc();
> new_ctx->ticket = xlog_cil_ticket_alloc(log);
> @@ -1098,7 +1127,7 @@ xlog_cil_push_work(
> list_add(&ctx->committing, &cil->xc_committing);
> spin_unlock(&cil->xc_push_lock);
>
> - xlog_cil_build_lv_chain(cil, ctx, &num_iovecs, &num_bytes);
> + xlog_cil_build_lv_chain(cil, ctx, &whiteouts, &num_iovecs, &num_bytes);
>
> /*
> * Switch the contexts so we can drop the context lock and move out
> @@ -1201,6 +1230,7 @@ xlog_cil_push_work(
> /* Not safe to reference ctx now! */
>
> spin_unlock(&log->l_icloglock);
> + xlog_cil_push_cleanup_whiteouts(&whiteouts);
> return;
>
> out_skip:
> @@ -1212,6 +1242,7 @@ xlog_cil_push_work(
> out_abort_free_ticket:
> xfs_log_ticket_ungrant(log, ctx->ticket);
> ASSERT(xlog_is_shutdown(log));
> + xlog_cil_push_cleanup_whiteouts(&whiteouts);
> if (!ctx->commit_iclog) {
> xlog_cil_committed(ctx);
> return;
> @@ -1360,6 +1391,43 @@ xlog_cil_empty(
> return empty;
> }
>
> +/*
> + * If there are intent done items in this transaction and the related intent was
> + * committed in the current (same) CIL checkpoint, we don't need to write either
> + * the intent or intent done item to the journal as the change will be
> + * journalled atomically within this checkpoint. As we cannot remove items from
> + * the CIL here, mark the related intent with a whiteout so that the CIL push
> + * can remove it rather than writing it to the journal. Then remove the intent
> + * done item from the current transaction and release it so it doesn't get put
> + * into the CIL at all.
> + */
> +static uint32_t
> +xlog_cil_process_intents(
> + struct xfs_cil *cil,
> + struct xfs_trans *tp)
> +{
> + struct xfs_log_item *lip, *ilip, *next;
> + uint32_t len = 0;
> +
> + list_for_each_entry_safe(lip, next, &tp->t_items, li_trans) {
> + if (!(lip->li_ops->flags & XFS_ITEM_INTENT_DONE))
> + continue;
> +
> + ilip = lip->li_ops->iop_intent(lip);
> + if (!ilip || !xlog_item_in_current_chkpt(cil, ilip))
> + continue;
> + set_bit(XFS_LI_WHITEOUT, &ilip->li_flags);
> + trace_xfs_cil_whiteout_mark(ilip);
> + len += ilip->li_lv->lv_bytes;
> + kmem_free(ilip->li_lv);
> + ilip->li_lv = NULL;
> +
> + xfs_trans_del_item(lip);
> + lip->li_ops->iop_release(lip);
> + }
> + return len;
> +}
> +
> /*
> * Commit a transaction with the given vector to the Committed Item List.
> *
> @@ -1382,6 +1450,7 @@ xlog_cil_commit(
> {
> struct xfs_cil *cil = log->l_cilp;
> struct xfs_log_item *lip, *next;
> + uint32_t released_space = 0;
>
> /*
> * Do all necessary memory allocation before we lock the CIL.
> @@ -1393,7 +1462,10 @@ xlog_cil_commit(
> /* lock out background commit */
> down_read(&cil->xc_ctx_lock);
>
> - xlog_cil_insert_items(log, tp);
> + if (tp->t_flags & XFS_TRANS_HAS_INTENT_DONE)
> + released_space = xlog_cil_process_intents(cil, tp);
> +
> + xlog_cil_insert_items(log, tp, released_space);
>
> if (regrant && !xlog_is_shutdown(log))
> xfs_log_ticket_regrant(log, tp->t_ticket);
> diff --git a/fs/xfs/xfs_trace.h b/fs/xfs/xfs_trace.h
> index e1197f9ad97e..75934e3c3f55 100644
> --- a/fs/xfs/xfs_trace.h
> +++ b/fs/xfs/xfs_trace.h
> @@ -1332,6 +1332,9 @@ DEFINE_LOG_ITEM_EVENT(xfs_ail_push);
> DEFINE_LOG_ITEM_EVENT(xfs_ail_pinned);
> DEFINE_LOG_ITEM_EVENT(xfs_ail_locked);
> DEFINE_LOG_ITEM_EVENT(xfs_ail_flushing);
> +DEFINE_LOG_ITEM_EVENT(xfs_cil_whiteout_mark);
> +DEFINE_LOG_ITEM_EVENT(xfs_cil_whiteout_skip);
> +DEFINE_LOG_ITEM_EVENT(xfs_cil_whiteout_unpin);
>
> DECLARE_EVENT_CLASS(xfs_ail_class,
> TP_PROTO(struct xfs_log_item *lip, xfs_lsn_t old_lsn, xfs_lsn_t new_lsn),
> diff --git a/fs/xfs/xfs_trans.h b/fs/xfs/xfs_trans.h
> index d72a5995d33e..9561f193e7e1 100644
> --- a/fs/xfs/xfs_trans.h
> +++ b/fs/xfs/xfs_trans.h
> @@ -55,13 +55,15 @@ struct xfs_log_item {
> #define XFS_LI_IN_AIL 0
> #define XFS_LI_ABORTED 1
> #define XFS_LI_FAILED 2
> -#define XFS_LI_DIRTY 3 /* log item dirty in transaction */
> +#define XFS_LI_DIRTY 3
> +#define XFS_LI_WHITEOUT 4
>
> #define XFS_LI_FLAGS \
> { (1u << XFS_LI_IN_AIL), "IN_AIL" }, \
> { (1u << XFS_LI_ABORTED), "ABORTED" }, \
> { (1u << XFS_LI_FAILED), "FAILED" }, \
> - { (1u << XFS_LI_DIRTY), "DIRTY" }
> + { (1u << XFS_LI_DIRTY), "DIRTY" }, \
> + { (1u << XFS_LI_WHITEOUT), "WHITEOUT" }
>
> struct xfs_item_ops {
> unsigned flags;
> --
> 2.35.1
>
next prev parent reply other threads:[~2022-05-03 22:50 UTC|newest]
Thread overview: 23+ messages / expand[flat|nested] mbox.gz Atom feed top
2022-05-03 22:17 [PATCH 00/10 v6] xfs: intent whiteouts Dave Chinner
2022-05-03 22:17 ` [PATCH 01/10] xfs: zero inode fork buffer at allocation Dave Chinner
2022-05-03 22:41 ` Darrick J. Wong
2022-05-03 22:42 ` Alli
2022-05-10 12:47 ` Christoph Hellwig
2022-05-03 22:17 ` [PATCH 02/10] xfs: fix potential log item leak Dave Chinner
2022-05-03 22:42 ` Alli
2022-05-03 22:44 ` Darrick J. Wong
2022-05-10 12:48 ` Christoph Hellwig
2022-05-03 22:17 ` [PATCH 03/10] xfs: hide log iovec alignment constraints Dave Chinner
2022-05-03 22:45 ` Darrick J. Wong
2022-05-03 23:07 ` Dave Chinner
2022-05-03 22:17 ` [PATCH 04/10] xfs: don't commit the first deferred transaction without intents Dave Chinner
2022-05-03 22:17 ` [PATCH 05/10] xfs: add log item flags to indicate intents Dave Chinner
2022-05-03 22:17 ` [PATCH 06/10] xfs: tag transactions that contain intent done items Dave Chinner
2022-05-03 22:17 ` [PATCH 07/10] xfs: factor and move some code in xfs_log_cil.c Dave Chinner
2022-05-03 22:17 ` [PATCH 08/10] xfs: add log item method to return related intents Dave Chinner
2022-05-03 22:17 ` [PATCH 09/10] xfs: whiteouts release intents that are not in the AIL Dave Chinner
2022-05-10 12:49 ` Christoph Hellwig
2022-05-03 22:17 ` [PATCH 10/10] xfs: intent item whiteouts Dave Chinner
2022-05-03 22:42 ` Alli
2022-05-03 22:50 ` Darrick J. Wong [this message]
2022-05-04 1:49 ` Dave Chinner
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20220503225009.GE8265@magnolia \
--to=djwong@kernel.org \
--cc=david@fromorbit.com \
--cc=linux-xfs@vger.kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox