From: Brian Foster <bfoster@redhat.com>
To: "Darrick J. Wong" <darrick.wong@oracle.com>
Cc: linux-xfs@vger.kernel.org, david@fromorbit.com, hch@lst.de
Subject: Re: [PATCH 3/5] xfs: proper replay of deferred ops queued during log recovery
Date: Thu, 1 Oct 2020 13:31:42 -0400 [thread overview]
Message-ID: <20201001173142.GE112884@bfoster> (raw)
In-Reply-To: <160140141157.830233.8230232141442784711.stgit@magnolia>
On Tue, Sep 29, 2020 at 10:43:31AM -0700, Darrick J. Wong wrote:
> From: Darrick J. Wong <darrick.wong@oracle.com>
>
> When we replay unfinished intent items that have been recovered from the
> log, it's possible that the replay will cause the creation of more
> deferred work items. As outlined in commit 509955823cc9c ("xfs: log
> recovery should replay deferred ops in order"), later work items have an
> implicit ordering dependency on earlier work items. Therefore, recovery
> must replay the items (both recovered and created) in the same order
> that they would have been during normal operation.
>
> For log recovery, we enforce this ordering by using an empty transaction
> to collect deferred ops that get created in the process of recovering a
> log intent item to prevent them from being committed before the rest of
> the recovered intent items. After we finish committing all the
> recovered log items, we allocate a transaction with an enormous block
> reservation, splice our huge list of created deferred ops into that
> transaction, and commit it, thereby finishing all those ops.
>
> This is /really/ hokey -- it's the one place in XFS where we allow
> nested transactions; the splicing of the defer ops list is is inelegant
> and has to be done twice per recovery function; and the broken way we
> handle inode pointers and block reservations cause subtle use-after-free
> and allocator problems that will be fixed by this patch and the two
> patches after it.
>
> Therefore, replace the hokey empty transaction with a structure designed
> to capture each chain of deferred ops that are created as part of
> recovering a single unfinished log intent. Finally, refactor the loop
> that replays those chains to do so using one transaction per chain.
>
> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
> ---
> fs/xfs/libxfs/xfs_defer.c | 89 +++++++++++++++++++++++++++++++--
> fs/xfs/libxfs/xfs_defer.h | 19 +++++++
> fs/xfs/xfs_bmap_item.c | 16 +-----
> fs/xfs/xfs_extfree_item.c | 7 +--
> fs/xfs/xfs_log_recover.c | 118 +++++++++++++++++++++++++-------------------
> fs/xfs/xfs_refcount_item.c | 16 +-----
> fs/xfs/xfs_rmap_item.c | 7 +--
> fs/xfs/xfs_trans.h | 3 +
> 8 files changed, 184 insertions(+), 91 deletions(-)
>
>
...
> diff --git a/fs/xfs/xfs_log_recover.c b/fs/xfs/xfs_log_recover.c
> index 84f876c6d498..550d0fa8057a 100644
> --- a/fs/xfs/xfs_log_recover.c
> +++ b/fs/xfs/xfs_log_recover.c
> @@ -2434,37 +2434,62 @@ xlog_recover_process_data(
> /* Take all the collected deferred ops and finish them in order. */
> static int
> xlog_finish_defer_ops(
> - struct xfs_trans *parent_tp)
> + struct xfs_mount *mp,
> + struct list_head *capture_list)
> {
> - struct xfs_mount *mp = parent_tp->t_mountp;
> + struct xfs_defer_capture *dfc, *next;
> struct xfs_trans *tp;
> int64_t freeblks;
> - uint resblks;
> - int error;
> + uint64_t resblks;
> + int error = 0;
>
> - /*
> - * We're finishing the defer_ops that accumulated as a result of
> - * recovering unfinished intent items during log recovery. We
> - * reserve an itruncate transaction because it is the largest
> - * permanent transaction type. Since we're the only user of the fs
> - * right now, take 93% (15/16) of the available free blocks. Use
> - * weird math to avoid a 64-bit division.
> - */
> - freeblks = percpu_counter_sum(&mp->m_fdblocks);
> - if (freeblks <= 0)
> - return -ENOSPC;
> - resblks = min_t(int64_t, UINT_MAX, freeblks);
> - resblks = (resblks * 15) >> 4;
> - error = xfs_trans_alloc(mp, &M_RES(mp)->tr_itruncate, resblks,
> - 0, XFS_TRANS_RESERVE, &tp);
> - if (error)
> - return error;
> - /* transfer all collected dfops to this transaction */
> - xfs_defer_move(tp, parent_tp);
> + list_for_each_entry_safe(dfc, next, capture_list, dfc_list) {
> + /*
> + * We're finishing the defer_ops that accumulated as a result
> + * of recovering unfinished intent items during log recovery.
> + * We reserve an itruncate transaction because it is the
> + * largest permanent transaction type. Since we're the only
> + * user of the fs right now, take 93% (15/16) of the available
> + * free blocks. Use weird math to avoid a 64-bit division.
> + */
> + freeblks = percpu_counter_sum(&mp->m_fdblocks);
> + if (freeblks <= 0)
> + return -ENOSPC;
>
> - return xfs_trans_commit(tp);
> + resblks = min_t(uint64_t, UINT_MAX, freeblks);
> + resblks = (resblks * 15) >> 4;
> + error = xfs_trans_alloc(mp, &M_RES(mp)->tr_itruncate, resblks,
> + 0, XFS_TRANS_RESERVE, &tp);
> + if (error)
> + return error;
> +
> + /* Transfer all collected dfops to this transaction. */
This old comment is a little misleading now that we have per-intent
capture lists. That nit aside, this looks good to me and the updated
factoring is easier to follow:
Reviewed-by: Brian Foster <bfoster@redhat.com>
> + list_del_init(&dfc->dfc_list);
> + xfs_defer_ops_continue(dfc, tp);
> +
> + error = xfs_trans_commit(tp);
> + if (error)
> + return error;
> + }
> +
> + ASSERT(list_empty(capture_list));
> + return 0;
> }
>
> +/* Release all the captured defer ops and capture structures in this list. */
> +static void
> +xlog_abort_defer_ops(
> + struct xfs_mount *mp,
> + struct list_head *capture_list)
> +{
> + struct xfs_defer_capture *dfc;
> + struct xfs_defer_capture *next;
> +
> + list_for_each_entry_safe(dfc, next, capture_list, dfc_list) {
> + list_del_init(&dfc->dfc_list);
> + xfs_defer_ops_release(mp, dfc);
> + }
> +}
> /*
> * When this is called, all of the log intent items which did not have
> * corresponding log done items should be in the AIL. What we do now
> @@ -2485,35 +2510,23 @@ STATIC int
> xlog_recover_process_intents(
> struct xlog *log)
> {
> - struct xfs_trans *parent_tp;
> + LIST_HEAD(capture_list);
> struct xfs_ail_cursor cur;
> struct xfs_log_item *lip;
> struct xfs_ail *ailp;
> - int error;
> + int error = 0;
> #if defined(DEBUG) || defined(XFS_WARN)
> xfs_lsn_t last_lsn;
> #endif
>
> - /*
> - * The intent recovery handlers commit transactions to complete recovery
> - * for individual intents, but any new deferred operations that are
> - * queued during that process are held off until the very end. The
> - * purpose of this transaction is to serve as a container for deferred
> - * operations. Each intent recovery handler must transfer dfops here
> - * before its local transaction commits, and we'll finish the entire
> - * list below.
> - */
> - error = xfs_trans_alloc_empty(log->l_mp, &parent_tp);
> - if (error)
> - return error;
> -
> ailp = log->l_ailp;
> spin_lock(&ailp->ail_lock);
> - lip = xfs_trans_ail_cursor_first(ailp, &cur, 0);
> #if defined(DEBUG) || defined(XFS_WARN)
> last_lsn = xlog_assign_lsn(log->l_curr_cycle, log->l_curr_block);
> #endif
> - while (lip != NULL) {
> + for (lip = xfs_trans_ail_cursor_first(ailp, &cur, 0);
> + lip != NULL;
> + lip = xfs_trans_ail_cursor_next(ailp, &cur)) {
> /*
> * We're done when we see something other than an intent.
> * There should be no intents left in the AIL now.
> @@ -2535,24 +2548,29 @@ xlog_recover_process_intents(
>
> /*
> * NOTE: If your intent processing routine can create more
> - * deferred ops, you /must/ attach them to the transaction in
> - * this routine or else those subsequent intents will get
> + * deferred ops, you /must/ attach them to the capture list in
> + * the recover routine or else those subsequent intents will be
> * replayed in the wrong order!
> */
> spin_unlock(&ailp->ail_lock);
> - error = lip->li_ops->iop_recover(lip, parent_tp);
> + error = lip->li_ops->iop_recover(lip, &capture_list);
> spin_lock(&ailp->ail_lock);
> if (error)
> - goto out;
> - lip = xfs_trans_ail_cursor_next(ailp, &cur);
> + break;
> }
> -out:
> +
> xfs_trans_ail_cursor_done(&cur);
> spin_unlock(&ailp->ail_lock);
> - if (!error)
> - error = xlog_finish_defer_ops(parent_tp);
> - xfs_trans_cancel(parent_tp);
> + if (error)
> + goto err;
>
> + error = xlog_finish_defer_ops(log->l_mp, &capture_list);
> + if (error)
> + goto err;
> +
> + return 0;
> +err:
> + xlog_abort_defer_ops(log->l_mp, &capture_list);
> return error;
> }
>
> diff --git a/fs/xfs/xfs_refcount_item.c b/fs/xfs/xfs_refcount_item.c
> index 3e34b7662361..0478374add64 100644
> --- a/fs/xfs/xfs_refcount_item.c
> +++ b/fs/xfs/xfs_refcount_item.c
> @@ -424,7 +424,7 @@ const struct xfs_defer_op_type xfs_refcount_update_defer_type = {
> STATIC int
> xfs_cui_item_recover(
> struct xfs_log_item *lip,
> - struct xfs_trans *parent_tp)
> + struct list_head *capture_list)
> {
> struct xfs_bmbt_irec irec;
> struct xfs_cui_log_item *cuip = CUI_ITEM(lip);
> @@ -432,7 +432,7 @@ xfs_cui_item_recover(
> struct xfs_cud_log_item *cudp;
> struct xfs_trans *tp;
> struct xfs_btree_cur *rcur = NULL;
> - struct xfs_mount *mp = parent_tp->t_mountp;
> + struct xfs_mount *mp = lip->li_mountp;
> xfs_fsblock_t startblock_fsb;
> xfs_fsblock_t new_fsb;
> xfs_extlen_t new_len;
> @@ -487,12 +487,7 @@ xfs_cui_item_recover(
> mp->m_refc_maxlevels * 2, 0, XFS_TRANS_RESERVE, &tp);
> if (error)
> return error;
> - /*
> - * Recovery stashes all deferred ops during intent processing and
> - * finishes them on completion. Transfer current dfops state to this
> - * transaction and transfer the result back before we return.
> - */
> - xfs_defer_move(tp, parent_tp);
> +
> cudp = xfs_trans_get_cud(tp, cuip);
>
> for (i = 0; i < cuip->cui_format.cui_nextents; i++) {
> @@ -549,13 +544,10 @@ xfs_cui_item_recover(
> }
>
> xfs_refcount_finish_one_cleanup(tp, rcur, error);
> - xfs_defer_capture(parent_tp, tp);
> - error = xfs_trans_commit(tp);
> - return error;
> + return xfs_defer_ops_capture_and_commit(tp, capture_list);
>
> abort_error:
> xfs_refcount_finish_one_cleanup(tp, rcur, error);
> - xfs_defer_move(parent_tp, tp);
> xfs_trans_cancel(tp);
> return error;
> }
> diff --git a/fs/xfs/xfs_rmap_item.c b/fs/xfs/xfs_rmap_item.c
> index e38ec5d736be..0d8fa707f079 100644
> --- a/fs/xfs/xfs_rmap_item.c
> +++ b/fs/xfs/xfs_rmap_item.c
> @@ -467,14 +467,14 @@ const struct xfs_defer_op_type xfs_rmap_update_defer_type = {
> STATIC int
> xfs_rui_item_recover(
> struct xfs_log_item *lip,
> - struct xfs_trans *parent_tp)
> + struct list_head *capture_list)
> {
> struct xfs_rui_log_item *ruip = RUI_ITEM(lip);
> struct xfs_map_extent *rmap;
> struct xfs_rud_log_item *rudp;
> struct xfs_trans *tp;
> struct xfs_btree_cur *rcur = NULL;
> - struct xfs_mount *mp = parent_tp->t_mountp;
> + struct xfs_mount *mp = lip->li_mountp;
> xfs_fsblock_t startblock_fsb;
> enum xfs_rmap_intent_type type;
> xfs_exntst_t state;
> @@ -567,8 +567,7 @@ xfs_rui_item_recover(
> }
>
> xfs_rmap_finish_one_cleanup(tp, rcur, error);
> - error = xfs_trans_commit(tp);
> - return error;
> + return xfs_defer_ops_capture_and_commit(tp, capture_list);
>
> abort_error:
> xfs_rmap_finish_one_cleanup(tp, rcur, error);
> diff --git a/fs/xfs/xfs_trans.h b/fs/xfs/xfs_trans.h
> index ced62a35a62b..186e77d08cc1 100644
> --- a/fs/xfs/xfs_trans.h
> +++ b/fs/xfs/xfs_trans.h
> @@ -72,7 +72,8 @@ struct xfs_item_ops {
> void (*iop_committing)(struct xfs_log_item *, xfs_lsn_t commit_lsn);
> void (*iop_release)(struct xfs_log_item *);
> xfs_lsn_t (*iop_committed)(struct xfs_log_item *, xfs_lsn_t);
> - int (*iop_recover)(struct xfs_log_item *lip, struct xfs_trans *tp);
> + int (*iop_recover)(struct xfs_log_item *lip,
> + struct list_head *capture_list);
> bool (*iop_match)(struct xfs_log_item *item, uint64_t id);
> };
>
>
next prev parent reply other threads:[~2020-10-01 17:31 UTC|newest]
Thread overview: 24+ messages / expand[flat|nested] mbox.gz Atom feed top
2020-09-29 17:43 [PATCH v4 0/5] xfs: fix how we deal with new intents during recovery Darrick J. Wong
2020-09-29 17:43 ` [PATCH 1/5] xfs: remove xfs_defer_reset Darrick J. Wong
2020-10-01 17:30 ` Brian Foster
2020-09-29 17:43 ` [PATCH 2/5] xfs: remove XFS_LI_RECOVERED Darrick J. Wong
2020-10-01 17:30 ` Brian Foster
2020-10-02 7:16 ` Christoph Hellwig
2020-09-29 17:43 ` [PATCH 3/5] xfs: proper replay of deferred ops queued during log recovery Darrick J. Wong
2020-10-01 17:31 ` Brian Foster [this message]
2020-10-01 17:41 ` Darrick J. Wong
2020-10-02 7:18 ` Christoph Hellwig
2020-09-29 17:43 ` [PATCH 4/5] xfs: xfs_defer_capture should absorb remaining block reservation Darrick J. Wong
2020-10-01 17:32 ` Brian Foster
2020-10-01 17:46 ` Darrick J. Wong
2020-10-02 4:20 ` [PATCH v4.2 " Darrick J. Wong
2020-10-02 7:22 ` Christoph Hellwig
2020-10-02 10:39 ` Brian Foster
2020-09-29 17:43 ` [PATCH 5/5] xfs: xfs_defer_capture should absorb remaining transaction reservation Darrick J. Wong
2020-10-01 17:32 ` Brian Foster
2020-10-01 17:52 ` Darrick J. Wong
2020-10-02 4:21 ` [PATCH v4.2 " Darrick J. Wong
2020-10-02 7:24 ` Christoph Hellwig
2020-10-02 7:32 ` Christoph Hellwig
2020-10-02 10:39 ` Brian Foster
-- strict thread matches above, loose matches on Subject: below --
2020-10-05 18:19 [PATCH v5 0/5] xfs: fix how we deal with new intents during recovery Darrick J. Wong
2020-10-05 18:20 ` [PATCH 3/5] xfs: proper replay of deferred ops queued during log recovery Darrick J. Wong
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20201001173142.GE112884@bfoster \
--to=bfoster@redhat.com \
--cc=darrick.wong@oracle.com \
--cc=david@fromorbit.com \
--cc=hch@lst.de \
--cc=linux-xfs@vger.kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.