From: Filipe Manana <fdmanana@kernel.org>
To: Josef Bacik <josef@toxicpanda.com>
Cc: linux-btrfs@vger.kernel.org, kernel-team@fb.com
Subject: Re: [PATCH] btrfs: do not block starts waiting on previous transaction commit
Date: Fri, 25 Aug 2023 12:51:57 +0100 [thread overview]
Message-ID: <ZOiV3TvaMnpFQE2f@debian0.Home> (raw)
In-Reply-To: <042d69d13a10b90a29b5e096db59b9669fac68d2.1692910751.git.josef@toxicpanda.com>
On Thu, Aug 24, 2023 at 04:59:22PM -0400, Josef Bacik wrote:
> Internally I got a report of very long stalls on normal operations like
> creating a new file when auto relocation was running. The reporter used
> the bpf offcputime tracer to show that we would get stuck in
> start_transaction for 5 to 30 seconds, and were always being woken up by
> the transaction commit.
>
> Using my timing-everything script, which times how long a function takes
> and what percentage of that total time is taken up by its children, I
> saw several traces like this
>
> 1083 took 32812902424 ns
> 29929002926 ns 91.2110% wait_for_commit_duration
> 25568 ns 7.7920e-05% commit_fs_roots_duration
> 1007751 ns 0.00307% commit_cowonly_roots_duration
> 446855602 ns 1.36182% btrfs_run_delayed_refs_duration
> 271980 ns 0.00082% btrfs_run_delayed_items_duration
> 2008 ns 6.1195e-06% btrfs_apply_pending_changes_duration
> 9656 ns 2.9427e-05% switch_commit_roots_duration
> 1598 ns 4.8700e-06% btrfs_commit_device_sizes_duration
> 4314 ns 1.3147e-05% btrfs_free_log_root_tree_duration
>
> Here I was only tracing functions that happen where we are between
> START_COMMIT and UNBLOCKED in order to see what would be keeping us
> blocked for so long. The wait_for_commit() we do is where we wait for a
> previous transaction that hasn't completed it's commit. This can
> include all of the unpin work and other cleanups, which tends to be the
> longest part of our transaction commit.
>
> There is no reason we should be blocking new things from entering the
> transaction at this point, it just adds to random latency spikes for no
> reason.
>
> Fix this by adding a PREP stage. This allows us to properly deal with
> multiple committers coming in at the same time, we retain the behavior
> that the winner waits on the previous transaction and the losers all
> wait for this transaction commit to occur. Nothing else is blocked
> during the PREP stage, and then once the wait is complete we switch to
> COMMIT_START and all of the same behavior as before is maintained.
>
> Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Looks good, thanks.
> ---
> fs/btrfs/disk-io.c | 8 ++++----
> fs/btrfs/locking.h | 2 +-
> fs/btrfs/transaction.c | 39 ++++++++++++++++++++++++---------------
> fs/btrfs/transaction.h | 1 +
> 4 files changed, 30 insertions(+), 20 deletions(-)
>
> diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
> index 0a96ea8c1d3a..639f1e308e4c 100644
> --- a/fs/btrfs/disk-io.c
> +++ b/fs/btrfs/disk-io.c
> @@ -1547,7 +1547,7 @@ static int transaction_kthread(void *arg)
>
> delta = ktime_get_seconds() - cur->start_time;
> if (!test_and_clear_bit(BTRFS_FS_COMMIT_TRANS, &fs_info->flags) &&
> - cur->state < TRANS_STATE_COMMIT_START &&
> + cur->state < TRANS_STATE_COMMIT_PREP &&
> delta < fs_info->commit_interval) {
> spin_unlock(&fs_info->trans_lock);
> delay -= msecs_to_jiffies((delta - 1) * 1000);
> @@ -2682,8 +2682,8 @@ void btrfs_init_fs_info(struct btrfs_fs_info *fs_info)
> btrfs_lockdep_init_map(fs_info, btrfs_trans_num_extwriters);
> btrfs_lockdep_init_map(fs_info, btrfs_trans_pending_ordered);
> btrfs_lockdep_init_map(fs_info, btrfs_ordered_extent);
> - btrfs_state_lockdep_init_map(fs_info, btrfs_trans_commit_start,
> - BTRFS_LOCKDEP_TRANS_COMMIT_START);
> + btrfs_state_lockdep_init_map(fs_info, btrfs_trans_commit_prep,
> + BTRFS_LOCKDEP_TRANS_COMMIT_PREP);
> btrfs_state_lockdep_init_map(fs_info, btrfs_trans_unblocked,
> BTRFS_LOCKDEP_TRANS_UNBLOCKED);
> btrfs_state_lockdep_init_map(fs_info, btrfs_trans_super_committed,
> @@ -4870,7 +4870,7 @@ static int btrfs_cleanup_transaction(struct btrfs_fs_info *fs_info)
> while (!list_empty(&fs_info->trans_list)) {
> t = list_first_entry(&fs_info->trans_list,
> struct btrfs_transaction, list);
> - if (t->state >= TRANS_STATE_COMMIT_START) {
> + if (t->state >= TRANS_STATE_COMMIT_PREP) {
> refcount_inc(&t->use_count);
> spin_unlock(&fs_info->trans_lock);
> btrfs_wait_for_commit(fs_info, t->transid);
> diff --git a/fs/btrfs/locking.h b/fs/btrfs/locking.h
> index edb9b4a0dba1..7d6ee1e609bf 100644
> --- a/fs/btrfs/locking.h
> +++ b/fs/btrfs/locking.h
> @@ -79,7 +79,7 @@ enum btrfs_lock_nesting {
> };
>
> enum btrfs_lockdep_trans_states {
> - BTRFS_LOCKDEP_TRANS_COMMIT_START,
> + BTRFS_LOCKDEP_TRANS_COMMIT_PREP,
> BTRFS_LOCKDEP_TRANS_UNBLOCKED,
> BTRFS_LOCKDEP_TRANS_SUPER_COMMITTED,
> BTRFS_LOCKDEP_TRANS_COMPLETED,
> diff --git a/fs/btrfs/transaction.c b/fs/btrfs/transaction.c
> index ab09542f2170..341363beaf10 100644
> --- a/fs/btrfs/transaction.c
> +++ b/fs/btrfs/transaction.c
> @@ -56,12 +56,17 @@ static struct kmem_cache *btrfs_trans_handle_cachep;
> * | Call btrfs_commit_transaction() on any trans handle attached to
> * | transaction N
> * V
> + * Transaction N [[TRANS_STATE_COMMIT_PREP]]
> + * |
> + * | If there are simultaneous calls to btrfs_commit_transaction() one will win
> + * | the race and the rest will wait for the winner to commit the transaction.
> + * |
> + * | The winner will wait for previous running transaction to completely finish
> + * | if there is one.
> + * |
> * Transaction N [[TRANS_STATE_COMMIT_START]]
> * |
> - * | Will wait for previous running transaction to completely finish if there
> - * | is one
> - * |
> - * | Then one of the following happes:
> + * | Then one of the following happens:
> * | - Wait for all other trans handle holders to release.
> * | The btrfs_commit_transaction() caller will do the commit work.
> * | - Wait for current transaction to be committed by others.
> @@ -112,6 +117,7 @@ static struct kmem_cache *btrfs_trans_handle_cachep;
> */
> static const unsigned int btrfs_blocked_trans_types[TRANS_STATE_MAX] = {
> [TRANS_STATE_RUNNING] = 0U,
> + [TRANS_STATE_COMMIT_PREP] = 0U,
> [TRANS_STATE_COMMIT_START] = (__TRANS_START | __TRANS_ATTACH),
> [TRANS_STATE_COMMIT_DOING] = (__TRANS_START |
> __TRANS_ATTACH |
> @@ -1983,7 +1989,7 @@ void btrfs_commit_transaction_async(struct btrfs_trans_handle *trans)
> * Wait for the current transaction commit to start and block
> * subsequent transaction joins
> */
> - btrfs_might_wait_for_state(fs_info, BTRFS_LOCKDEP_TRANS_COMMIT_START);
> + btrfs_might_wait_for_state(fs_info, BTRFS_LOCKDEP_TRANS_COMMIT_PREP);
> wait_event(fs_info->transaction_blocked_wait,
> cur_trans->state >= TRANS_STATE_COMMIT_START ||
> TRANS_ABORTED(cur_trans));
> @@ -2130,7 +2136,7 @@ static void add_pending_snapshot(struct btrfs_trans_handle *trans)
> return;
>
> lockdep_assert_held(&trans->fs_info->trans_lock);
> - ASSERT(cur_trans->state >= TRANS_STATE_COMMIT_START);
> + ASSERT(cur_trans->state >= TRANS_STATE_COMMIT_PREP);
>
> list_add(&trans->pending_snapshot->list, &cur_trans->pending_snapshots);
> }
> @@ -2154,7 +2160,7 @@ int btrfs_commit_transaction(struct btrfs_trans_handle *trans)
> ktime_t interval;
>
> ASSERT(refcount_read(&trans->use_count) == 1);
> - btrfs_trans_state_lockdep_acquire(fs_info, BTRFS_LOCKDEP_TRANS_COMMIT_START);
> + btrfs_trans_state_lockdep_acquire(fs_info, BTRFS_LOCKDEP_TRANS_COMMIT_PREP);
>
> clear_bit(BTRFS_FS_NEED_TRANS_COMMIT, &fs_info->flags);
>
> @@ -2214,7 +2220,7 @@ int btrfs_commit_transaction(struct btrfs_trans_handle *trans)
> }
>
> spin_lock(&fs_info->trans_lock);
> - if (cur_trans->state >= TRANS_STATE_COMMIT_START) {
> + if (cur_trans->state >= TRANS_STATE_COMMIT_PREP) {
> enum btrfs_trans_state want_state = TRANS_STATE_COMPLETED;
>
> add_pending_snapshot(trans);
> @@ -2226,7 +2232,7 @@ int btrfs_commit_transaction(struct btrfs_trans_handle *trans)
> want_state = TRANS_STATE_SUPER_COMMITTED;
>
> btrfs_trans_state_lockdep_release(fs_info,
> - BTRFS_LOCKDEP_TRANS_COMMIT_START);
> + BTRFS_LOCKDEP_TRANS_COMMIT_PREP);
> ret = btrfs_end_transaction(trans);
> wait_for_commit(cur_trans, want_state);
>
> @@ -2238,9 +2244,9 @@ int btrfs_commit_transaction(struct btrfs_trans_handle *trans)
> return ret;
> }
>
> - cur_trans->state = TRANS_STATE_COMMIT_START;
> + cur_trans->state = TRANS_STATE_COMMIT_PREP;
> wake_up(&fs_info->transaction_blocked_wait);
> - btrfs_trans_state_lockdep_release(fs_info, BTRFS_LOCKDEP_TRANS_COMMIT_START);
> + btrfs_trans_state_lockdep_release(fs_info, BTRFS_LOCKDEP_TRANS_COMMIT_PREP);
>
> if (cur_trans->list.prev != &fs_info->trans_list) {
> enum btrfs_trans_state want_state = TRANS_STATE_COMPLETED;
> @@ -2261,11 +2267,9 @@ int btrfs_commit_transaction(struct btrfs_trans_handle *trans)
> btrfs_put_transaction(prev_trans);
> if (ret)
> goto lockdep_release;
> - } else {
> - spin_unlock(&fs_info->trans_lock);
> + spin_lock(&fs_info->trans_lock);
> }
> } else {
> - spin_unlock(&fs_info->trans_lock);
> /*
> * The previous transaction was aborted and was already removed
> * from the list of transactions at fs_info->trans_list. So we
> @@ -2273,11 +2277,16 @@ int btrfs_commit_transaction(struct btrfs_trans_handle *trans)
> * corrupt state (pointing to trees with unwritten nodes/leafs).
> */
> if (BTRFS_FS_ERROR(fs_info)) {
> + spin_unlock(&fs_info->trans_lock);
> ret = -EROFS;
> goto lockdep_release;
> }
> }
>
> + cur_trans->state = TRANS_STATE_COMMIT_START;
> + wake_up(&fs_info->transaction_blocked_wait);
> + spin_unlock(&fs_info->trans_lock);
> +
> /*
> * Get the time spent on the work done by the commit thread and not
> * the time spent waiting on a previous commit
> @@ -2587,7 +2596,7 @@ int btrfs_commit_transaction(struct btrfs_trans_handle *trans)
> goto cleanup_transaction;
>
> lockdep_trans_commit_start_release:
> - btrfs_trans_state_lockdep_release(fs_info, BTRFS_LOCKDEP_TRANS_COMMIT_START);
> + btrfs_trans_state_lockdep_release(fs_info, BTRFS_LOCKDEP_TRANS_COMMIT_PREP);
> btrfs_end_transaction(trans);
> return ret;
> }
> diff --git a/fs/btrfs/transaction.h b/fs/btrfs/transaction.h
> index 8e9fa23bd7fe..6b309f8a99a8 100644
> --- a/fs/btrfs/transaction.h
> +++ b/fs/btrfs/transaction.h
> @@ -14,6 +14,7 @@
>
> enum btrfs_trans_state {
> TRANS_STATE_RUNNING,
> + TRANS_STATE_COMMIT_PREP,
> TRANS_STATE_COMMIT_START,
> TRANS_STATE_COMMIT_DOING,
> TRANS_STATE_UNBLOCKED,
> --
> 2.26.3
>
next prev parent reply other threads:[~2023-08-25 11:53 UTC|newest]
Thread overview: 5+ messages / expand[flat|nested] mbox.gz Atom feed top
2023-08-24 20:59 [PATCH] btrfs: do not block starts waiting on previous transaction commit Josef Bacik
2023-08-25 11:51 ` Filipe Manana [this message]
2023-08-25 11:59 ` Filipe Manana
2023-08-25 18:38 ` Josef Bacik
2023-09-04 22:32 ` David Sterba
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=ZOiV3TvaMnpFQE2f@debian0.Home \
--to=fdmanana@kernel.org \
--cc=josef@toxicpanda.com \
--cc=kernel-team@fb.com \
--cc=linux-btrfs@vger.kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox