From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id 24642C3DA66 for ; Fri, 25 Aug 2023 11:53:10 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S244420AbjHYLwf (ORCPT ); Fri, 25 Aug 2023 07:52:35 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:39304 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S237896AbjHYLwD (ORCPT ); Fri, 25 Aug 2023 07:52:03 -0400 Received: from dfw.source.kernel.org (dfw.source.kernel.org [139.178.84.217]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 4530BE7F for ; Fri, 25 Aug 2023 04:52:01 -0700 (PDT) Received: from smtp.kernel.org (relay.kernel.org [52.25.139.140]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (2048 bits)) (No client certificate requested) by dfw.source.kernel.org (Postfix) with ESMTPS id D8976653FF for ; Fri, 25 Aug 2023 11:52:00 +0000 (UTC) Received: by smtp.kernel.org (Postfix) with ESMTPSA id E5F85C433C7; Fri, 25 Aug 2023 11:51:59 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=k20201202; t=1692964320; bh=pK6UECNO3rCoWwt2z7+nT6zYY09tA5KVtMeYw/bYwfc=; h=Date:From:To:Cc:Subject:References:In-Reply-To:From; b=sjBkyHha0PVbdw/mMo3P0xuo9EHmvHOlwzbS6Wl5JrbbyT7r//JIDDUv8OE75iQcI kWNxBwbsfXyps8CrmNwifnF5XMU2/mGVI1f5NFe2wuBQMbjiq45VJLG0yvq6IHG/rj k6fwB2FwN4xI35d4faG7pM5LZ1tUsFo4unwA2OVN+lBSL8dbqEnp0p+V8mUXvCHyb3 VkEsXUgy4kLd+oWKZt4QxAExdt05CglZp7rOw4LjwpeYFCIFsne3xa5uEF83oCqW/h WOI4h0bgj/35McT2YsArofjhsKeYZ/2kATXcP6BvPu0ofcOl8ZIcJKMkigXF2hmpvP niiMGAAaKhKfA== Date: Fri, 25 Aug 2023 12:51:57 +0100 From: Filipe Manana To: Josef Bacik Cc: linux-btrfs@vger.kernel.org, kernel-team@fb.com Subject: Re: [PATCH] btrfs: do not block starts waiting on previous transaction commit Message-ID: References: <042d69d13a10b90a29b5e096db59b9669fac68d2.1692910751.git.josef@toxicpanda.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <042d69d13a10b90a29b5e096db59b9669fac68d2.1692910751.git.josef@toxicpanda.com> Precedence: bulk List-ID: X-Mailing-List: linux-btrfs@vger.kernel.org On Thu, Aug 24, 2023 at 04:59:22PM -0400, Josef Bacik wrote: > Internally I got a report of very long stalls on normal operations like > creating a new file when auto relocation was running. The reporter used > the bpf offcputime tracer to show that we would get stuck in > start_transaction for 5 to 30 seconds, and were always being woken up by > the transaction commit. > > Using my timing-everything script, which times how long a function takes > and what percentage of that total time is taken up by its children, I > saw several traces like this > > 1083 took 32812902424 ns > 29929002926 ns 91.2110% wait_for_commit_duration > 25568 ns 7.7920e-05% commit_fs_roots_duration > 1007751 ns 0.00307% commit_cowonly_roots_duration > 446855602 ns 1.36182% btrfs_run_delayed_refs_duration > 271980 ns 0.00082% btrfs_run_delayed_items_duration > 2008 ns 6.1195e-06% btrfs_apply_pending_changes_duration > 9656 ns 2.9427e-05% switch_commit_roots_duration > 1598 ns 4.8700e-06% btrfs_commit_device_sizes_duration > 4314 ns 1.3147e-05% btrfs_free_log_root_tree_duration > > Here I was only tracing functions that happen where we are between > START_COMMIT and UNBLOCKED in order to see what would be keeping us > blocked for so long. The wait_for_commit() we do is where we wait for a > previous transaction that hasn't completed it's commit. This can > include all of the unpin work and other cleanups, which tends to be the > longest part of our transaction commit. > > There is no reason we should be blocking new things from entering the > transaction at this point, it just adds to random latency spikes for no > reason. > > Fix this by adding a PREP stage. This allows us to properly deal with > multiple committers coming in at the same time, we retain the behavior > that the winner waits on the previous transaction and the losers all > wait for this transaction commit to occur. Nothing else is blocked > during the PREP stage, and then once the wait is complete we switch to > COMMIT_START and all of the same behavior as before is maintained. > > Signed-off-by: Josef Bacik Reviewed-by: Filipe Manana Looks good, thanks. > --- > fs/btrfs/disk-io.c | 8 ++++---- > fs/btrfs/locking.h | 2 +- > fs/btrfs/transaction.c | 39 ++++++++++++++++++++++++--------------- > fs/btrfs/transaction.h | 1 + > 4 files changed, 30 insertions(+), 20 deletions(-) > > diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c > index 0a96ea8c1d3a..639f1e308e4c 100644 > --- a/fs/btrfs/disk-io.c > +++ b/fs/btrfs/disk-io.c > @@ -1547,7 +1547,7 @@ static int transaction_kthread(void *arg) > > delta = ktime_get_seconds() - cur->start_time; > if (!test_and_clear_bit(BTRFS_FS_COMMIT_TRANS, &fs_info->flags) && > - cur->state < TRANS_STATE_COMMIT_START && > + cur->state < TRANS_STATE_COMMIT_PREP && > delta < fs_info->commit_interval) { > spin_unlock(&fs_info->trans_lock); > delay -= msecs_to_jiffies((delta - 1) * 1000); > @@ -2682,8 +2682,8 @@ void btrfs_init_fs_info(struct btrfs_fs_info *fs_info) > btrfs_lockdep_init_map(fs_info, btrfs_trans_num_extwriters); > btrfs_lockdep_init_map(fs_info, btrfs_trans_pending_ordered); > btrfs_lockdep_init_map(fs_info, btrfs_ordered_extent); > - btrfs_state_lockdep_init_map(fs_info, btrfs_trans_commit_start, > - BTRFS_LOCKDEP_TRANS_COMMIT_START); > + btrfs_state_lockdep_init_map(fs_info, btrfs_trans_commit_prep, > + BTRFS_LOCKDEP_TRANS_COMMIT_PREP); > btrfs_state_lockdep_init_map(fs_info, btrfs_trans_unblocked, > BTRFS_LOCKDEP_TRANS_UNBLOCKED); > btrfs_state_lockdep_init_map(fs_info, btrfs_trans_super_committed, > @@ -4870,7 +4870,7 @@ static int btrfs_cleanup_transaction(struct btrfs_fs_info *fs_info) > while (!list_empty(&fs_info->trans_list)) { > t = list_first_entry(&fs_info->trans_list, > struct btrfs_transaction, list); > - if (t->state >= TRANS_STATE_COMMIT_START) { > + if (t->state >= TRANS_STATE_COMMIT_PREP) { > refcount_inc(&t->use_count); > spin_unlock(&fs_info->trans_lock); > btrfs_wait_for_commit(fs_info, t->transid); > diff --git a/fs/btrfs/locking.h b/fs/btrfs/locking.h > index edb9b4a0dba1..7d6ee1e609bf 100644 > --- a/fs/btrfs/locking.h > +++ b/fs/btrfs/locking.h > @@ -79,7 +79,7 @@ enum btrfs_lock_nesting { > }; > > enum btrfs_lockdep_trans_states { > - BTRFS_LOCKDEP_TRANS_COMMIT_START, > + BTRFS_LOCKDEP_TRANS_COMMIT_PREP, > BTRFS_LOCKDEP_TRANS_UNBLOCKED, > BTRFS_LOCKDEP_TRANS_SUPER_COMMITTED, > BTRFS_LOCKDEP_TRANS_COMPLETED, > diff --git a/fs/btrfs/transaction.c b/fs/btrfs/transaction.c > index ab09542f2170..341363beaf10 100644 > --- a/fs/btrfs/transaction.c > +++ b/fs/btrfs/transaction.c > @@ -56,12 +56,17 @@ static struct kmem_cache *btrfs_trans_handle_cachep; > * | Call btrfs_commit_transaction() on any trans handle attached to > * | transaction N > * V > + * Transaction N [[TRANS_STATE_COMMIT_PREP]] > + * | > + * | If there are simultaneous calls to btrfs_commit_transaction() one will win > + * | the race and the rest will wait for the winner to commit the transaction. > + * | > + * | The winner will wait for previous running transaction to completely finish > + * | if there is one. > + * | > * Transaction N [[TRANS_STATE_COMMIT_START]] > * | > - * | Will wait for previous running transaction to completely finish if there > - * | is one > - * | > - * | Then one of the following happes: > + * | Then one of the following happens: > * | - Wait for all other trans handle holders to release. > * | The btrfs_commit_transaction() caller will do the commit work. > * | - Wait for current transaction to be committed by others. > @@ -112,6 +117,7 @@ static struct kmem_cache *btrfs_trans_handle_cachep; > */ > static const unsigned int btrfs_blocked_trans_types[TRANS_STATE_MAX] = { > [TRANS_STATE_RUNNING] = 0U, > + [TRANS_STATE_COMMIT_PREP] = 0U, > [TRANS_STATE_COMMIT_START] = (__TRANS_START | __TRANS_ATTACH), > [TRANS_STATE_COMMIT_DOING] = (__TRANS_START | > __TRANS_ATTACH | > @@ -1983,7 +1989,7 @@ void btrfs_commit_transaction_async(struct btrfs_trans_handle *trans) > * Wait for the current transaction commit to start and block > * subsequent transaction joins > */ > - btrfs_might_wait_for_state(fs_info, BTRFS_LOCKDEP_TRANS_COMMIT_START); > + btrfs_might_wait_for_state(fs_info, BTRFS_LOCKDEP_TRANS_COMMIT_PREP); > wait_event(fs_info->transaction_blocked_wait, > cur_trans->state >= TRANS_STATE_COMMIT_START || > TRANS_ABORTED(cur_trans)); > @@ -2130,7 +2136,7 @@ static void add_pending_snapshot(struct btrfs_trans_handle *trans) > return; > > lockdep_assert_held(&trans->fs_info->trans_lock); > - ASSERT(cur_trans->state >= TRANS_STATE_COMMIT_START); > + ASSERT(cur_trans->state >= TRANS_STATE_COMMIT_PREP); > > list_add(&trans->pending_snapshot->list, &cur_trans->pending_snapshots); > } > @@ -2154,7 +2160,7 @@ int btrfs_commit_transaction(struct btrfs_trans_handle *trans) > ktime_t interval; > > ASSERT(refcount_read(&trans->use_count) == 1); > - btrfs_trans_state_lockdep_acquire(fs_info, BTRFS_LOCKDEP_TRANS_COMMIT_START); > + btrfs_trans_state_lockdep_acquire(fs_info, BTRFS_LOCKDEP_TRANS_COMMIT_PREP); > > clear_bit(BTRFS_FS_NEED_TRANS_COMMIT, &fs_info->flags); > > @@ -2214,7 +2220,7 @@ int btrfs_commit_transaction(struct btrfs_trans_handle *trans) > } > > spin_lock(&fs_info->trans_lock); > - if (cur_trans->state >= TRANS_STATE_COMMIT_START) { > + if (cur_trans->state >= TRANS_STATE_COMMIT_PREP) { > enum btrfs_trans_state want_state = TRANS_STATE_COMPLETED; > > add_pending_snapshot(trans); > @@ -2226,7 +2232,7 @@ int btrfs_commit_transaction(struct btrfs_trans_handle *trans) > want_state = TRANS_STATE_SUPER_COMMITTED; > > btrfs_trans_state_lockdep_release(fs_info, > - BTRFS_LOCKDEP_TRANS_COMMIT_START); > + BTRFS_LOCKDEP_TRANS_COMMIT_PREP); > ret = btrfs_end_transaction(trans); > wait_for_commit(cur_trans, want_state); > > @@ -2238,9 +2244,9 @@ int btrfs_commit_transaction(struct btrfs_trans_handle *trans) > return ret; > } > > - cur_trans->state = TRANS_STATE_COMMIT_START; > + cur_trans->state = TRANS_STATE_COMMIT_PREP; > wake_up(&fs_info->transaction_blocked_wait); > - btrfs_trans_state_lockdep_release(fs_info, BTRFS_LOCKDEP_TRANS_COMMIT_START); > + btrfs_trans_state_lockdep_release(fs_info, BTRFS_LOCKDEP_TRANS_COMMIT_PREP); > > if (cur_trans->list.prev != &fs_info->trans_list) { > enum btrfs_trans_state want_state = TRANS_STATE_COMPLETED; > @@ -2261,11 +2267,9 @@ int btrfs_commit_transaction(struct btrfs_trans_handle *trans) > btrfs_put_transaction(prev_trans); > if (ret) > goto lockdep_release; > - } else { > - spin_unlock(&fs_info->trans_lock); > + spin_lock(&fs_info->trans_lock); > } > } else { > - spin_unlock(&fs_info->trans_lock); > /* > * The previous transaction was aborted and was already removed > * from the list of transactions at fs_info->trans_list. So we > @@ -2273,11 +2277,16 @@ int btrfs_commit_transaction(struct btrfs_trans_handle *trans) > * corrupt state (pointing to trees with unwritten nodes/leafs). > */ > if (BTRFS_FS_ERROR(fs_info)) { > + spin_unlock(&fs_info->trans_lock); > ret = -EROFS; > goto lockdep_release; > } > } > > + cur_trans->state = TRANS_STATE_COMMIT_START; > + wake_up(&fs_info->transaction_blocked_wait); > + spin_unlock(&fs_info->trans_lock); > + > /* > * Get the time spent on the work done by the commit thread and not > * the time spent waiting on a previous commit > @@ -2587,7 +2596,7 @@ int btrfs_commit_transaction(struct btrfs_trans_handle *trans) > goto cleanup_transaction; > > lockdep_trans_commit_start_release: > - btrfs_trans_state_lockdep_release(fs_info, BTRFS_LOCKDEP_TRANS_COMMIT_START); > + btrfs_trans_state_lockdep_release(fs_info, BTRFS_LOCKDEP_TRANS_COMMIT_PREP); > btrfs_end_transaction(trans); > return ret; > } > diff --git a/fs/btrfs/transaction.h b/fs/btrfs/transaction.h > index 8e9fa23bd7fe..6b309f8a99a8 100644 > --- a/fs/btrfs/transaction.h > +++ b/fs/btrfs/transaction.h > @@ -14,6 +14,7 @@ > > enum btrfs_trans_state { > TRANS_STATE_RUNNING, > + TRANS_STATE_COMMIT_PREP, > TRANS_STATE_COMMIT_START, > TRANS_STATE_COMMIT_DOING, > TRANS_STATE_UNBLOCKED, > -- > 2.26.3 >