public inbox for linux-btrfs@vger.kernel.org
 help / color / mirror / Atom feed
From: Josef Bacik <josef@toxicpanda.com>
To: Filipe Manana <fdmanana@kernel.org>
Cc: linux-btrfs@vger.kernel.org, kernel-team@fb.com
Subject: Re: [PATCH] btrfs: do not block starts waiting on previous transaction commit
Date: Fri, 25 Aug 2023 14:38:08 -0400	[thread overview]
Message-ID: <20230825183808.GA831059@perftesting> (raw)
In-Reply-To: <ZOiXnk2W0ajqyapa@debian0.Home>

On Fri, Aug 25, 2023 at 12:59:26PM +0100, Filipe Manana wrote:
> On Fri, Aug 25, 2023 at 12:51:57PM +0100, Filipe Manana wrote:
> > On Thu, Aug 24, 2023 at 04:59:22PM -0400, Josef Bacik wrote:
> > > Internally I got a report of very long stalls on normal operations like
> > > creating a new file when auto relocation was running.  The reporter used
> > > the bpf offcputime tracer to show that we would get stuck in
> > > start_transaction for 5 to 30 seconds, and were always being woken up by
> > > the transaction commit.
> > > 
> > > Using my timing-everything script, which times how long a function takes
> > > and what percentage of that total time is taken up by its children, I
> > > saw several traces like this
> > > 
> > > 1083 took 32812902424 ns
> > >         29929002926 ns 91.2110% wait_for_commit_duration
> > >         25568 ns 7.7920e-05% commit_fs_roots_duration
> > >         1007751 ns 0.00307% commit_cowonly_roots_duration
> > >         446855602 ns 1.36182% btrfs_run_delayed_refs_duration
> > >         271980 ns 0.00082% btrfs_run_delayed_items_duration
> > >         2008 ns 6.1195e-06% btrfs_apply_pending_changes_duration
> > >         9656 ns 2.9427e-05% switch_commit_roots_duration
> > >         1598 ns 4.8700e-06% btrfs_commit_device_sizes_duration
> > >         4314 ns 1.3147e-05% btrfs_free_log_root_tree_duration
> > > 
> > > Here I was only tracing functions that happen where we are between
> > > START_COMMIT and UNBLOCKED in order to see what would be keeping us
> > > blocked for so long.  The wait_for_commit() we do is where we wait for a
> > > previous transaction that hasn't completed it's commit.  This can
> > > include all of the unpin work and other cleanups, which tends to be the
> > > longest part of our transaction commit.
> > > 
> > > There is no reason we should be blocking new things from entering the
> > > transaction at this point, it just adds to random latency spikes for no
> > > reason.
> > > 
> > > Fix this by adding a PREP stage.  This allows us to properly deal with
> > > multiple committers coming in at the same time, we retain the behavior
> > > that the winner waits on the previous transaction and the losers all
> > > wait for this transaction commit to occur.  Nothing else is blocked
> > > during the PREP stage, and then once the wait is complete we switch to
> > > COMMIT_START and all of the same behavior as before is maintained.
> > > 
> > > Signed-off-by: Josef Bacik <josef@toxicpanda.com>
> > 
> > Reviewed-by: Filipe Manana <fdmanana@suse.com>
> > 
> > Looks good, thanks.
> > 
> > > ---
> > >  fs/btrfs/disk-io.c     |  8 ++++----
> > >  fs/btrfs/locking.h     |  2 +-
> > >  fs/btrfs/transaction.c | 39 ++++++++++++++++++++++++---------------
> > >  fs/btrfs/transaction.h |  1 +
> > >  4 files changed, 30 insertions(+), 20 deletions(-)
> > > 
> > > diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
> > > index 0a96ea8c1d3a..639f1e308e4c 100644
> > > --- a/fs/btrfs/disk-io.c
> > > +++ b/fs/btrfs/disk-io.c
> > > @@ -1547,7 +1547,7 @@ static int transaction_kthread(void *arg)
> > >  
> > >  		delta = ktime_get_seconds() - cur->start_time;
> > >  		if (!test_and_clear_bit(BTRFS_FS_COMMIT_TRANS, &fs_info->flags) &&
> > > -		    cur->state < TRANS_STATE_COMMIT_START &&
> > > +		    cur->state < TRANS_STATE_COMMIT_PREP &&
> > >  		    delta < fs_info->commit_interval) {
> > >  			spin_unlock(&fs_info->trans_lock);
> > >  			delay -= msecs_to_jiffies((delta - 1) * 1000);
> > > @@ -2682,8 +2682,8 @@ void btrfs_init_fs_info(struct btrfs_fs_info *fs_info)
> > >  	btrfs_lockdep_init_map(fs_info, btrfs_trans_num_extwriters);
> > >  	btrfs_lockdep_init_map(fs_info, btrfs_trans_pending_ordered);
> > >  	btrfs_lockdep_init_map(fs_info, btrfs_ordered_extent);
> > > -	btrfs_state_lockdep_init_map(fs_info, btrfs_trans_commit_start,
> > > -				     BTRFS_LOCKDEP_TRANS_COMMIT_START);
> > > +	btrfs_state_lockdep_init_map(fs_info, btrfs_trans_commit_prep,
> > > +				     BTRFS_LOCKDEP_TRANS_COMMIT_PREP);
> > >  	btrfs_state_lockdep_init_map(fs_info, btrfs_trans_unblocked,
> > >  				     BTRFS_LOCKDEP_TRANS_UNBLOCKED);
> > >  	btrfs_state_lockdep_init_map(fs_info, btrfs_trans_super_committed,
> > > @@ -4870,7 +4870,7 @@ static int btrfs_cleanup_transaction(struct btrfs_fs_info *fs_info)
> > >  	while (!list_empty(&fs_info->trans_list)) {
> > >  		t = list_first_entry(&fs_info->trans_list,
> > >  				     struct btrfs_transaction, list);
> > > -		if (t->state >= TRANS_STATE_COMMIT_START) {
> > > +		if (t->state >= TRANS_STATE_COMMIT_PREP) {
> > >  			refcount_inc(&t->use_count);
> > >  			spin_unlock(&fs_info->trans_lock);
> > >  			btrfs_wait_for_commit(fs_info, t->transid);
> > > diff --git a/fs/btrfs/locking.h b/fs/btrfs/locking.h
> > > index edb9b4a0dba1..7d6ee1e609bf 100644
> > > --- a/fs/btrfs/locking.h
> > > +++ b/fs/btrfs/locking.h
> > > @@ -79,7 +79,7 @@ enum btrfs_lock_nesting {
> > >  };
> > >  
> > >  enum btrfs_lockdep_trans_states {
> > > -	BTRFS_LOCKDEP_TRANS_COMMIT_START,
> > > +	BTRFS_LOCKDEP_TRANS_COMMIT_PREP,
> > >  	BTRFS_LOCKDEP_TRANS_UNBLOCKED,
> > >  	BTRFS_LOCKDEP_TRANS_SUPER_COMMITTED,
> > >  	BTRFS_LOCKDEP_TRANS_COMPLETED,
> > > diff --git a/fs/btrfs/transaction.c b/fs/btrfs/transaction.c
> > > index ab09542f2170..341363beaf10 100644
> > > --- a/fs/btrfs/transaction.c
> > > +++ b/fs/btrfs/transaction.c
> > > @@ -56,12 +56,17 @@ static struct kmem_cache *btrfs_trans_handle_cachep;
> > >   * |  Call btrfs_commit_transaction() on any trans handle attached to
> > >   * |  transaction N
> > >   * V
> > > + * Transaction N [[TRANS_STATE_COMMIT_PREP]]
> > > + * |
> > > + * | If there are simultaneous calls to btrfs_commit_transaction() one will win
> > > + * | the race and the rest will wait for the winner to commit the transaction.
> > > + * |
> > > + * | The winner will wait for previous running transaction to completely finish
> > > + * | if there is one.
> > > + * |
> > >   * Transaction N [[TRANS_STATE_COMMIT_START]]
> > >   * |
> > > - * | Will wait for previous running transaction to completely finish if there
> > > - * | is one
> > > - * |
> > > - * | Then one of the following happes:
> > > + * | Then one of the following happens:
> > >   * | - Wait for all other trans handle holders to release.
> > >   * |   The btrfs_commit_transaction() caller will do the commit work.
> > >   * | - Wait for current transaction to be committed by others.
> > > @@ -112,6 +117,7 @@ static struct kmem_cache *btrfs_trans_handle_cachep;
> > >   */
> > >  static const unsigned int btrfs_blocked_trans_types[TRANS_STATE_MAX] = {
> > >  	[TRANS_STATE_RUNNING]		= 0U,
> > > +	[TRANS_STATE_COMMIT_PREP]	= 0U,
> > >  	[TRANS_STATE_COMMIT_START]	= (__TRANS_START | __TRANS_ATTACH),
> > >  	[TRANS_STATE_COMMIT_DOING]	= (__TRANS_START |
> > >  					   __TRANS_ATTACH |
> > > @@ -1983,7 +1989,7 @@ void btrfs_commit_transaction_async(struct btrfs_trans_handle *trans)
> > >  	 * Wait for the current transaction commit to start and block
> > >  	 * subsequent transaction joins
> > >  	 */
> > > -	btrfs_might_wait_for_state(fs_info, BTRFS_LOCKDEP_TRANS_COMMIT_START);
> > > +	btrfs_might_wait_for_state(fs_info, BTRFS_LOCKDEP_TRANS_COMMIT_PREP);
> > >  	wait_event(fs_info->transaction_blocked_wait,
> > >  		   cur_trans->state >= TRANS_STATE_COMMIT_START ||
> > >  		   TRANS_ABORTED(cur_trans));
> > > @@ -2130,7 +2136,7 @@ static void add_pending_snapshot(struct btrfs_trans_handle *trans)
> > >  		return;
> > >  
> > >  	lockdep_assert_held(&trans->fs_info->trans_lock);
> > > -	ASSERT(cur_trans->state >= TRANS_STATE_COMMIT_START);
> > > +	ASSERT(cur_trans->state >= TRANS_STATE_COMMIT_PREP);
> > >  
> > >  	list_add(&trans->pending_snapshot->list, &cur_trans->pending_snapshots);
> > >  }
> > > @@ -2154,7 +2160,7 @@ int btrfs_commit_transaction(struct btrfs_trans_handle *trans)
> > >  	ktime_t interval;
> > >  
> > >  	ASSERT(refcount_read(&trans->use_count) == 1);
> > > -	btrfs_trans_state_lockdep_acquire(fs_info, BTRFS_LOCKDEP_TRANS_COMMIT_START);
> > > +	btrfs_trans_state_lockdep_acquire(fs_info, BTRFS_LOCKDEP_TRANS_COMMIT_PREP);
> > >  
> > >  	clear_bit(BTRFS_FS_NEED_TRANS_COMMIT, &fs_info->flags);
> > >  
> > > @@ -2214,7 +2220,7 @@ int btrfs_commit_transaction(struct btrfs_trans_handle *trans)
> > >  	}
> > >  
> > >  	spin_lock(&fs_info->trans_lock);
> > > -	if (cur_trans->state >= TRANS_STATE_COMMIT_START) {
> > > +	if (cur_trans->state >= TRANS_STATE_COMMIT_PREP) {
> > >  		enum btrfs_trans_state want_state = TRANS_STATE_COMPLETED;
> > >  
> > >  		add_pending_snapshot(trans);
> > > @@ -2226,7 +2232,7 @@ int btrfs_commit_transaction(struct btrfs_trans_handle *trans)
> > >  			want_state = TRANS_STATE_SUPER_COMMITTED;
> > >  
> > >  		btrfs_trans_state_lockdep_release(fs_info,
> > > -						  BTRFS_LOCKDEP_TRANS_COMMIT_START);
> > > +						  BTRFS_LOCKDEP_TRANS_COMMIT_PREP);
> > >  		ret = btrfs_end_transaction(trans);
> > >  		wait_for_commit(cur_trans, want_state);
> > >  
> > > @@ -2238,9 +2244,9 @@ int btrfs_commit_transaction(struct btrfs_trans_handle *trans)
> > >  		return ret;
> > >  	}
> > >  
> > > -	cur_trans->state = TRANS_STATE_COMMIT_START;
> > > +	cur_trans->state = TRANS_STATE_COMMIT_PREP;
> > >  	wake_up(&fs_info->transaction_blocked_wait);
> > > -	btrfs_trans_state_lockdep_release(fs_info, BTRFS_LOCKDEP_TRANS_COMMIT_START);
> > > +	btrfs_trans_state_lockdep_release(fs_info, BTRFS_LOCKDEP_TRANS_COMMIT_PREP);
> > >  
> > >  	if (cur_trans->list.prev != &fs_info->trans_list) {
> > >  		enum btrfs_trans_state want_state = TRANS_STATE_COMPLETED;
> > > @@ -2261,11 +2267,9 @@ int btrfs_commit_transaction(struct btrfs_trans_handle *trans)
> > >  			btrfs_put_transaction(prev_trans);
> > >  			if (ret)
> > >  				goto lockdep_release;
> > > -		} else {
> > > -			spin_unlock(&fs_info->trans_lock);
> > > +			spin_lock(&fs_info->trans_lock);
> > >  		}
> > >  	} else {
> > > -		spin_unlock(&fs_info->trans_lock);
> > >  		/*
> > >  		 * The previous transaction was aborted and was already removed
> > >  		 * from the list of transactions at fs_info->trans_list. So we
> > > @@ -2273,11 +2277,16 @@ int btrfs_commit_transaction(struct btrfs_trans_handle *trans)
> > >  		 * corrupt state (pointing to trees with unwritten nodes/leafs).
> > >  		 */
> > >  		if (BTRFS_FS_ERROR(fs_info)) {
> > > +			spin_unlock(&fs_info->trans_lock);
> > >  			ret = -EROFS;
> > >  			goto lockdep_release;
> > >  		}
> > >  	}
> > >  
> > > +	cur_trans->state = TRANS_STATE_COMMIT_START;
> > > +	wake_up(&fs_info->transaction_blocked_wait);
> 
> Wait, isn't this missing here a:
> 
>     btrfs_trans_state_lockdep_release(fs_info, BTRFS_LOCKDEP_TRANS_COMMIT_START);

No I renamed it to TRANS_COMMIT_PREP and tied it to PREP.  Thanks,

Josef

  reply	other threads:[~2023-08-25 18:38 UTC|newest]

Thread overview: 5+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2023-08-24 20:59 [PATCH] btrfs: do not block starts waiting on previous transaction commit Josef Bacik
2023-08-25 11:51 ` Filipe Manana
2023-08-25 11:59   ` Filipe Manana
2023-08-25 18:38     ` Josef Bacik [this message]
2023-09-04 22:32 ` David Sterba

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20230825183808.GA831059@perftesting \
    --to=josef@toxicpanda.com \
    --cc=fdmanana@kernel.org \
    --cc=kernel-team@fb.com \
    --cc=linux-btrfs@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox