Re: [PATCH 1/3] btrfs: wait for block groups to finish caching during allocation

linux-btrfs.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

From: Boris Burkov <boris@bur.io>
To: Josef Bacik <josef@toxicpanda.com>
Cc: linux-btrfs@vger.kernel.org, kernel-team@fb.com
Subject: Re: [PATCH 1/3] btrfs: wait for block groups to finish caching during allocation
Date: Thu, 20 Jul 2023 15:21:01 -0700	[thread overview]
Message-ID: <20230720222101.GA545904@zen> (raw)
In-Reply-To: <bd295f0e2277e34008b4aa5648527d0394472de1.1689883754.git.josef@toxicpanda.com>

On Thu, Jul 20, 2023 at 04:12:14PM -0400, Josef Bacik wrote:
> Recently we've been having mysterious hangs while running generic/475 on
> the CI system.  This turned out to be something like this
> 
> Task 1
> dmsetup suspend --nolockfs
> -> __dm_suspend
>  -> dm_wait_for_completion
>   -> dm_wait_for_bios_completion
>    -> Unable to complete because of IO's on a plug in Task 2
> 
> Task 2
> wb_workfn
> -> wb_writeback
>  -> blk_start_plug
>   -> writeback_sb_inodes
>    -> Infinite loop unable to make an allocation
> 
> Task 3
> cache_block_group
> ->read_extent_buffer_pages
>  ->Waiting for IO to complete that can't be submitted because Task 1
>    suspended the DM device
> 
> The problem here is that we need Task 2 to be scheduled completely for
> the blk plug to flush.  Normally this would happen, we normally wait for
> the block group caching to finish (Task 3), and this schedule would
> result in the block plug flushing.
> 
> However if there's enough free space available from the current caching
> to satisfy the allocation we won't actually wait for the caching to
> complete.  This check however just checks that we have enough space, not
> that we can make the allocation.  In this particular case we were trying
> to allocate 9mib, and we had 10mib of free space, but we didn't have
> 9mib of contiguous space to allocate, and thus the allocation failed and
> we looped.
> 
> We specifically don't cycle through the FFE loop until we stop finding
> cached block groups because we don't want to allocate new block groups
> just because we're caching, so we short circuit the normal loop once we
> hit LOOP_CACHING_WAIT and we found a caching block group.
> 
> This is normally fine, except in this particular case where the caching
> thread can't make progress because the dm device has been suspended.
> 
> Fix this by adding another LOOP state that specifically waits for the
> block group to be completely cached before proceeding.  This allows us
> to drop this particular optimization, and will give us the proper
> scheduling needed to finish the plug.
> 
> The alternative here was to simply flush the plug if we need_resched(),
> but this is actually a sort of bad behavior from us where we assume that
> if the block group has enough free space to match our allocation we'll
> actually be successful.  It is a good enough check for a quick pass to
> avoid the latency of a full wait, but free space != contiguous free
> space, so waiting is more appropriate.
> 
> Signed-off-by: Josef Bacik <josef@toxicpanda.com>
> ---
>  fs/btrfs/extent-tree.c | 20 +++++++++++++-------
>  1 file changed, 13 insertions(+), 7 deletions(-)
> 
> diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
> index 04ceb9d25d3e..2850bd411a0e 100644
> --- a/fs/btrfs/extent-tree.c
> +++ b/fs/btrfs/extent-tree.c
> @@ -3331,6 +3331,7 @@ int btrfs_free_extent(struct btrfs_trans_handle *trans, struct btrfs_ref *ref)
>  enum btrfs_loop_type {
>  	LOOP_CACHING_NOWAIT,
>  	LOOP_CACHING_WAIT,
> +	LOOP_CACHING_DONE,
>  	LOOP_UNSET_SIZE_CLASS,
>  	LOOP_ALLOC_CHUNK,
>  	LOOP_WRONG_SIZE_CLASS,
> @@ -3920,9 +3921,6 @@ static int find_free_extent_update_loop(struct btrfs_fs_info *fs_info,
>  		return 0;
>  	}
>  
> -	if (ffe_ctl->loop >= LOOP_CACHING_WAIT && ffe_ctl->have_caching_bg)
> -		return 1;
> -

As far as I can tell, this change significantly diminishes the
meaning/value of LOOP_CACHING_WAIT. It goes from a busy loop that
continues until we stop failing on an uncached bgs (success or all bgs
cached) to just a single retry before moving on to LOOP_CACHING_DONE
which does the real wait_event for the bg to get cached.

I am not convinced that a single retry does much compared to just going
straight to the wait_event. I think it should be reasonably easy to
answer this question by creating a big FS, mounting it and allocating
right after and seeing if the allocations succeed in LOOP_CACHING_WAIT
or LOOP_CACHING_DONE.

Let's suppose that I guessed correctly and they succeed in
LOOP_CACHING_DONE. Let's also assume, based on prior experience, that we
can't always affort to wait for the bg to get fully cached in.

Perhaps another option is to add a new event to the mix. It would be
signalled when we make progress caching a bg, and LOOP_CACHING_WAIT
could call the non-blocking btrfs_cache_block_group, then wait on that
event. This would have better granularity than waiting for the whole
block group while still having the desired scheduling behavior needed to
fix this bug.
>  	ffe_ctl->index++;
>  	if (ffe_ctl->index < BTRFS_NR_RAID_TYPES)
>  		return 1;
> @@ -3931,6 +3929,8 @@ static int find_free_extent_update_loop(struct btrfs_fs_info *fs_info,
>  	 * LOOP_CACHING_NOWAIT, search partially cached block groups, kicking
>  	 *			caching kthreads as we move along
>  	 * LOOP_CACHING_WAIT, search everything, and wait if our bg is caching
> +	 * LOOP_CACHING_DONE, search everything, wait for the caching to
> +	 *			completely finish
>  	 * LOOP_UNSET_SIZE_CLASS, allow unset size class
>  	 * LOOP_ALLOC_CHUNK, force a chunk allocation and try again
>  	 * LOOP_NO_EMPTY_SIZE, set empty_size and empty_cluster to 0 and try
> @@ -3939,13 +3939,13 @@ static int find_free_extent_update_loop(struct btrfs_fs_info *fs_info,
>  	if (ffe_ctl->loop < LOOP_NO_EMPTY_SIZE) {
>  		ffe_ctl->index = 0;
>  		/*
> -		 * We want to skip the LOOP_CACHING_WAIT step if we don't have
> +		 * We want to skip the LOOP_CACHING_* steps if we don't have
>  		 * any uncached bgs and we've already done a full search
>  		 * through.
>  		 */
>  		if (ffe_ctl->loop == LOOP_CACHING_NOWAIT &&
>  		    (!ffe_ctl->orig_have_caching_bg && full_search))
> -			ffe_ctl->loop++;
> +			ffe_ctl->loop = LOOP_CACHING_DONE;
>  		ffe_ctl->loop++;
>  
>  		if (ffe_ctl->loop == LOOP_ALLOC_CHUNK) {
> @@ -4269,8 +4269,11 @@ static noinline int find_free_extent(struct btrfs_root *root,
>  		trace_find_free_extent_have_block_group(root, ffe_ctl, block_group);
>  		ffe_ctl->cached = btrfs_block_group_done(block_group);
>  		if (unlikely(!ffe_ctl->cached)) {
> -			ffe_ctl->have_caching_bg = true;
> -			ret = btrfs_cache_block_group(block_group, false);
> +			bool wait = ffe_ctl->loop == LOOP_CACHING_DONE;

This feels quite unlikely, but I think it's also theoretically possible
that every single call to btrfs_cache_block_group in the loop will fail
with -ENOMEM, which we swallow. We then advance to the next big loop
with more caching outstanding that we no longer wait for in any way.

I think changing the wait check to >= LOOP_CACHING_DONE would fix this.

> +
> +			if (!wait)
> +				ffe_ctl->have_caching_bg = true;
> +			ret = btrfs_cache_block_group(block_group, wait);

I think a comment somewhere explaining that the wait_event this triggers
is critical would be helpful.

>  
>  			/*
>  			 * If we get ENOMEM here or something else we want to
> @@ -4285,6 +4288,9 @@ static noinline int find_free_extent(struct btrfs_root *root,
>  				ret = 0;
>  				goto loop;
>  			}
> +
> +			if (wait)
> +				ffe_ctl->cached = btrfs_block_group_done(block_group);

should we set have_caching_bg = false too?

>  			ret = 0;
>  		}
>  
> -- 
> 2.41.0
>

next prev parent reply	other threads:[~2023-07-20 22:23 UTC|newest]

Thread overview: 11+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2023-07-20 20:12 [PATCH 0/3] btrfs: fix generic/475 hang Josef Bacik
2023-07-20 20:12 ` [PATCH 1/3] btrfs: wait for block groups to finish caching during allocation Josef Bacik
2023-07-20 22:21   ` Boris Burkov [this message]
2023-07-20 20:12 ` [PATCH 2/3] btrfs: move comments to btrfs_loop_type definition Josef Bacik
2023-07-20 22:28   ` Boris Burkov
2023-07-27 16:32   ` David Sterba
2023-07-20 20:12 ` [PATCH 3/3] btrfs: cycle through the RAID profiles as a last resort Josef Bacik
2023-07-20 22:28   ` Boris Burkov
2023-07-21  1:07     ` Josef Bacik
2023-07-25 14:37   ` kernel test robot
2023-07-21  7:36 ` [PATCH 0/3] btrfs: fix generic/475 hang Christoph Hellwig

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20230720222101.GA545904@zen \
    --to=boris@bur.io \
    --cc=josef@toxicpanda.com \
    --cc=kernel-team@fb.com \
    --cc=linux-btrfs@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).