From: Boris Burkov <boris@bur.io>
To: Josef Bacik <josef@toxicpanda.com>
Cc: linux-btrfs@vger.kernel.org, kernel-team@fb.com
Subject: Re: [PATCH 1/3] btrfs: wait for block groups to finish caching during allocation
Date: Thu, 20 Jul 2023 15:21:01 -0700 [thread overview]
Message-ID: <20230720222101.GA545904@zen> (raw)
In-Reply-To: <bd295f0e2277e34008b4aa5648527d0394472de1.1689883754.git.josef@toxicpanda.com>
On Thu, Jul 20, 2023 at 04:12:14PM -0400, Josef Bacik wrote:
> Recently we've been having mysterious hangs while running generic/475 on
> the CI system. This turned out to be something like this
>
> Task 1
> dmsetup suspend --nolockfs
> -> __dm_suspend
> -> dm_wait_for_completion
> -> dm_wait_for_bios_completion
> -> Unable to complete because of IO's on a plug in Task 2
>
> Task 2
> wb_workfn
> -> wb_writeback
> -> blk_start_plug
> -> writeback_sb_inodes
> -> Infinite loop unable to make an allocation
>
> Task 3
> cache_block_group
> ->read_extent_buffer_pages
> ->Waiting for IO to complete that can't be submitted because Task 1
> suspended the DM device
>
> The problem here is that we need Task 2 to be scheduled completely for
> the blk plug to flush. Normally this would happen, we normally wait for
> the block group caching to finish (Task 3), and this schedule would
> result in the block plug flushing.
>
> However if there's enough free space available from the current caching
> to satisfy the allocation we won't actually wait for the caching to
> complete. This check however just checks that we have enough space, not
> that we can make the allocation. In this particular case we were trying
> to allocate 9mib, and we had 10mib of free space, but we didn't have
> 9mib of contiguous space to allocate, and thus the allocation failed and
> we looped.
>
> We specifically don't cycle through the FFE loop until we stop finding
> cached block groups because we don't want to allocate new block groups
> just because we're caching, so we short circuit the normal loop once we
> hit LOOP_CACHING_WAIT and we found a caching block group.
>
> This is normally fine, except in this particular case where the caching
> thread can't make progress because the dm device has been suspended.
>
> Fix this by adding another LOOP state that specifically waits for the
> block group to be completely cached before proceeding. This allows us
> to drop this particular optimization, and will give us the proper
> scheduling needed to finish the plug.
>
> The alternative here was to simply flush the plug if we need_resched(),
> but this is actually a sort of bad behavior from us where we assume that
> if the block group has enough free space to match our allocation we'll
> actually be successful. It is a good enough check for a quick pass to
> avoid the latency of a full wait, but free space != contiguous free
> space, so waiting is more appropriate.
>
> Signed-off-by: Josef Bacik <josef@toxicpanda.com>
> ---
> fs/btrfs/extent-tree.c | 20 +++++++++++++-------
> 1 file changed, 13 insertions(+), 7 deletions(-)
>
> diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
> index 04ceb9d25d3e..2850bd411a0e 100644
> --- a/fs/btrfs/extent-tree.c
> +++ b/fs/btrfs/extent-tree.c
> @@ -3331,6 +3331,7 @@ int btrfs_free_extent(struct btrfs_trans_handle *trans, struct btrfs_ref *ref)
> enum btrfs_loop_type {
> LOOP_CACHING_NOWAIT,
> LOOP_CACHING_WAIT,
> + LOOP_CACHING_DONE,
> LOOP_UNSET_SIZE_CLASS,
> LOOP_ALLOC_CHUNK,
> LOOP_WRONG_SIZE_CLASS,
> @@ -3920,9 +3921,6 @@ static int find_free_extent_update_loop(struct btrfs_fs_info *fs_info,
> return 0;
> }
>
> - if (ffe_ctl->loop >= LOOP_CACHING_WAIT && ffe_ctl->have_caching_bg)
> - return 1;
> -
As far as I can tell, this change significantly diminishes the
meaning/value of LOOP_CACHING_WAIT. It goes from a busy loop that
continues until we stop failing on an uncached bgs (success or all bgs
cached) to just a single retry before moving on to LOOP_CACHING_DONE
which does the real wait_event for the bg to get cached.
I am not convinced that a single retry does much compared to just going
straight to the wait_event. I think it should be reasonably easy to
answer this question by creating a big FS, mounting it and allocating
right after and seeing if the allocations succeed in LOOP_CACHING_WAIT
or LOOP_CACHING_DONE.
Let's suppose that I guessed correctly and they succeed in
LOOP_CACHING_DONE. Let's also assume, based on prior experience, that we
can't always affort to wait for the bg to get fully cached in.
Perhaps another option is to add a new event to the mix. It would be
signalled when we make progress caching a bg, and LOOP_CACHING_WAIT
could call the non-blocking btrfs_cache_block_group, then wait on that
event. This would have better granularity than waiting for the whole
block group while still having the desired scheduling behavior needed to
fix this bug.
> ffe_ctl->index++;
> if (ffe_ctl->index < BTRFS_NR_RAID_TYPES)
> return 1;
> @@ -3931,6 +3929,8 @@ static int find_free_extent_update_loop(struct btrfs_fs_info *fs_info,
> * LOOP_CACHING_NOWAIT, search partially cached block groups, kicking
> * caching kthreads as we move along
> * LOOP_CACHING_WAIT, search everything, and wait if our bg is caching
> + * LOOP_CACHING_DONE, search everything, wait for the caching to
> + * completely finish
> * LOOP_UNSET_SIZE_CLASS, allow unset size class
> * LOOP_ALLOC_CHUNK, force a chunk allocation and try again
> * LOOP_NO_EMPTY_SIZE, set empty_size and empty_cluster to 0 and try
> @@ -3939,13 +3939,13 @@ static int find_free_extent_update_loop(struct btrfs_fs_info *fs_info,
> if (ffe_ctl->loop < LOOP_NO_EMPTY_SIZE) {
> ffe_ctl->index = 0;
> /*
> - * We want to skip the LOOP_CACHING_WAIT step if we don't have
> + * We want to skip the LOOP_CACHING_* steps if we don't have
> * any uncached bgs and we've already done a full search
> * through.
> */
> if (ffe_ctl->loop == LOOP_CACHING_NOWAIT &&
> (!ffe_ctl->orig_have_caching_bg && full_search))
> - ffe_ctl->loop++;
> + ffe_ctl->loop = LOOP_CACHING_DONE;
> ffe_ctl->loop++;
>
> if (ffe_ctl->loop == LOOP_ALLOC_CHUNK) {
> @@ -4269,8 +4269,11 @@ static noinline int find_free_extent(struct btrfs_root *root,
> trace_find_free_extent_have_block_group(root, ffe_ctl, block_group);
> ffe_ctl->cached = btrfs_block_group_done(block_group);
> if (unlikely(!ffe_ctl->cached)) {
> - ffe_ctl->have_caching_bg = true;
> - ret = btrfs_cache_block_group(block_group, false);
> + bool wait = ffe_ctl->loop == LOOP_CACHING_DONE;
This feels quite unlikely, but I think it's also theoretically possible
that every single call to btrfs_cache_block_group in the loop will fail
with -ENOMEM, which we swallow. We then advance to the next big loop
with more caching outstanding that we no longer wait for in any way.
I think changing the wait check to >= LOOP_CACHING_DONE would fix this.
> +
> + if (!wait)
> + ffe_ctl->have_caching_bg = true;
> + ret = btrfs_cache_block_group(block_group, wait);
I think a comment somewhere explaining that the wait_event this triggers
is critical would be helpful.
>
> /*
> * If we get ENOMEM here or something else we want to
> @@ -4285,6 +4288,9 @@ static noinline int find_free_extent(struct btrfs_root *root,
> ret = 0;
> goto loop;
> }
> +
> + if (wait)
> + ffe_ctl->cached = btrfs_block_group_done(block_group);
should we set have_caching_bg = false too?
> ret = 0;
> }
>
> --
> 2.41.0
>
next prev parent reply other threads:[~2023-07-20 22:23 UTC|newest]
Thread overview: 11+ messages / expand[flat|nested] mbox.gz Atom feed top
2023-07-20 20:12 [PATCH 0/3] btrfs: fix generic/475 hang Josef Bacik
2023-07-20 20:12 ` [PATCH 1/3] btrfs: wait for block groups to finish caching during allocation Josef Bacik
2023-07-20 22:21 ` Boris Burkov [this message]
2023-07-20 20:12 ` [PATCH 2/3] btrfs: move comments to btrfs_loop_type definition Josef Bacik
2023-07-20 22:28 ` Boris Burkov
2023-07-27 16:32 ` David Sterba
2023-07-20 20:12 ` [PATCH 3/3] btrfs: cycle through the RAID profiles as a last resort Josef Bacik
2023-07-20 22:28 ` Boris Burkov
2023-07-21 1:07 ` Josef Bacik
2023-07-25 14:37 ` kernel test robot
2023-07-21 7:36 ` [PATCH 0/3] btrfs: fix generic/475 hang Christoph Hellwig
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20230720222101.GA545904@zen \
--to=boris@bur.io \
--cc=josef@toxicpanda.com \
--cc=kernel-team@fb.com \
--cc=linux-btrfs@vger.kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).