From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id 1FF3AEB64DA for ; Thu, 20 Jul 2023 22:23:20 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S230060AbjGTWXT (ORCPT ); Thu, 20 Jul 2023 18:23:19 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:51154 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S230054AbjGTWXS (ORCPT ); Thu, 20 Jul 2023 18:23:18 -0400 Received: from out2-smtp.messagingengine.com (out2-smtp.messagingengine.com [66.111.4.26]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 95C1C30F0 for ; Thu, 20 Jul 2023 15:22:57 -0700 (PDT) Received: from compute6.internal (compute6.nyi.internal [10.202.2.47]) by mailout.nyi.internal (Postfix) with ESMTP id 69BA65C0051; Thu, 20 Jul 2023 18:22:32 -0400 (EDT) Received: from mailfrontend1 ([10.202.2.162]) by compute6.internal (MEProxy); Thu, 20 Jul 2023 18:22:32 -0400 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=bur.io; h=cc:cc :content-type:content-type:date:date:from:from:in-reply-to :in-reply-to:message-id:mime-version:references:reply-to:sender :subject:subject:to:to; s=fm1; t=1689891752; x=1689978152; bh=r7 wadMdu8i2snwX5mH2PXPXlsihiVbxoD9AcRVRnENo=; b=hxfmHo1NcDqW14SuLP 532fV8uIsa/wvodsbNCxya5WUAW/IuQXQQ1qfZ6dtuPIjM/rfoDHNYixY1mHg9+e 6Q86gi2A/QGBvUqwdnDG9QpP7RzVdKRR7RQM4UVpAZQsqmr17157msY2p2gQ7+w3 TF1vmuHjbO3Vs1ZDT3a1cPQVtfYE5Cvx8kcIZptQpYyTBiiUtShjG5hS1u9CdKcW UHvE69J4YEOm3wkigCZcvvQD/Ih31iI5aSXvhQmqOoNfjz2SviENMa7hCHY0Alj8 DTGm2DlOo24moRyzb6wkc+F4f2f3JArd83waJRFUjrbhzmt/O0fyt1Hv4Bc1TG4H RVIg== DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d= messagingengine.com; h=cc:cc:content-type:content-type:date:date :feedback-id:feedback-id:from:from:in-reply-to:in-reply-to :message-id:mime-version:references:reply-to:sender:subject :subject:to:to:x-me-proxy:x-me-proxy:x-me-sender:x-me-sender :x-sasl-enc; s=fm3; t=1689891752; x=1689978152; bh=r7wadMdu8i2sn wX5mH2PXPXlsihiVbxoD9AcRVRnENo=; b=N0eL1HOKIxMWE0TA+iHwJJj37dYqf wCM/zTF/zFcAgwXj2qy6fx5UwNtNWnngReXo5+/LIaoeiqt00F1oc2IrKK2YO/3F 2nLFednyZMiKeLaEdpEYiRPGYhF3tggXjQIuWHRvuRA2ZQkH/XTf/5kBTJvLVoyc k33ef3R9gDY7HIkcRdlSMIxI7O7zfZKvVNmSr7WNM+tuTSThWdQuwfZL5Xf9WVLk X1A0gE9nUUgcPoeIIWhGPexSzEeQcj43ZSRoqwhH6qWmMdUwGdCqgz1P6+i9F/Yh zFEeXow2HWb0rvF5/Ss1n5u5mE1h8XoDN6SBwYgErlaRnEcAf1+DBb+MA== X-ME-Sender: X-ME-Received: X-ME-Proxy-Cause: gggruggvucftvghtrhhoucdtuddrgedviedrhedugddtkecutefuodetggdotefrodftvf curfhrohhfihhlvgemucfhrghsthforghilhdpqfgfvfdpuffrtefokffrpgfnqfghnecu uegrihhlohhuthemuceftddtnecusecvtfgvtghiphhivghnthhsucdlqddutddtmdenuc fjughrpeffhffvvefukfhfgggtuggjsehttdertddttddvnecuhfhrohhmpeeuohhrihhs uceuuhhrkhhovhcuoegsohhrihhssegsuhhrrdhioheqnecuggftrfgrthhtvghrnhepke dvkeffjeellefhveehvdejudfhjedthfdvveeiieeiudfguefgtdejgfefleejnecuvehl uhhsthgvrhfuihiivgeptdenucfrrghrrghmpehmrghilhhfrhhomhepsghorhhishessg hurhdrihho X-ME-Proxy: Feedback-ID: i083147f8:Fastmail Received: by mail.messagingengine.com (Postfix) with ESMTPA; Thu, 20 Jul 2023 18:22:31 -0400 (EDT) Date: Thu, 20 Jul 2023 15:21:01 -0700 From: Boris Burkov To: Josef Bacik Cc: linux-btrfs@vger.kernel.org, kernel-team@fb.com Subject: Re: [PATCH 1/3] btrfs: wait for block groups to finish caching during allocation Message-ID: <20230720222101.GA545904@zen> References: MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: Precedence: bulk List-ID: X-Mailing-List: linux-btrfs@vger.kernel.org On Thu, Jul 20, 2023 at 04:12:14PM -0400, Josef Bacik wrote: > Recently we've been having mysterious hangs while running generic/475 on > the CI system. This turned out to be something like this > > Task 1 > dmsetup suspend --nolockfs > -> __dm_suspend > -> dm_wait_for_completion > -> dm_wait_for_bios_completion > -> Unable to complete because of IO's on a plug in Task 2 > > Task 2 > wb_workfn > -> wb_writeback > -> blk_start_plug > -> writeback_sb_inodes > -> Infinite loop unable to make an allocation > > Task 3 > cache_block_group > ->read_extent_buffer_pages > ->Waiting for IO to complete that can't be submitted because Task 1 > suspended the DM device > > The problem here is that we need Task 2 to be scheduled completely for > the blk plug to flush. Normally this would happen, we normally wait for > the block group caching to finish (Task 3), and this schedule would > result in the block plug flushing. > > However if there's enough free space available from the current caching > to satisfy the allocation we won't actually wait for the caching to > complete. This check however just checks that we have enough space, not > that we can make the allocation. In this particular case we were trying > to allocate 9mib, and we had 10mib of free space, but we didn't have > 9mib of contiguous space to allocate, and thus the allocation failed and > we looped. > > We specifically don't cycle through the FFE loop until we stop finding > cached block groups because we don't want to allocate new block groups > just because we're caching, so we short circuit the normal loop once we > hit LOOP_CACHING_WAIT and we found a caching block group. > > This is normally fine, except in this particular case where the caching > thread can't make progress because the dm device has been suspended. > > Fix this by adding another LOOP state that specifically waits for the > block group to be completely cached before proceeding. This allows us > to drop this particular optimization, and will give us the proper > scheduling needed to finish the plug. > > The alternative here was to simply flush the plug if we need_resched(), > but this is actually a sort of bad behavior from us where we assume that > if the block group has enough free space to match our allocation we'll > actually be successful. It is a good enough check for a quick pass to > avoid the latency of a full wait, but free space != contiguous free > space, so waiting is more appropriate. > > Signed-off-by: Josef Bacik > --- > fs/btrfs/extent-tree.c | 20 +++++++++++++------- > 1 file changed, 13 insertions(+), 7 deletions(-) > > diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c > index 04ceb9d25d3e..2850bd411a0e 100644 > --- a/fs/btrfs/extent-tree.c > +++ b/fs/btrfs/extent-tree.c > @@ -3331,6 +3331,7 @@ int btrfs_free_extent(struct btrfs_trans_handle *trans, struct btrfs_ref *ref) > enum btrfs_loop_type { > LOOP_CACHING_NOWAIT, > LOOP_CACHING_WAIT, > + LOOP_CACHING_DONE, > LOOP_UNSET_SIZE_CLASS, > LOOP_ALLOC_CHUNK, > LOOP_WRONG_SIZE_CLASS, > @@ -3920,9 +3921,6 @@ static int find_free_extent_update_loop(struct btrfs_fs_info *fs_info, > return 0; > } > > - if (ffe_ctl->loop >= LOOP_CACHING_WAIT && ffe_ctl->have_caching_bg) > - return 1; > - As far as I can tell, this change significantly diminishes the meaning/value of LOOP_CACHING_WAIT. It goes from a busy loop that continues until we stop failing on an uncached bgs (success or all bgs cached) to just a single retry before moving on to LOOP_CACHING_DONE which does the real wait_event for the bg to get cached. I am not convinced that a single retry does much compared to just going straight to the wait_event. I think it should be reasonably easy to answer this question by creating a big FS, mounting it and allocating right after and seeing if the allocations succeed in LOOP_CACHING_WAIT or LOOP_CACHING_DONE. Let's suppose that I guessed correctly and they succeed in LOOP_CACHING_DONE. Let's also assume, based on prior experience, that we can't always affort to wait for the bg to get fully cached in. Perhaps another option is to add a new event to the mix. It would be signalled when we make progress caching a bg, and LOOP_CACHING_WAIT could call the non-blocking btrfs_cache_block_group, then wait on that event. This would have better granularity than waiting for the whole block group while still having the desired scheduling behavior needed to fix this bug. > ffe_ctl->index++; > if (ffe_ctl->index < BTRFS_NR_RAID_TYPES) > return 1; > @@ -3931,6 +3929,8 @@ static int find_free_extent_update_loop(struct btrfs_fs_info *fs_info, > * LOOP_CACHING_NOWAIT, search partially cached block groups, kicking > * caching kthreads as we move along > * LOOP_CACHING_WAIT, search everything, and wait if our bg is caching > + * LOOP_CACHING_DONE, search everything, wait for the caching to > + * completely finish > * LOOP_UNSET_SIZE_CLASS, allow unset size class > * LOOP_ALLOC_CHUNK, force a chunk allocation and try again > * LOOP_NO_EMPTY_SIZE, set empty_size and empty_cluster to 0 and try > @@ -3939,13 +3939,13 @@ static int find_free_extent_update_loop(struct btrfs_fs_info *fs_info, > if (ffe_ctl->loop < LOOP_NO_EMPTY_SIZE) { > ffe_ctl->index = 0; > /* > - * We want to skip the LOOP_CACHING_WAIT step if we don't have > + * We want to skip the LOOP_CACHING_* steps if we don't have > * any uncached bgs and we've already done a full search > * through. > */ > if (ffe_ctl->loop == LOOP_CACHING_NOWAIT && > (!ffe_ctl->orig_have_caching_bg && full_search)) > - ffe_ctl->loop++; > + ffe_ctl->loop = LOOP_CACHING_DONE; > ffe_ctl->loop++; > > if (ffe_ctl->loop == LOOP_ALLOC_CHUNK) { > @@ -4269,8 +4269,11 @@ static noinline int find_free_extent(struct btrfs_root *root, > trace_find_free_extent_have_block_group(root, ffe_ctl, block_group); > ffe_ctl->cached = btrfs_block_group_done(block_group); > if (unlikely(!ffe_ctl->cached)) { > - ffe_ctl->have_caching_bg = true; > - ret = btrfs_cache_block_group(block_group, false); > + bool wait = ffe_ctl->loop == LOOP_CACHING_DONE; This feels quite unlikely, but I think it's also theoretically possible that every single call to btrfs_cache_block_group in the loop will fail with -ENOMEM, which we swallow. We then advance to the next big loop with more caching outstanding that we no longer wait for in any way. I think changing the wait check to >= LOOP_CACHING_DONE would fix this. > + > + if (!wait) > + ffe_ctl->have_caching_bg = true; > + ret = btrfs_cache_block_group(block_group, wait); I think a comment somewhere explaining that the wait_event this triggers is critical would be helpful. > > /* > * If we get ENOMEM here or something else we want to > @@ -4285,6 +4288,9 @@ static noinline int find_free_extent(struct btrfs_root *root, > ret = 0; > goto loop; > } > + > + if (wait) > + ffe_ctl->cached = btrfs_block_group_done(block_group); should we set have_caching_bg = false too? > ret = 0; > } > > -- > 2.41.0 >