From: Qu Wenruo <wqu@suse.com>
To: Teng Liu <27rabbitlt@gmail.com>
Cc: linux-btrfs@vger.kernel.org, dsterba@suse.com, clm@fb.com,
linux-kernel@vger.kernel.org
Subject: Re: [PATCH] btrfs: wait for in-flight readahead BIOs on open_ctree() error
Date: Tue, 31 Mar 2026 08:44:53 +1030 [thread overview]
Message-ID: <05020d93-524f-458d-a44a-765043ddbdb7@suse.com> (raw)
In-Reply-To: <6564fe96-ead0-45d4-9655-cd14f13bdc9a@suse.com>
在 2026/3/31 08:18, Qu Wenruo 写道:
>
>
> 在 2026/3/31 04:30, Teng Liu 写道:
[...]
>>>>
>>>> 3) Use buffer_tree xarray to iterate through all ebs
>>>> Since this is only for error handling of open_ctree(), we're
>>>> fine to
>>>> do the full xarray iteration, and wait for any eb that has
>>>> EXTENT_BUFFER_READING flag.
>>>>
>>>> The problem is, we do not have a dedicated tag like
>>>> PAGECACHE_TAG_(TOWRITE|DIRTY) to easily catch all dirty/writeback
>>>> ebs.
>>>> So the only option is to go through each eb and check their flags.
>>>>
>>>> I think this is the one with minimal impact, but may cause much
>>>> longer runtime during this error handling path.
>>>>
>>>> My personal preference is option 3).
>>>
>>> Or the 4th one, which is only an idea and I haven't yet verified:
>>>
>>> 4) Handle error from invalidate_inode_pages2()
>>> Currently we just call invalidate_inode_pages2() on btree inode and
>>> expect it to return 0.
>>>
>>> But if there is still an eb reading pending, it will make that
>>> function to return -EBUSY, as try_release_extent_buffer() will
>>> find a eb whose refs is not 0, and refuse the release that eb which
>>> belongs to a folio.
>>>
>>> That should be a good indicator of any pending metadata reads.
>>>
>>> So if that invalidate_inode_pages2() returned -EBUSY, we should wait
>>> retry until it returns 0.
>>>
>>>
>>
>> Thanks! Yes, it makes sense, simply waiting on the bio counter doesnt
>> fix the problem here.
>>
>> Among the options, I prefer option 3. Although it may be slower, but it
>> only happens in mount failure path so extra cost seems acceptable.
>
> The problem is not limited to mount failure, but also affects
> close_ctree() too.
>
> As it shares the same root problem, we have nothing to trace nor wait
> for any pending metadata read.
>
>>
>> I am quite new to btrfs codebase so I dont know whether
>> `invalidate_inode_pages2()` would be a reliable solution so maybe I
>> should start with option 3?
>
> Sure. Although iterating through xarray may not be that simple either,
> as you may still need to look into all kinds of extra locks/rcu lock
> etc, and if you apply that to the callsite of close_ctree(), it may be a
> much bigger problem, as we have a lot of more ebs compared to mount time.
>
> You can even mix option 3 and 4, e.g. only after
> invalidate_inode_pages2() failed with -EBUSY then switch to xarray
> iteration.
>
> This should greatly reduce the number of ebs that are still inside the
> xarray, thus makes the iteration much faster.
>
Although option 4 is much easier to implement. I'm already testing with
a testing patch applied, so far the fstests run looks pretty boring.
If you can verify this fix against the original report, it will be
appreciated.
But please note that, this is only a PoC, not perfect.
The biggest problem is the busy loop wait, as I hit a bug of an older
version that invalidate_inode_pages2() is called before freeing the root
pointers, thus invalidate_inode_pages2() will always return -EBUSY, and
take a CPU core to do the busy loop forever.
Even the current version has that problem fixed, it will still cause the
same unnecessary busy loop for very slow storage, which is far from ideal.
So option 3 is still needed to avoid busy loop, and may detect
unexpected dirty ebs better.
I believe the best option is really mixing option 3 and option 4.
diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index c835141ee384..39420d599822 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -3706,7 +3706,11 @@ int __cold open_ctree(struct super_block *sb,
struct btrfs_fs_devices *fs_device
if (fs_info->data_reloc_root)
btrfs_drop_and_free_fs_root(fs_info,
fs_info->data_reloc_root);
free_root_pointers(fs_info, true);
- invalidate_inode_pages2(fs_info->btree_inode->i_mapping);
+ ret = invalidate_inode_pages2(fs_info->btree_inode->i_mapping);
+ while (ret) {
+ cond_resched();
+ ret =
invalidate_inode_pages2(fs_info->btree_inode->i_mapping);
+ }
fail_sb_buffer:
btrfs_stop_all_workers(fs_info);
@@ -4434,19 +4438,23 @@ void __cold close_ctree(struct btrfs_fs_info
*fs_info)
btrfs_put_block_group_cache(fs_info);
- /*
- * we must make sure there is not any read request to
- * submit after we stopping all workers.
- */
- invalidate_inode_pages2(fs_info->btree_inode->i_mapping);
- btrfs_stop_all_workers(fs_info);
-
/* We shouldn't have any transaction open at this point */
warn_about_uncommitted_trans(fs_info);
clear_bit(BTRFS_FS_OPEN, &fs_info->flags);
free_root_pointers(fs_info, true);
btrfs_free_fs_roots(fs_info);
+ /*
+ * we must make sure there is not any read request to
+ * submit after we stopping all workers.
+ */
+ ret = invalidate_inode_pages2(fs_info->btree_inode->i_mapping);
+ while (ret) {
+ cond_resched();
+ ret =
invalidate_inode_pages2(fs_info->btree_inode->i_mapping);
+ }
+ btrfs_stop_all_workers(fs_info);
+
/*
* We must free the block groups after dropping the fs_roots as
we could
--
prev parent reply other threads:[~2026-03-30 22:15 UTC|newest]
Thread overview: 8+ messages / expand[flat|nested] mbox.gz Atom feed top
2026-03-29 6:31 [PATCH] btrfs: wait for in-flight readahead BIOs on open_ctree() error Teng Liu
2026-03-29 7:03 ` Qu Wenruo
2026-03-29 17:23 ` Teng Liu
2026-03-29 22:06 ` Qu Wenruo
2026-03-29 22:21 ` Qu Wenruo
2026-03-30 18:00 ` Teng Liu
2026-03-30 21:48 ` Qu Wenruo
2026-03-30 22:14 ` Qu Wenruo [this message]
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=05020d93-524f-458d-a44a-765043ddbdb7@suse.com \
--to=wqu@suse.com \
--cc=27rabbitlt@gmail.com \
--cc=clm@fb.com \
--cc=dsterba@suse.com \
--cc=linux-btrfs@vger.kernel.org \
--cc=linux-kernel@vger.kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox