* [PATCH] btrfs: wait for in-flight readahead BIOs on open_ctree() error @ 2026-03-29 6:31 Teng Liu 2026-03-29 7:03 ` Qu Wenruo 0 siblings, 1 reply; 8+ messages in thread From: Teng Liu @ 2026-03-29 6:31 UTC (permalink / raw) To: linux-btrfs; +Cc: dsterba, clm, linux-kernel, Teng Liu When open_ctree() fails during btrfs_read_chunk_tree(), readahead BIOs submitted by readahead_tree_node_children() may still be in flight. The error path frees fs_info without waiting for these BIOs to complete. When a readahead BIO later completes, btrfs_simple_end_io() calls btrfs_bio_counter_dec() which accesses the already-freed fs_info->dev_replace.bio_counter, causing a use-after-free. This can be triggered by connecting a USB drive with a corrupted btrfs filesystem (e.g. chunk tree destroyed by a partial format), where the slow USB device keeps readahead BIOs in flight long enough for the error path to free fs_info before they complete. It can be reproduced on qemu with a properly corrupted btrfs img. BTRFS error (device sda): failed to read chunk tree: -2 BTRFS error (device sda): open_ctree failed: -2 BUG: unable to handle page fault for address: ffff89322ceb3000 RIP: 0010:percpu_counter_add_batch+0xe/0xb0 btrfs_bio_counter_sub+0x22/0x60 btrfs_simple_end_io+0x32/0x90 blk_update_request+0x12b/0x480 scsi_end_request+0x26/0x1b0 scsi_io_completion+0x50/0x790 Fix this by waiting for the bio_counter to reach zero in the error path before stopping workers, so all in-flight BIOs have completed their callbacks before fs_info is freed. The bio_counter is already initialized in init_mount_fs_info() so this wait is safe for all error paths reaching the fail_sb_buffer label. Link: https://bugzilla.kernel.org/show_bug.cgi?id=221270 Reported-by: AHN SEOK-YOUNG Signed-off-by: Teng Liu <27rabbitlt@gmail.com> --- fs/btrfs/disk-io.c | 12 ++++++++++++ 1 file changed, 12 insertions(+) diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c index 01f2dbb69..61e6b8dca 100644 --- a/fs/btrfs/disk-io.c +++ b/fs/btrfs/disk-io.c @@ -3723,6 +3723,18 @@ int __cold open_ctree(struct super_block *sb, struct btrfs_fs_devices *fs_device invalidate_inode_pages2(fs_info->btree_inode->i_mapping); fail_sb_buffer: + /* + * Wait for in-flight readahead BIOs before stopping workers. + * Readahead BIOs from btrfs_read_chunk_tree() (via + * readahead_tree_node_children) may still be in flight on slow + * devices (e.g. USB). Their completion callbacks + * (btrfs_simple_end_io) access fs_info->dev_replace.bio_counter + * which would be destroyed later, causing a use-after-free. + * The bio_counter was already initialized in init_mount_fs_info() + * so this wait is safe for all error paths reaching this label. + */ + wait_event(fs_info->dev_replace.replace_wait, + percpu_counter_sum(&fs_info->dev_replace.bio_counter) == 0); btrfs_stop_all_workers(fs_info); btrfs_free_block_groups(fs_info); fail_alloc: -- 2.53.0 ^ permalink raw reply related [flat|nested] 8+ messages in thread
* Re: [PATCH] btrfs: wait for in-flight readahead BIOs on open_ctree() error 2026-03-29 6:31 [PATCH] btrfs: wait for in-flight readahead BIOs on open_ctree() error Teng Liu @ 2026-03-29 7:03 ` Qu Wenruo 2026-03-29 17:23 ` Teng Liu 0 siblings, 1 reply; 8+ messages in thread From: Qu Wenruo @ 2026-03-29 7:03 UTC (permalink / raw) To: Teng Liu, linux-btrfs; +Cc: dsterba, clm, linux-kernel 在 2026/3/29 17:01, Teng Liu 写道: > When open_ctree() fails during btrfs_read_chunk_tree(), readahead BIOs > submitted by readahead_tree_node_children() may still be in flight. The > error path frees fs_info without waiting for these BIOs to complete. > When a readahead BIO later completes, btrfs_simple_end_io() calls > btrfs_bio_counter_dec() which accesses the already-freed > fs_info->dev_replace.bio_counter, causing a use-after-free. > > This can be triggered by connecting a USB drive with a corrupted btrfs > filesystem (e.g. chunk tree destroyed by a partial format), where the > slow USB device keeps readahead BIOs in flight long enough for the > error path to free fs_info before they complete. It can be reproduced > on qemu with a properly corrupted btrfs img. > > BTRFS error (device sda): failed to read chunk tree: -2 > BTRFS error (device sda): open_ctree failed: -2 > BUG: unable to handle page fault for address: ffff89322ceb3000 > RIP: 0010:percpu_counter_add_batch+0xe/0xb0 > btrfs_bio_counter_sub+0x22/0x60 > btrfs_simple_end_io+0x32/0x90 > blk_update_request+0x12b/0x480 > scsi_end_request+0x26/0x1b0 > scsi_io_completion+0x50/0x790 > > Fix this by waiting for the bio_counter to reach zero in the error path > before stopping workers, so all in-flight BIOs have completed their > callbacks before fs_info is freed. The bio_counter is already > initialized in init_mount_fs_info() so this wait is safe for all error > paths reaching the fail_sb_buffer label. > > Link: https://bugzilla.kernel.org/show_bug.cgi?id=221270 > Reported-by: AHN SEOK-YOUNG > Signed-off-by: Teng Liu <27rabbitlt@gmail.com> > --- > fs/btrfs/disk-io.c | 12 ++++++++++++ > 1 file changed, 12 insertions(+) > > diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c > index 01f2dbb69..61e6b8dca 100644 > --- a/fs/btrfs/disk-io.c > +++ b/fs/btrfs/disk-io.c > @@ -3723,6 +3723,18 @@ int __cold open_ctree(struct super_block *sb, struct btrfs_fs_devices *fs_device > invalidate_inode_pages2(fs_info->btree_inode->i_mapping); > > fail_sb_buffer: > + /* > + * Wait for in-flight readahead BIOs before stopping workers. > + * Readahead BIOs from btrfs_read_chunk_tree() (via > + * readahead_tree_node_children) may still be in flight on slow > + * devices (e.g. USB). Their completion callbacks > + * (btrfs_simple_end_io) access fs_info->dev_replace.bio_counter > + * which would be destroyed later, causing a use-after-free. > + * The bio_counter was already initialized in init_mount_fs_info() > + * so this wait is safe for all error paths reaching this label. > + */ > + wait_event(fs_info->dev_replace.replace_wait, > + percpu_counter_sum(&fs_info->dev_replace.bio_counter) == 0); This doesn't make any sense to me. The wait and counter are all for dev-reaplce, not matching your description of the generic metadata readahead. If you want to wait for all existing metadata reads, I didn't find a good helper, thus you will need to go through all extent buffers and wait for EXTENT_BUFFER_READING flags. > btrfs_stop_all_workers(fs_info); > btrfs_free_block_groups(fs_info); > fail_alloc: ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [PATCH] btrfs: wait for in-flight readahead BIOs on open_ctree() error 2026-03-29 7:03 ` Qu Wenruo @ 2026-03-29 17:23 ` Teng Liu 2026-03-29 22:06 ` Qu Wenruo 0 siblings, 1 reply; 8+ messages in thread From: Teng Liu @ 2026-03-29 17:23 UTC (permalink / raw) To: Qu Wenruo, linux-btrfs; +Cc: dsterba, clm, linux-kernel Thanks for your review! On 2026-03-29 17:33, Qu Wenruo wrote: > > > This doesn't make any sense to me. It confuses me as well when I try to reproduce the bug. The reported claimed that btrfs_bio_counter_sub triggered a use-after-free but this function lives under `dev-reaplce.c` which should have nothing to do with the setting from the name. However when I checked the function call chain: open_ctree() → btrfs_read_sys_array() # OK — sys_chunk_array in superblock is intact → load_super_root(chunk_root) # OK — reads root node, passes validation → btrfs_read_chunk_tree() → btrfs_for_each_slot() → readahead_tree_node_children(node) → for each child pointer in the internal node: btrfs_readahead_node_child() → btrfs_readahead_tree_block() → read_extent_buffer_pages_nowait() → btrfs_submit_bbio() → btrfs_submit_chunk() → btrfs_bio_counter_inc_blocked() ← bio_counter++ → btrfs_map_block() → submit_bio() ← sent to USB drive After submit_bio() sends BIO to USB drive, we continue on read_one_dev(): open_ctree() → btrfs_read_sys_array() # OK — sys_chunk_array in superblock is intact → load_super_root(chunk_root) # OK — reads root node, passes validation → btrfs_read_chunk_tree() → btrfs_for_each_slot() → readahead_tree_node_children(node) → bio_coutner++ and submit_bio() send BIO to USB drive → read_one_dev() This read_one_dev will return an error since the leaf block is actually corrupted. Then open_ctree will get into error path and try to free fs_info. After USB device finished BIO, it will try to decreament the counter but the fs_info is already freed. Any suggestions on this? > > The wait and counter are all for dev-reaplce, not matching your description > of the generic metadata readahead. > > If you want to wait for all existing metadata reads, I didn't find a good > helper, thus you will need to go through all extent buffers and wait for > EXTENT_BUFFER_READING flags. > > ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [PATCH] btrfs: wait for in-flight readahead BIOs on open_ctree() error 2026-03-29 17:23 ` Teng Liu @ 2026-03-29 22:06 ` Qu Wenruo 2026-03-29 22:21 ` Qu Wenruo 0 siblings, 1 reply; 8+ messages in thread From: Qu Wenruo @ 2026-03-29 22:06 UTC (permalink / raw) To: Teng Liu, linux-btrfs; +Cc: dsterba, clm, linux-kernel 在 2026/3/30 03:53, Teng Liu 写道: > Thanks for your review! > On 2026-03-29 17:33, Qu Wenruo wrote: >> >> >> This doesn't make any sense to me. > It confuses me as well when I try to reproduce the bug. The reported > claimed that btrfs_bio_counter_sub triggered a use-after-free but this > function lives under `dev-reaplce.c` which should have nothing to do > with the setting from the name. > > However when I checked the function call chain: > > open_ctree() > → btrfs_read_sys_array() # OK — sys_chunk_array in superblock is intact > → load_super_root(chunk_root) # OK — reads root node, passes validation > → btrfs_read_chunk_tree() > → btrfs_for_each_slot() > → readahead_tree_node_children(node) > → for each child pointer in the internal node: > btrfs_readahead_node_child() > → btrfs_readahead_tree_block() > → read_extent_buffer_pages_nowait() > → btrfs_submit_bbio() > → btrfs_submit_chunk() > → btrfs_bio_counter_inc_blocked() ← bio_counter++ > → btrfs_map_block() > → submit_bio() ← sent to USB drive Even you wait for all bios, it can still cause problems. As the bio counter is only for btrfs bio layer, we still have btrfs_bio::end_io called after btrfs_bio_counter_dec(). And if the full fs_info has been freed, then at end_bbio_meta_read(), we can still have problems as btrfs_validate_extent_buffer() will access eb (bbio->private) and fs_info (eb->fs_info), which triggers use after free. So using that bio counter is not going to solve all problems, but only reducing the race window thus masking the problem. > > After submit_bio() sends BIO to USB drive, we continue on > read_one_dev(): > > open_ctree() > → btrfs_read_sys_array() # OK — sys_chunk_array in superblock is intact > → load_super_root(chunk_root) # OK — reads root node, passes validation > → btrfs_read_chunk_tree() > → btrfs_for_each_slot() > → readahead_tree_node_children(node) > → bio_coutner++ and submit_bio() send BIO to USB drive > → read_one_dev() > > This read_one_dev will return an error since the leaf block is actually > corrupted. Then open_ctree will get into error path and try to free > fs_info. > > After USB device finished BIO, it will try to decreament the counter but > the fs_info is already freed. > > Any suggestions on this? The following ideas come up to me, but neither seems as simple as your current one: 1) Introduce a dedicated counter for metadata readahead/reads This seems to be the simplest one among all. But the only usage is only the error handling, thus may not be worthy. 2) Disable metadata readahead during open_ctree() Which will delay the mount, especially for large extent tree without bgt feature. 3) Use buffer_tree xarray to iterate through all ebs Since this is only for error handling of open_ctree(), we're fine to do the full xarray iteration, and wait for any eb that has EXTENT_BUFFER_READING flag. The problem is, we do not have a dedicated tag like PAGECACHE_TAG_(TOWRITE|DIRTY) to easily catch all dirty/writeback ebs. So the only option is to go through each eb and check their flags. I think this is the one with minimal impact, but may cause much longer runtime during this error handling path. My personal preference is option 3). > > >> >> The wait and counter are all for dev-reaplce, not matching your description >> of the generic metadata readahead. >> >> If you want to wait for all existing metadata reads, I didn't find a good >> helper, thus you will need to go through all extent buffers and wait for >> EXTENT_BUFFER_READING flags. >> >> > > ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [PATCH] btrfs: wait for in-flight readahead BIOs on open_ctree() error 2026-03-29 22:06 ` Qu Wenruo @ 2026-03-29 22:21 ` Qu Wenruo 2026-03-30 18:00 ` Teng Liu 0 siblings, 1 reply; 8+ messages in thread From: Qu Wenruo @ 2026-03-29 22:21 UTC (permalink / raw) To: Teng Liu, linux-btrfs; +Cc: dsterba, clm, linux-kernel 在 2026/3/30 08:36, Qu Wenruo 写道: > > > 在 2026/3/30 03:53, Teng Liu 写道: >> Thanks for your review! >> On 2026-03-29 17:33, Qu Wenruo wrote: >>> >>> >>> This doesn't make any sense to me. >> It confuses me as well when I try to reproduce the bug. The reported >> claimed that btrfs_bio_counter_sub triggered a use-after-free but this >> function lives under `dev-reaplce.c` which should have nothing to do >> with the setting from the name. >> >> However when I checked the function call chain: >> >> open_ctree() >> → btrfs_read_sys_array() # OK — sys_chunk_array in >> superblock is intact >> → load_super_root(chunk_root) # OK — reads root node, passes >> validation >> → btrfs_read_chunk_tree() >> → btrfs_for_each_slot() >> → readahead_tree_node_children(node) >> → for each child pointer in the internal node: >> btrfs_readahead_node_child() >> → btrfs_readahead_tree_block() >> → read_extent_buffer_pages_nowait() >> → btrfs_submit_bbio() >> → btrfs_submit_chunk() >> → btrfs_bio_counter_inc_blocked() ← >> bio_counter++ >> → btrfs_map_block() >> → submit_bio() ← sent >> to USB drive > > Even you wait for all bios, it can still cause problems. > > As the bio counter is only for btrfs bio layer, we still have > btrfs_bio::end_io called after btrfs_bio_counter_dec(). > > And if the full fs_info has been freed, then at end_bbio_meta_read(), we > can still have problems as btrfs_validate_extent_buffer() will access eb > (bbio->private) and fs_info (eb->fs_info), which triggers use after free. > > So using that bio counter is not going to solve all problems, but only > reducing the race window thus masking the problem. > >> >> After submit_bio() sends BIO to USB drive, we continue on >> read_one_dev(): >> >> open_ctree() >> → btrfs_read_sys_array() # OK — sys_chunk_array in >> superblock is intact >> → load_super_root(chunk_root) # OK — reads root node, passes >> validation >> → btrfs_read_chunk_tree() >> → btrfs_for_each_slot() >> → readahead_tree_node_children(node) >> → bio_coutner++ and submit_bio() send BIO to USB drive >> → read_one_dev() >> >> This read_one_dev will return an error since the leaf block is actually >> corrupted. Then open_ctree will get into error path and try to free >> fs_info. >> >> After USB device finished BIO, it will try to decreament the counter but >> the fs_info is already freed. >> >> Any suggestions on this? > > The following ideas come up to me, but neither seems as simple as your > current one: > > 1) Introduce a dedicated counter for metadata readahead/reads > This seems to be the simplest one among all. > But the only usage is only the error handling, thus may not be > worthy. > > 2) Disable metadata readahead during open_ctree() > Which will delay the mount, especially for large extent tree without > bgt feature. > > 3) Use buffer_tree xarray to iterate through all ebs > Since this is only for error handling of open_ctree(), we're fine to > do the full xarray iteration, and wait for any eb that has > EXTENT_BUFFER_READING flag. > > The problem is, we do not have a dedicated tag like > PAGECACHE_TAG_(TOWRITE|DIRTY) to easily catch all dirty/writeback > ebs. > So the only option is to go through each eb and check their flags. > > I think this is the one with minimal impact, but may cause much > longer runtime during this error handling path. > > My personal preference is option 3). Or the 4th one, which is only an idea and I haven't yet verified: 4) Handle error from invalidate_inode_pages2() Currently we just call invalidate_inode_pages2() on btree inode and expect it to return 0. But if there is still an eb reading pending, it will make that function to return -EBUSY, as try_release_extent_buffer() will find a eb whose refs is not 0, and refuse the release that eb which belongs to a folio. That should be a good indicator of any pending metadata reads. So if that invalidate_inode_pages2() returned -EBUSY, we should wait retry until it returns 0. > >> >> >>> >>> The wait and counter are all for dev-reaplce, not matching your >>> description >>> of the generic metadata readahead. >>> >>> If you want to wait for all existing metadata reads, I didn't find a >>> good >>> helper, thus you will need to go through all extent buffers and wait for >>> EXTENT_BUFFER_READING flags. >>> >>> >> >> > > ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [PATCH] btrfs: wait for in-flight readahead BIOs on open_ctree() error 2026-03-29 22:21 ` Qu Wenruo @ 2026-03-30 18:00 ` Teng Liu 2026-03-30 21:48 ` Qu Wenruo 0 siblings, 1 reply; 8+ messages in thread From: Teng Liu @ 2026-03-30 18:00 UTC (permalink / raw) To: Qu Wenruo; +Cc: linux-btrfs, dsterba, clm, linux-kernel On 2026-03-30 08:51, Qu Wenruo wrote: > > > 在 2026/3/30 08:36, Qu Wenruo 写道: > > > > > > Even you wait for all bios, it can still cause problems. > > > > As the bio counter is only for btrfs bio layer, we still have > > btrfs_bio::end_io called after btrfs_bio_counter_dec(). > > > > And if the full fs_info has been freed, then at end_bbio_meta_read(), we > > can still have problems as btrfs_validate_extent_buffer() will access eb > > (bbio->private) and fs_info (eb->fs_info), which triggers use after > > free. > > > > So using that bio counter is not going to solve all problems, but only > > reducing the race window thus masking the problem. > > > > > > The following ideas come up to me, but neither seems as simple as your > > current one: > > > > 1) Introduce a dedicated counter for metadata readahead/reads > > This seems to be the simplest one among all. > > But the only usage is only the error handling, thus may not be > > worthy. > > > > 2) Disable metadata readahead during open_ctree() > > Which will delay the mount, especially for large extent tree without > > bgt feature. > > > > 3) Use buffer_tree xarray to iterate through all ebs > > Since this is only for error handling of open_ctree(), we're fine to > > do the full xarray iteration, and wait for any eb that has > > EXTENT_BUFFER_READING flag. > > > > The problem is, we do not have a dedicated tag like > > PAGECACHE_TAG_(TOWRITE|DIRTY) to easily catch all dirty/writeback > > ebs. > > So the only option is to go through each eb and check their flags. > > > > I think this is the one with minimal impact, but may cause much > > longer runtime during this error handling path. > > > > My personal preference is option 3). > > Or the 4th one, which is only an idea and I haven't yet verified: > > 4) Handle error from invalidate_inode_pages2() > Currently we just call invalidate_inode_pages2() on btree inode and > expect it to return 0. > > But if there is still an eb reading pending, it will make that > function to return -EBUSY, as try_release_extent_buffer() will > find a eb whose refs is not 0, and refuse the release that eb which > belongs to a folio. > > That should be a good indicator of any pending metadata reads. > > So if that invalidate_inode_pages2() returned -EBUSY, we should wait > retry until it returns 0. > > Thanks! Yes, it makes sense, simply waiting on the bio counter doesnt fix the problem here. Among the options, I prefer option 3. Although it may be slower, but it only happens in mount failure path so extra cost seems acceptable. I am quite new to btrfs codebase so I dont know whether `invalidate_inode_pages2()` would be a reliable solution so maybe I should start with option 3? ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [PATCH] btrfs: wait for in-flight readahead BIOs on open_ctree() error 2026-03-30 18:00 ` Teng Liu @ 2026-03-30 21:48 ` Qu Wenruo 2026-03-30 22:14 ` Qu Wenruo 0 siblings, 1 reply; 8+ messages in thread From: Qu Wenruo @ 2026-03-30 21:48 UTC (permalink / raw) To: Teng Liu; +Cc: linux-btrfs, dsterba, clm, linux-kernel 在 2026/3/31 04:30, Teng Liu 写道: > On 2026-03-30 08:51, Qu Wenruo wrote: >> >> >> 在 2026/3/30 08:36, Qu Wenruo 写道: >>> >>> >>> Even you wait for all bios, it can still cause problems. >>> >>> As the bio counter is only for btrfs bio layer, we still have >>> btrfs_bio::end_io called after btrfs_bio_counter_dec(). >>> >>> And if the full fs_info has been freed, then at end_bbio_meta_read(), we >>> can still have problems as btrfs_validate_extent_buffer() will access eb >>> (bbio->private) and fs_info (eb->fs_info), which triggers use after >>> free. >>> >>> So using that bio counter is not going to solve all problems, but only >>> reducing the race window thus masking the problem. >>> >>> >>> The following ideas come up to me, but neither seems as simple as your >>> current one: >>> >>> 1) Introduce a dedicated counter for metadata readahead/reads >>> This seems to be the simplest one among all. >>> But the only usage is only the error handling, thus may not be >>> worthy. >>> >>> 2) Disable metadata readahead during open_ctree() >>> Which will delay the mount, especially for large extent tree without >>> bgt feature. >>> >>> 3) Use buffer_tree xarray to iterate through all ebs >>> Since this is only for error handling of open_ctree(), we're fine to >>> do the full xarray iteration, and wait for any eb that has >>> EXTENT_BUFFER_READING flag. >>> >>> The problem is, we do not have a dedicated tag like >>> PAGECACHE_TAG_(TOWRITE|DIRTY) to easily catch all dirty/writeback >>> ebs. >>> So the only option is to go through each eb and check their flags. >>> >>> I think this is the one with minimal impact, but may cause much >>> longer runtime during this error handling path. >>> >>> My personal preference is option 3). >> >> Or the 4th one, which is only an idea and I haven't yet verified: >> >> 4) Handle error from invalidate_inode_pages2() >> Currently we just call invalidate_inode_pages2() on btree inode and >> expect it to return 0. >> >> But if there is still an eb reading pending, it will make that >> function to return -EBUSY, as try_release_extent_buffer() will >> find a eb whose refs is not 0, and refuse the release that eb which >> belongs to a folio. >> >> That should be a good indicator of any pending metadata reads. >> >> So if that invalidate_inode_pages2() returned -EBUSY, we should wait >> retry until it returns 0. >> >> > > Thanks! Yes, it makes sense, simply waiting on the bio counter doesnt > fix the problem here. > > Among the options, I prefer option 3. Although it may be slower, but it > only happens in mount failure path so extra cost seems acceptable. The problem is not limited to mount failure, but also affects close_ctree() too. As it shares the same root problem, we have nothing to trace nor wait for any pending metadata read. > > I am quite new to btrfs codebase so I dont know whether > `invalidate_inode_pages2()` would be a reliable solution so maybe I > should start with option 3? Sure. Although iterating through xarray may not be that simple either, as you may still need to look into all kinds of extra locks/rcu lock etc, and if you apply that to the callsite of close_ctree(), it may be a much bigger problem, as we have a lot of more ebs compared to mount time. You can even mix option 3 and 4, e.g. only after invalidate_inode_pages2() failed with -EBUSY then switch to xarray iteration. This should greatly reduce the number of ebs that are still inside the xarray, thus makes the iteration much faster. ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [PATCH] btrfs: wait for in-flight readahead BIOs on open_ctree() error 2026-03-30 21:48 ` Qu Wenruo @ 2026-03-30 22:14 ` Qu Wenruo 0 siblings, 0 replies; 8+ messages in thread From: Qu Wenruo @ 2026-03-30 22:14 UTC (permalink / raw) To: Teng Liu; +Cc: linux-btrfs, dsterba, clm, linux-kernel 在 2026/3/31 08:18, Qu Wenruo 写道: > > > 在 2026/3/31 04:30, Teng Liu 写道: [...] >>>> >>>> 3) Use buffer_tree xarray to iterate through all ebs >>>> Since this is only for error handling of open_ctree(), we're >>>> fine to >>>> do the full xarray iteration, and wait for any eb that has >>>> EXTENT_BUFFER_READING flag. >>>> >>>> The problem is, we do not have a dedicated tag like >>>> PAGECACHE_TAG_(TOWRITE|DIRTY) to easily catch all dirty/writeback >>>> ebs. >>>> So the only option is to go through each eb and check their flags. >>>> >>>> I think this is the one with minimal impact, but may cause much >>>> longer runtime during this error handling path. >>>> >>>> My personal preference is option 3). >>> >>> Or the 4th one, which is only an idea and I haven't yet verified: >>> >>> 4) Handle error from invalidate_inode_pages2() >>> Currently we just call invalidate_inode_pages2() on btree inode and >>> expect it to return 0. >>> >>> But if there is still an eb reading pending, it will make that >>> function to return -EBUSY, as try_release_extent_buffer() will >>> find a eb whose refs is not 0, and refuse the release that eb which >>> belongs to a folio. >>> >>> That should be a good indicator of any pending metadata reads. >>> >>> So if that invalidate_inode_pages2() returned -EBUSY, we should wait >>> retry until it returns 0. >>> >>> >> >> Thanks! Yes, it makes sense, simply waiting on the bio counter doesnt >> fix the problem here. >> >> Among the options, I prefer option 3. Although it may be slower, but it >> only happens in mount failure path so extra cost seems acceptable. > > The problem is not limited to mount failure, but also affects > close_ctree() too. > > As it shares the same root problem, we have nothing to trace nor wait > for any pending metadata read. > >> >> I am quite new to btrfs codebase so I dont know whether >> `invalidate_inode_pages2()` would be a reliable solution so maybe I >> should start with option 3? > > Sure. Although iterating through xarray may not be that simple either, > as you may still need to look into all kinds of extra locks/rcu lock > etc, and if you apply that to the callsite of close_ctree(), it may be a > much bigger problem, as we have a lot of more ebs compared to mount time. > > You can even mix option 3 and 4, e.g. only after > invalidate_inode_pages2() failed with -EBUSY then switch to xarray > iteration. > > This should greatly reduce the number of ebs that are still inside the > xarray, thus makes the iteration much faster. > Although option 4 is much easier to implement. I'm already testing with a testing patch applied, so far the fstests run looks pretty boring. If you can verify this fix against the original report, it will be appreciated. But please note that, this is only a PoC, not perfect. The biggest problem is the busy loop wait, as I hit a bug of an older version that invalidate_inode_pages2() is called before freeing the root pointers, thus invalidate_inode_pages2() will always return -EBUSY, and take a CPU core to do the busy loop forever. Even the current version has that problem fixed, it will still cause the same unnecessary busy loop for very slow storage, which is far from ideal. So option 3 is still needed to avoid busy loop, and may detect unexpected dirty ebs better. I believe the best option is really mixing option 3 and option 4. diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c index c835141ee384..39420d599822 100644 --- a/fs/btrfs/disk-io.c +++ b/fs/btrfs/disk-io.c @@ -3706,7 +3706,11 @@ int __cold open_ctree(struct super_block *sb, struct btrfs_fs_devices *fs_device if (fs_info->data_reloc_root) btrfs_drop_and_free_fs_root(fs_info, fs_info->data_reloc_root); free_root_pointers(fs_info, true); - invalidate_inode_pages2(fs_info->btree_inode->i_mapping); + ret = invalidate_inode_pages2(fs_info->btree_inode->i_mapping); + while (ret) { + cond_resched(); + ret = invalidate_inode_pages2(fs_info->btree_inode->i_mapping); + } fail_sb_buffer: btrfs_stop_all_workers(fs_info); @@ -4434,19 +4438,23 @@ void __cold close_ctree(struct btrfs_fs_info *fs_info) btrfs_put_block_group_cache(fs_info); - /* - * we must make sure there is not any read request to - * submit after we stopping all workers. - */ - invalidate_inode_pages2(fs_info->btree_inode->i_mapping); - btrfs_stop_all_workers(fs_info); - /* We shouldn't have any transaction open at this point */ warn_about_uncommitted_trans(fs_info); clear_bit(BTRFS_FS_OPEN, &fs_info->flags); free_root_pointers(fs_info, true); btrfs_free_fs_roots(fs_info); + /* + * we must make sure there is not any read request to + * submit after we stopping all workers. + */ + ret = invalidate_inode_pages2(fs_info->btree_inode->i_mapping); + while (ret) { + cond_resched(); + ret = invalidate_inode_pages2(fs_info->btree_inode->i_mapping); + } + btrfs_stop_all_workers(fs_info); + /* * We must free the block groups after dropping the fs_roots as we could -- ^ permalink raw reply related [flat|nested] 8+ messages in thread
end of thread, other threads:[~2026-03-30 22:15 UTC | newest] Thread overview: 8+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2026-03-29 6:31 [PATCH] btrfs: wait for in-flight readahead BIOs on open_ctree() error Teng Liu 2026-03-29 7:03 ` Qu Wenruo 2026-03-29 17:23 ` Teng Liu 2026-03-29 22:06 ` Qu Wenruo 2026-03-29 22:21 ` Qu Wenruo 2026-03-30 18:00 ` Teng Liu 2026-03-30 21:48 ` Qu Wenruo 2026-03-30 22:14 ` Qu Wenruo
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox