* [PATCH v2] btrfs: remove an ASSERT() which can cause false alert @ 2022-02-17 9:49 Qu Wenruo 2022-02-17 11:14 ` Filipe Manana 2022-04-19 20:34 ` David Sterba 0 siblings, 2 replies; 9+ messages in thread From: Qu Wenruo @ 2022-02-17 9:49 UTC (permalink / raw) To: linux-btrfs; +Cc: Su Yue [BUG] Su reported that with his VM running on an M1 mac, it has a high chance to trigger the following ASSERT() with scrub and balance test cases: ASSERT(cache->start == chunk_offset); ret = scrub_chunk(sctx, cache, scrub_dev, found_key.offset, dev_extent_len); [CAUSE] The ASSERT() is making sure that the block group cache (@cache) and the chunk offset from the device extent match. But the truth is, block group caches and dev extent items are deleted at different timing. For block group caches and items, after btrfs_remove_block_group() they will be marked deleted, but the device extent and the chunk item is still in dev tree and chunk tree respectively. The device extents and chunk items are only deleted in btrfs_remove_chunk(), which happens either a btrfs_delete_unused_bgs() or btrfs_relocate_chunk(). This timing difference leaves a window where we can have a mismatch between block group cache and device extent tree, and triggering the ASSERT(). [FIX] - Remove the ASSERT() - Add a quick exit if our found bg doesn't match @chunk_offset - Add a comment on the existing checks in scrub_chunk() This is the ultimate safenet, as it will iterate through all the stripes of the found chunk. And only scrub the stripe if it matches both device and physical offset. So even by some how we got a dev extent which is no longer owned by the target chunk, we will just skip this chunk completely, without any consequence. Reported-by: Su Yue <l@damenly.su> Signed-off-by: Qu Wenruo <wqu@suse.com> --- Changelog: v2: - Add a new quick exit with extra comments - Add a new comment in the existing safenet in scrub_chunk() --- fs/btrfs/scrub.c | 21 ++++++++++++++++++++- 1 file changed, 20 insertions(+), 1 deletion(-) diff --git a/fs/btrfs/scrub.c b/fs/btrfs/scrub.c index 11089568b287..1c3ed4720ddd 100644 --- a/fs/btrfs/scrub.c +++ b/fs/btrfs/scrub.c @@ -3578,6 +3578,14 @@ static noinline_for_stack int scrub_chunk(struct scrub_ctx *sctx, goto out; map = em->map_lookup; + + /* + * Due to the delayed modification of device tree, this chunk + * may not own this dev extent. + * + * So we need to iterate through all stripes, and only scrub + * the stripe which matches both device and physical offset. + */ for (i = 0; i < map->num_stripes; ++i) { if (map->stripes[i].dev->bdev == scrub_dev->bdev && map->stripes[i].physical == dev_offset) { @@ -3699,6 +3707,18 @@ int scrub_enumerate_chunks(struct scrub_ctx *sctx, if (!cache) goto skip; + /* + * Since dev extents modification is delayed, it's possible + * we got a bg which doesn't really own this dev extent. + * + * Here just do a quick skip, scrub_chunk() has a more + * comprehensive check in it. + */ + if (cache->start != chunk_offset) { + btrfs_put_block_group(cache); + goto skip; + } + if (sctx->is_dev_replace && btrfs_is_zoned(fs_info)) { spin_lock(&cache->lock); if (!cache->to_copy) { @@ -3822,7 +3842,6 @@ int scrub_enumerate_chunks(struct scrub_ctx *sctx, dev_replace->item_needs_writeback = 1; up_write(&dev_replace->rwsem); - ASSERT(cache->start == chunk_offset); ret = scrub_chunk(sctx, cache, scrub_dev, found_key.offset, dev_extent_len); -- 2.35.1 ^ permalink raw reply related [flat|nested] 9+ messages in thread
* Re: [PATCH v2] btrfs: remove an ASSERT() which can cause false alert 2022-02-17 9:49 [PATCH v2] btrfs: remove an ASSERT() which can cause false alert Qu Wenruo @ 2022-02-17 11:14 ` Filipe Manana 2022-02-17 11:24 ` Qu Wenruo 2022-04-19 20:34 ` David Sterba 1 sibling, 1 reply; 9+ messages in thread From: Filipe Manana @ 2022-02-17 11:14 UTC (permalink / raw) To: Qu Wenruo; +Cc: linux-btrfs, Su Yue On Thu, Feb 17, 2022 at 05:49:34PM +0800, Qu Wenruo wrote: > [BUG] > Su reported that with his VM running on an M1 mac, it has a high chance > to trigger the following ASSERT() with scrub and balance test cases: > > ASSERT(cache->start == chunk_offset); > ret = scrub_chunk(sctx, cache, scrub_dev, found_key.offset, > dev_extent_len); > > [CAUSE] > The ASSERT() is making sure that the block group cache (@cache) and the > chunk offset from the device extent match. > > But the truth is, block group caches and dev extent items are deleted at > different timing. > > For block group caches and items, after btrfs_remove_block_group() they > will be marked deleted, but the device extent and the chunk item is > still in dev tree and chunk tree respectively. This is not correct, what happens is: 1) btrfs_remove_chunk() will remove the device extent items and the chunk item; 2) It then calls btrfs_remove_block_group(). This will not remove the extent map if the block group was frozen (due to trim, and scrub itself). But it will remove the block group (struct btrfs_block_group) from the red black tree of block groups, and mark the block group as deleted (set struct btrfs_block_group::removed to 1). If the extent map was not removed, meaning the block group is frozen, then no one will be able to create a new block group with the same logical address before the block group is unfrozen (by someone who is holding a reference on it). So a lookup on the red black tree will return NULL for the logical address until the block group is unfrozen and its logical address is reused for a new block group. So the ordering of what you are saying is reversed - the device extent and chunk items are removed before marking the block group as deleted. > > The device extents and chunk items are only deleted in > btrfs_remove_chunk(), which happens either a btrfs_delete_unused_bgs() > or btrfs_relocate_chunk(). This is also confusing because it gives a wrong idea of the ordering. The device extents and chunk items are removed by btrfs_remove_chunk(), which happens before marking a block group as deleted. > > This timing difference leaves a window where we can have a mismatch > between block group cache and device extent tree, and triggering the > ASSERT(). I would like to see more details on this. The sequence of steps between two tasks that result in this assertion being triggered. > > [FIX] > > - Remove the ASSERT() > > - Add a quick exit if our found bg doesn't match @chunk_offset This shouldn't happen. I would like to know the sequence of steps between 2 (or more) tasks that leads to that. We are getting the block group with btrfs_lookup_block_group() at scrub_enumerate_chunks(), that calls block_group_cache_tree_search() with the "contains" argument set to 1, meaning it can return a block group that contains the given bytenr but does not start at that bytenr necessarily. This gives me the idea a small block group was deleted and then a new one was allocated which starts at a lower logical offset and includes "chunk_offset" (ends beyond that). We should probably be using btrfs_lookup_first_block_group() at scrub_enumerate_chunks(), so it looks for a block group that starts exactly at the given logical address. But anyway, as I read this and see the patch's diff, this feels a lot like a "bail out if something unexpected happens but we don't know exactly why/how that is possible". > > - Add a comment on the existing checks in scrub_chunk() > This is the ultimate safenet, as it will iterate through all the stripes > of the found chunk. > And only scrub the stripe if it matches both device and physical > offset. > > So even by some how we got a dev extent which is no longer owned > by the target chunk, we will just skip this chunk completely, without > any consequence. > > Reported-by: Su Yue <l@damenly.su> > Signed-off-by: Qu Wenruo <wqu@suse.com> > --- > Changelog: > v2: > - Add a new quick exit with extra comments > > - Add a new comment in the existing safenet in scrub_chunk() > --- > fs/btrfs/scrub.c | 21 ++++++++++++++++++++- > 1 file changed, 20 insertions(+), 1 deletion(-) > > diff --git a/fs/btrfs/scrub.c b/fs/btrfs/scrub.c > index 11089568b287..1c3ed4720ddd 100644 > --- a/fs/btrfs/scrub.c > +++ b/fs/btrfs/scrub.c > @@ -3578,6 +3578,14 @@ static noinline_for_stack int scrub_chunk(struct scrub_ctx *sctx, > goto out; > > map = em->map_lookup; > + > + /* > + * Due to the delayed modification of device tree, this chunk > + * may not own this dev extent. This is confusing. What delayed modification, what do you mean by it exactly? Same below, with more details. > + * > + * So we need to iterate through all stripes, and only scrub > + * the stripe which matches both device and physical offset. > + */ > for (i = 0; i < map->num_stripes; ++i) { > if (map->stripes[i].dev->bdev == scrub_dev->bdev && > map->stripes[i].physical == dev_offset) { > @@ -3699,6 +3707,18 @@ int scrub_enumerate_chunks(struct scrub_ctx *sctx, > if (!cache) > goto skip; > > + /* > + * Since dev extents modification is delayed, it's possible > + * we got a bg which doesn't really own this dev extent. Same as before. Too confusing, what is delayed dev extent modification? Once added, device extents can only be deleted, they are never modified. I think you are referring to the deletions and the fact we use the commit roots to find extents, but that should be a more clear comment, without such level of ambiguity. Thanks. > + * > + * Here just do a quick skip, scrub_chunk() has a more > + * comprehensive check in it. > + */ > + if (cache->start != chunk_offset) { > + btrfs_put_block_group(cache); > + goto skip; > + } > + > if (sctx->is_dev_replace && btrfs_is_zoned(fs_info)) { > spin_lock(&cache->lock); > if (!cache->to_copy) { > @@ -3822,7 +3842,6 @@ int scrub_enumerate_chunks(struct scrub_ctx *sctx, > dev_replace->item_needs_writeback = 1; > up_write(&dev_replace->rwsem); > > - ASSERT(cache->start == chunk_offset); > ret = scrub_chunk(sctx, cache, scrub_dev, found_key.offset, > dev_extent_len); > > -- > 2.35.1 > ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [PATCH v2] btrfs: remove an ASSERT() which can cause false alert 2022-02-17 11:14 ` Filipe Manana @ 2022-02-17 11:24 ` Qu Wenruo 2022-02-17 12:32 ` Filipe Manana 0 siblings, 1 reply; 9+ messages in thread From: Qu Wenruo @ 2022-02-17 11:24 UTC (permalink / raw) To: Filipe Manana, Qu Wenruo; +Cc: linux-btrfs, Su Yue On 2022/2/17 19:14, Filipe Manana wrote: > On Thu, Feb 17, 2022 at 05:49:34PM +0800, Qu Wenruo wrote: >> [BUG] >> Su reported that with his VM running on an M1 mac, it has a high chance >> to trigger the following ASSERT() with scrub and balance test cases: >> >> ASSERT(cache->start == chunk_offset); >> ret = scrub_chunk(sctx, cache, scrub_dev, found_key.offset, >> dev_extent_len); >> >> [CAUSE] >> The ASSERT() is making sure that the block group cache (@cache) and the >> chunk offset from the device extent match. >> >> But the truth is, block group caches and dev extent items are deleted at >> different timing. >> >> For block group caches and items, after btrfs_remove_block_group() they >> will be marked deleted, but the device extent and the chunk item is >> still in dev tree and chunk tree respectively. > > This is not correct, what happens is: > > 1) btrfs_remove_chunk() will remove the device extent items and the chunk > item; > > 2) It then calls btrfs_remove_block_group(). This will not remove the > extent map if the block group was frozen (due to trim, and scrub > itself). But it will remove the block group (struct btrfs_block_group) > from the red black tree of block groups, and mark the block group > as deleted (set struct btrfs_block_group::removed to 1). > > If the extent map was not removed, meaning the block group is frozen, > then no one will be able to create a new block group with the same logical > address before the block group is unfrozen (by someone who is holding > a reference on it). So a lookup on the red black tree will return NULL > for the logical address until the block group is unfrozen and its > logical address is reused for a new block group. > > So the ordering of what you are saying is reversed - the device extent > and chunk items are removed before marking the block group as deleted. > >> >> The device extents and chunk items are only deleted in >> btrfs_remove_chunk(), which happens either a btrfs_delete_unused_bgs() >> or btrfs_relocate_chunk(). > > This is also confusing because it gives a wrong idea of the ordering. > The device extents and chunk items are removed by btrfs_remove_chunk(), > which happens before marking a block group as deleted. > >> >> This timing difference leaves a window where we can have a mismatch >> between block group cache and device extent tree, and triggering the >> ASSERT(). > > I would like to see more details on this. The sequence of steps between > two tasks that result in this assertion being triggered. > >> >> [FIX] >> >> - Remove the ASSERT() >> >> - Add a quick exit if our found bg doesn't match @chunk_offset > > This shouldn't happen. I would like to know the sequence of steps > between 2 (or more) tasks that leads to that. > > We are getting the block group with btrfs_lookup_block_group() at > scrub_enumerate_chunks(), that calls block_group_cache_tree_search() > with the "contains" argument set to 1, meaning it can return a > block group that contains the given bytenr but does not start at > that bytenr necessarily. > > This gives me the idea a small block group was deleted and then a > new one was allocated which starts at a lower logical offset and > includes "chunk_offset" (ends beyond that). > > We should probably be using btrfs_lookup_first_block_group() at > scrub_enumerate_chunks(), so it looks for a block group that starts > exactly at the given logical address. > > But anyway, as I read this and see the patch's diff, this feels a lot > like a "bail out if something unexpected happens but we don't know > exactly why/how that is possible". You're completely correct. Unfortunately I have no way to reproduce the bug on x86_64 VMs, and all my ARM VMs are too slow to reproduce. Furthermore, due to some unrelated bugs, v5.17-rc based kernels all failed to boot inside the VMs on Apple M1. So unfortunately I don't have extra details for now. Will find a way to reproduce first, then update the patch with better comments. Thanks, Qu > >> >> - Add a comment on the existing checks in scrub_chunk() >> This is the ultimate safenet, as it will iterate through all the stripes >> of the found chunk. >> And only scrub the stripe if it matches both device and physical >> offset. >> >> So even by some how we got a dev extent which is no longer owned >> by the target chunk, we will just skip this chunk completely, without >> any consequence. >> >> Reported-by: Su Yue <l@damenly.su> >> Signed-off-by: Qu Wenruo <wqu@suse.com> >> --- >> Changelog: >> v2: >> - Add a new quick exit with extra comments >> >> - Add a new comment in the existing safenet in scrub_chunk() >> --- >> fs/btrfs/scrub.c | 21 ++++++++++++++++++++- >> 1 file changed, 20 insertions(+), 1 deletion(-) >> >> diff --git a/fs/btrfs/scrub.c b/fs/btrfs/scrub.c >> index 11089568b287..1c3ed4720ddd 100644 >> --- a/fs/btrfs/scrub.c >> +++ b/fs/btrfs/scrub.c >> @@ -3578,6 +3578,14 @@ static noinline_for_stack int scrub_chunk(struct scrub_ctx *sctx, >> goto out; >> >> map = em->map_lookup; >> + >> + /* >> + * Due to the delayed modification of device tree, this chunk >> + * may not own this dev extent. > > This is confusing. What delayed modification, what do you mean by it > exactly? Same below, with more details. > >> + * >> + * So we need to iterate through all stripes, and only scrub >> + * the stripe which matches both device and physical offset. >> + */ >> for (i = 0; i < map->num_stripes; ++i) { >> if (map->stripes[i].dev->bdev == scrub_dev->bdev && >> map->stripes[i].physical == dev_offset) { >> @@ -3699,6 +3707,18 @@ int scrub_enumerate_chunks(struct scrub_ctx *sctx, >> if (!cache) >> goto skip; >> >> + /* >> + * Since dev extents modification is delayed, it's possible >> + * we got a bg which doesn't really own this dev extent. > > Same as before. Too confusing, what is delayed dev extent modification? > > Once added, device extents can only be deleted, they are never modified. > I think you are referring to the deletions and the fact we use the commit > roots to find extents, but that should be a more clear comment, without > such level of ambiguity. > > Thanks. > >> + * >> + * Here just do a quick skip, scrub_chunk() has a more >> + * comprehensive check in it. >> + */ >> + if (cache->start != chunk_offset) { >> + btrfs_put_block_group(cache); >> + goto skip; >> + } >> + >> if (sctx->is_dev_replace && btrfs_is_zoned(fs_info)) { >> spin_lock(&cache->lock); >> if (!cache->to_copy) { >> @@ -3822,7 +3842,6 @@ int scrub_enumerate_chunks(struct scrub_ctx *sctx, >> dev_replace->item_needs_writeback = 1; >> up_write(&dev_replace->rwsem); >> >> - ASSERT(cache->start == chunk_offset); >> ret = scrub_chunk(sctx, cache, scrub_dev, found_key.offset, >> dev_extent_len); >> >> -- >> 2.35.1 >> ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [PATCH v2] btrfs: remove an ASSERT() which can cause false alert 2022-02-17 11:24 ` Qu Wenruo @ 2022-02-17 12:32 ` Filipe Manana 2022-02-17 13:00 ` Qu Wenruo 0 siblings, 1 reply; 9+ messages in thread From: Filipe Manana @ 2022-02-17 12:32 UTC (permalink / raw) To: Qu Wenruo; +Cc: Qu Wenruo, linux-btrfs, Su Yue On Thu, Feb 17, 2022 at 11:24 AM Qu Wenruo <quwenruo.btrfs@gmx.com> wrote: > > > > On 2022/2/17 19:14, Filipe Manana wrote: > > On Thu, Feb 17, 2022 at 05:49:34PM +0800, Qu Wenruo wrote: > >> [BUG] > >> Su reported that with his VM running on an M1 mac, it has a high chance > >> to trigger the following ASSERT() with scrub and balance test cases: > >> > >> ASSERT(cache->start == chunk_offset); > >> ret = scrub_chunk(sctx, cache, scrub_dev, found_key.offset, > >> dev_extent_len); > >> > >> [CAUSE] > >> The ASSERT() is making sure that the block group cache (@cache) and the > >> chunk offset from the device extent match. > >> > >> But the truth is, block group caches and dev extent items are deleted at > >> different timing. > >> > >> For block group caches and items, after btrfs_remove_block_group() they > >> will be marked deleted, but the device extent and the chunk item is > >> still in dev tree and chunk tree respectively. > > > > This is not correct, what happens is: > > > > 1) btrfs_remove_chunk() will remove the device extent items and the chunk > > item; > > > > 2) It then calls btrfs_remove_block_group(). This will not remove the > > extent map if the block group was frozen (due to trim, and scrub > > itself). But it will remove the block group (struct btrfs_block_group) > > from the red black tree of block groups, and mark the block group > > as deleted (set struct btrfs_block_group::removed to 1). > > > > If the extent map was not removed, meaning the block group is frozen, > > then no one will be able to create a new block group with the same logical > > address before the block group is unfrozen (by someone who is holding > > a reference on it). So a lookup on the red black tree will return NULL > > for the logical address until the block group is unfrozen and its > > logical address is reused for a new block group. > > > > So the ordering of what you are saying is reversed - the device extent > > and chunk items are removed before marking the block group as deleted. > > > >> > >> The device extents and chunk items are only deleted in > >> btrfs_remove_chunk(), which happens either a btrfs_delete_unused_bgs() > >> or btrfs_relocate_chunk(). > > > > This is also confusing because it gives a wrong idea of the ordering. > > The device extents and chunk items are removed by btrfs_remove_chunk(), > > which happens before marking a block group as deleted. > > > >> > >> This timing difference leaves a window where we can have a mismatch > >> between block group cache and device extent tree, and triggering the > >> ASSERT(). > > > > I would like to see more details on this. The sequence of steps between > > two tasks that result in this assertion being triggered. > > > >> > >> [FIX] > >> > >> - Remove the ASSERT() > >> > >> - Add a quick exit if our found bg doesn't match @chunk_offset > > > > This shouldn't happen. I would like to know the sequence of steps > > between 2 (or more) tasks that leads to that. > > > > We are getting the block group with btrfs_lookup_block_group() at > > scrub_enumerate_chunks(), that calls block_group_cache_tree_search() > > with the "contains" argument set to 1, meaning it can return a > > block group that contains the given bytenr but does not start at > > that bytenr necessarily. > > > > This gives me the idea a small block group was deleted and then a > > new one was allocated which starts at a lower logical offset and > > includes "chunk_offset" (ends beyond that). > > > > We should probably be using btrfs_lookup_first_block_group() at > > scrub_enumerate_chunks(), so it looks for a block group that starts > > exactly at the given logical address. > > > > But anyway, as I read this and see the patch's diff, this feels a lot > > like a "bail out if something unexpected happens but we don't know > > exactly why/how that is possible". > > You're completely correct. > > Unfortunately I have no way to reproduce the bug on x86_64 VMs, and all > my ARM VMs are too slow to reproduce. > > Furthermore, due to some unrelated bugs, v5.17-rc based kernels all > failed to boot inside the VMs on Apple M1. > > So unfortunately I don't have extra details for now. > > Will find a way to reproduce first, then update the patch with better > comments. Was it triggered with some specific test case from fstests? > > Thanks, > Qu > > > > >> > >> - Add a comment on the existing checks in scrub_chunk() > >> This is the ultimate safenet, as it will iterate through all the stripes > >> of the found chunk. > >> And only scrub the stripe if it matches both device and physical > >> offset. > >> > >> So even by some how we got a dev extent which is no longer owned > >> by the target chunk, we will just skip this chunk completely, without > >> any consequence. > >> > >> Reported-by: Su Yue <l@damenly.su> > >> Signed-off-by: Qu Wenruo <wqu@suse.com> > >> --- > >> Changelog: > >> v2: > >> - Add a new quick exit with extra comments > >> > >> - Add a new comment in the existing safenet in scrub_chunk() > >> --- > >> fs/btrfs/scrub.c | 21 ++++++++++++++++++++- > >> 1 file changed, 20 insertions(+), 1 deletion(-) > >> > >> diff --git a/fs/btrfs/scrub.c b/fs/btrfs/scrub.c > >> index 11089568b287..1c3ed4720ddd 100644 > >> --- a/fs/btrfs/scrub.c > >> +++ b/fs/btrfs/scrub.c > >> @@ -3578,6 +3578,14 @@ static noinline_for_stack int scrub_chunk(struct scrub_ctx *sctx, > >> goto out; > >> > >> map = em->map_lookup; > >> + > >> + /* > >> + * Due to the delayed modification of device tree, this chunk > >> + * may not own this dev extent. > > > > This is confusing. What delayed modification, what do you mean by it > > exactly? Same below, with more details. > > > >> + * > >> + * So we need to iterate through all stripes, and only scrub > >> + * the stripe which matches both device and physical offset. > >> + */ > >> for (i = 0; i < map->num_stripes; ++i) { > >> if (map->stripes[i].dev->bdev == scrub_dev->bdev && > >> map->stripes[i].physical == dev_offset) { > >> @@ -3699,6 +3707,18 @@ int scrub_enumerate_chunks(struct scrub_ctx *sctx, > >> if (!cache) > >> goto skip; > >> > >> + /* > >> + * Since dev extents modification is delayed, it's possible > >> + * we got a bg which doesn't really own this dev extent. > > > > Same as before. Too confusing, what is delayed dev extent modification? > > > > Once added, device extents can only be deleted, they are never modified. > > I think you are referring to the deletions and the fact we use the commit > > roots to find extents, but that should be a more clear comment, without > > such level of ambiguity. > > > > Thanks. > > > >> + * > >> + * Here just do a quick skip, scrub_chunk() has a more > >> + * comprehensive check in it. > >> + */ > >> + if (cache->start != chunk_offset) { > >> + btrfs_put_block_group(cache); > >> + goto skip; > >> + } > >> + > >> if (sctx->is_dev_replace && btrfs_is_zoned(fs_info)) { > >> spin_lock(&cache->lock); > >> if (!cache->to_copy) { > >> @@ -3822,7 +3842,6 @@ int scrub_enumerate_chunks(struct scrub_ctx *sctx, > >> dev_replace->item_needs_writeback = 1; > >> up_write(&dev_replace->rwsem); > >> > >> - ASSERT(cache->start == chunk_offset); > >> ret = scrub_chunk(sctx, cache, scrub_dev, found_key.offset, > >> dev_extent_len); > >> > >> -- > >> 2.35.1 > >> ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [PATCH v2] btrfs: remove an ASSERT() which can cause false alert 2022-02-17 12:32 ` Filipe Manana @ 2022-02-17 13:00 ` Qu Wenruo 2022-02-17 13:02 ` Filipe Manana 0 siblings, 1 reply; 9+ messages in thread From: Qu Wenruo @ 2022-02-17 13:00 UTC (permalink / raw) To: Filipe Manana; +Cc: Qu Wenruo, linux-btrfs, Su Yue On 2022/2/17 20:32, Filipe Manana wrote: > On Thu, Feb 17, 2022 at 11:24 AM Qu Wenruo <quwenruo.btrfs@gmx.com> wrote: >> >> >> >> On 2022/2/17 19:14, Filipe Manana wrote: >>> On Thu, Feb 17, 2022 at 05:49:34PM +0800, Qu Wenruo wrote: >>>> [BUG] >>>> Su reported that with his VM running on an M1 mac, it has a high chance >>>> to trigger the following ASSERT() with scrub and balance test cases: >>>> >>>> ASSERT(cache->start == chunk_offset); >>>> ret = scrub_chunk(sctx, cache, scrub_dev, found_key.offset, >>>> dev_extent_len); >>>> >>>> [CAUSE] >>>> The ASSERT() is making sure that the block group cache (@cache) and the >>>> chunk offset from the device extent match. >>>> >>>> But the truth is, block group caches and dev extent items are deleted at >>>> different timing. >>>> >>>> For block group caches and items, after btrfs_remove_block_group() they >>>> will be marked deleted, but the device extent and the chunk item is >>>> still in dev tree and chunk tree respectively. >>> >>> This is not correct, what happens is: >>> >>> 1) btrfs_remove_chunk() will remove the device extent items and the chunk >>> item; >>> >>> 2) It then calls btrfs_remove_block_group(). This will not remove the >>> extent map if the block group was frozen (due to trim, and scrub >>> itself). But it will remove the block group (struct btrfs_block_group) >>> from the red black tree of block groups, and mark the block group >>> as deleted (set struct btrfs_block_group::removed to 1). >>> >>> If the extent map was not removed, meaning the block group is frozen, >>> then no one will be able to create a new block group with the same logical >>> address before the block group is unfrozen (by someone who is holding >>> a reference on it). So a lookup on the red black tree will return NULL >>> for the logical address until the block group is unfrozen and its >>> logical address is reused for a new block group. >>> >>> So the ordering of what you are saying is reversed - the device extent >>> and chunk items are removed before marking the block group as deleted. >>> >>>> >>>> The device extents and chunk items are only deleted in >>>> btrfs_remove_chunk(), which happens either a btrfs_delete_unused_bgs() >>>> or btrfs_relocate_chunk(). >>> >>> This is also confusing because it gives a wrong idea of the ordering. >>> The device extents and chunk items are removed by btrfs_remove_chunk(), >>> which happens before marking a block group as deleted. >>> >>>> >>>> This timing difference leaves a window where we can have a mismatch >>>> between block group cache and device extent tree, and triggering the >>>> ASSERT(). >>> >>> I would like to see more details on this. The sequence of steps between >>> two tasks that result in this assertion being triggered. >>> >>>> >>>> [FIX] >>>> >>>> - Remove the ASSERT() >>>> >>>> - Add a quick exit if our found bg doesn't match @chunk_offset >>> >>> This shouldn't happen. I would like to know the sequence of steps >>> between 2 (or more) tasks that leads to that. >>> >>> We are getting the block group with btrfs_lookup_block_group() at >>> scrub_enumerate_chunks(), that calls block_group_cache_tree_search() >>> with the "contains" argument set to 1, meaning it can return a >>> block group that contains the given bytenr but does not start at >>> that bytenr necessarily. >>> >>> This gives me the idea a small block group was deleted and then a >>> new one was allocated which starts at a lower logical offset and >>> includes "chunk_offset" (ends beyond that). >>> >>> We should probably be using btrfs_lookup_first_block_group() at >>> scrub_enumerate_chunks(), so it looks for a block group that starts >>> exactly at the given logical address. >>> >>> But anyway, as I read this and see the patch's diff, this feels a lot >>> like a "bail out if something unexpected happens but we don't know >>> exactly why/how that is possible". >> >> You're completely correct. >> >> Unfortunately I have no way to reproduce the bug on x86_64 VMs, and all >> my ARM VMs are too slow to reproduce. >> >> Furthermore, due to some unrelated bugs, v5.17-rc based kernels all >> failed to boot inside the VMs on Apple M1. >> >> So unfortunately I don't have extra details for now. >> >> Will find a way to reproduce first, then update the patch with better >> comments. > > Was it triggered with some specific test case from fstests? Yes, from a regular run on fstests. But that was on an older branch (some older misc-next, which has the patch but still based on v5.16). The v5.17-rc based branch will not boot inside the ARM vm at all, and no early boot messages after the EFI routine. Will attack the boot problem first. Thanks, Qu > >> >> Thanks, >> Qu >> >>> >>>> >>>> - Add a comment on the existing checks in scrub_chunk() >>>> This is the ultimate safenet, as it will iterate through all the stripes >>>> of the found chunk. >>>> And only scrub the stripe if it matches both device and physical >>>> offset. >>>> >>>> So even by some how we got a dev extent which is no longer owned >>>> by the target chunk, we will just skip this chunk completely, without >>>> any consequence. >>>> >>>> Reported-by: Su Yue <l@damenly.su> >>>> Signed-off-by: Qu Wenruo <wqu@suse.com> >>>> --- >>>> Changelog: >>>> v2: >>>> - Add a new quick exit with extra comments >>>> >>>> - Add a new comment in the existing safenet in scrub_chunk() >>>> --- >>>> fs/btrfs/scrub.c | 21 ++++++++++++++++++++- >>>> 1 file changed, 20 insertions(+), 1 deletion(-) >>>> >>>> diff --git a/fs/btrfs/scrub.c b/fs/btrfs/scrub.c >>>> index 11089568b287..1c3ed4720ddd 100644 >>>> --- a/fs/btrfs/scrub.c >>>> +++ b/fs/btrfs/scrub.c >>>> @@ -3578,6 +3578,14 @@ static noinline_for_stack int scrub_chunk(struct scrub_ctx *sctx, >>>> goto out; >>>> >>>> map = em->map_lookup; >>>> + >>>> + /* >>>> + * Due to the delayed modification of device tree, this chunk >>>> + * may not own this dev extent. >>> >>> This is confusing. What delayed modification, what do you mean by it >>> exactly? Same below, with more details. >>> >>>> + * >>>> + * So we need to iterate through all stripes, and only scrub >>>> + * the stripe which matches both device and physical offset. >>>> + */ >>>> for (i = 0; i < map->num_stripes; ++i) { >>>> if (map->stripes[i].dev->bdev == scrub_dev->bdev && >>>> map->stripes[i].physical == dev_offset) { >>>> @@ -3699,6 +3707,18 @@ int scrub_enumerate_chunks(struct scrub_ctx *sctx, >>>> if (!cache) >>>> goto skip; >>>> >>>> + /* >>>> + * Since dev extents modification is delayed, it's possible >>>> + * we got a bg which doesn't really own this dev extent. >>> >>> Same as before. Too confusing, what is delayed dev extent modification? >>> >>> Once added, device extents can only be deleted, they are never modified. >>> I think you are referring to the deletions and the fact we use the commit >>> roots to find extents, but that should be a more clear comment, without >>> such level of ambiguity. >>> >>> Thanks. >>> >>>> + * >>>> + * Here just do a quick skip, scrub_chunk() has a more >>>> + * comprehensive check in it. >>>> + */ >>>> + if (cache->start != chunk_offset) { >>>> + btrfs_put_block_group(cache); >>>> + goto skip; >>>> + } >>>> + >>>> if (sctx->is_dev_replace && btrfs_is_zoned(fs_info)) { >>>> spin_lock(&cache->lock); >>>> if (!cache->to_copy) { >>>> @@ -3822,7 +3842,6 @@ int scrub_enumerate_chunks(struct scrub_ctx *sctx, >>>> dev_replace->item_needs_writeback = 1; >>>> up_write(&dev_replace->rwsem); >>>> >>>> - ASSERT(cache->start == chunk_offset); >>>> ret = scrub_chunk(sctx, cache, scrub_dev, found_key.offset, >>>> dev_extent_len); >>>> >>>> -- >>>> 2.35.1 >>>> ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [PATCH v2] btrfs: remove an ASSERT() which can cause false alert 2022-02-17 13:00 ` Qu Wenruo @ 2022-02-17 13:02 ` Filipe Manana 2022-02-17 13:25 ` Qu Wenruo 2022-02-18 2:29 ` Su Yue 0 siblings, 2 replies; 9+ messages in thread From: Filipe Manana @ 2022-02-17 13:02 UTC (permalink / raw) To: Qu Wenruo; +Cc: Qu Wenruo, linux-btrfs, Su Yue On Thu, Feb 17, 2022 at 1:00 PM Qu Wenruo <quwenruo.btrfs@gmx.com> wrote: > > > > On 2022/2/17 20:32, Filipe Manana wrote: > > On Thu, Feb 17, 2022 at 11:24 AM Qu Wenruo <quwenruo.btrfs@gmx.com> wrote: > >> > >> > >> > >> On 2022/2/17 19:14, Filipe Manana wrote: > >>> On Thu, Feb 17, 2022 at 05:49:34PM +0800, Qu Wenruo wrote: > >>>> [BUG] > >>>> Su reported that with his VM running on an M1 mac, it has a high chance > >>>> to trigger the following ASSERT() with scrub and balance test cases: > >>>> > >>>> ASSERT(cache->start == chunk_offset); > >>>> ret = scrub_chunk(sctx, cache, scrub_dev, found_key.offset, > >>>> dev_extent_len); > >>>> > >>>> [CAUSE] > >>>> The ASSERT() is making sure that the block group cache (@cache) and the > >>>> chunk offset from the device extent match. > >>>> > >>>> But the truth is, block group caches and dev extent items are deleted at > >>>> different timing. > >>>> > >>>> For block group caches and items, after btrfs_remove_block_group() they > >>>> will be marked deleted, but the device extent and the chunk item is > >>>> still in dev tree and chunk tree respectively. > >>> > >>> This is not correct, what happens is: > >>> > >>> 1) btrfs_remove_chunk() will remove the device extent items and the chunk > >>> item; > >>> > >>> 2) It then calls btrfs_remove_block_group(). This will not remove the > >>> extent map if the block group was frozen (due to trim, and scrub > >>> itself). But it will remove the block group (struct btrfs_block_group) > >>> from the red black tree of block groups, and mark the block group > >>> as deleted (set struct btrfs_block_group::removed to 1). > >>> > >>> If the extent map was not removed, meaning the block group is frozen, > >>> then no one will be able to create a new block group with the same logical > >>> address before the block group is unfrozen (by someone who is holding > >>> a reference on it). So a lookup on the red black tree will return NULL > >>> for the logical address until the block group is unfrozen and its > >>> logical address is reused for a new block group. > >>> > >>> So the ordering of what you are saying is reversed - the device extent > >>> and chunk items are removed before marking the block group as deleted. > >>> > >>>> > >>>> The device extents and chunk items are only deleted in > >>>> btrfs_remove_chunk(), which happens either a btrfs_delete_unused_bgs() > >>>> or btrfs_relocate_chunk(). > >>> > >>> This is also confusing because it gives a wrong idea of the ordering. > >>> The device extents and chunk items are removed by btrfs_remove_chunk(), > >>> which happens before marking a block group as deleted. > >>> > >>>> > >>>> This timing difference leaves a window where we can have a mismatch > >>>> between block group cache and device extent tree, and triggering the > >>>> ASSERT(). > >>> > >>> I would like to see more details on this. The sequence of steps between > >>> two tasks that result in this assertion being triggered. > >>> > >>>> > >>>> [FIX] > >>>> > >>>> - Remove the ASSERT() > >>>> > >>>> - Add a quick exit if our found bg doesn't match @chunk_offset > >>> > >>> This shouldn't happen. I would like to know the sequence of steps > >>> between 2 (or more) tasks that leads to that. > >>> > >>> We are getting the block group with btrfs_lookup_block_group() at > >>> scrub_enumerate_chunks(), that calls block_group_cache_tree_search() > >>> with the "contains" argument set to 1, meaning it can return a > >>> block group that contains the given bytenr but does not start at > >>> that bytenr necessarily. > >>> > >>> This gives me the idea a small block group was deleted and then a > >>> new one was allocated which starts at a lower logical offset and > >>> includes "chunk_offset" (ends beyond that). > >>> > >>> We should probably be using btrfs_lookup_first_block_group() at > >>> scrub_enumerate_chunks(), so it looks for a block group that starts > >>> exactly at the given logical address. > >>> > >>> But anyway, as I read this and see the patch's diff, this feels a lot > >>> like a "bail out if something unexpected happens but we don't know > >>> exactly why/how that is possible". > >> > >> You're completely correct. > >> > >> Unfortunately I have no way to reproduce the bug on x86_64 VMs, and all > >> my ARM VMs are too slow to reproduce. > >> > >> Furthermore, due to some unrelated bugs, v5.17-rc based kernels all > >> failed to boot inside the VMs on Apple M1. > >> > >> So unfortunately I don't have extra details for now. > >> > >> Will find a way to reproduce first, then update the patch with better > >> comments. > > > > Was it triggered with some specific test case from fstests? > > Yes, from a regular run on fstests. Ok, but which specific test(s)? > > But that was on an older branch (some older misc-next, which has the > patch but still based on v5.16). > > The v5.17-rc based branch will not boot inside the ARM vm at all, and no > early boot messages after the EFI routine. > > Will attack the boot problem first. > > Thanks, > Qu > > > >> > >> Thanks, > >> Qu > >> > >>> > >>>> > >>>> - Add a comment on the existing checks in scrub_chunk() > >>>> This is the ultimate safenet, as it will iterate through all the stripes > >>>> of the found chunk. > >>>> And only scrub the stripe if it matches both device and physical > >>>> offset. > >>>> > >>>> So even by some how we got a dev extent which is no longer owned > >>>> by the target chunk, we will just skip this chunk completely, without > >>>> any consequence. > >>>> > >>>> Reported-by: Su Yue <l@damenly.su> > >>>> Signed-off-by: Qu Wenruo <wqu@suse.com> > >>>> --- > >>>> Changelog: > >>>> v2: > >>>> - Add a new quick exit with extra comments > >>>> > >>>> - Add a new comment in the existing safenet in scrub_chunk() > >>>> --- > >>>> fs/btrfs/scrub.c | 21 ++++++++++++++++++++- > >>>> 1 file changed, 20 insertions(+), 1 deletion(-) > >>>> > >>>> diff --git a/fs/btrfs/scrub.c b/fs/btrfs/scrub.c > >>>> index 11089568b287..1c3ed4720ddd 100644 > >>>> --- a/fs/btrfs/scrub.c > >>>> +++ b/fs/btrfs/scrub.c > >>>> @@ -3578,6 +3578,14 @@ static noinline_for_stack int scrub_chunk(struct scrub_ctx *sctx, > >>>> goto out; > >>>> > >>>> map = em->map_lookup; > >>>> + > >>>> + /* > >>>> + * Due to the delayed modification of device tree, this chunk > >>>> + * may not own this dev extent. > >>> > >>> This is confusing. What delayed modification, what do you mean by it > >>> exactly? Same below, with more details. > >>> > >>>> + * > >>>> + * So we need to iterate through all stripes, and only scrub > >>>> + * the stripe which matches both device and physical offset. > >>>> + */ > >>>> for (i = 0; i < map->num_stripes; ++i) { > >>>> if (map->stripes[i].dev->bdev == scrub_dev->bdev && > >>>> map->stripes[i].physical == dev_offset) { > >>>> @@ -3699,6 +3707,18 @@ int scrub_enumerate_chunks(struct scrub_ctx *sctx, > >>>> if (!cache) > >>>> goto skip; > >>>> > >>>> + /* > >>>> + * Since dev extents modification is delayed, it's possible > >>>> + * we got a bg which doesn't really own this dev extent. > >>> > >>> Same as before. Too confusing, what is delayed dev extent modification? > >>> > >>> Once added, device extents can only be deleted, they are never modified. > >>> I think you are referring to the deletions and the fact we use the commit > >>> roots to find extents, but that should be a more clear comment, without > >>> such level of ambiguity. > >>> > >>> Thanks. > >>> > >>>> + * > >>>> + * Here just do a quick skip, scrub_chunk() has a more > >>>> + * comprehensive check in it. > >>>> + */ > >>>> + if (cache->start != chunk_offset) { > >>>> + btrfs_put_block_group(cache); > >>>> + goto skip; > >>>> + } > >>>> + > >>>> if (sctx->is_dev_replace && btrfs_is_zoned(fs_info)) { > >>>> spin_lock(&cache->lock); > >>>> if (!cache->to_copy) { > >>>> @@ -3822,7 +3842,6 @@ int scrub_enumerate_chunks(struct scrub_ctx *sctx, > >>>> dev_replace->item_needs_writeback = 1; > >>>> up_write(&dev_replace->rwsem); > >>>> > >>>> - ASSERT(cache->start == chunk_offset); > >>>> ret = scrub_chunk(sctx, cache, scrub_dev, found_key.offset, > >>>> dev_extent_len); > >>>> > >>>> -- > >>>> 2.35.1 > >>>> ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [PATCH v2] btrfs: remove an ASSERT() which can cause false alert 2022-02-17 13:02 ` Filipe Manana @ 2022-02-17 13:25 ` Qu Wenruo 2022-02-18 2:29 ` Su Yue 1 sibling, 0 replies; 9+ messages in thread From: Qu Wenruo @ 2022-02-17 13:25 UTC (permalink / raw) To: Filipe Manana, Qu Wenruo; +Cc: linux-btrfs, Su Yue On 2022/2/17 21:02, Filipe Manana wrote: > On Thu, Feb 17, 2022 at 1:00 PM Qu Wenruo <quwenruo.btrfs@gmx.com> wrote: >> >> >> >> On 2022/2/17 20:32, Filipe Manana wrote: >>> On Thu, Feb 17, 2022 at 11:24 AM Qu Wenruo <quwenruo.btrfs@gmx.com> wrote: >>>> >>>> >>>> >>>> On 2022/2/17 19:14, Filipe Manana wrote: >>>>> On Thu, Feb 17, 2022 at 05:49:34PM +0800, Qu Wenruo wrote: >>>>>> [BUG] >>>>>> Su reported that with his VM running on an M1 mac, it has a high chance >>>>>> to trigger the following ASSERT() with scrub and balance test cases: >>>>>> >>>>>> ASSERT(cache->start == chunk_offset); >>>>>> ret = scrub_chunk(sctx, cache, scrub_dev, found_key.offset, >>>>>> dev_extent_len); >>>>>> >>>>>> [CAUSE] >>>>>> The ASSERT() is making sure that the block group cache (@cache) and the >>>>>> chunk offset from the device extent match. >>>>>> >>>>>> But the truth is, block group caches and dev extent items are deleted at >>>>>> different timing. >>>>>> >>>>>> For block group caches and items, after btrfs_remove_block_group() they >>>>>> will be marked deleted, but the device extent and the chunk item is >>>>>> still in dev tree and chunk tree respectively. >>>>> >>>>> This is not correct, what happens is: >>>>> >>>>> 1) btrfs_remove_chunk() will remove the device extent items and the chunk >>>>> item; >>>>> >>>>> 2) It then calls btrfs_remove_block_group(). This will not remove the >>>>> extent map if the block group was frozen (due to trim, and scrub >>>>> itself). But it will remove the block group (struct btrfs_block_group) >>>>> from the red black tree of block groups, and mark the block group >>>>> as deleted (set struct btrfs_block_group::removed to 1). >>>>> >>>>> If the extent map was not removed, meaning the block group is frozen, >>>>> then no one will be able to create a new block group with the same logical >>>>> address before the block group is unfrozen (by someone who is holding >>>>> a reference on it). So a lookup on the red black tree will return NULL >>>>> for the logical address until the block group is unfrozen and its >>>>> logical address is reused for a new block group. >>>>> >>>>> So the ordering of what you are saying is reversed - the device extent >>>>> and chunk items are removed before marking the block group as deleted. >>>>> >>>>>> >>>>>> The device extents and chunk items are only deleted in >>>>>> btrfs_remove_chunk(), which happens either a btrfs_delete_unused_bgs() >>>>>> or btrfs_relocate_chunk(). >>>>> >>>>> This is also confusing because it gives a wrong idea of the ordering. >>>>> The device extents and chunk items are removed by btrfs_remove_chunk(), >>>>> which happens before marking a block group as deleted. >>>>> >>>>>> >>>>>> This timing difference leaves a window where we can have a mismatch >>>>>> between block group cache and device extent tree, and triggering the >>>>>> ASSERT(). >>>>> >>>>> I would like to see more details on this. The sequence of steps between >>>>> two tasks that result in this assertion being triggered. >>>>> >>>>>> >>>>>> [FIX] >>>>>> >>>>>> - Remove the ASSERT() >>>>>> >>>>>> - Add a quick exit if our found bg doesn't match @chunk_offset >>>>> >>>>> This shouldn't happen. I would like to know the sequence of steps >>>>> between 2 (or more) tasks that leads to that. >>>>> >>>>> We are getting the block group with btrfs_lookup_block_group() at >>>>> scrub_enumerate_chunks(), that calls block_group_cache_tree_search() >>>>> with the "contains" argument set to 1, meaning it can return a >>>>> block group that contains the given bytenr but does not start at >>>>> that bytenr necessarily. >>>>> >>>>> This gives me the idea a small block group was deleted and then a >>>>> new one was allocated which starts at a lower logical offset and >>>>> includes "chunk_offset" (ends beyond that). >>>>> >>>>> We should probably be using btrfs_lookup_first_block_group() at >>>>> scrub_enumerate_chunks(), so it looks for a block group that starts >>>>> exactly at the given logical address. >>>>> >>>>> But anyway, as I read this and see the patch's diff, this feels a lot >>>>> like a "bail out if something unexpected happens but we don't know >>>>> exactly why/how that is possible". >>>> >>>> You're completely correct. >>>> >>>> Unfortunately I have no way to reproduce the bug on x86_64 VMs, and all >>>> my ARM VMs are too slow to reproduce. >>>> >>>> Furthermore, due to some unrelated bugs, v5.17-rc based kernels all >>>> failed to boot inside the VMs on Apple M1. >>>> >>>> So unfortunately I don't have extra details for now. >>>> >>>> Will find a way to reproduce first, then update the patch with better >>>> comments. >>> >>> Was it triggered with some specific test case from fstests? >> >> Yes, from a regular run on fstests. > > Ok, but which specific test(s)? No mention, but I guess it's inside scrub + replace group. > >> >> But that was on an older branch (some older misc-next, which has the >> patch but still based on v5.16). >> >> The v5.17-rc based branch will not boot inside the ARM vm at all, and no >> early boot messages after the EFI routine. >> >> Will attack the boot problem first. >> >> Thanks, >> Qu >>> >>>> >>>> Thanks, >>>> Qu >>>> >>>>> >>>>>> >>>>>> - Add a comment on the existing checks in scrub_chunk() >>>>>> This is the ultimate safenet, as it will iterate through all the stripes >>>>>> of the found chunk. >>>>>> And only scrub the stripe if it matches both device and physical >>>>>> offset. >>>>>> >>>>>> So even by some how we got a dev extent which is no longer owned >>>>>> by the target chunk, we will just skip this chunk completely, without >>>>>> any consequence. >>>>>> >>>>>> Reported-by: Su Yue <l@damenly.su> >>>>>> Signed-off-by: Qu Wenruo <wqu@suse.com> >>>>>> --- >>>>>> Changelog: >>>>>> v2: >>>>>> - Add a new quick exit with extra comments >>>>>> >>>>>> - Add a new comment in the existing safenet in scrub_chunk() >>>>>> --- >>>>>> fs/btrfs/scrub.c | 21 ++++++++++++++++++++- >>>>>> 1 file changed, 20 insertions(+), 1 deletion(-) >>>>>> >>>>>> diff --git a/fs/btrfs/scrub.c b/fs/btrfs/scrub.c >>>>>> index 11089568b287..1c3ed4720ddd 100644 >>>>>> --- a/fs/btrfs/scrub.c >>>>>> +++ b/fs/btrfs/scrub.c >>>>>> @@ -3578,6 +3578,14 @@ static noinline_for_stack int scrub_chunk(struct scrub_ctx *sctx, >>>>>> goto out; >>>>>> >>>>>> map = em->map_lookup; >>>>>> + >>>>>> + /* >>>>>> + * Due to the delayed modification of device tree, this chunk >>>>>> + * may not own this dev extent. >>>>> >>>>> This is confusing. What delayed modification, what do you mean by it >>>>> exactly? Same below, with more details. >>>>> >>>>>> + * >>>>>> + * So we need to iterate through all stripes, and only scrub >>>>>> + * the stripe which matches both device and physical offset. >>>>>> + */ >>>>>> for (i = 0; i < map->num_stripes; ++i) { >>>>>> if (map->stripes[i].dev->bdev == scrub_dev->bdev && >>>>>> map->stripes[i].physical == dev_offset) { >>>>>> @@ -3699,6 +3707,18 @@ int scrub_enumerate_chunks(struct scrub_ctx *sctx, >>>>>> if (!cache) >>>>>> goto skip; >>>>>> >>>>>> + /* >>>>>> + * Since dev extents modification is delayed, it's possible >>>>>> + * we got a bg which doesn't really own this dev extent. >>>>> >>>>> Same as before. Too confusing, what is delayed dev extent modification? >>>>> >>>>> Once added, device extents can only be deleted, they are never modified. >>>>> I think you are referring to the deletions and the fact we use the commit >>>>> roots to find extents, but that should be a more clear comment, without >>>>> such level of ambiguity. >>>>> >>>>> Thanks. >>>>> >>>>>> + * >>>>>> + * Here just do a quick skip, scrub_chunk() has a more >>>>>> + * comprehensive check in it. >>>>>> + */ >>>>>> + if (cache->start != chunk_offset) { >>>>>> + btrfs_put_block_group(cache); >>>>>> + goto skip; >>>>>> + } >>>>>> + >>>>>> if (sctx->is_dev_replace && btrfs_is_zoned(fs_info)) { >>>>>> spin_lock(&cache->lock); >>>>>> if (!cache->to_copy) { >>>>>> @@ -3822,7 +3842,6 @@ int scrub_enumerate_chunks(struct scrub_ctx *sctx, >>>>>> dev_replace->item_needs_writeback = 1; >>>>>> up_write(&dev_replace->rwsem); >>>>>> >>>>>> - ASSERT(cache->start == chunk_offset); >>>>>> ret = scrub_chunk(sctx, cache, scrub_dev, found_key.offset, >>>>>> dev_extent_len); >>>>>> >>>>>> -- >>>>>> 2.35.1 >>>>>> > ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [PATCH v2] btrfs: remove an ASSERT() which can cause false alert 2022-02-17 13:02 ` Filipe Manana 2022-02-17 13:25 ` Qu Wenruo @ 2022-02-18 2:29 ` Su Yue 1 sibling, 0 replies; 9+ messages in thread From: Su Yue @ 2022-02-18 2:29 UTC (permalink / raw) To: Filipe Manana; +Cc: Qu Wenruo, Qu Wenruo, linux-btrfs On Thu 17 Feb 2022 at 13:02, Filipe Manana <fdmanana@kernel.org> wrote: > On Thu, Feb 17, 2022 at 1:00 PM Qu Wenruo > <quwenruo.btrfs@gmx.com> wrote: >> >> >> >> On 2022/2/17 20:32, Filipe Manana wrote: >> > On Thu, Feb 17, 2022 at 11:24 AM Qu Wenruo >> > <quwenruo.btrfs@gmx.com> wrote: >> >> >> >> >> >> >> >> On 2022/2/17 19:14, Filipe Manana wrote: >> >>> On Thu, Feb 17, 2022 at 05:49:34PM +0800, Qu Wenruo wrote: >> >>>> [BUG] >> >>>> Su reported that with his VM running on an M1 mac, it has >> >>>> a high chance >> >>>> to trigger the following ASSERT() with scrub and balance >> >>>> test cases: >> >>>> >> >>>> ASSERT(cache->start == chunk_offset); >> >>>> ret = scrub_chunk(sctx, cache, scrub_dev, >> >>>> found_key.offset, >> >>>> dev_extent_len); >> >>>> >> >>>> [CAUSE] >> >>>> The ASSERT() is making sure that the block group cache >> >>>> (@cache) and the >> >>>> chunk offset from the device extent match. >> >>>> >> >>>> But the truth is, block group caches and dev extent items >> >>>> are deleted at >> >>>> different timing. >> >>>> >> >>>> For block group caches and items, after >> >>>> btrfs_remove_block_group() they >> >>>> will be marked deleted, but the device extent and the >> >>>> chunk item is >> >>>> still in dev tree and chunk tree respectively. >> >>> >> >>> This is not correct, what happens is: >> >>> >> >>> 1) btrfs_remove_chunk() will remove the device extent items >> >>> and the chunk >> >>> item; >> >>> >> >>> 2) It then calls btrfs_remove_block_group(). This will not >> >>> remove the >> >>> extent map if the block group was frozen (due to trim, >> >>> and scrub >> >>> itself). But it will remove the block group (struct >> >>> btrfs_block_group) >> >>> from the red black tree of block groups, and mark the >> >>> block group >> >>> as deleted (set struct btrfs_block_group::removed to >> >>> 1). >> >>> >> >>> If the extent map was not removed, meaning the block >> >>> group is frozen, >> >>> then no one will be able to create a new block group >> >>> with the same logical >> >>> address before the block group is unfrozen (by someone >> >>> who is holding >> >>> a reference on it). So a lookup on the red black tree >> >>> will return NULL >> >>> for the logical address until the block group is >> >>> unfrozen and its >> >>> logical address is reused for a new block group. >> >>> >> >>> So the ordering of what you are saying is reversed - the >> >>> device extent >> >>> and chunk items are removed before marking the block group >> >>> as deleted. >> >>> >> >>>> >> >>>> The device extents and chunk items are only deleted in >> >>>> btrfs_remove_chunk(), which happens either a >> >>>> btrfs_delete_unused_bgs() >> >>>> or btrfs_relocate_chunk(). >> >>> >> >>> This is also confusing because it gives a wrong idea of the >> >>> ordering. >> >>> The device extents and chunk items are removed by >> >>> btrfs_remove_chunk(), >> >>> which happens before marking a block group as deleted. >> >>> >> >>>> >> >>>> This timing difference leaves a window where we can have a >> >>>> mismatch >> >>>> between block group cache and device extent tree, and >> >>>> triggering the >> >>>> ASSERT(). >> >>> >> >>> I would like to see more details on this. The sequence of >> >>> steps between >> >>> two tasks that result in this assertion being triggered. >> >>> >> >>>> >> >>>> [FIX] >> >>>> >> >>>> - Remove the ASSERT() >> >>>> >> >>>> - Add a quick exit if our found bg doesn't match >> >>>> @chunk_offset >> >>> >> >>> This shouldn't happen. I would like to know the sequence of >> >>> steps >> >>> between 2 (or more) tasks that leads to that. >> >>> >> >>> We are getting the block group with >> >>> btrfs_lookup_block_group() at >> >>> scrub_enumerate_chunks(), that calls >> >>> block_group_cache_tree_search() >> >>> with the "contains" argument set to 1, meaning it can >> >>> return a >> >>> block group that contains the given bytenr but does not >> >>> start at >> >>> that bytenr necessarily. >> >>> >> >>> This gives me the idea a small block group was deleted and >> >>> then a >> >>> new one was allocated which starts at a lower logical >> >>> offset and >> >>> includes "chunk_offset" (ends beyond that). >> >>> >> >>> We should probably be using >> >>> btrfs_lookup_first_block_group() at >> >>> scrub_enumerate_chunks(), so it looks for a block group >> >>> that starts >> >>> exactly at the given logical address. >> >>> >> >>> But anyway, as I read this and see the patch's diff, this >> >>> feels a lot >> >>> like a "bail out if something unexpected happens but we >> >>> don't know >> >>> exactly why/how that is possible". >> >> >> >> You're completely correct. >> >> >> >> Unfortunately I have no way to reproduce the bug on x86_64 >> >> VMs, and all >> >> my ARM VMs are too slow to reproduce. >> >> >> >> Furthermore, due to some unrelated bugs, v5.17-rc based >> >> kernels all >> >> failed to boot inside the VMs on Apple M1. >> >> >> >> So unfortunately I don't have extra details for now. >> >> >> >> Will find a way to reproduce first, then update the patch >> >> with better >> >> comments. >> > >> > Was it triggered with some specific test case from fstests? >> >> Yes, from a regular run on fstests. > > Ok, but which specific test(s)? > It is btrfs/074. It was found during early 16k subpage support development of Qu. Unfortunately dmesg is lost and it can't be reproduced from old branch fetched from `git reflog`. I'd report back if the issue occurs again. -- Su >> >> But that was on an older branch (some older misc-next, which >> has the >> patch but still based on v5.16). >> >> The v5.17-rc based branch will not boot inside the ARM vm at >> all, and no >> early boot messages after the EFI routine. >> >> Will attack the boot problem first. >> >> Thanks, >> Qu >> > >> >> >> >> Thanks, >> >> Qu >> >> >> >>> >> >>>> >> >>>> - Add a comment on the existing checks in scrub_chunk() >> >>>> This is the ultimate safenet, as it will iterate >> >>>> through all the stripes >> >>>> of the found chunk. >> >>>> And only scrub the stripe if it matches both device >> >>>> and physical >> >>>> offset. >> >>>> >> >>>> So even by some how we got a dev extent which is no >> >>>> longer owned >> >>>> by the target chunk, we will just skip this chunk >> >>>> completely, without >> >>>> any consequence. >> >>>> >> >>>> Reported-by: Su Yue <l@damenly.su> >> >>>> Signed-off-by: Qu Wenruo <wqu@suse.com> >> >>>> --- >> >>>> Changelog: >> >>>> v2: >> >>>> - Add a new quick exit with extra comments >> >>>> >> >>>> - Add a new comment in the existing safenet in >> >>>> scrub_chunk() >> >>>> --- >> >>>> fs/btrfs/scrub.c | 21 ++++++++++++++++++++- >> >>>> 1 file changed, 20 insertions(+), 1 deletion(-) >> >>>> >> >>>> diff --git a/fs/btrfs/scrub.c b/fs/btrfs/scrub.c >> >>>> index 11089568b287..1c3ed4720ddd 100644 >> >>>> --- a/fs/btrfs/scrub.c >> >>>> +++ b/fs/btrfs/scrub.c >> >>>> @@ -3578,6 +3578,14 @@ static noinline_for_stack int >> >>>> scrub_chunk(struct scrub_ctx *sctx, >> >>>> goto out; >> >>>> >> >>>> map = em->map_lookup; >> >>>> + >> >>>> + /* >> >>>> + * Due to the delayed modification of device tree, >> >>>> this chunk >> >>>> + * may not own this dev extent. >> >>> >> >>> This is confusing. What delayed modification, what do you >> >>> mean by it >> >>> exactly? Same below, with more details. >> >>> >> >>>> + * >> >>>> + * So we need to iterate through all stripes, and >> >>>> only scrub >> >>>> + * the stripe which matches both device and physical >> >>>> offset. >> >>>> + */ >> >>>> for (i = 0; i < map->num_stripes; ++i) { >> >>>> if (map->stripes[i].dev->bdev == >> >>>> scrub_dev->bdev && >> >>>> map->stripes[i].physical == dev_offset) >> >>>> { >> >>>> @@ -3699,6 +3707,18 @@ int scrub_enumerate_chunks(struct >> >>>> scrub_ctx *sctx, >> >>>> if (!cache) >> >>>> goto skip; >> >>>> >> >>>> + /* >> >>>> + * Since dev extents modification is delayed, >> >>>> it's possible >> >>>> + * we got a bg which doesn't really own this >> >>>> dev extent. >> >>> >> >>> Same as before. Too confusing, what is delayed dev extent >> >>> modification? >> >>> >> >>> Once added, device extents can only be deleted, they are >> >>> never modified. >> >>> I think you are referring to the deletions and the fact we >> >>> use the commit >> >>> roots to find extents, but that should be a more clear >> >>> comment, without >> >>> such level of ambiguity. >> >>> >> >>> Thanks. >> >>> >> >>>> + * >> >>>> + * Here just do a quick skip, scrub_chunk() >> >>>> has a more >> >>>> + * comprehensive check in it. >> >>>> + */ >> >>>> + if (cache->start != chunk_offset) { >> >>>> + btrfs_put_block_group(cache); >> >>>> + goto skip; >> >>>> + } >> >>>> + >> >>>> if (sctx->is_dev_replace && >> >>>> btrfs_is_zoned(fs_info)) { >> >>>> spin_lock(&cache->lock); >> >>>> if (!cache->to_copy) { >> >>>> @@ -3822,7 +3842,6 @@ int scrub_enumerate_chunks(struct >> >>>> scrub_ctx *sctx, >> >>>> dev_replace->item_needs_writeback = 1; >> >>>> up_write(&dev_replace->rwsem); >> >>>> >> >>>> - ASSERT(cache->start == chunk_offset); >> >>>> ret = scrub_chunk(sctx, cache, scrub_dev, >> >>>> found_key.offset, >> >>>> dev_extent_len); >> >>>> >> >>>> -- >> >>>> 2.35.1 >> >>>> ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [PATCH v2] btrfs: remove an ASSERT() which can cause false alert 2022-02-17 9:49 [PATCH v2] btrfs: remove an ASSERT() which can cause false alert Qu Wenruo 2022-02-17 11:14 ` Filipe Manana @ 2022-04-19 20:34 ` David Sterba 1 sibling, 0 replies; 9+ messages in thread From: David Sterba @ 2022-04-19 20:34 UTC (permalink / raw) To: Qu Wenruo; +Cc: linux-btrfs, Su Yue On Thu, Feb 17, 2022 at 05:49:34PM +0800, Qu Wenruo wrote: > [BUG] > Su reported that with his VM running on an M1 mac, it has a high chance > to trigger the following ASSERT() with scrub and balance test cases: > > ASSERT(cache->start == chunk_offset); > ret = scrub_chunk(sctx, cache, scrub_dev, found_key.offset, > dev_extent_len); > > [CAUSE] > The ASSERT() is making sure that the block group cache (@cache) and the > chunk offset from the device extent match. > > But the truth is, block group caches and dev extent items are deleted at > different timing. > > For block group caches and items, after btrfs_remove_block_group() they > will be marked deleted, but the device extent and the chunk item is > still in dev tree and chunk tree respectively. > > The device extents and chunk items are only deleted in > btrfs_remove_chunk(), which happens either a btrfs_delete_unused_bgs() > or btrfs_relocate_chunk(). > > This timing difference leaves a window where we can have a mismatch > between block group cache and device extent tree, and triggering the > ASSERT(). > > [FIX] > > - Remove the ASSERT() > > - Add a quick exit if our found bg doesn't match @chunk_offset > > - Add a comment on the existing checks in scrub_chunk() > This is the ultimate safenet, as it will iterate through all the stripes > of the found chunk. > And only scrub the stripe if it matches both device and physical > offset. > > So even by some how we got a dev extent which is no longer owned > by the target chunk, we will just skip this chunk completely, without > any consequence. > > Reported-by: Su Yue <l@damenly.su> > Signed-off-by: Qu Wenruo <wqu@suse.com> Filipe sent a fix that's slightly different, as you haven't sent an update I'll take his patch. ^ permalink raw reply [flat|nested] 9+ messages in thread
end of thread, other threads:[~2022-04-19 20:38 UTC | newest] Thread overview: 9+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2022-02-17 9:49 [PATCH v2] btrfs: remove an ASSERT() which can cause false alert Qu Wenruo 2022-02-17 11:14 ` Filipe Manana 2022-02-17 11:24 ` Qu Wenruo 2022-02-17 12:32 ` Filipe Manana 2022-02-17 13:00 ` Qu Wenruo 2022-02-17 13:02 ` Filipe Manana 2022-02-17 13:25 ` Qu Wenruo 2022-02-18 2:29 ` Su Yue 2022-04-19 20:34 ` David Sterba
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox