* [PATCH v4 1/7] btrfs: replace stripe extents
2024-07-05 15:13 [PATCH v4 0/7] btrfs: rst: updates for RAID stripe tree Johannes Thumshirn
@ 2024-07-05 15:13 ` Johannes Thumshirn
2024-07-05 23:19 ` Qu Wenruo
2024-07-05 15:13 ` [PATCH v4 2/7] btrfs: rst: don't print tree dump in case lookup fails Johannes Thumshirn
` (5 subsequent siblings)
6 siblings, 1 reply; 23+ messages in thread
From: Johannes Thumshirn @ 2024-07-05 15:13 UTC (permalink / raw)
To: Chris Mason, Josef Bacik, David Sterba
Cc: linux-btrfs, linux-kernel, Johannes Thumshirn
From: Johannes Thumshirn <johannes.thumshirn@wdc.com>
If we can't insert a stripe extent in the RAID stripe tree, because
the key that points to the specific position in the stripe tree is
already existing, we have to remove the item and then replace it by a
new item.
This can happen for example on device replace operations.
Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
---
fs/btrfs/raid-stripe-tree.c | 36 ++++++++++++++++++++++++++++++++++++
1 file changed, 36 insertions(+)
diff --git a/fs/btrfs/raid-stripe-tree.c b/fs/btrfs/raid-stripe-tree.c
index e6f7a234b8f6..3b32e96c33fc 100644
--- a/fs/btrfs/raid-stripe-tree.c
+++ b/fs/btrfs/raid-stripe-tree.c
@@ -73,6 +73,39 @@ int btrfs_delete_raid_extent(struct btrfs_trans_handle *trans, u64 start, u64 le
return ret;
}
+static int replace_raid_extent_item(struct btrfs_trans_handle *trans,
+ struct btrfs_key *key,
+ struct btrfs_stripe_extent *stripe_extent,
+ const size_t item_size)
+{
+ struct btrfs_fs_info *fs_info = trans->fs_info;
+ struct btrfs_root *stripe_root = fs_info->stripe_root;
+ struct btrfs_path *path;
+ int ret;
+
+ path = btrfs_alloc_path();
+ if (!path)
+ return -ENOMEM;
+
+ ret = btrfs_search_slot(trans, stripe_root, key, path, -1, 1);
+ if (ret)
+ goto err;
+
+ ret = btrfs_del_item(trans, stripe_root, path);
+ if (ret) {
+ ret = (ret == 1) ? -ENOENT : ret;
+ goto err;
+ }
+
+ btrfs_free_path(path);
+
+ return btrfs_insert_item(trans, stripe_root, key, stripe_extent,
+ item_size);
+ err:
+ btrfs_free_path(path);
+ return ret;
+}
+
static int btrfs_insert_one_raid_extent(struct btrfs_trans_handle *trans,
struct btrfs_io_context *bioc)
{
@@ -112,6 +145,9 @@ static int btrfs_insert_one_raid_extent(struct btrfs_trans_handle *trans,
ret = btrfs_insert_item(trans, stripe_root, &stripe_key, stripe_extent,
item_size);
+ if (ret == -EEXIST)
+ ret = replace_raid_extent_item(trans, &stripe_key,
+ stripe_extent, item_size);
if (ret)
btrfs_abort_transaction(trans, ret);
--
2.43.0
^ permalink raw reply related [flat|nested] 23+ messages in thread* Re: [PATCH v4 1/7] btrfs: replace stripe extents
2024-07-05 15:13 ` [PATCH v4 1/7] btrfs: replace stripe extents Johannes Thumshirn
@ 2024-07-05 23:19 ` Qu Wenruo
2024-07-08 11:43 ` Johannes Thumshirn
0 siblings, 1 reply; 23+ messages in thread
From: Qu Wenruo @ 2024-07-05 23:19 UTC (permalink / raw)
To: Johannes Thumshirn, Chris Mason, Josef Bacik, David Sterba
Cc: linux-btrfs, linux-kernel, Johannes Thumshirn
在 2024/7/6 00:43, Johannes Thumshirn 写道:
> From: Johannes Thumshirn <johannes.thumshirn@wdc.com>
>
> If we can't insert a stripe extent in the RAID stripe tree, because
> the key that points to the specific position in the stripe tree is
> already existing, we have to remove the item and then replace it by a
> new item.
>
> This can happen for example on device replace operations.
In that case, can we just modify the targeted dev stripe?
Or do we have other call sites that can lead to such conflicts?
As I'm not that confident if such replace behavior would mask some real
problems.
Thanks,
Qu
>
> Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
> ---
> fs/btrfs/raid-stripe-tree.c | 36 ++++++++++++++++++++++++++++++++++++
> 1 file changed, 36 insertions(+)
>
> diff --git a/fs/btrfs/raid-stripe-tree.c b/fs/btrfs/raid-stripe-tree.c
> index e6f7a234b8f6..3b32e96c33fc 100644
> --- a/fs/btrfs/raid-stripe-tree.c
> +++ b/fs/btrfs/raid-stripe-tree.c
> @@ -73,6 +73,39 @@ int btrfs_delete_raid_extent(struct btrfs_trans_handle *trans, u64 start, u64 le
> return ret;
> }
>
> +static int replace_raid_extent_item(struct btrfs_trans_handle *trans,
> + struct btrfs_key *key,
> + struct btrfs_stripe_extent *stripe_extent,
> + const size_t item_size)
> +{
> + struct btrfs_fs_info *fs_info = trans->fs_info;
> + struct btrfs_root *stripe_root = fs_info->stripe_root;
> + struct btrfs_path *path;
> + int ret;
> +
> + path = btrfs_alloc_path();
> + if (!path)
> + return -ENOMEM;
> +
> + ret = btrfs_search_slot(trans, stripe_root, key, path, -1, 1);
> + if (ret)
> + goto err;
> +
> + ret = btrfs_del_item(trans, stripe_root, path);
> + if (ret) {
> + ret = (ret == 1) ? -ENOENT : ret;
> + goto err;
> + }
> +
> + btrfs_free_path(path);
> +
> + return btrfs_insert_item(trans, stripe_root, key, stripe_extent,
> + item_size);
> + err:
> + btrfs_free_path(path);
> + return ret;
> +}
> +
> static int btrfs_insert_one_raid_extent(struct btrfs_trans_handle *trans,
> struct btrfs_io_context *bioc)
> {
> @@ -112,6 +145,9 @@ static int btrfs_insert_one_raid_extent(struct btrfs_trans_handle *trans,
>
> ret = btrfs_insert_item(trans, stripe_root, &stripe_key, stripe_extent,
> item_size);
> + if (ret == -EEXIST)
> + ret = replace_raid_extent_item(trans, &stripe_key,
> + stripe_extent, item_size);
> if (ret)
> btrfs_abort_transaction(trans, ret);
>
>
^ permalink raw reply [flat|nested] 23+ messages in thread* Re: [PATCH v4 1/7] btrfs: replace stripe extents
2024-07-05 23:19 ` Qu Wenruo
@ 2024-07-08 11:43 ` Johannes Thumshirn
2024-07-08 22:14 ` Qu Wenruo
2024-07-09 5:36 ` Qu Wenruo
0 siblings, 2 replies; 23+ messages in thread
From: Johannes Thumshirn @ 2024-07-08 11:43 UTC (permalink / raw)
To: Qu Wenruo, Johannes Thumshirn, Chris Mason, Josef Bacik,
David Sterba
Cc: linux-btrfs@vger.kernel.org, linux-kernel@vger.kernel.org
On 06.07.24 01:19, Qu Wenruo wrote:
>
>
> 在 2024/7/6 00:43, Johannes Thumshirn 写道:
>> From: Johannes Thumshirn <johannes.thumshirn@wdc.com>
>>
>> If we can't insert a stripe extent in the RAID stripe tree, because
>> the key that points to the specific position in the stripe tree is
>> already existing, we have to remove the item and then replace it by a
>> new item.
>>
>> This can happen for example on device replace operations.
>
> In that case, can we just modify the targeted dev stripe?
>
> Or do we have other call sites that can lead to such conflicts?
>
> As I'm not that confident if such replace behavior would mask some real
> problems.
I've just tested the following patch and it looks like it's working:
diff --git a/fs/btrfs/raid-stripe-tree.c b/fs/btrfs/raid-stripe-tree.c
index e6f7a234b8f6..7bfd8654c110 100644
--- a/fs/btrfs/raid-stripe-tree.c
+++ b/fs/btrfs/raid-stripe-tree.c
@@ -73,6 +73,53 @@ int btrfs_delete_raid_extent(struct
btrfs_trans_handle *trans, u64 start, u64 le
return ret;
}
+static int update_raid_extent_item(struct btrfs_trans_handle *trans,
+ struct btrfs_key *key,
+ struct btrfs_io_context *bioc)
+{
+ struct btrfs_path *path;
+ struct extent_buffer *leaf;
+ struct btrfs_stripe_extent *stripe_extent;
+ int num_stripes;
+ int ret;
+ int slot;
+
+ path = btrfs_alloc_path();
+ if (!path)
+ return -ENOMEM;
+
+ ret = btrfs_search_slot(trans, trans->fs_info->stripe_root, key, path,
+ 0, 1);
+ if (ret)
+ return ret == 1 ? ret : -EINVAL;
+
+ leaf = path->nodes[0];
+ slot = path->slots[0];
+
+ btrfs_item_key_to_cpu(leaf, key, slot);
+ num_stripes = btrfs_num_raid_stripes(btrfs_item_size(leaf, slot));
+ stripe_extent = btrfs_item_ptr(leaf, slot, struct btrfs_stripe_extent);
+
+ for (int i = 0; i < num_stripes; i++) {
+ u64 devid = bioc->stripes[i].dev->devid;
+ u64 physical = bioc->stripes[i].physical;
+ u64 length = bioc->stripes[i].length;
+ struct btrfs_raid_stride *raid_stride =
+ &stripe_extent->strides[i];
+
+ if (length == 0)
+ length = bioc->size;
+
+ btrfs_set_raid_stride_devid(leaf, raid_stride, devid);
+ btrfs_set_raid_stride_physical(leaf, raid_stride, physical);
+ }
+
+ btrfs_mark_buffer_dirty(trans, leaf);
+ btrfs_free_path(path);
+
+ return ret;
+}
+
static int btrfs_insert_one_raid_extent(struct btrfs_trans_handle *trans,
struct btrfs_io_context *bioc)
{
@@ -112,6 +159,8 @@ static int btrfs_insert_one_raid_extent(struct
btrfs_trans_handle *trans,
ret = btrfs_insert_item(trans, stripe_root, &stripe_key, stripe_extent,
item_size);
+ if (ret == -EEXIST)
+ ret = update_raid_extent_item(trans, &stripe_key, bioc);
if (ret)
btrfs_abort_transaction(trans, ret);
^ permalink raw reply related [flat|nested] 23+ messages in thread* Re: [PATCH v4 1/7] btrfs: replace stripe extents
2024-07-08 11:43 ` Johannes Thumshirn
@ 2024-07-08 22:14 ` Qu Wenruo
2024-07-09 5:49 ` Johannes Thumshirn
2024-07-09 5:36 ` Qu Wenruo
1 sibling, 1 reply; 23+ messages in thread
From: Qu Wenruo @ 2024-07-08 22:14 UTC (permalink / raw)
To: Johannes Thumshirn, Johannes Thumshirn, Chris Mason, Josef Bacik,
David Sterba
Cc: linux-btrfs@vger.kernel.org, linux-kernel@vger.kernel.org
在 2024/7/8 21:13, Johannes Thumshirn 写道:
> On 06.07.24 01:19, Qu Wenruo wrote:
>>
>>
>> 在 2024/7/6 00:43, Johannes Thumshirn 写道:
>>> From: Johannes Thumshirn <johannes.thumshirn@wdc.com>
>>>
>>> If we can't insert a stripe extent in the RAID stripe tree, because
>>> the key that points to the specific position in the stripe tree is
>>> already existing, we have to remove the item and then replace it by a
>>> new item.
>>>
>>> This can happen for example on device replace operations.
>>
>> In that case, can we just modify the targeted dev stripe?
>>
>> Or do we have other call sites that can lead to such conflicts?
>>
>> As I'm not that confident if such replace behavior would mask some real
>> problems.
>
> I've just tested the following patch and it looks like it's working:
>
>
> diff --git a/fs/btrfs/raid-stripe-tree.c b/fs/btrfs/raid-stripe-tree.c
> index e6f7a234b8f6..7bfd8654c110 100644
> --- a/fs/btrfs/raid-stripe-tree.c
> +++ b/fs/btrfs/raid-stripe-tree.c
> @@ -73,6 +73,53 @@ int btrfs_delete_raid_extent(struct
> btrfs_trans_handle *trans, u64 start, u64 le
> return ret;
> }
>
> +static int update_raid_extent_item(struct btrfs_trans_handle *trans,
> + struct btrfs_key *key,
> + struct btrfs_io_context *bioc)
> +{
> + struct btrfs_path *path;
> + struct extent_buffer *leaf;
> + struct btrfs_stripe_extent *stripe_extent;
> + int num_stripes;
> + int ret;
> + int slot;
> +
> + path = btrfs_alloc_path();
> + if (!path)
> + return -ENOMEM;
> +
> + ret = btrfs_search_slot(trans, trans->fs_info->stripe_root, key, path,
> + 0, 1);
> + if (ret)
> + return ret == 1 ? ret : -EINVAL;
Looks good to me overall.
Considering in this case the bioc should match the existing rst entry,
can we add an extra ASSERT() to check the length of the entry against
the bioc?
Thanks,
Qu
> +
> + leaf = path->nodes[0];
> + slot = path->slots[0];
> +
> + btrfs_item_key_to_cpu(leaf, key, slot);
> + num_stripes = btrfs_num_raid_stripes(btrfs_item_size(leaf, slot));
> + stripe_extent = btrfs_item_ptr(leaf, slot, struct btrfs_stripe_extent);
> +
> + for (int i = 0; i < num_stripes; i++) {
> + u64 devid = bioc->stripes[i].dev->devid;
> + u64 physical = bioc->stripes[i].physical;
> + u64 length = bioc->stripes[i].length;
> + struct btrfs_raid_stride *raid_stride =
> + &stripe_extent->strides[i];
> +
> + if (length == 0)
> + length = bioc->size;
> +
> + btrfs_set_raid_stride_devid(leaf, raid_stride, devid);
> + btrfs_set_raid_stride_physical(leaf, raid_stride, physical);
> + }
> +
> + btrfs_mark_buffer_dirty(trans, leaf);
> + btrfs_free_path(path);
> +
> + return ret;
> +}
> +
> static int btrfs_insert_one_raid_extent(struct btrfs_trans_handle *trans,
> struct btrfs_io_context *bioc)
> {
> @@ -112,6 +159,8 @@ static int btrfs_insert_one_raid_extent(struct
> btrfs_trans_handle *trans,
>
> ret = btrfs_insert_item(trans, stripe_root, &stripe_key, stripe_extent,
> item_size);
> + if (ret == -EEXIST)
> + ret = update_raid_extent_item(trans, &stripe_key, bioc);
> if (ret)
> btrfs_abort_transaction(trans, ret);
>
^ permalink raw reply [flat|nested] 23+ messages in thread* Re: [PATCH v4 1/7] btrfs: replace stripe extents
2024-07-08 11:43 ` Johannes Thumshirn
2024-07-08 22:14 ` Qu Wenruo
@ 2024-07-09 5:36 ` Qu Wenruo
1 sibling, 0 replies; 23+ messages in thread
From: Qu Wenruo @ 2024-07-09 5:36 UTC (permalink / raw)
To: Johannes Thumshirn, Johannes Thumshirn, Chris Mason, Josef Bacik,
David Sterba
Cc: linux-btrfs@vger.kernel.org, linux-kernel@vger.kernel.org
在 2024/7/8 21:13, Johannes Thumshirn 写道:
> On 06.07.24 01:19, Qu Wenruo wrote:
>>
>>
>> 在 2024/7/6 00:43, Johannes Thumshirn 写道:
>>> From: Johannes Thumshirn <johannes.thumshirn@wdc.com>
>>>
>>> If we can't insert a stripe extent in the RAID stripe tree, because
>>> the key that points to the specific position in the stripe tree is
>>> already existing, we have to remove the item and then replace it by a
>>> new item.
>>>
>>> This can happen for example on device replace operations.
>>
>> In that case, can we just modify the targeted dev stripe?
>>
>> Or do we have other call sites that can lead to such conflicts?
>>
>> As I'm not that confident if such replace behavior would mask some real
>> problems.
>
> I've just tested the following patch and it looks like it's working:
After some more thinking, I'm wondering why dev-replace would even
trigger an RST entry update?
Normally for non-rst replace, we just reuse the scrub routine to read
out all the extents, then only write the content to the replace target,
thus there should be no update to anything (no chunk nor extent level
update).
I understand that for RST we can not directly go that routine, because
the extents' bytenr is no longer directly mapped into a chunk, thus the
data on-disk can be out-of-order and can not be directly used for
dev-replace.
But on the other hand, the extent based iteration is just to avoid
wasting IO, in theory we can just copy the dev extent from one device to
the target device, then everything should work as expected.
(The bg is marked RO, thus no new write should happen there)
Thus I'm wondering, can we just do a device extent level copying for RST
replace.
By that, we can avoid any update to RST entries at all, mirroring the
behavior of non-RST code.
Although the cost is, we have to implement a dedicated RST routine for
device-replace.
As in that case, dev-replace for RST would be something like:
- Scrub the source device dev-extent
- Copy the dev extent for that chunk directly to the target device
That can only happen if the source dev extent is all correct.
Thanks,
Qu
>
>
> diff --git a/fs/btrfs/raid-stripe-tree.c b/fs/btrfs/raid-stripe-tree.c
> index e6f7a234b8f6..7bfd8654c110 100644
> --- a/fs/btrfs/raid-stripe-tree.c
> +++ b/fs/btrfs/raid-stripe-tree.c
> @@ -73,6 +73,53 @@ int btrfs_delete_raid_extent(struct
> btrfs_trans_handle *trans, u64 start, u64 le
> return ret;
> }
>
> +static int update_raid_extent_item(struct btrfs_trans_handle *trans,
> + struct btrfs_key *key,
> + struct btrfs_io_context *bioc)
> +{
> + struct btrfs_path *path;
> + struct extent_buffer *leaf;
> + struct btrfs_stripe_extent *stripe_extent;
> + int num_stripes;
> + int ret;
> + int slot;
> +
> + path = btrfs_alloc_path();
> + if (!path)
> + return -ENOMEM;
> +
> + ret = btrfs_search_slot(trans, trans->fs_info->stripe_root, key, path,
> + 0, 1);
> + if (ret)
> + return ret == 1 ? ret : -EINVAL;
> +
> + leaf = path->nodes[0];
> + slot = path->slots[0];
> +
> + btrfs_item_key_to_cpu(leaf, key, slot);
> + num_stripes = btrfs_num_raid_stripes(btrfs_item_size(leaf, slot));
> + stripe_extent = btrfs_item_ptr(leaf, slot, struct btrfs_stripe_extent);
> +
> + for (int i = 0; i < num_stripes; i++) {
> + u64 devid = bioc->stripes[i].dev->devid;
> + u64 physical = bioc->stripes[i].physical;
> + u64 length = bioc->stripes[i].length;
> + struct btrfs_raid_stride *raid_stride =
> + &stripe_extent->strides[i];
> +
> + if (length == 0)
> + length = bioc->size;
> +
> + btrfs_set_raid_stride_devid(leaf, raid_stride, devid);
> + btrfs_set_raid_stride_physical(leaf, raid_stride, physical);
> + }
> +
> + btrfs_mark_buffer_dirty(trans, leaf);
> + btrfs_free_path(path);
> +
> + return ret;
> +}
> +
> static int btrfs_insert_one_raid_extent(struct btrfs_trans_handle *trans,
> struct btrfs_io_context *bioc)
> {
> @@ -112,6 +159,8 @@ static int btrfs_insert_one_raid_extent(struct
> btrfs_trans_handle *trans,
>
> ret = btrfs_insert_item(trans, stripe_root, &stripe_key, stripe_extent,
> item_size);
> + if (ret == -EEXIST)
> + ret = update_raid_extent_item(trans, &stripe_key, bioc);
> if (ret)
> btrfs_abort_transaction(trans, ret);
>
^ permalink raw reply [flat|nested] 23+ messages in thread
* [PATCH v4 2/7] btrfs: rst: don't print tree dump in case lookup fails
2024-07-05 15:13 [PATCH v4 0/7] btrfs: rst: updates for RAID stripe tree Johannes Thumshirn
2024-07-05 15:13 ` [PATCH v4 1/7] btrfs: replace stripe extents Johannes Thumshirn
@ 2024-07-05 15:13 ` Johannes Thumshirn
2024-07-05 23:20 ` Qu Wenruo
2024-07-05 15:13 ` [PATCH v4 3/7] btrfs: split RAID stripes on deletion Johannes Thumshirn
` (4 subsequent siblings)
6 siblings, 1 reply; 23+ messages in thread
From: Johannes Thumshirn @ 2024-07-05 15:13 UTC (permalink / raw)
To: Chris Mason, Josef Bacik, David Sterba
Cc: linux-btrfs, linux-kernel, Johannes Thumshirn
From: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Don't print tree dump in case a raid-stripe-tree lookup fails.
Dumping the stripe tree in case of a lookup failure was originally
intended to be a debug feature, but it turned out to be a problem, in case
of i.e. readahead.
Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
---
fs/btrfs/raid-stripe-tree.c | 6 ++----
1 file changed, 2 insertions(+), 4 deletions(-)
diff --git a/fs/btrfs/raid-stripe-tree.c b/fs/btrfs/raid-stripe-tree.c
index 3b32e96c33fc..84746c8886be 100644
--- a/fs/btrfs/raid-stripe-tree.c
+++ b/fs/btrfs/raid-stripe-tree.c
@@ -281,10 +281,8 @@ int btrfs_get_raid_extent_offset(struct btrfs_fs_info *fs_info,
out:
if (ret > 0)
ret = -ENOENT;
- if (ret && ret != -EIO && !stripe->is_scrub) {
- if (IS_ENABLED(CONFIG_BTRFS_DEBUG))
- btrfs_print_tree(leaf, 1);
- btrfs_err(fs_info,
+ if (ret && ret != -EIO) {
+ btrfs_debug(fs_info,
"cannot find raid-stripe for logical [%llu, %llu] devid %llu, profile %s",
logical, logical + *length, stripe->dev->devid,
btrfs_bg_type_to_raid_name(map_type));
--
2.43.0
^ permalink raw reply related [flat|nested] 23+ messages in thread* Re: [PATCH v4 2/7] btrfs: rst: don't print tree dump in case lookup fails
2024-07-05 15:13 ` [PATCH v4 2/7] btrfs: rst: don't print tree dump in case lookup fails Johannes Thumshirn
@ 2024-07-05 23:20 ` Qu Wenruo
0 siblings, 0 replies; 23+ messages in thread
From: Qu Wenruo @ 2024-07-05 23:20 UTC (permalink / raw)
To: Johannes Thumshirn, Chris Mason, Josef Bacik, David Sterba
Cc: linux-btrfs, linux-kernel, Johannes Thumshirn
在 2024/7/6 00:43, Johannes Thumshirn 写道:
> From: Johannes Thumshirn <johannes.thumshirn@wdc.com>
>
> Don't print tree dump in case a raid-stripe-tree lookup fails.
>
> Dumping the stripe tree in case of a lookup failure was originally
> intended to be a debug feature, but it turned out to be a problem, in case
> of i.e. readahead.
But why readahead is going to cause problem?
IIRC the readahead is still based on file, in that case it still needs
to go through the data extents, then rst mapping.
Thus mind you explain more on the readahead problem?
Thanks,
Qu
>
> Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
> ---
> fs/btrfs/raid-stripe-tree.c | 6 ++----
> 1 file changed, 2 insertions(+), 4 deletions(-)
>
> diff --git a/fs/btrfs/raid-stripe-tree.c b/fs/btrfs/raid-stripe-tree.c
> index 3b32e96c33fc..84746c8886be 100644
> --- a/fs/btrfs/raid-stripe-tree.c
> +++ b/fs/btrfs/raid-stripe-tree.c
> @@ -281,10 +281,8 @@ int btrfs_get_raid_extent_offset(struct btrfs_fs_info *fs_info,
> out:
> if (ret > 0)
> ret = -ENOENT;
> - if (ret && ret != -EIO && !stripe->is_scrub) {
> - if (IS_ENABLED(CONFIG_BTRFS_DEBUG))
> - btrfs_print_tree(leaf, 1);
> - btrfs_err(fs_info,
> + if (ret && ret != -EIO) {
> + btrfs_debug(fs_info,
> "cannot find raid-stripe for logical [%llu, %llu] devid %llu, profile %s",
> logical, logical + *length, stripe->dev->devid,
> btrfs_bg_type_to_raid_name(map_type));
>
^ permalink raw reply [flat|nested] 23+ messages in thread
* [PATCH v4 3/7] btrfs: split RAID stripes on deletion
2024-07-05 15:13 [PATCH v4 0/7] btrfs: rst: updates for RAID stripe tree Johannes Thumshirn
2024-07-05 15:13 ` [PATCH v4 1/7] btrfs: replace stripe extents Johannes Thumshirn
2024-07-05 15:13 ` [PATCH v4 2/7] btrfs: rst: don't print tree dump in case lookup fails Johannes Thumshirn
@ 2024-07-05 15:13 ` Johannes Thumshirn
2024-07-05 23:26 ` Qu Wenruo
2024-07-05 15:13 ` [PATCH v4 4/7] btrfs: stripe-tree: add selftests Johannes Thumshirn
` (3 subsequent siblings)
6 siblings, 1 reply; 23+ messages in thread
From: Johannes Thumshirn @ 2024-07-05 15:13 UTC (permalink / raw)
To: Chris Mason, Josef Bacik, David Sterba
Cc: linux-btrfs, linux-kernel, Johannes Thumshirn
From: Johannes Thumshirn <johannes.thumshirn@wdc.com>
The current RAID stripe code assumes, that we will always remove a
whole stripe entry.
But if we're only removing a part of a RAID stripe we're hitting the
ASSERT()ion checking for this condition.
Instead of assuming the complete deletion of a RAID stripe, split the
stripe if we need to.
Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
---
fs/btrfs/ctree.c | 1 +
fs/btrfs/raid-stripe-tree.c | 98 ++++++++++++++++++++++++++++++++++-----------
2 files changed, 75 insertions(+), 24 deletions(-)
diff --git a/fs/btrfs/ctree.c b/fs/btrfs/ctree.c
index 451203055bbf..82278e54969e 100644
--- a/fs/btrfs/ctree.c
+++ b/fs/btrfs/ctree.c
@@ -3863,6 +3863,7 @@ static noinline int setup_leaf_for_split(struct btrfs_trans_handle *trans,
btrfs_item_key_to_cpu(leaf, &key, path->slots[0]);
BUG_ON(key.type != BTRFS_EXTENT_DATA_KEY &&
+ key.type != BTRFS_RAID_STRIPE_KEY &&
key.type != BTRFS_EXTENT_CSUM_KEY);
if (btrfs_leaf_free_space(leaf) >= ins_len)
diff --git a/fs/btrfs/raid-stripe-tree.c b/fs/btrfs/raid-stripe-tree.c
index 84746c8886be..d2a6e409b3e8 100644
--- a/fs/btrfs/raid-stripe-tree.c
+++ b/fs/btrfs/raid-stripe-tree.c
@@ -33,42 +33,92 @@ int btrfs_delete_raid_extent(struct btrfs_trans_handle *trans, u64 start, u64 le
if (!path)
return -ENOMEM;
- while (1) {
- key.objectid = start;
- key.type = BTRFS_RAID_STRIPE_KEY;
- key.offset = length;
+again:
+ key.objectid = start;
+ key.type = BTRFS_RAID_STRIPE_KEY;
+ key.offset = length;
- ret = btrfs_search_slot(trans, stripe_root, &key, path, -1, 1);
- if (ret < 0)
- break;
- if (ret > 0) {
- ret = 0;
- if (path->slots[0] == 0)
- break;
- path->slots[0]--;
- }
+ ret = btrfs_search_slot(trans, stripe_root, &key, path, -1, 1);
+ if (ret < 0)
+ goto out;
+ if (ret > 0) {
+ ret = 0;
+ if (path->slots[0] == 0)
+ goto out;
+ path->slots[0]--;
+ }
+
+ leaf = path->nodes[0];
+ slot = path->slots[0];
+ btrfs_item_key_to_cpu(leaf, &key, slot);
+ found_start = key.objectid;
+ found_end = found_start + key.offset;
+
+ /* That stripe ends before we start, we're done. */
+ if (found_end <= start)
+ goto out;
+
+ trace_btrfs_raid_extent_delete(fs_info, start, end,
+ found_start, found_end);
+
+ if (found_start < start) {
+ u64 diff = start - found_start;
+ struct btrfs_key new_key;
+ int num_stripes;
+ struct btrfs_stripe_extent *stripe_extent;
+
+ new_key.objectid = start;
+ new_key.type = BTRFS_RAID_STRIPE_KEY;
+ new_key.offset = length - diff;
+
+ ret = btrfs_duplicate_item(trans, stripe_root, path,
+ &new_key);
+ if (ret)
+ goto out;
leaf = path->nodes[0];
slot = path->slots[0];
- btrfs_item_key_to_cpu(leaf, &key, slot);
- found_start = key.objectid;
- found_end = found_start + key.offset;
- /* That stripe ends before we start, we're done. */
- if (found_end <= start)
- break;
+ num_stripes =
+ btrfs_num_raid_stripes(btrfs_item_size(leaf, slot));
+ stripe_extent =
+ btrfs_item_ptr(leaf, slot, struct btrfs_stripe_extent);
+
+ for (int i = 0; i < num_stripes; i++) {
+ struct btrfs_raid_stride *raid_stride =
+ &stripe_extent->strides[i];
+ u64 physical =
+ btrfs_raid_stride_physical(leaf, raid_stride);
+
+ btrfs_set_raid_stride_physical(leaf, raid_stride,
+ physical + diff);
+ }
+
+ btrfs_mark_buffer_dirty(trans, leaf);
+ btrfs_release_path(path);
+ goto again;
+ }
- trace_btrfs_raid_extent_delete(fs_info, start, end,
- found_start, found_end);
+ if (found_end > end) {
+ u64 diff = found_end - end;
+ struct btrfs_key new_key;
- ASSERT(found_start >= start && found_end <= end);
- ret = btrfs_del_item(trans, stripe_root, path);
+ new_key.objectid = found_start;
+ new_key.type = BTRFS_RAID_STRIPE_KEY;
+ new_key.offset = length - diff;
+
+ ret = btrfs_duplicate_item(trans, stripe_root, path,
+ &new_key);
if (ret)
- break;
+ goto out;
+ btrfs_mark_buffer_dirty(trans, leaf);
btrfs_release_path(path);
}
+ ret = btrfs_del_item(trans, stripe_root, path);
+
+ out:
btrfs_free_path(path);
return ret;
}
--
2.43.0
^ permalink raw reply related [flat|nested] 23+ messages in thread* Re: [PATCH v4 3/7] btrfs: split RAID stripes on deletion
2024-07-05 15:13 ` [PATCH v4 3/7] btrfs: split RAID stripes on deletion Johannes Thumshirn
@ 2024-07-05 23:26 ` Qu Wenruo
2024-07-08 4:56 ` Johannes Thumshirn
0 siblings, 1 reply; 23+ messages in thread
From: Qu Wenruo @ 2024-07-05 23:26 UTC (permalink / raw)
To: Johannes Thumshirn, Chris Mason, Josef Bacik, David Sterba
Cc: linux-btrfs, linux-kernel, Johannes Thumshirn
在 2024/7/6 00:43, Johannes Thumshirn 写道:
> From: Johannes Thumshirn <johannes.thumshirn@wdc.com>
>
> The current RAID stripe code assumes, that we will always remove a
> whole stripe entry.
>
> But if we're only removing a part of a RAID stripe we're hitting the
> ASSERT()ion checking for this condition.
>
> Instead of assuming the complete deletion of a RAID stripe, split the
> stripe if we need to.
Sorry to be so critical, but if I understand correctly,
btrfs_insert_one_raid_extent() does not do any merge of stripe extent.
Thus one stripe extent always means part of a data extent.
In that case a removal of a data extent should always remove all its
stripe extents.
Furthermore due to the COW nature on zoned/rst devices, the space of a
deleted data extent should not be re-allocated until a transaction
commitment.
Thus I'm wonder if this split is masking some bigger problems.
Thanks,
Qu
>
> Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
> ---
> fs/btrfs/ctree.c | 1 +
> fs/btrfs/raid-stripe-tree.c | 98 ++++++++++++++++++++++++++++++++++-----------
> 2 files changed, 75 insertions(+), 24 deletions(-)
>
> diff --git a/fs/btrfs/ctree.c b/fs/btrfs/ctree.c
> index 451203055bbf..82278e54969e 100644
> --- a/fs/btrfs/ctree.c
> +++ b/fs/btrfs/ctree.c
> @@ -3863,6 +3863,7 @@ static noinline int setup_leaf_for_split(struct btrfs_trans_handle *trans,
> btrfs_item_key_to_cpu(leaf, &key, path->slots[0]);
>
> BUG_ON(key.type != BTRFS_EXTENT_DATA_KEY &&
> + key.type != BTRFS_RAID_STRIPE_KEY &&
> key.type != BTRFS_EXTENT_CSUM_KEY);
>
> if (btrfs_leaf_free_space(leaf) >= ins_len)
> diff --git a/fs/btrfs/raid-stripe-tree.c b/fs/btrfs/raid-stripe-tree.c
> index 84746c8886be..d2a6e409b3e8 100644
> --- a/fs/btrfs/raid-stripe-tree.c
> +++ b/fs/btrfs/raid-stripe-tree.c
> @@ -33,42 +33,92 @@ int btrfs_delete_raid_extent(struct btrfs_trans_handle *trans, u64 start, u64 le
> if (!path)
> return -ENOMEM;
>
> - while (1) {
> - key.objectid = start;
> - key.type = BTRFS_RAID_STRIPE_KEY;
> - key.offset = length;
> +again:
> + key.objectid = start;
> + key.type = BTRFS_RAID_STRIPE_KEY;
> + key.offset = length;
>
> - ret = btrfs_search_slot(trans, stripe_root, &key, path, -1, 1);
> - if (ret < 0)
> - break;
> - if (ret > 0) {
> - ret = 0;
> - if (path->slots[0] == 0)
> - break;
> - path->slots[0]--;
> - }
> + ret = btrfs_search_slot(trans, stripe_root, &key, path, -1, 1);
> + if (ret < 0)
> + goto out;
> + if (ret > 0) {
> + ret = 0;
> + if (path->slots[0] == 0)
> + goto out;
> + path->slots[0]--;
> + }
> +
> + leaf = path->nodes[0];
> + slot = path->slots[0];
> + btrfs_item_key_to_cpu(leaf, &key, slot);
> + found_start = key.objectid;
> + found_end = found_start + key.offset;
> +
> + /* That stripe ends before we start, we're done. */
> + if (found_end <= start)
> + goto out;
> +
> + trace_btrfs_raid_extent_delete(fs_info, start, end,
> + found_start, found_end);
> +
> + if (found_start < start) {
> + u64 diff = start - found_start;
> + struct btrfs_key new_key;
> + int num_stripes;
> + struct btrfs_stripe_extent *stripe_extent;
> +
> + new_key.objectid = start;
> + new_key.type = BTRFS_RAID_STRIPE_KEY;
> + new_key.offset = length - diff;
> +
> + ret = btrfs_duplicate_item(trans, stripe_root, path,
> + &new_key);
> + if (ret)
> + goto out;
>
> leaf = path->nodes[0];
> slot = path->slots[0];
> - btrfs_item_key_to_cpu(leaf, &key, slot);
> - found_start = key.objectid;
> - found_end = found_start + key.offset;
>
> - /* That stripe ends before we start, we're done. */
> - if (found_end <= start)
> - break;
> + num_stripes =
> + btrfs_num_raid_stripes(btrfs_item_size(leaf, slot));
> + stripe_extent =
> + btrfs_item_ptr(leaf, slot, struct btrfs_stripe_extent);
> +
> + for (int i = 0; i < num_stripes; i++) {
> + struct btrfs_raid_stride *raid_stride =
> + &stripe_extent->strides[i];
> + u64 physical =
> + btrfs_raid_stride_physical(leaf, raid_stride);
> +
> + btrfs_set_raid_stride_physical(leaf, raid_stride,
> + physical + diff);
> + }
> +
> + btrfs_mark_buffer_dirty(trans, leaf);
> + btrfs_release_path(path);
> + goto again;
> + }
>
> - trace_btrfs_raid_extent_delete(fs_info, start, end,
> - found_start, found_end);
> + if (found_end > end) {
> + u64 diff = found_end - end;
> + struct btrfs_key new_key;
>
> - ASSERT(found_start >= start && found_end <= end);
> - ret = btrfs_del_item(trans, stripe_root, path);
> + new_key.objectid = found_start;
> + new_key.type = BTRFS_RAID_STRIPE_KEY;
> + new_key.offset = length - diff;
> +
> + ret = btrfs_duplicate_item(trans, stripe_root, path,
> + &new_key);
> if (ret)
> - break;
> + goto out;
>
> + btrfs_mark_buffer_dirty(trans, leaf);
> btrfs_release_path(path);
> }
>
> + ret = btrfs_del_item(trans, stripe_root, path);
> +
> + out:
> btrfs_free_path(path);
> return ret;
> }
>
^ permalink raw reply [flat|nested] 23+ messages in thread* Re: [PATCH v4 3/7] btrfs: split RAID stripes on deletion
2024-07-05 23:26 ` Qu Wenruo
@ 2024-07-08 4:56 ` Johannes Thumshirn
2024-07-08 5:20 ` Qu Wenruo
0 siblings, 1 reply; 23+ messages in thread
From: Johannes Thumshirn @ 2024-07-08 4:56 UTC (permalink / raw)
To: Qu Wenruo, Johannes Thumshirn, Chris Mason, Josef Bacik,
David Sterba
Cc: linux-btrfs@vger.kernel.org, linux-kernel@vger.kernel.org
On 06.07.24 01:26, Qu Wenruo wrote:
>
>
> 在 2024/7/6 00:43, Johannes Thumshirn 写道:
>> From: Johannes Thumshirn <johannes.thumshirn@wdc.com>
>>
>> The current RAID stripe code assumes, that we will always remove a
>> whole stripe entry.
>>
>> But if we're only removing a part of a RAID stripe we're hitting the
>> ASSERT()ion checking for this condition.
>>
>> Instead of assuming the complete deletion of a RAID stripe, split the
>> stripe if we need to.
>
> Sorry to be so critical, but if I understand correctly,
> btrfs_insert_one_raid_extent() does not do any merge of stripe extent.
No problem at all. I want to solve bugs, not increase my patch count ;).
>
> Thus one stripe extent always means part of a data extent.
>
> In that case a removal of a data extent should always remove all its
> stripe extents.
>
> Furthermore due to the COW nature on zoned/rst devices, the space of a
> deleted data extent should not be re-allocated until a transaction
> commitment.
>
> Thus I'm wonder if this split is masking some bigger problems.
Hmm now that you're saying it. The reason I wrote this path is, that I
did hit the following ASSERT() in my testing:
>> - ASSERT(found_start >= start && found_end <= end);
This indicates a partial delete of a stripe extent. But I agree as
stripe extents are tied to extent items, this shouldn't really happen.
So maybe most of this series (apart from the deadlock fix) masks problems?
I'm back to the drawing board :(.
^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [PATCH v4 3/7] btrfs: split RAID stripes on deletion
2024-07-08 4:56 ` Johannes Thumshirn
@ 2024-07-08 5:20 ` Qu Wenruo
2024-07-08 5:25 ` Johannes Thumshirn
0 siblings, 1 reply; 23+ messages in thread
From: Qu Wenruo @ 2024-07-08 5:20 UTC (permalink / raw)
To: Johannes Thumshirn, Johannes Thumshirn, Chris Mason, Josef Bacik,
David Sterba
Cc: linux-btrfs@vger.kernel.org, linux-kernel@vger.kernel.org
在 2024/7/8 14:26, Johannes Thumshirn 写道:
> On 06.07.24 01:26, Qu Wenruo wrote:
>>
>>
>> 在 2024/7/6 00:43, Johannes Thumshirn 写道:
>>> From: Johannes Thumshirn <johannes.thumshirn@wdc.com>
>>>
>>> The current RAID stripe code assumes, that we will always remove a
>>> whole stripe entry.
>>>
>>> But if we're only removing a part of a RAID stripe we're hitting the
>>> ASSERT()ion checking for this condition.
>>>
>>> Instead of assuming the complete deletion of a RAID stripe, split the
>>> stripe if we need to.
>>
>> Sorry to be so critical, but if I understand correctly,
>> btrfs_insert_one_raid_extent() does not do any merge of stripe extent.
>
> No problem at all. I want to solve bugs, not increase my patch count ;).
>
>>
>> Thus one stripe extent always means part of a data extent.
>>
>> In that case a removal of a data extent should always remove all its
>> stripe extents.
>>
>> Furthermore due to the COW nature on zoned/rst devices, the space of a
>> deleted data extent should not be re-allocated until a transaction
>> commitment.
>>
>> Thus I'm wonder if this split is masking some bigger problems.
>
> Hmm now that you're saying it. The reason I wrote this path is, that I
> did hit the following ASSERT() in my testing:
>
>>> - ASSERT(found_start >= start && found_end <= end);
>
> This indicates a partial delete of a stripe extent. But I agree as
> stripe extents are tied to extent items, this shouldn't really happen.
>
> So maybe most of this series (apart from the deadlock fix) masks problems?
>
> I'm back to the drawing board :(.
Can the ASSERT() be reproduced without a zoned device? (I'm really not a
fan of the existing tcmu emulated solution, meanwhile libvirt still
doesn't support ZNS devices)
If it can be reproduced just with RST feature, I may provide some help
digging into the ASSERT().
Thanks,
Qu
^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [PATCH v4 3/7] btrfs: split RAID stripes on deletion
2024-07-08 5:20 ` Qu Wenruo
@ 2024-07-08 5:25 ` Johannes Thumshirn
2024-07-08 10:52 ` Johannes Thumshirn
0 siblings, 1 reply; 23+ messages in thread
From: Johannes Thumshirn @ 2024-07-08 5:25 UTC (permalink / raw)
To: Qu Wenruo, Johannes Thumshirn, Chris Mason, Josef Bacik,
David Sterba
Cc: linux-btrfs@vger.kernel.org, linux-kernel@vger.kernel.org
On 08.07.24 07:20, Qu Wenruo wrote:
>
> Can the ASSERT() be reproduced without a zoned device? (I'm really not a
> fan of the existing tcmu emulated solution, meanwhile libvirt still
> doesn't support ZNS devices)
>
> If it can be reproduced just with RST feature, I may provide some help
> digging into the ASSERT().
Let me check. It's very sporadic as well unfortunately.
^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [PATCH v4 3/7] btrfs: split RAID stripes on deletion
2024-07-08 5:25 ` Johannes Thumshirn
@ 2024-07-08 10:52 ` Johannes Thumshirn
2024-07-08 23:02 ` Qu Wenruo
0 siblings, 1 reply; 23+ messages in thread
From: Johannes Thumshirn @ 2024-07-08 10:52 UTC (permalink / raw)
To: Qu Wenruo, Johannes Thumshirn, Chris Mason, Josef Bacik,
David Sterba
Cc: linux-btrfs@vger.kernel.org, linux-kernel@vger.kernel.org
On 08.07.24 07:26, Johannes Thumshirn wrote:
> On 08.07.24 07:20, Qu Wenruo wrote:
>>
>> Can the ASSERT() be reproduced without a zoned device? (I'm really not a
>> fan of the existing tcmu emulated solution, meanwhile libvirt still
>> doesn't support ZNS devices)
>>
>> If it can be reproduced just with RST feature, I may provide some help
>> digging into the ASSERT().
>
> Let me check. It's very sporadic as well unfortunately.
>
>
OK, I've managed to trigger the failure with btrfs/070 on a
SCRATCH_DEV_POOL with 5 non-zoned devices.
^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [PATCH v4 3/7] btrfs: split RAID stripes on deletion
2024-07-08 10:52 ` Johannes Thumshirn
@ 2024-07-08 23:02 ` Qu Wenruo
2024-07-09 5:51 ` Johannes Thumshirn
0 siblings, 1 reply; 23+ messages in thread
From: Qu Wenruo @ 2024-07-08 23:02 UTC (permalink / raw)
To: Johannes Thumshirn, Johannes Thumshirn, Chris Mason, Josef Bacik,
David Sterba
Cc: linux-btrfs@vger.kernel.org, linux-kernel@vger.kernel.org
在 2024/7/8 20:22, Johannes Thumshirn 写道:
> On 08.07.24 07:26, Johannes Thumshirn wrote:
>> On 08.07.24 07:20, Qu Wenruo wrote:
>>>
>>> Can the ASSERT() be reproduced without a zoned device? (I'm really not a
>>> fan of the existing tcmu emulated solution, meanwhile libvirt still
>>> doesn't support ZNS devices)
>>>
>>> If it can be reproduced just with RST feature, I may provide some help
>>> digging into the ASSERT().
>>
>> Let me check. It's very sporadic as well unfortunately.
>>
>>
>
> OK, I've managed to trigger the failure with btrfs/070 on a
> SCRATCH_DEV_POOL with 5 non-zoned devices.
I'm hitting errors like this:
[ 227.898320] ------------[ cut here ]------------
[ 227.898817] BTRFS: Transaction aborted (error -17)
[ 227.899250] WARNING: CPU: 7 PID: 65 at
fs/btrfs/raid-stripe-tree.c:116 btrfs_insert_raid_extent+0x337/0x3d0 [btrfs]
[ 227.900122] Modules linked in: btrfs blake2b_generic xor
zstd_compress vfat fat intel_rapl_msr intel_rapl_common crct10dif_pclmul
crc32_pclmul ghash_clmulni_intel iTCO_wdt iTCO_vendor_support
aesni_intel crypto_simd cryptd psmouse i2c_i801 pcspkr i2c_smbus lpc_ich
intel_agp intel_gtt joydev agpgart mousedev raid6_pq libcrc32c loop drm
fuse qemu_fw_cfg ext4 crc32c_generic crc16 mbcache jbd2 dm_mod
virtio_rng virtio_net virtio_blk virtio_balloon net_failover
virtio_console failover virtio_scsi rng_core dimlib usbhid virtio_pci
virtio_pci_legacy_dev crc32c_intel virtio_pci_modern_dev serio_raw
[ 227.904452] CPU: 7 PID: 65 Comm: kworker/u40:0 Not tainted
6.10.0-rc6-custom+ #167
[ 227.905220] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS
unknown 2/2/2022
[ 227.905827] Workqueue: btrfs-endio-write btrfs_work_helper [btrfs]
[ 227.906558] RIP: 0010:btrfs_insert_raid_extent+0x337/0x3d0 [btrfs]
[ 227.907246] Code: 89 6b 08 e8 4b 18 f7 ff 49 8b 84 24 50 02 00 00 4c
39 f0 75 be 31 db e9 7d fe ff ff 89 de 48 c7 c7 f0 8d 9d a0 e8 29 a1 79
e0 <0f> 0b e9 69 ff ff ff e8 bd 95 3e e1 49 8b 46 60 48 05 48 1a 00 00
[ 227.908277] BTRFS: error (device dm-3 state A) in
btrfs_insert_one_raid_extent:116: errno=-17 Object already exists
[ 227.909356] RSP: 0018:ffffc9000026fca0 EFLAGS: 00010282
[ 227.909361] RAX: 0000000000000000 RBX: 00000000ffffffef RCX:
0000000000000027
[ 227.911934] RDX: ffff88817bda1948 RSI: 0000000000000001 RDI:
ffff88817bda1940
[ 227.912722] RBP: ffff8881029dcbe0 R08: 0000000000000000 R09:
0000000000000003
[ 227.913095] BTRFS info (device dm-3 state EA): forced readonly
[ 227.913569] R10: ffffc9000026fb38 R11: ffffffff826d0508 R12:
0000000000000010
[ 227.915182] R13: ffff8881029dcbe0 R14: ffff88812a5ff790 R15:
ffff8881488f2780
[ 227.916130] FS: 0000000000000000(0000) GS:ffff88817bd80000(0000)
knlGS:0000000000000000
[ 227.916912] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 227.917500] CR2: 0000561364dec000 CR3: 00000001583ca000 CR4:
0000000000750ef0
[ 227.918210] PKRU: 55555554
[ 227.918484] Call Trace:
[ 227.918727] <TASK>
[ 227.918940] ? __warn+0x8c/0x180
[ 227.919299] ? btrfs_insert_raid_extent+0x337/0x3d0 [btrfs]
[ 227.919891] ? report_bug+0x164/0x190
[ 227.920272] ? prb_read_valid+0x1b/0x30
[ 227.920666] ? handle_bug+0x3c/0x80
[ 227.921013] ? exc_invalid_op+0x17/0x70
[ 227.921397] ? asm_exc_invalid_op+0x1a/0x20
[ 227.921835] ? btrfs_insert_raid_extent+0x337/0x3d0 [btrfs]
[ 227.922440] btrfs_finish_one_ordered+0x3c3/0xaa0 [btrfs]
[ 227.923055] ? srso_alias_return_thunk+0x5/0xfbef5
[ 227.923549] ? srso_alias_return_thunk+0x5/0xfbef5
[ 227.924062] btrfs_work_helper+0x107/0x4c0 [btrfs]
[ 227.924612] ? lock_is_held_type+0x9a/0x110
[ 227.925040] process_one_work+0x212/0x720
[ 227.925454] ? srso_alias_return_thunk+0x5/0xfbef5
[ 227.926010] worker_thread+0x1dc/0x3b0
[ 227.926411] ? __pfx_worker_thread+0x10/0x10
[ 227.926918] kthread+0xe0/0x110
[ 227.927377] ? __pfx_kthread+0x10/0x10
[ 227.927776] ret_from_fork+0x31/0x50
[ 227.928151] ? __pfx_kthread+0x10/0x10
[ 227.928564] ret_from_fork_asm+0x1a/0x30
[ 227.929035] </TASK>
[ 227.929305] irq event stamp: 11077
[ 227.929710] hardirqs last enabled at (11085): [<ffffffff8115daf5>]
console_unlock+0x135/0x160
[ 227.930725] hardirqs last disabled at (11094): [<ffffffff8115dada>]
console_unlock+0x11a/0x160
[ 227.931730] softirqs last enabled at (10728): [<ffffffff810b4684>]
__irq_exit_rcu+0x84/0xa0
[ 227.932568] softirqs last disabled at (10723): [<ffffffff810b4684>]
__irq_exit_rcu+0x84/0xa0
[ 227.933494] ---[ end trace 0000000000000000 ]---
[ 227.933992] BTRFS: error (device dm-3 state EA) in
btrfs_insert_one_raid_extent:116: errno=-17 Object already exists
[ 227.935193] BTRFS warning (device dm-3 state EA): Skipping commit of
aborted transaction.
[ 227.936383] BTRFS: error (device dm-3 state EA) in
cleanup_transaction:2018: errno=-17 Object already exists
But not that ASSERT() yet.
I guess I need the first patch to pass this first?
Thanks,
Qu
^ permalink raw reply [flat|nested] 23+ messages in thread
* [PATCH v4 4/7] btrfs: stripe-tree: add selftests
2024-07-05 15:13 [PATCH v4 0/7] btrfs: rst: updates for RAID stripe tree Johannes Thumshirn
` (2 preceding siblings ...)
2024-07-05 15:13 ` [PATCH v4 3/7] btrfs: split RAID stripes on deletion Johannes Thumshirn
@ 2024-07-05 15:13 ` Johannes Thumshirn
2024-07-05 15:13 ` [PATCH v4 5/7] btrfs: don't hold dev_replace rwsem over whole of btrfs_map_block Johannes Thumshirn
` (2 subsequent siblings)
6 siblings, 0 replies; 23+ messages in thread
From: Johannes Thumshirn @ 2024-07-05 15:13 UTC (permalink / raw)
To: Chris Mason, Josef Bacik, David Sterba
Cc: linux-btrfs, linux-kernel, Johannes Thumshirn
From: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Add self-tests for the RAID stripe tree code.
Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
---
fs/btrfs/Makefile | 3 +-
fs/btrfs/raid-stripe-tree.c | 3 +-
fs/btrfs/raid-stripe-tree.h | 5 +
fs/btrfs/tests/btrfs-tests.c | 3 +
fs/btrfs/tests/btrfs-tests.h | 1 +
fs/btrfs/tests/raid-stripe-tree-tests.c | 322 ++++++++++++++++++++++++++++++++
6 files changed, 335 insertions(+), 2 deletions(-)
diff --git a/fs/btrfs/Makefile b/fs/btrfs/Makefile
index 87617f2968bc..3cfc440c636c 100644
--- a/fs/btrfs/Makefile
+++ b/fs/btrfs/Makefile
@@ -43,4 +43,5 @@ btrfs-$(CONFIG_FS_VERITY) += verity.o
btrfs-$(CONFIG_BTRFS_FS_RUN_SANITY_TESTS) += tests/free-space-tests.o \
tests/extent-buffer-tests.o tests/btrfs-tests.o \
tests/extent-io-tests.o tests/inode-tests.o tests/qgroup-tests.o \
- tests/free-space-tree-tests.o tests/extent-map-tests.o
+ tests/free-space-tree-tests.o tests/extent-map-tests.o \
+ tests/raid-stripe-tree-tests.o
diff --git a/fs/btrfs/raid-stripe-tree.c b/fs/btrfs/raid-stripe-tree.c
index d2a6e409b3e8..ba0733c6be76 100644
--- a/fs/btrfs/raid-stripe-tree.c
+++ b/fs/btrfs/raid-stripe-tree.c
@@ -156,7 +156,8 @@ static int replace_raid_extent_item(struct btrfs_trans_handle *trans,
return ret;
}
-static int btrfs_insert_one_raid_extent(struct btrfs_trans_handle *trans,
+EXPORT_FOR_TESTS
+int btrfs_insert_one_raid_extent(struct btrfs_trans_handle *trans,
struct btrfs_io_context *bioc)
{
struct btrfs_fs_info *fs_info = trans->fs_info;
diff --git a/fs/btrfs/raid-stripe-tree.h b/fs/btrfs/raid-stripe-tree.h
index 1ac1c21aac2f..541836421778 100644
--- a/fs/btrfs/raid-stripe-tree.h
+++ b/fs/btrfs/raid-stripe-tree.h
@@ -28,6 +28,11 @@ int btrfs_get_raid_extent_offset(struct btrfs_fs_info *fs_info,
int btrfs_insert_raid_extent(struct btrfs_trans_handle *trans,
struct btrfs_ordered_extent *ordered_extent);
+#ifdef CONFIG_BTRFS_FS_RUN_SANITY_TESTS
+int btrfs_insert_one_raid_extent(struct btrfs_trans_handle *trans,
+ struct btrfs_io_context *bioc);
+#endif
+
static inline bool btrfs_need_stripe_tree_update(struct btrfs_fs_info *fs_info,
u64 map_type)
{
diff --git a/fs/btrfs/tests/btrfs-tests.c b/fs/btrfs/tests/btrfs-tests.c
index ce50847e1e01..18e1ab4a0914 100644
--- a/fs/btrfs/tests/btrfs-tests.c
+++ b/fs/btrfs/tests/btrfs-tests.c
@@ -291,6 +291,9 @@ int btrfs_run_sanity_tests(void)
ret = btrfs_test_free_space_tree(sectorsize, nodesize);
if (ret)
goto out;
+ ret = btrfs_test_raid_stripe_tree(sectorsize, nodesize);
+ if (ret)
+ goto out;
}
}
ret = btrfs_test_extent_map();
diff --git a/fs/btrfs/tests/btrfs-tests.h b/fs/btrfs/tests/btrfs-tests.h
index dc2f2ab15fa5..61bcadaf2036 100644
--- a/fs/btrfs/tests/btrfs-tests.h
+++ b/fs/btrfs/tests/btrfs-tests.h
@@ -37,6 +37,7 @@ int btrfs_test_extent_io(u32 sectorsize, u32 nodesize);
int btrfs_test_inodes(u32 sectorsize, u32 nodesize);
int btrfs_test_qgroups(u32 sectorsize, u32 nodesize);
int btrfs_test_free_space_tree(u32 sectorsize, u32 nodesize);
+int btrfs_test_raid_stripe_tree(u32 sectorsize, u32 nodesize);
int btrfs_test_extent_map(void);
struct inode *btrfs_new_test_inode(void);
struct btrfs_fs_info *btrfs_alloc_dummy_fs_info(u32 nodesize, u32 sectorsize);
diff --git a/fs/btrfs/tests/raid-stripe-tree-tests.c b/fs/btrfs/tests/raid-stripe-tree-tests.c
new file mode 100644
index 000000000000..bec7c210c14c
--- /dev/null
+++ b/fs/btrfs/tests/raid-stripe-tree-tests.c
@@ -0,0 +1,322 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * Copyright (C) 2024 Western Digital Corporation. All rights reserved.
+ */
+#include <linux/array_size.h>
+#include <linux/sizes.h>
+#include <linux/btrfs.h>
+#include <linux/btrfs_tree.h>
+#include <linux/errno.h>
+#include <linux/types.h>
+#include <linux/kernel.h>
+#include "btrfs-tests.h"
+#include "../disk-io.h"
+#include "../transaction.h"
+#include "../volumes.h"
+#include "../raid-stripe-tree.h"
+#include "../block-group.h"
+
+static struct btrfs_io_context *alloc_dummy_bioc(struct btrfs_fs_info *fs_info,
+ u64 logical, u16 total_stripes)
+{
+ struct btrfs_io_context *bioc;
+
+ bioc = kzalloc(sizeof(struct btrfs_io_context) +
+ sizeof(struct btrfs_io_stripe) * total_stripes,
+ GFP_KERNEL);
+
+ if (!bioc)
+ return NULL;
+
+ refcount_set(&bioc->refs, 1);
+
+ bioc->fs_info = fs_info;
+ bioc->replace_stripe_src = -1;
+ bioc->full_stripe_logical = (u64)-1;
+ bioc->logical = logical;
+
+ return bioc;
+}
+
+typedef int (*test_func_t)(struct btrfs_fs_info *);
+
+static int test_stripe_tree_delete_tail(struct btrfs_fs_info *fs_info)
+{
+ struct btrfs_trans_handle trans;
+ struct btrfs_io_context *bioc;
+ struct btrfs_io_stripe stripe = { };
+ const u64 map_type = BTRFS_BLOCK_GROUP_DATA | BTRFS_BLOCK_GROUP_RAID1;
+ const int total_stripes = btrfs_bg_type_to_factor(map_type);
+ u64 logical = SZ_8K;
+ u64 length = SZ_64K;
+ u64 read_length;
+ int i;
+ int last = 0;
+ int ret;
+
+ btrfs_init_dummy_trans(&trans, fs_info);
+
+ bioc = alloc_dummy_bioc(fs_info, logical, total_stripes);
+ if (!bioc)
+ return -ENOMEM;
+
+ bioc->size = length;
+ bioc->map_type = map_type;
+ for (i = 0; i < total_stripes; ++i) {
+ struct btrfs_device *dev;
+
+ dev = kzalloc(sizeof(struct btrfs_device), GFP_KERNEL);
+ if (!dev) {
+ ret = -ENOMEM;
+ goto out;
+ }
+ dev->devid = i;
+ bioc->stripes[i].dev = dev;
+ bioc->stripes[i].length = length;
+ bioc->stripes[i].physical = i * SZ_8K;
+ last = i;
+ }
+
+ ret = btrfs_insert_one_raid_extent(&trans, bioc);
+ if (ret)
+ goto out;
+
+ ret = btrfs_delete_raid_extent(&trans, logical, SZ_16K);
+ if (ret)
+ goto out;
+
+ stripe.dev = bioc->stripes[last].dev;
+ read_length = length - SZ_16K;
+ ret = btrfs_get_raid_extent_offset(fs_info, logical,
+ &read_length, map_type, 0, &stripe);
+ if (ret)
+ goto out;
+
+ if (read_length != length - SZ_16K) {
+ test_err("invalid length %llu vs %llu", read_length,
+ length - SZ_16K);
+ ret = -EINVAL;
+ goto out;
+ }
+
+ if (stripe.physical != bioc->stripes[last].physical) {
+ test_err("invalid physical %llu vs %llu", stripe.physical,
+ bioc->stripes[last].physical);
+ ret = -EINVAL;
+ }
+
+out:
+ for (i = 0; i < total_stripes; i++)
+ kfree(bioc->stripes[i].dev);
+
+ kfree(bioc);
+ return ret;
+}
+
+static int test_stripe_tree_delete_front(struct btrfs_fs_info *fs_info)
+{
+ struct btrfs_trans_handle trans;
+ struct btrfs_io_context *bioc;
+ const u64 map_type = BTRFS_BLOCK_GROUP_DATA | BTRFS_BLOCK_GROUP_RAID1;
+ const int total_stripes = btrfs_bg_type_to_factor(map_type);
+ u64 logical = SZ_8K;
+ u64 length = SZ_64K;
+ u64 read_length;
+ struct btrfs_io_stripe stripe = { };
+ int i;
+ int last = 0;
+ int ret;
+
+ btrfs_init_dummy_trans(&trans, fs_info);
+
+ bioc = alloc_dummy_bioc(fs_info, logical, total_stripes);
+ if (!bioc)
+ return -ENOMEM;
+
+ bioc->size = length;
+ bioc->map_type = map_type;
+ for (i = 0; i < total_stripes; i++) {
+ struct btrfs_device *dev;
+
+ dev = kzalloc(sizeof(struct btrfs_device), GFP_KERNEL);
+ if (!dev) {
+ ret = -ENOMEM;
+ goto out;
+ }
+ dev->devid = i;
+ bioc->stripes[i].dev = dev;
+ bioc->stripes[i].length = length;
+ bioc->stripes[i].physical = i * SZ_8K;
+ last = i;
+ }
+
+ ret = btrfs_insert_one_raid_extent(&trans, bioc);
+ if (ret)
+ goto out;
+
+ ret = btrfs_delete_raid_extent(&trans, logical, SZ_8K);
+ if (ret)
+ goto out;
+
+ stripe.dev = bioc->stripes[last].dev;
+ read_length = length - SZ_8K;
+ ret = btrfs_get_raid_extent_offset(fs_info, logical + SZ_8K,
+ &read_length, map_type, 0, &stripe);
+ if (ret)
+ goto out;
+
+ if (read_length != length - SZ_8K) {
+ test_err("invalid length %llu vs %llu", read_length,
+ length - SZ_8K);
+ ret = -EINVAL;
+ goto out;
+ }
+
+ if (stripe.physical != bioc->stripes[last].physical + SZ_8K) {
+ test_err("invalid physical %llu vs %llu", stripe.physical,
+ bioc->stripes[last].physical);
+ ret = -EINVAL;
+ }
+
+out:
+ for (i = 0; i < total_stripes; i++)
+ kfree(bioc->stripes[i].dev);
+
+ kfree(bioc);
+ return ret;
+
+}
+
+static int test_stripe_tree_delete_whole(struct btrfs_fs_info *fs_info)
+{
+ struct btrfs_trans_handle trans;
+ struct btrfs_io_context *bioc;
+ const u64 map_type = BTRFS_BLOCK_GROUP_DATA | BTRFS_BLOCK_GROUP_RAID1;
+ const int total_stripes = btrfs_bg_type_to_factor(map_type);
+ u64 logical = SZ_8K;
+ u64 length = SZ_64K;
+ int i;
+ int ret;
+
+ btrfs_init_dummy_trans(&trans, fs_info);
+
+ bioc = alloc_dummy_bioc(fs_info, logical, total_stripes);
+ if (!bioc)
+ return -ENOMEM;
+
+ bioc->size = length;
+ bioc->map_type = map_type;
+ for (i = 0; i < total_stripes; ++i) {
+ struct btrfs_device *dev;
+
+ dev = kzalloc(sizeof(struct btrfs_device), GFP_KERNEL);
+ if (!dev) {
+ ret = -ENOMEM;
+ goto out;
+ }
+ dev->devid = i;
+ bioc->stripes[i].dev = dev;
+ bioc->stripes[i].length = length;
+ bioc->stripes[i].physical = i * SZ_8K;
+ }
+
+ ret = btrfs_insert_one_raid_extent(&trans, bioc);
+ if (ret)
+ goto out;
+
+ ret = btrfs_delete_raid_extent(&trans, logical, length);
+ if (ret)
+ goto out;
+
+ ret = btrfs_header_nritems(fs_info->stripe_root->node);
+ if (ret != 0) {
+ test_err("test failed");
+ ret = -EINVAL;
+ }
+
+out:
+ for (i = 0; i < total_stripes; i++)
+ kfree(bioc->stripes[i].dev);
+
+ kfree(bioc);
+ return ret;
+}
+
+static int test_stripe_tree_delete(struct btrfs_fs_info *fs_info)
+{
+ test_func_t delete_tests[] = {
+ test_stripe_tree_delete_whole,
+ test_stripe_tree_delete_front,
+ test_stripe_tree_delete_tail,
+ };
+ int ret;
+
+ for (int i = 0; i < ARRAY_SIZE(delete_tests); i++) {
+ test_func_t test = delete_tests[i];
+
+ ret = test(fs_info);
+ if (ret)
+ goto out;
+ }
+
+out:
+ return ret;
+}
+
+int btrfs_test_raid_stripe_tree(u32 sectorsize, u32 nodesize)
+{
+ test_func_t tests[] = {
+ test_stripe_tree_delete,
+ };
+ struct btrfs_fs_info *fs_info;
+ struct btrfs_root *root = NULL;
+ int ret = 0;
+
+ test_msg("running raid stripe tree tests");
+
+ fs_info = btrfs_alloc_dummy_fs_info(nodesize, sectorsize);
+ if (!fs_info) {
+ test_std_err(TEST_ALLOC_FS_INFO);
+ ret = -ENOMEM;
+ goto out;
+ }
+
+ root = btrfs_alloc_dummy_root(fs_info);
+ if (IS_ERR(root)) {
+ test_std_err(TEST_ALLOC_ROOT);
+ ret = PTR_ERR(root);
+ goto out;
+ }
+
+ root->root_key.objectid = BTRFS_RAID_STRIPE_TREE_OBJECTID;
+ root->root_key.type = BTRFS_ROOT_ITEM_KEY;
+ root->root_key.offset = 0;
+ btrfs_global_root_insert(root);
+ root->fs_info->stripe_root = root;
+ root->fs_info->tree_root = root;
+ btrfs_set_super_incompat_flags(fs_info->super_copy,
+ BTRFS_FEATURE_INCOMPAT_RAID_STRIPE_TREE);
+
+
+ root->node = alloc_test_extent_buffer(fs_info, nodesize);
+ if (IS_ERR(root->node)) {
+ test_std_err(TEST_ALLOC_EXTENT_BUFFER);
+ ret = PTR_ERR(root->node);
+ goto out;
+ }
+ btrfs_set_header_level(root->node, 0);
+ btrfs_set_header_nritems(root->node, 0);
+ root->alloc_bytenr += 2 * nodesize;
+
+ for (int i = 0; i < ARRAY_SIZE(tests); i++) {
+ test_func_t test = tests[i];
+
+ ret = test(fs_info);
+ if (ret)
+ goto out;
+ }
+out:
+ btrfs_free_dummy_root(root);
+ btrfs_free_dummy_fs_info(fs_info);
+ return ret;
+}
--
2.43.0
^ permalink raw reply related [flat|nested] 23+ messages in thread* [PATCH v4 5/7] btrfs: don't hold dev_replace rwsem over whole of btrfs_map_block
2024-07-05 15:13 [PATCH v4 0/7] btrfs: rst: updates for RAID stripe tree Johannes Thumshirn
` (3 preceding siblings ...)
2024-07-05 15:13 ` [PATCH v4 4/7] btrfs: stripe-tree: add selftests Johannes Thumshirn
@ 2024-07-05 15:13 ` Johannes Thumshirn
2024-07-05 23:28 ` Qu Wenruo
2024-07-05 15:13 ` [PATCH v4 6/7] btrfs: rename brtfs_io_stripe::is_scrub to commit_root Johannes Thumshirn
2024-07-05 15:13 ` [PATCH v4 7/7] btrfs: stripe-tree: also look at commit root on relocation Johannes Thumshirn
6 siblings, 1 reply; 23+ messages in thread
From: Johannes Thumshirn @ 2024-07-05 15:13 UTC (permalink / raw)
To: Chris Mason, Josef Bacik, David Sterba
Cc: linux-btrfs, linux-kernel, Johannes Thumshirn
From: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Don't hold the dev_replace rwsem for the entirety of btrfs_map_block().
It is only needed to protect
a) calls to find_live_mirror() and
b) calling into handle_ops_on_dev_replace().
But there is no need to hold the rwsem for any kind of set_io_stripe()
calls.
So relax taking the dev_replace rwsem to only protect both cases and check
if the device replace status has changed in the meantime, for which we have
to re-do the find_live_mirror() calls.
This fixes a deadlock on raid-stripe-tree where device replace performs a
scrub operation, which in turn calls into btrfs_map_block() to find the
physical location of the block.
Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
---
fs/btrfs/volumes.c | 28 +++++++++++++++++-----------
1 file changed, 17 insertions(+), 11 deletions(-)
diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index fcedc43ef291..4209419244a1 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -6650,14 +6650,9 @@ int btrfs_map_block(struct btrfs_fs_info *fs_info, enum btrfs_map_op op,
max_len = btrfs_max_io_len(map, map_offset, &io_geom);
*length = min_t(u64, map->chunk_len - map_offset, max_len);
+again:
down_read(&dev_replace->rwsem);
dev_replace_is_ongoing = btrfs_dev_replace_is_ongoing(dev_replace);
- /*
- * Hold the semaphore for read during the whole operation, write is
- * requested at commit time but must wait.
- */
- if (!dev_replace_is_ongoing)
- up_read(&dev_replace->rwsem);
switch (map->type & BTRFS_BLOCK_GROUP_PROFILE_MASK) {
case BTRFS_BLOCK_GROUP_RAID0:
@@ -6695,6 +6690,7 @@ int btrfs_map_block(struct btrfs_fs_info *fs_info, enum btrfs_map_op op,
"stripe index math went horribly wrong, got stripe_index=%u, num_stripes=%u",
io_geom.stripe_index, map->num_stripes);
ret = -EINVAL;
+ up_read(&dev_replace->rwsem);
goto out;
}
@@ -6710,6 +6706,8 @@ int btrfs_map_block(struct btrfs_fs_info *fs_info, enum btrfs_map_op op,
*/
num_alloc_stripes += 2;
+ up_read(&dev_replace->rwsem);
+
/*
* If this I/O maps to a single device, try to return the device and
* physical block information on the stack instead of allocating an
@@ -6782,6 +6780,18 @@ int btrfs_map_block(struct btrfs_fs_info *fs_info, enum btrfs_map_op op,
goto out;
}
+ /*
+ * Check if something changed the dev_replace state since
+ * we've checked it for the last time and if redo the whole
+ * mapping operation.
+ */
+ down_read(&dev_replace->rwsem);
+ if (dev_replace_is_ongoing !=
+ btrfs_dev_replace_is_ongoing(dev_replace)) {
+ up_read(&dev_replace->rwsem);
+ goto again;
+ }
+
if (op != BTRFS_MAP_READ)
io_geom.max_errors = btrfs_chunk_max_errors(map);
@@ -6789,6 +6799,7 @@ int btrfs_map_block(struct btrfs_fs_info *fs_info, enum btrfs_map_op op,
op != BTRFS_MAP_READ) {
handle_ops_on_dev_replace(bioc, dev_replace, logical, &io_geom);
}
+ up_read(&dev_replace->rwsem);
*bioc_ret = bioc;
bioc->num_stripes = io_geom.num_stripes;
@@ -6796,11 +6807,6 @@ int btrfs_map_block(struct btrfs_fs_info *fs_info, enum btrfs_map_op op,
bioc->mirror_num = io_geom.mirror_num;
out:
- if (dev_replace_is_ongoing) {
- lockdep_assert_held(&dev_replace->rwsem);
- /* Unlock and let waiting writers proceed */
- up_read(&dev_replace->rwsem);
- }
btrfs_free_chunk_map(map);
return ret;
}
--
2.43.0
^ permalink raw reply related [flat|nested] 23+ messages in thread* Re: [PATCH v4 5/7] btrfs: don't hold dev_replace rwsem over whole of btrfs_map_block
2024-07-05 15:13 ` [PATCH v4 5/7] btrfs: don't hold dev_replace rwsem over whole of btrfs_map_block Johannes Thumshirn
@ 2024-07-05 23:28 ` Qu Wenruo
0 siblings, 0 replies; 23+ messages in thread
From: Qu Wenruo @ 2024-07-05 23:28 UTC (permalink / raw)
To: Johannes Thumshirn, Chris Mason, Josef Bacik, David Sterba
Cc: linux-btrfs, linux-kernel, Johannes Thumshirn
在 2024/7/6 00:43, Johannes Thumshirn 写道:
> From: Johannes Thumshirn <johannes.thumshirn@wdc.com>
>
> Don't hold the dev_replace rwsem for the entirety of btrfs_map_block().
>
> It is only needed to protect
> a) calls to find_live_mirror() and
> b) calling into handle_ops_on_dev_replace().
>
> But there is no need to hold the rwsem for any kind of set_io_stripe()
> calls.
>
> So relax taking the dev_replace rwsem to only protect both cases and check
> if the device replace status has changed in the meantime, for which we have
> to re-do the find_live_mirror() calls.
>
> This fixes a deadlock on raid-stripe-tree where device replace performs a
> scrub operation, which in turn calls into btrfs_map_block() to find the
> physical location of the block.
>
> Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
> Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Thanks,
Qu
> ---
> fs/btrfs/volumes.c | 28 +++++++++++++++++-----------
> 1 file changed, 17 insertions(+), 11 deletions(-)
>
> diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
> index fcedc43ef291..4209419244a1 100644
> --- a/fs/btrfs/volumes.c
> +++ b/fs/btrfs/volumes.c
> @@ -6650,14 +6650,9 @@ int btrfs_map_block(struct btrfs_fs_info *fs_info, enum btrfs_map_op op,
> max_len = btrfs_max_io_len(map, map_offset, &io_geom);
> *length = min_t(u64, map->chunk_len - map_offset, max_len);
>
> +again:
> down_read(&dev_replace->rwsem);
> dev_replace_is_ongoing = btrfs_dev_replace_is_ongoing(dev_replace);
> - /*
> - * Hold the semaphore for read during the whole operation, write is
> - * requested at commit time but must wait.
> - */
> - if (!dev_replace_is_ongoing)
> - up_read(&dev_replace->rwsem);
>
> switch (map->type & BTRFS_BLOCK_GROUP_PROFILE_MASK) {
> case BTRFS_BLOCK_GROUP_RAID0:
> @@ -6695,6 +6690,7 @@ int btrfs_map_block(struct btrfs_fs_info *fs_info, enum btrfs_map_op op,
> "stripe index math went horribly wrong, got stripe_index=%u, num_stripes=%u",
> io_geom.stripe_index, map->num_stripes);
> ret = -EINVAL;
> + up_read(&dev_replace->rwsem);
> goto out;
> }
>
> @@ -6710,6 +6706,8 @@ int btrfs_map_block(struct btrfs_fs_info *fs_info, enum btrfs_map_op op,
> */
> num_alloc_stripes += 2;
>
> + up_read(&dev_replace->rwsem);
> +
> /*
> * If this I/O maps to a single device, try to return the device and
> * physical block information on the stack instead of allocating an
> @@ -6782,6 +6780,18 @@ int btrfs_map_block(struct btrfs_fs_info *fs_info, enum btrfs_map_op op,
> goto out;
> }
>
> + /*
> + * Check if something changed the dev_replace state since
> + * we've checked it for the last time and if redo the whole
> + * mapping operation.
> + */
> + down_read(&dev_replace->rwsem);
> + if (dev_replace_is_ongoing !=
> + btrfs_dev_replace_is_ongoing(dev_replace)) {
> + up_read(&dev_replace->rwsem);
> + goto again;
> + }
> +
> if (op != BTRFS_MAP_READ)
> io_geom.max_errors = btrfs_chunk_max_errors(map);
>
> @@ -6789,6 +6799,7 @@ int btrfs_map_block(struct btrfs_fs_info *fs_info, enum btrfs_map_op op,
> op != BTRFS_MAP_READ) {
> handle_ops_on_dev_replace(bioc, dev_replace, logical, &io_geom);
> }
> + up_read(&dev_replace->rwsem);
>
> *bioc_ret = bioc;
> bioc->num_stripes = io_geom.num_stripes;
> @@ -6796,11 +6807,6 @@ int btrfs_map_block(struct btrfs_fs_info *fs_info, enum btrfs_map_op op,
> bioc->mirror_num = io_geom.mirror_num;
>
> out:
> - if (dev_replace_is_ongoing) {
> - lockdep_assert_held(&dev_replace->rwsem);
> - /* Unlock and let waiting writers proceed */
> - up_read(&dev_replace->rwsem);
> - }
> btrfs_free_chunk_map(map);
> return ret;
> }
>
^ permalink raw reply [flat|nested] 23+ messages in thread
* [PATCH v4 6/7] btrfs: rename brtfs_io_stripe::is_scrub to commit_root
2024-07-05 15:13 [PATCH v4 0/7] btrfs: rst: updates for RAID stripe tree Johannes Thumshirn
` (4 preceding siblings ...)
2024-07-05 15:13 ` [PATCH v4 5/7] btrfs: don't hold dev_replace rwsem over whole of btrfs_map_block Johannes Thumshirn
@ 2024-07-05 15:13 ` Johannes Thumshirn
2024-07-05 23:32 ` Qu Wenruo
2024-07-05 15:13 ` [PATCH v4 7/7] btrfs: stripe-tree: also look at commit root on relocation Johannes Thumshirn
6 siblings, 1 reply; 23+ messages in thread
From: Johannes Thumshirn @ 2024-07-05 15:13 UTC (permalink / raw)
To: Chris Mason, Josef Bacik, David Sterba
Cc: linux-btrfs, linux-kernel, Johannes Thumshirn
From: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Rename brtfs_io_stripe's is_scrub to commit_root, as this is what it
actually does, instruct btrfs_get_raid_extent_offset() to look at the
commit root.
Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
---
fs/btrfs/bio.c | 2 +-
fs/btrfs/raid-stripe-tree.c | 2 +-
fs/btrfs/scrub.c | 2 +-
fs/btrfs/volumes.h | 2 +-
4 files changed, 4 insertions(+), 4 deletions(-)
diff --git a/fs/btrfs/bio.c b/fs/btrfs/bio.c
index f04d93109960..5f36c75a2457 100644
--- a/fs/btrfs/bio.c
+++ b/fs/btrfs/bio.c
@@ -679,7 +679,7 @@ static bool btrfs_submit_chunk(struct btrfs_bio *bbio, int mirror_num)
blk_status_t ret;
int error;
- smap.is_scrub = !bbio->inode;
+ smap.commit_root = !bbio->inode;
btrfs_bio_counter_inc_blocked(fs_info);
error = btrfs_map_block(fs_info, btrfs_op(bio), logical, &map_length,
diff --git a/fs/btrfs/raid-stripe-tree.c b/fs/btrfs/raid-stripe-tree.c
index ba0733c6be76..39085ff971c9 100644
--- a/fs/btrfs/raid-stripe-tree.c
+++ b/fs/btrfs/raid-stripe-tree.c
@@ -259,7 +259,7 @@ int btrfs_get_raid_extent_offset(struct btrfs_fs_info *fs_info,
if (!path)
return -ENOMEM;
- if (stripe->is_scrub) {
+ if (stripe->commit_root) {
path->skip_locking = 1;
path->search_commit_root = 1;
}
diff --git a/fs/btrfs/scrub.c b/fs/btrfs/scrub.c
index 14a8d7100018..9c483b799cf1 100644
--- a/fs/btrfs/scrub.c
+++ b/fs/btrfs/scrub.c
@@ -1688,7 +1688,7 @@ static void scrub_submit_extent_sector_read(struct scrub_ctx *sctx,
(i << fs_info->sectorsize_bits);
int err;
- io_stripe.is_scrub = true;
+ io_stripe.commit_root = true;
stripe_len = (nr_sectors - i) << fs_info->sectorsize_bits;
/*
* For RST cases, we need to manually split the bbio to
diff --git a/fs/btrfs/volumes.h b/fs/btrfs/volumes.h
index 37a09ebb34dd..25bc68af0df8 100644
--- a/fs/btrfs/volumes.h
+++ b/fs/btrfs/volumes.h
@@ -444,7 +444,7 @@ struct btrfs_io_stripe {
/* Block mapping. */
u64 physical;
u64 length;
- bool is_scrub;
+ bool commit_root;
/* For the endio handler. */
struct btrfs_io_context *bioc;
};
--
2.43.0
^ permalink raw reply related [flat|nested] 23+ messages in thread* Re: [PATCH v4 6/7] btrfs: rename brtfs_io_stripe::is_scrub to commit_root
2024-07-05 15:13 ` [PATCH v4 6/7] btrfs: rename brtfs_io_stripe::is_scrub to commit_root Johannes Thumshirn
@ 2024-07-05 23:32 ` Qu Wenruo
0 siblings, 0 replies; 23+ messages in thread
From: Qu Wenruo @ 2024-07-05 23:32 UTC (permalink / raw)
To: Johannes Thumshirn, Chris Mason, Josef Bacik, David Sterba
Cc: linux-btrfs, linux-kernel, Johannes Thumshirn
在 2024/7/6 00:43, Johannes Thumshirn 写道:
> From: Johannes Thumshirn <johannes.thumshirn@wdc.com>
>
> Rename brtfs_io_stripe's is_scrub to commit_root, as this is what it
> actually does, instruct btrfs_get_raid_extent_offset() to look at the
> commit root.
The commit_root name looks a little confusing to me.
Yes, it is to indicate whether we should search commit root, but since
only scrub (and dev-replace) is doing such behavior, it doesn't looks
that odd.
Furthermore, commit_root is way more common in btrfs_root::commit_root,
the same name can lead to different meaning
(btrfs_io_stripe::commit_root means whether to search commit root, while
btrfs_root::commit_root means commit root node).
I'd prefer something like "search_commit_root" if we're really going to
do a rename.
Thanks,
Qu
>
> Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
> ---
> fs/btrfs/bio.c | 2 +-
> fs/btrfs/raid-stripe-tree.c | 2 +-
> fs/btrfs/scrub.c | 2 +-
> fs/btrfs/volumes.h | 2 +-
> 4 files changed, 4 insertions(+), 4 deletions(-)
>
> diff --git a/fs/btrfs/bio.c b/fs/btrfs/bio.c
> index f04d93109960..5f36c75a2457 100644
> --- a/fs/btrfs/bio.c
> +++ b/fs/btrfs/bio.c
> @@ -679,7 +679,7 @@ static bool btrfs_submit_chunk(struct btrfs_bio *bbio, int mirror_num)
> blk_status_t ret;
> int error;
>
> - smap.is_scrub = !bbio->inode;
> + smap.commit_root = !bbio->inode;
>
> btrfs_bio_counter_inc_blocked(fs_info);
> error = btrfs_map_block(fs_info, btrfs_op(bio), logical, &map_length,
> diff --git a/fs/btrfs/raid-stripe-tree.c b/fs/btrfs/raid-stripe-tree.c
> index ba0733c6be76..39085ff971c9 100644
> --- a/fs/btrfs/raid-stripe-tree.c
> +++ b/fs/btrfs/raid-stripe-tree.c
> @@ -259,7 +259,7 @@ int btrfs_get_raid_extent_offset(struct btrfs_fs_info *fs_info,
> if (!path)
> return -ENOMEM;
>
> - if (stripe->is_scrub) {
> + if (stripe->commit_root) {
> path->skip_locking = 1;
> path->search_commit_root = 1;
> }
> diff --git a/fs/btrfs/scrub.c b/fs/btrfs/scrub.c
> index 14a8d7100018..9c483b799cf1 100644
> --- a/fs/btrfs/scrub.c
> +++ b/fs/btrfs/scrub.c
> @@ -1688,7 +1688,7 @@ static void scrub_submit_extent_sector_read(struct scrub_ctx *sctx,
> (i << fs_info->sectorsize_bits);
> int err;
>
> - io_stripe.is_scrub = true;
> + io_stripe.commit_root = true;
> stripe_len = (nr_sectors - i) << fs_info->sectorsize_bits;
> /*
> * For RST cases, we need to manually split the bbio to
> diff --git a/fs/btrfs/volumes.h b/fs/btrfs/volumes.h
> index 37a09ebb34dd..25bc68af0df8 100644
> --- a/fs/btrfs/volumes.h
> +++ b/fs/btrfs/volumes.h
> @@ -444,7 +444,7 @@ struct btrfs_io_stripe {
> /* Block mapping. */
> u64 physical;
> u64 length;
> - bool is_scrub;
> + bool commit_root;
> /* For the endio handler. */
> struct btrfs_io_context *bioc;
> };
>
^ permalink raw reply [flat|nested] 23+ messages in thread
* [PATCH v4 7/7] btrfs: stripe-tree: also look at commit root on relocation
2024-07-05 15:13 [PATCH v4 0/7] btrfs: rst: updates for RAID stripe tree Johannes Thumshirn
` (5 preceding siblings ...)
2024-07-05 15:13 ` [PATCH v4 6/7] btrfs: rename brtfs_io_stripe::is_scrub to commit_root Johannes Thumshirn
@ 2024-07-05 15:13 ` Johannes Thumshirn
6 siblings, 0 replies; 23+ messages in thread
From: Johannes Thumshirn @ 2024-07-05 15:13 UTC (permalink / raw)
To: Chris Mason, Josef Bacik, David Sterba
Cc: linux-btrfs, linux-kernel, Johannes Thumshirn
From: Johannes Thumshirn <johannes.thumshirn@wdc.com>
When doing reads on a raid stripe tree enabled system while relocation
is ongoinig, also look at the commit root. The data we're interested
in could've just been relocated to another place in the current
transaction.
Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
---
fs/btrfs/bio.c | 3 ++-
1 file changed, 2 insertions(+), 1 deletion(-)
diff --git a/fs/btrfs/bio.c b/fs/btrfs/bio.c
index 5f36c75a2457..948509ac343e 100644
--- a/fs/btrfs/bio.c
+++ b/fs/btrfs/bio.c
@@ -679,7 +679,8 @@ static bool btrfs_submit_chunk(struct btrfs_bio *bbio, int mirror_num)
blk_status_t ret;
int error;
- smap.commit_root = !bbio->inode;
+ smap.commit_root = !bbio->inode ||
+ btrfs_is_data_reloc_root(inode->root);
btrfs_bio_counter_inc_blocked(fs_info);
error = btrfs_map_block(fs_info, btrfs_op(bio), logical, &map_length,
--
2.43.0
^ permalink raw reply related [flat|nested] 23+ messages in thread