public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed
* [PATCH v2 0/3] btrfs: more RAID stripe tree updates
@ 2024-07-11  6:21 Johannes Thumshirn
  2024-07-11  6:21 ` [PATCH v2 1/3] btrfs: don't hold dev_replace rwsem over whole of btrfs_map_block Johannes Thumshirn
                   ` (2 more replies)
  0 siblings, 3 replies; 9+ messages in thread
From: Johannes Thumshirn @ 2024-07-11  6:21 UTC (permalink / raw)
  To: Chris Mason, Josef Bacik, David Sterba
  Cc: linux-btrfs, linux-kernel, Qu Wenru, Filipe Manana,
	Johannes Thumshirn

Three further RST updates targeted for 6.11 (hopefully).

The first one is a reworked version of the scrub vs dev-replace deadlock
fix. It does have reviews from Josef and Qu but I'd love to head Filipe's
take on it.

The second one updates a stripe extent in case a write to a already
present logical address happens.

The third one correects assumptions in the delete code. My assumption was
that we are deleting a single stripe extent on each call to
btrfs_delete_stripe_extent(). But do_free_extent_accounting() passes in a
start address and range of bytes that is deleted, so we need to keep track
of how many bytes we already have deleted and update the loop accordingly.

---
Changes in v2:
- Add Qu's Reviewed-by on patch 2
- Add patch 3
- Link to v1: https://lore.kernel.org/r/20240709-b4-rst-updates-v1-0-200800dfe0fd@kernel.org

---
Johannes Thumshirn (3):
      btrfs: don't hold dev_replace rwsem over whole of btrfs_map_block
      btrfs: replace stripe extents
      btrfs: update stripe_extent delete loop assumptions

 fs/btrfs/raid-stripe-tree.c | 56 +++++++++++++++++++++++++++++++++++++++++++++
 fs/btrfs/volumes.c          | 28 ++++++++++++++---------
 2 files changed, 73 insertions(+), 11 deletions(-)
---
base-commit: 584df860cac6e35e364ada101ccd13495b954644
change-id: 20240709-b4-rst-updates-bb9c0e49cd5b

Best regards,
-- 
Johannes Thumshirn <jth@kernel.org>


^ permalink raw reply	[flat|nested] 9+ messages in thread

* [PATCH v2 1/3] btrfs: don't hold dev_replace rwsem over whole of btrfs_map_block
  2024-07-11  6:21 [PATCH v2 0/3] btrfs: more RAID stripe tree updates Johannes Thumshirn
@ 2024-07-11  6:21 ` Johannes Thumshirn
  2024-07-11  6:21 ` [PATCH v2 2/3] btrfs: replace stripe extents Johannes Thumshirn
  2024-07-11  6:21 ` [PATCH v2 3/3] btrfs: update stripe_extent delete loop assumptions Johannes Thumshirn
  2 siblings, 0 replies; 9+ messages in thread
From: Johannes Thumshirn @ 2024-07-11  6:21 UTC (permalink / raw)
  To: Chris Mason, Josef Bacik, David Sterba
  Cc: linux-btrfs, linux-kernel, Qu Wenru, Filipe Manana,
	Johannes Thumshirn

From: Johannes Thumshirn <johannes.thumshirn@wdc.com>

Don't hold the dev_replace rwsem for the entirety of btrfs_map_block().

It is only needed to protect
a) calls to find_live_mirror() and
b) calling into handle_ops_on_dev_replace().

But there is no need to hold the rwsem for any kind of set_io_stripe()
calls.

So relax taking the dev_replace rwsem to only protect both cases and check
if the device replace status has changed in the meantime, for which we have
to re-do the find_live_mirror() calls.

This fixes a deadlock on raid-stripe-tree where device replace performs a
scrub operation, which in turn calls into btrfs_map_block() to find the
physical location of the block.

Cc: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
---
 fs/btrfs/volumes.c | 28 +++++++++++++++++-----------
 1 file changed, 17 insertions(+), 11 deletions(-)

diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index fcedc43ef291..4209419244a1 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -6650,14 +6650,9 @@ int btrfs_map_block(struct btrfs_fs_info *fs_info, enum btrfs_map_op op,
 	max_len = btrfs_max_io_len(map, map_offset, &io_geom);
 	*length = min_t(u64, map->chunk_len - map_offset, max_len);
 
+again:
 	down_read(&dev_replace->rwsem);
 	dev_replace_is_ongoing = btrfs_dev_replace_is_ongoing(dev_replace);
-	/*
-	 * Hold the semaphore for read during the whole operation, write is
-	 * requested at commit time but must wait.
-	 */
-	if (!dev_replace_is_ongoing)
-		up_read(&dev_replace->rwsem);
 
 	switch (map->type & BTRFS_BLOCK_GROUP_PROFILE_MASK) {
 	case BTRFS_BLOCK_GROUP_RAID0:
@@ -6695,6 +6690,7 @@ int btrfs_map_block(struct btrfs_fs_info *fs_info, enum btrfs_map_op op,
 			   "stripe index math went horribly wrong, got stripe_index=%u, num_stripes=%u",
 			   io_geom.stripe_index, map->num_stripes);
 		ret = -EINVAL;
+		up_read(&dev_replace->rwsem);
 		goto out;
 	}
 
@@ -6710,6 +6706,8 @@ int btrfs_map_block(struct btrfs_fs_info *fs_info, enum btrfs_map_op op,
 		 */
 		num_alloc_stripes += 2;
 
+	up_read(&dev_replace->rwsem);
+
 	/*
 	 * If this I/O maps to a single device, try to return the device and
 	 * physical block information on the stack instead of allocating an
@@ -6782,6 +6780,18 @@ int btrfs_map_block(struct btrfs_fs_info *fs_info, enum btrfs_map_op op,
 		goto out;
 	}
 
+	/*
+	 * Check if something changed the dev_replace state since
+	 * we've checked it for the last time and if redo the whole
+	 * mapping operation.
+	 */
+	down_read(&dev_replace->rwsem);
+	if (dev_replace_is_ongoing !=
+	    btrfs_dev_replace_is_ongoing(dev_replace)) {
+		up_read(&dev_replace->rwsem);
+		goto again;
+	}
+
 	if (op != BTRFS_MAP_READ)
 		io_geom.max_errors = btrfs_chunk_max_errors(map);
 
@@ -6789,6 +6799,7 @@ int btrfs_map_block(struct btrfs_fs_info *fs_info, enum btrfs_map_op op,
 	    op != BTRFS_MAP_READ) {
 		handle_ops_on_dev_replace(bioc, dev_replace, logical, &io_geom);
 	}
+	up_read(&dev_replace->rwsem);
 
 	*bioc_ret = bioc;
 	bioc->num_stripes = io_geom.num_stripes;
@@ -6796,11 +6807,6 @@ int btrfs_map_block(struct btrfs_fs_info *fs_info, enum btrfs_map_op op,
 	bioc->mirror_num = io_geom.mirror_num;
 
 out:
-	if (dev_replace_is_ongoing) {
-		lockdep_assert_held(&dev_replace->rwsem);
-		/* Unlock and let waiting writers proceed */
-		up_read(&dev_replace->rwsem);
-	}
 	btrfs_free_chunk_map(map);
 	return ret;
 }

-- 
2.43.0


^ permalink raw reply related	[flat|nested] 9+ messages in thread

* [PATCH v2 2/3] btrfs: replace stripe extents
  2024-07-11  6:21 [PATCH v2 0/3] btrfs: more RAID stripe tree updates Johannes Thumshirn
  2024-07-11  6:21 ` [PATCH v2 1/3] btrfs: don't hold dev_replace rwsem over whole of btrfs_map_block Johannes Thumshirn
@ 2024-07-11  6:21 ` Johannes Thumshirn
  2024-07-11  7:51   ` Naohiro Aota
  2024-07-11  6:21 ` [PATCH v2 3/3] btrfs: update stripe_extent delete loop assumptions Johannes Thumshirn
  2 siblings, 1 reply; 9+ messages in thread
From: Johannes Thumshirn @ 2024-07-11  6:21 UTC (permalink / raw)
  To: Chris Mason, Josef Bacik, David Sterba
  Cc: linux-btrfs, linux-kernel, Qu Wenru, Filipe Manana,
	Johannes Thumshirn

From: Johannes Thumshirn <johannes.thumshirn@wdc.com>

Update stripe extents in case a write to an already existing address
incoming.

Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
---
 fs/btrfs/raid-stripe-tree.c | 51 +++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 51 insertions(+)

diff --git a/fs/btrfs/raid-stripe-tree.c b/fs/btrfs/raid-stripe-tree.c
index e6f7a234b8f6..fd56535b2289 100644
--- a/fs/btrfs/raid-stripe-tree.c
+++ b/fs/btrfs/raid-stripe-tree.c
@@ -73,6 +73,55 @@ int btrfs_delete_raid_extent(struct btrfs_trans_handle *trans, u64 start, u64 le
 	return ret;
 }
 
+static int update_raid_extent_item(struct btrfs_trans_handle *trans,
+				   struct btrfs_key *key,
+				   struct btrfs_io_context *bioc)
+{
+	struct btrfs_path *path;
+	struct extent_buffer *leaf;
+	struct btrfs_stripe_extent *stripe_extent;
+	int num_stripes;
+	int ret;
+	int slot;
+
+	path = btrfs_alloc_path();
+	if (!path)
+		return -ENOMEM;
+
+	ret = btrfs_search_slot(trans, trans->fs_info->stripe_root, key, path,
+				0, 1);
+	if (ret)
+		return ret == 1 ? ret : -EINVAL;
+
+	leaf = path->nodes[0];
+	slot = path->slots[0];
+
+	btrfs_item_key_to_cpu(leaf, key, slot);
+	num_stripes = btrfs_num_raid_stripes(btrfs_item_size(leaf, slot));
+	stripe_extent = btrfs_item_ptr(leaf, slot, struct btrfs_stripe_extent);
+
+	ASSERT(key->offset == bioc->size);
+
+	for (int i = 0; i < num_stripes; i++) {
+		u64 devid = bioc->stripes[i].dev->devid;
+		u64 physical = bioc->stripes[i].physical;
+		u64 length = bioc->stripes[i].length;
+		struct btrfs_raid_stride *raid_stride =
+			&stripe_extent->strides[i];
+
+		if (length == 0)
+			length = bioc->size;
+
+		btrfs_set_raid_stride_devid(leaf, raid_stride, devid);
+		btrfs_set_raid_stride_physical(leaf, raid_stride, physical);
+	}
+
+	btrfs_mark_buffer_dirty(trans, leaf);
+	btrfs_free_path(path);
+
+	return ret;
+}
+
 static int btrfs_insert_one_raid_extent(struct btrfs_trans_handle *trans,
 					struct btrfs_io_context *bioc)
 {
@@ -112,6 +161,8 @@ static int btrfs_insert_one_raid_extent(struct btrfs_trans_handle *trans,
 
 	ret = btrfs_insert_item(trans, stripe_root, &stripe_key, stripe_extent,
 				item_size);
+	if (ret == -EEXIST)
+		ret = update_raid_extent_item(trans, &stripe_key, bioc);
 	if (ret)
 		btrfs_abort_transaction(trans, ret);
 

-- 
2.43.0


^ permalink raw reply related	[flat|nested] 9+ messages in thread

* [PATCH v2 3/3] btrfs: update stripe_extent delete loop assumptions
  2024-07-11  6:21 [PATCH v2 0/3] btrfs: more RAID stripe tree updates Johannes Thumshirn
  2024-07-11  6:21 ` [PATCH v2 1/3] btrfs: don't hold dev_replace rwsem over whole of btrfs_map_block Johannes Thumshirn
  2024-07-11  6:21 ` [PATCH v2 2/3] btrfs: replace stripe extents Johannes Thumshirn
@ 2024-07-11  6:21 ` Johannes Thumshirn
  2024-07-11  6:55   ` Qu Wenruo
  2 siblings, 1 reply; 9+ messages in thread
From: Johannes Thumshirn @ 2024-07-11  6:21 UTC (permalink / raw)
  To: Chris Mason, Josef Bacik, David Sterba
  Cc: linux-btrfs, linux-kernel, Qu Wenru, Filipe Manana,
	Johannes Thumshirn

From: Johannes Thumshirn <johannes.thumshirn@wdc.com>

btrfs_delete_raid_extent() was written under the assumption, that it's
call-chain always passes a start, length tuple that matches a single
extent. But btrfs_delete_raid_extent() is called by
do_free_extent_acounting() which in term is called by
__btrfs_free_extent().

But this call-chain passes in a start address and a length that can
possibly match multiple on-disk extents.

To make this possible, we have to adjust the start and length of each
btree node lookup, to not delete beyond the requested range.

Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
---
 fs/btrfs/raid-stripe-tree.c | 5 +++++
 1 file changed, 5 insertions(+)

diff --git a/fs/btrfs/raid-stripe-tree.c b/fs/btrfs/raid-stripe-tree.c
index fd56535b2289..6f65be334637 100644
--- a/fs/btrfs/raid-stripe-tree.c
+++ b/fs/btrfs/raid-stripe-tree.c
@@ -66,6 +66,11 @@ int btrfs_delete_raid_extent(struct btrfs_trans_handle *trans, u64 start, u64 le
 		if (ret)
 			break;
 
+		start += key.offset;
+		length -= key.offset;
+		if (length == 0)
+			break;
+
 		btrfs_release_path(path);
 	}
 

-- 
2.43.0


^ permalink raw reply related	[flat|nested] 9+ messages in thread

* Re: [PATCH v2 3/3] btrfs: update stripe_extent delete loop assumptions
  2024-07-11  6:21 ` [PATCH v2 3/3] btrfs: update stripe_extent delete loop assumptions Johannes Thumshirn
@ 2024-07-11  6:55   ` Qu Wenruo
  2024-07-11  7:44     ` Qu Wenruo
  0 siblings, 1 reply; 9+ messages in thread
From: Qu Wenruo @ 2024-07-11  6:55 UTC (permalink / raw)
  To: Johannes Thumshirn, Chris Mason, Josef Bacik, David Sterba
  Cc: linux-btrfs, linux-kernel, Filipe Manana, Johannes Thumshirn



在 2024/7/11 15:51, Johannes Thumshirn 写道:
> From: Johannes Thumshirn <johannes.thumshirn@wdc.com>
> 
> btrfs_delete_raid_extent() was written under the assumption, that it's
> call-chain always passes a start, length tuple that matches a single
> extent. But btrfs_delete_raid_extent() is called by
> do_free_extent_acounting() which in term is called by > __btrfs_free_extent().

But from the call site __btrfs_free_extent(), it is still called for a 
single extent.

Or we will get an error and abort the current transaction.

> 
> But this call-chain passes in a start address and a length that can
> possibly match multiple on-disk extents.

Mind to give a more detailed example on this?

Thanks,
Qu

> 
> To make this possible, we have to adjust the start and length of each
> btree node lookup, to not delete beyond the requested range.
> 
> Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
> ---
>   fs/btrfs/raid-stripe-tree.c | 5 +++++
>   1 file changed, 5 insertions(+)
> 
> diff --git a/fs/btrfs/raid-stripe-tree.c b/fs/btrfs/raid-stripe-tree.c
> index fd56535b2289..6f65be334637 100644
> --- a/fs/btrfs/raid-stripe-tree.c
> +++ b/fs/btrfs/raid-stripe-tree.c
> @@ -66,6 +66,11 @@ int btrfs_delete_raid_extent(struct btrfs_trans_handle *trans, u64 start, u64 le
>   		if (ret)
>   			break;
>   
> +		start += key.offset;
> +		length -= key.offset;
> +		if (length == 0)
> +			break;
> +
>   		btrfs_release_path(path);
>   	}
>   
> 

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH v2 3/3] btrfs: update stripe_extent delete loop assumptions
  2024-07-11  6:55   ` Qu Wenruo
@ 2024-07-11  7:44     ` Qu Wenruo
  2024-07-11  7:55       ` Qu Wenruo
  0 siblings, 1 reply; 9+ messages in thread
From: Qu Wenruo @ 2024-07-11  7:44 UTC (permalink / raw)
  To: Johannes Thumshirn, Chris Mason, Josef Bacik, David Sterba
  Cc: linux-btrfs, linux-kernel, Filipe Manana, Johannes Thumshirn



在 2024/7/11 16:25, Qu Wenruo 写道:
> 
> 
> 在 2024/7/11 15:51, Johannes Thumshirn 写道:
>> From: Johannes Thumshirn <johannes.thumshirn@wdc.com>
>>
>> btrfs_delete_raid_extent() was written under the assumption, that it's
>> call-chain always passes a start, length tuple that matches a single
>> extent. But btrfs_delete_raid_extent() is called by
>> do_free_extent_acounting() which in term is called by > 
>> __btrfs_free_extent().
> 
> But from the call site __btrfs_free_extent(), it is still called for a 
> single extent.
> 
> Or we will get an error and abort the current transaction.

Or does it mean, one data extent can have multiple RST entries?

Is that a non-zoned RST specific behavior?
As I still remember that we split ordered extents for zoned devices, so 
that it should always have one extent for each split OE.

Thanks,
Qu
> 
>>
>> But this call-chain passes in a start address and a length that can
>> possibly match multiple on-disk extents.
> 
> Mind to give a more detailed example on this?
> 
> Thanks,
> Qu
> 
>>
>> To make this possible, we have to adjust the start and length of each
>> btree node lookup, to not delete beyond the requested range.
>>
>> Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
>> ---
>>   fs/btrfs/raid-stripe-tree.c | 5 +++++
>>   1 file changed, 5 insertions(+)
>>
>> diff --git a/fs/btrfs/raid-stripe-tree.c b/fs/btrfs/raid-stripe-tree.c
>> index fd56535b2289..6f65be334637 100644
>> --- a/fs/btrfs/raid-stripe-tree.c
>> +++ b/fs/btrfs/raid-stripe-tree.c
>> @@ -66,6 +66,11 @@ int btrfs_delete_raid_extent(struct 
>> btrfs_trans_handle *trans, u64 start, u64 le
>>           if (ret)
>>               break;
>> +        start += key.offset;
>> +        length -= key.offset;
>> +        if (length == 0)
>> +            break;
>> +
>>           btrfs_release_path(path);
>>       }
>>
> 

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH v2 2/3] btrfs: replace stripe extents
  2024-07-11  6:21 ` [PATCH v2 2/3] btrfs: replace stripe extents Johannes Thumshirn
@ 2024-07-11  7:51   ` Naohiro Aota
  2024-07-12  6:34     ` Johannes Thumshirn
  0 siblings, 1 reply; 9+ messages in thread
From: Naohiro Aota @ 2024-07-11  7:51 UTC (permalink / raw)
  To: Johannes Thumshirn
  Cc: Chris Mason, Josef Bacik, David Sterba,
	linux-btrfs@vger.kernel.org, linux-kernel@vger.kernel.org,
	Qu Wenru, Filipe Manana, Johannes Thumshirn

On Thu, Jul 11, 2024 at 08:21:31AM GMT, Johannes Thumshirn wrote:
> From: Johannes Thumshirn <johannes.thumshirn@wdc.com>
> 
> Update stripe extents in case a write to an already existing address
> incoming.
> 
> Reviewed-by: Qu Wenruo <wqu@suse.com>
> Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
> ---
>  fs/btrfs/raid-stripe-tree.c | 51 +++++++++++++++++++++++++++++++++++++++++++++
>  1 file changed, 51 insertions(+)
> 
> diff --git a/fs/btrfs/raid-stripe-tree.c b/fs/btrfs/raid-stripe-tree.c
> index e6f7a234b8f6..fd56535b2289 100644
> --- a/fs/btrfs/raid-stripe-tree.c
> +++ b/fs/btrfs/raid-stripe-tree.c
> @@ -73,6 +73,55 @@ int btrfs_delete_raid_extent(struct btrfs_trans_handle *trans, u64 start, u64 le
>  	return ret;
>  }
>  
> +static int update_raid_extent_item(struct btrfs_trans_handle *trans,
> +				   struct btrfs_key *key,
> +				   struct btrfs_io_context *bioc)
> +{
> +	struct btrfs_path *path;
> +	struct extent_buffer *leaf;
> +	struct btrfs_stripe_extent *stripe_extent;
> +	int num_stripes;
> +	int ret;
> +	int slot;
> +
> +	path = btrfs_alloc_path();
> +	if (!path)
> +		return -ENOMEM;
> +
> +	ret = btrfs_search_slot(trans, trans->fs_info->stripe_root, key, path,
> +				0, 1);
> +	if (ret)
> +		return ret == 1 ? ret : -EINVAL;
> +
> +	leaf = path->nodes[0];
> +	slot = path->slots[0];
> +
> +	btrfs_item_key_to_cpu(leaf, key, slot);
> +	num_stripes = btrfs_num_raid_stripes(btrfs_item_size(leaf, slot));
> +	stripe_extent = btrfs_item_ptr(leaf, slot, struct btrfs_stripe_extent);
> +
> +	ASSERT(key->offset == bioc->size);
> +
> +	for (int i = 0; i < num_stripes; i++) {
> +		u64 devid = bioc->stripes[i].dev->devid;
> +		u64 physical = bioc->stripes[i].physical;
> +		u64 length = bioc->stripes[i].length;
> +		struct btrfs_raid_stride *raid_stride =
> +			&stripe_extent->strides[i];
> +
> +		if (length == 0)
> +			length = bioc->size;
> +
> +		btrfs_set_raid_stride_devid(leaf, raid_stride, devid);
> +		btrfs_set_raid_stride_physical(leaf, raid_stride, physical);
> +	}

Can we take the "stripe_extent" and item_size and use write_extent_buffer()
to overwrite the item here? Then, we don't need duplicated code.

> +
> +	btrfs_mark_buffer_dirty(trans, leaf);
> +	btrfs_free_path(path);
> +
> +	return ret;
> +}
> +
>  static int btrfs_insert_one_raid_extent(struct btrfs_trans_handle *trans,
>  					struct btrfs_io_context *bioc)
>  {
> @@ -112,6 +161,8 @@ static int btrfs_insert_one_raid_extent(struct btrfs_trans_handle *trans,
>  
>  	ret = btrfs_insert_item(trans, stripe_root, &stripe_key, stripe_extent,
>  				item_size);
> +	if (ret == -EEXIST)
> +		ret = update_raid_extent_item(trans, &stripe_key, bioc);
>  	if (ret)
>  		btrfs_abort_transaction(trans, ret);
>  
> 
> -- 
> 2.43.0
> 

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH v2 3/3] btrfs: update stripe_extent delete loop assumptions
  2024-07-11  7:44     ` Qu Wenruo
@ 2024-07-11  7:55       ` Qu Wenruo
  0 siblings, 0 replies; 9+ messages in thread
From: Qu Wenruo @ 2024-07-11  7:55 UTC (permalink / raw)
  To: Johannes Thumshirn, Chris Mason, Josef Bacik, David Sterba
  Cc: linux-btrfs, linux-kernel, Filipe Manana, Johannes Thumshirn



在 2024/7/11 17:14, Qu Wenruo 写道:
> 
> 
> 在 2024/7/11 16:25, Qu Wenruo 写道:
>>
>>
>> 在 2024/7/11 15:51, Johannes Thumshirn 写道:
>>> From: Johannes Thumshirn <johannes.thumshirn@wdc.com>
>>>
>>> btrfs_delete_raid_extent() was written under the assumption, that it's
>>> call-chain always passes a start, length tuple that matches a single
>>> extent. But btrfs_delete_raid_extent() is called by
>>> do_free_extent_acounting() which in term is called by > 
>>> __btrfs_free_extent().
>>
>> But from the call site __btrfs_free_extent(), it is still called for a 
>> single extent.
>>
>> Or we will get an error and abort the current transaction.
> 
> Or does it mean, one data extent can have multiple RST entries?
> 
> Is that a non-zoned RST specific behavior?
> As I still remember that we split ordered extents for zoned devices, so 
> that it should always have one extent for each split OE.


OK, it's indeed an RST specific behavior (at least for RST with 
non-zoned devices).

I can have the following layout:

         item 15 key (258 EXTENT_DATA 419430400) itemoff 15306 itemsize 53
                 generation 10 type 1 (regular)
                 extent data disk byte 1808793600 nr 117440512
                 extent data offset 0 nr 117440512 ram 117440512
                 extent compression 0 (none)

Which is a large data extent with 112MiB length.

Meanwhile for the RST entries there are 3 split ones:

         item 13 key (1808793600 RAID_STRIPE 33619968) itemoff 15835 
itemsize 32
                         stripe 0 devid 2 physical 1787822080
                         stripe 1 devid 1 physical 1808793600
         item 14 key (1842413568 RAID_STRIPE 58789888) itemoff 15803 
itemsize 32
                         stripe 0 devid 2 physical 1821442048
                         stripe 1 devid 1 physical 1842413568
         item 15 key (1901203456 RAID_STRIPE 25030656) itemoff 15771 
itemsize 32
                         stripe 0 devid 2 physical 1880231936
                         stripe 1 devid 1 physical 1901203456

So yes, it's possible to have multiple RST entries for a single data 
extent, it's no longer the old zoned behavior.

In that case, the patch looks fine to me.

Reviewed-by: Qu Wenruo <wqu@suse.com>

Thanks,
Qu


> 
> Thanks,
> Qu
>>
>>>
>>> But this call-chain passes in a start address and a length that can
>>> possibly match multiple on-disk extents.
>>
>> Mind to give a more detailed example on this?
>>
>> Thanks,
>> Qu
>>
>>>
>>> To make this possible, we have to adjust the start and length of each
>>> btree node lookup, to not delete beyond the requested range.
>>>
>>> Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
>>> ---
>>>   fs/btrfs/raid-stripe-tree.c | 5 +++++
>>>   1 file changed, 5 insertions(+)
>>>
>>> diff --git a/fs/btrfs/raid-stripe-tree.c b/fs/btrfs/raid-stripe-tree.c
>>> index fd56535b2289..6f65be334637 100644
>>> --- a/fs/btrfs/raid-stripe-tree.c
>>> +++ b/fs/btrfs/raid-stripe-tree.c
>>> @@ -66,6 +66,11 @@ int btrfs_delete_raid_extent(struct 
>>> btrfs_trans_handle *trans, u64 start, u64 le
>>>           if (ret)
>>>               break;
>>> +        start += key.offset;
>>> +        length -= key.offset;
>>> +        if (length == 0)
>>> +            break;
>>> +
>>>           btrfs_release_path(path);
>>>       }
>>>
>>
> 

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH v2 2/3] btrfs: replace stripe extents
  2024-07-11  7:51   ` Naohiro Aota
@ 2024-07-12  6:34     ` Johannes Thumshirn
  0 siblings, 0 replies; 9+ messages in thread
From: Johannes Thumshirn @ 2024-07-12  6:34 UTC (permalink / raw)
  To: Naohiro Aota, Johannes Thumshirn
  Cc: Chris Mason, Josef Bacik, David Sterba,
	linux-btrfs@vger.kernel.org, linux-kernel@vger.kernel.org,
	Qu Wenru, Filipe Manana

On 11.07.24 09:51, Naohiro Aota wrote:
> Can we take the "stripe_extent" and item_size and use write_extent_buffer()
> to overwrite the item here? Then, we don't need duplicated code.

This could indeed work, let me give it a try.

^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2024-07-12  6:34 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-07-11  6:21 [PATCH v2 0/3] btrfs: more RAID stripe tree updates Johannes Thumshirn
2024-07-11  6:21 ` [PATCH v2 1/3] btrfs: don't hold dev_replace rwsem over whole of btrfs_map_block Johannes Thumshirn
2024-07-11  6:21 ` [PATCH v2 2/3] btrfs: replace stripe extents Johannes Thumshirn
2024-07-11  7:51   ` Naohiro Aota
2024-07-12  6:34     ` Johannes Thumshirn
2024-07-11  6:21 ` [PATCH v2 3/3] btrfs: update stripe_extent delete loop assumptions Johannes Thumshirn
2024-07-11  6:55   ` Qu Wenruo
2024-07-11  7:44     ` Qu Wenruo
2024-07-11  7:55       ` Qu Wenruo

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox