* [PATCH v3 0/5] btrfs: rst: updates for RAID stripe tree
@ 2024-07-01 10:25 Johannes Thumshirn
2024-07-01 10:25 ` [PATCH v3 1/5] btrfs: replace stripe extents Johannes Thumshirn
` (4 more replies)
0 siblings, 5 replies; 18+ messages in thread
From: Johannes Thumshirn @ 2024-07-01 10:25 UTC (permalink / raw)
To: Chris Mason, Josef Bacik, David Sterba
Cc: linux-btrfs, linux-kernel, Johannes Thumshirn
Patch 1 replaces stripe extents in case we hit a EEXIST when inserting a
stripe extent on a write. This can happen i.e. on device-replace.
Patch 2 splits a stripe extent on partial delete of a stripe.
Patch 3 adds selftests for the stripe-tree delete code, these selftests
can and will be extended to cover insert and get as well.
Patch 4 fixes a deadlock between scrub and device replace when RST is
used.
Patch 5 get's rid of the pointless tree dump in case we're not finding an
RST entry.
Note: There still is a known bug triggering a crash on btrfs/06[01]. I'm
working on it, but I've not yet root caused it.
---
Changes in v3:
- Drop on-disk format change as it's in for-next
- Add patches 4 & 5
- Link to v2: https://lore.kernel.org/r/20240619-b4-rst-updates-v2-0-89c34d0d5298@kernel.org
Changes in v2:
- Added selftests for delete code
- Link to v1: https://lore.kernel.org/r/20240610-b4-rst-updates-v1-0-179c1eec08f2@kernel.org
---
Johannes Thumshirn (5):
btrfs: replace stripe extents
btrfs: split RAID stripes on deletion
btrfs: stripe-tree: add selftests
btrfs: don't hold dev_replace rwsem over whole of btrfs_map_block
btrfs: rst: don't print tree dump in case lookup fails
fs/btrfs/Makefile | 3 +-
fs/btrfs/ctree.c | 1 +
fs/btrfs/raid-stripe-tree.c | 143 ++++++++++++++----
fs/btrfs/raid-stripe-tree.h | 5 +
fs/btrfs/tests/btrfs-tests.c | 3 +
fs/btrfs/tests/btrfs-tests.h | 1 +
fs/btrfs/tests/raid-stripe-tree-tests.c | 258 ++++++++++++++++++++++++++++++++
fs/btrfs/volumes.c | 28 ++--
8 files changed, 401 insertions(+), 41 deletions(-)
---
base-commit: 9c681cca9c5f8e3bd5891f3944f7b9ce4d14f4ec
change-id: 20240610-b4-rst-updates-d0aa696b9d5a
Best regards,
--
Johannes Thumshirn <jth@kernel.org>
^ permalink raw reply [flat|nested] 18+ messages in thread
* [PATCH v3 1/5] btrfs: replace stripe extents
2024-07-01 10:25 [PATCH v3 0/5] btrfs: rst: updates for RAID stripe tree Johannes Thumshirn
@ 2024-07-01 10:25 ` Johannes Thumshirn
2024-07-01 13:57 ` Josef Bacik
2024-07-01 20:37 ` Josef Bacik
2024-07-01 10:25 ` [PATCH v3 2/5] btrfs: split RAID stripes on deletion Johannes Thumshirn
` (3 subsequent siblings)
4 siblings, 2 replies; 18+ messages in thread
From: Johannes Thumshirn @ 2024-07-01 10:25 UTC (permalink / raw)
To: Chris Mason, Josef Bacik, David Sterba
Cc: linux-btrfs, linux-kernel, Johannes Thumshirn
From: Johannes Thumshirn <johannes.thumshirn@wdc.com>
If we can't insert a stripe extent in the RAID stripe tree, because
the key that points to the specific position in the stripe tree is
already existing, we have to remove the item and then replace it by a
new item.
This can happen for example on device replace operations.
Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
---
fs/btrfs/raid-stripe-tree.c | 34 ++++++++++++++++++++++++++++++++++
1 file changed, 34 insertions(+)
diff --git a/fs/btrfs/raid-stripe-tree.c b/fs/btrfs/raid-stripe-tree.c
index e6f7a234b8f6..3020820dd6e2 100644
--- a/fs/btrfs/raid-stripe-tree.c
+++ b/fs/btrfs/raid-stripe-tree.c
@@ -73,6 +73,37 @@ int btrfs_delete_raid_extent(struct btrfs_trans_handle *trans, u64 start, u64 le
return ret;
}
+static int replace_raid_extent_item(struct btrfs_trans_handle *trans,
+ struct btrfs_key *key,
+ struct btrfs_stripe_extent *stripe_extent,
+ const size_t item_size)
+{
+ struct btrfs_fs_info *fs_info = trans->fs_info;
+ struct btrfs_root *stripe_root = fs_info->stripe_root;
+ struct btrfs_path *path;
+ int ret;
+
+ path = btrfs_alloc_path();
+ if (!path)
+ return -ENOMEM;
+
+ ret = btrfs_search_slot(trans, stripe_root, key, path, -1, 1);
+ if (ret)
+ goto err;
+
+ ret = btrfs_del_item(trans, stripe_root, path);
+ if (ret)
+ goto err;
+
+ btrfs_free_path(path);
+
+ return btrfs_insert_item(trans, stripe_root, key, stripe_extent,
+ item_size);
+ err:
+ btrfs_free_path(path);
+ return ret;
+}
+
static int btrfs_insert_one_raid_extent(struct btrfs_trans_handle *trans,
struct btrfs_io_context *bioc)
{
@@ -112,6 +143,9 @@ static int btrfs_insert_one_raid_extent(struct btrfs_trans_handle *trans,
ret = btrfs_insert_item(trans, stripe_root, &stripe_key, stripe_extent,
item_size);
+ if (ret == -EEXIST)
+ ret = replace_raid_extent_item(trans, &stripe_key,
+ stripe_extent, item_size);
if (ret)
btrfs_abort_transaction(trans, ret);
--
2.43.0
^ permalink raw reply related [flat|nested] 18+ messages in thread
* [PATCH v3 2/5] btrfs: split RAID stripes on deletion
2024-07-01 10:25 [PATCH v3 0/5] btrfs: rst: updates for RAID stripe tree Johannes Thumshirn
2024-07-01 10:25 ` [PATCH v3 1/5] btrfs: replace stripe extents Johannes Thumshirn
@ 2024-07-01 10:25 ` Johannes Thumshirn
2024-07-01 14:07 ` Josef Bacik
2024-07-01 10:25 ` [PATCH v3 3/5] btrfs: stripe-tree: add selftests Johannes Thumshirn
` (2 subsequent siblings)
4 siblings, 1 reply; 18+ messages in thread
From: Johannes Thumshirn @ 2024-07-01 10:25 UTC (permalink / raw)
To: Chris Mason, Josef Bacik, David Sterba
Cc: linux-btrfs, linux-kernel, Johannes Thumshirn
From: Johannes Thumshirn <johannes.thumshirn@wdc.com>
The current RAID stripe code assumes, that we will always remove a
whole stripe entry.
But if we're only removing a part of a RAID stripe we're hitting the
ASSERT()ion checking for this condition.
Instead of assuming the complete deletion of a RAID stripe, split the
stripe if we need to.
Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
---
fs/btrfs/ctree.c | 1 +
fs/btrfs/raid-stripe-tree.c | 100 +++++++++++++++++++++++++++++++++-----------
2 files changed, 77 insertions(+), 24 deletions(-)
diff --git a/fs/btrfs/ctree.c b/fs/btrfs/ctree.c
index e33f9f5a228d..16f9cf6360a4 100644
--- a/fs/btrfs/ctree.c
+++ b/fs/btrfs/ctree.c
@@ -3863,6 +3863,7 @@ static noinline int setup_leaf_for_split(struct btrfs_trans_handle *trans,
btrfs_item_key_to_cpu(leaf, &key, path->slots[0]);
BUG_ON(key.type != BTRFS_EXTENT_DATA_KEY &&
+ key.type != BTRFS_RAID_STRIPE_KEY &&
key.type != BTRFS_EXTENT_CSUM_KEY);
if (btrfs_leaf_free_space(leaf) >= ins_len)
diff --git a/fs/btrfs/raid-stripe-tree.c b/fs/btrfs/raid-stripe-tree.c
index 3020820dd6e2..64e36b46cbab 100644
--- a/fs/btrfs/raid-stripe-tree.c
+++ b/fs/btrfs/raid-stripe-tree.c
@@ -33,42 +33,94 @@ int btrfs_delete_raid_extent(struct btrfs_trans_handle *trans, u64 start, u64 le
if (!path)
return -ENOMEM;
- while (1) {
- key.objectid = start;
- key.type = BTRFS_RAID_STRIPE_KEY;
- key.offset = length;
+again:
+ key.objectid = start;
+ key.type = BTRFS_RAID_STRIPE_KEY;
+ key.offset = length;
- ret = btrfs_search_slot(trans, stripe_root, &key, path, -1, 1);
- if (ret < 0)
- break;
- if (ret > 0) {
- ret = 0;
- if (path->slots[0] == 0)
- break;
- path->slots[0]--;
- }
+ ret = btrfs_search_slot(trans, stripe_root, &key, path, -1, 1);
+ if (ret < 0)
+ goto out;
+ if (ret > 0) {
+ ret = 0;
+ if (path->slots[0] == 0)
+ goto out;
+ path->slots[0]--;
+ }
+
+ leaf = path->nodes[0];
+ slot = path->slots[0];
+ btrfs_item_key_to_cpu(leaf, &key, slot);
+ found_start = key.objectid;
+ found_end = found_start + key.offset;
+
+ /* That stripe ends before we start, we're done. */
+ if (found_end <= start)
+ goto out;
+
+ trace_btrfs_raid_extent_delete(fs_info, start, end,
+ found_start, found_end);
+
+ if (found_start < start) {
+ u64 diff = start - found_start;
+ struct btrfs_key new_key;
+ int num_stripes;
+ struct btrfs_stripe_extent *stripe_extent;
+
+ new_key.objectid = start;
+ new_key.type = BTRFS_RAID_STRIPE_KEY;
+ new_key.offset = length - diff;
+
+ ret = btrfs_duplicate_item(trans, stripe_root, path,
+ &new_key);
+ if (ret)
+ goto out;
leaf = path->nodes[0];
slot = path->slots[0];
- btrfs_item_key_to_cpu(leaf, &key, slot);
- found_start = key.objectid;
- found_end = found_start + key.offset;
- /* That stripe ends before we start, we're done. */
- if (found_end <= start)
- break;
+ num_stripes =
+ btrfs_num_raid_stripes(btrfs_item_size(leaf, slot));
+ stripe_extent =
+ btrfs_item_ptr(leaf, slot, struct btrfs_stripe_extent);
+
+ for (int i = 0; i < num_stripes; i++) {
+ struct btrfs_raid_stride *raid_stride =
+ &stripe_extent->strides[i];
+ u64 physical =
+ btrfs_raid_stride_physical(leaf, raid_stride);
+
+ btrfs_set_raid_stride_physical(leaf, raid_stride,
+ physical + diff);
+ }
+
+ btrfs_mark_buffer_dirty(trans, leaf);
+ btrfs_release_path(path);
+ goto again;
+ }
+
+ if (found_end > end) {
+ u64 diff = found_end - end;
+ struct btrfs_key new_key;
- trace_btrfs_raid_extent_delete(fs_info, start, end,
- found_start, found_end);
+ new_key.objectid = found_start;
+ new_key.type = BTRFS_RAID_STRIPE_KEY;
+ new_key.offset = length - diff;
- ASSERT(found_start >= start && found_end <= end);
- ret = btrfs_del_item(trans, stripe_root, path);
+ ret = btrfs_duplicate_item(trans, stripe_root, path,
+ &new_key);
if (ret)
- break;
+ goto out;
+ btrfs_mark_buffer_dirty(trans, leaf);
btrfs_release_path(path);
+ goto again;
+
}
+ ret = btrfs_del_item(trans, stripe_root, path);
+
+ out:
btrfs_free_path(path);
return ret;
}
--
2.43.0
^ permalink raw reply related [flat|nested] 18+ messages in thread
* [PATCH v3 3/5] btrfs: stripe-tree: add selftests
2024-07-01 10:25 [PATCH v3 0/5] btrfs: rst: updates for RAID stripe tree Johannes Thumshirn
2024-07-01 10:25 ` [PATCH v3 1/5] btrfs: replace stripe extents Johannes Thumshirn
2024-07-01 10:25 ` [PATCH v3 2/5] btrfs: split RAID stripes on deletion Johannes Thumshirn
@ 2024-07-01 10:25 ` Johannes Thumshirn
2024-07-01 14:08 ` Josef Bacik
2024-07-01 10:25 ` [PATCH v3 4/5] btrfs: don't hold dev_replace rwsem over whole of btrfs_map_block Johannes Thumshirn
2024-07-01 10:25 ` [PATCH v3 5/5] btrfs: rst: don't print tree dump in case lookup fails Johannes Thumshirn
4 siblings, 1 reply; 18+ messages in thread
From: Johannes Thumshirn @ 2024-07-01 10:25 UTC (permalink / raw)
To: Chris Mason, Josef Bacik, David Sterba
Cc: linux-btrfs, linux-kernel, Johannes Thumshirn
From: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Add self-tests for the RAID stripe tree code.
Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
---
fs/btrfs/Makefile | 3 +-
fs/btrfs/raid-stripe-tree.c | 3 +-
fs/btrfs/raid-stripe-tree.h | 5 +
fs/btrfs/tests/btrfs-tests.c | 3 +
fs/btrfs/tests/btrfs-tests.h | 1 +
fs/btrfs/tests/raid-stripe-tree-tests.c | 258 ++++++++++++++++++++++++++++++++
6 files changed, 271 insertions(+), 2 deletions(-)
diff --git a/fs/btrfs/Makefile b/fs/btrfs/Makefile
index 87617f2968bc..3cfc440c636c 100644
--- a/fs/btrfs/Makefile
+++ b/fs/btrfs/Makefile
@@ -43,4 +43,5 @@ btrfs-$(CONFIG_FS_VERITY) += verity.o
btrfs-$(CONFIG_BTRFS_FS_RUN_SANITY_TESTS) += tests/free-space-tests.o \
tests/extent-buffer-tests.o tests/btrfs-tests.o \
tests/extent-io-tests.o tests/inode-tests.o tests/qgroup-tests.o \
- tests/free-space-tree-tests.o tests/extent-map-tests.o
+ tests/free-space-tree-tests.o tests/extent-map-tests.o \
+ tests/raid-stripe-tree-tests.o
diff --git a/fs/btrfs/raid-stripe-tree.c b/fs/btrfs/raid-stripe-tree.c
index 64e36b46cbab..c5c2f19387ff 100644
--- a/fs/btrfs/raid-stripe-tree.c
+++ b/fs/btrfs/raid-stripe-tree.c
@@ -156,7 +156,8 @@ static int replace_raid_extent_item(struct btrfs_trans_handle *trans,
return ret;
}
-static int btrfs_insert_one_raid_extent(struct btrfs_trans_handle *trans,
+EXPORT_FOR_TESTS
+int btrfs_insert_one_raid_extent(struct btrfs_trans_handle *trans,
struct btrfs_io_context *bioc)
{
struct btrfs_fs_info *fs_info = trans->fs_info;
diff --git a/fs/btrfs/raid-stripe-tree.h b/fs/btrfs/raid-stripe-tree.h
index 1ac1c21aac2f..541836421778 100644
--- a/fs/btrfs/raid-stripe-tree.h
+++ b/fs/btrfs/raid-stripe-tree.h
@@ -28,6 +28,11 @@ int btrfs_get_raid_extent_offset(struct btrfs_fs_info *fs_info,
int btrfs_insert_raid_extent(struct btrfs_trans_handle *trans,
struct btrfs_ordered_extent *ordered_extent);
+#ifdef CONFIG_BTRFS_FS_RUN_SANITY_TESTS
+int btrfs_insert_one_raid_extent(struct btrfs_trans_handle *trans,
+ struct btrfs_io_context *bioc);
+#endif
+
static inline bool btrfs_need_stripe_tree_update(struct btrfs_fs_info *fs_info,
u64 map_type)
{
diff --git a/fs/btrfs/tests/btrfs-tests.c b/fs/btrfs/tests/btrfs-tests.c
index ce50847e1e01..18e1ab4a0914 100644
--- a/fs/btrfs/tests/btrfs-tests.c
+++ b/fs/btrfs/tests/btrfs-tests.c
@@ -291,6 +291,9 @@ int btrfs_run_sanity_tests(void)
ret = btrfs_test_free_space_tree(sectorsize, nodesize);
if (ret)
goto out;
+ ret = btrfs_test_raid_stripe_tree(sectorsize, nodesize);
+ if (ret)
+ goto out;
}
}
ret = btrfs_test_extent_map();
diff --git a/fs/btrfs/tests/btrfs-tests.h b/fs/btrfs/tests/btrfs-tests.h
index dc2f2ab15fa5..61bcadaf2036 100644
--- a/fs/btrfs/tests/btrfs-tests.h
+++ b/fs/btrfs/tests/btrfs-tests.h
@@ -37,6 +37,7 @@ int btrfs_test_extent_io(u32 sectorsize, u32 nodesize);
int btrfs_test_inodes(u32 sectorsize, u32 nodesize);
int btrfs_test_qgroups(u32 sectorsize, u32 nodesize);
int btrfs_test_free_space_tree(u32 sectorsize, u32 nodesize);
+int btrfs_test_raid_stripe_tree(u32 sectorsize, u32 nodesize);
int btrfs_test_extent_map(void);
struct inode *btrfs_new_test_inode(void);
struct btrfs_fs_info *btrfs_alloc_dummy_fs_info(u32 nodesize, u32 sectorsize);
diff --git a/fs/btrfs/tests/raid-stripe-tree-tests.c b/fs/btrfs/tests/raid-stripe-tree-tests.c
new file mode 100644
index 000000000000..9a5848c4a1c4
--- /dev/null
+++ b/fs/btrfs/tests/raid-stripe-tree-tests.c
@@ -0,0 +1,258 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * Copyright (C) 2024 Western Digital Corporation. All rights reserved.
+ */
+#include <linux/array_size.h>
+#include <linux/sizes.h>
+#include <linux/btrfs.h>
+#include <linux/btrfs_tree.h>
+#include <linux/errno.h>
+#include <linux/types.h>
+#include <linux/kernel.h>
+#include "btrfs-tests.h"
+#include "../disk-io.h"
+#include "../transaction.h"
+#include "../volumes.h"
+#include "../raid-stripe-tree.h"
+#include "../block-group.h"
+
+static struct btrfs_io_context *alloc_dummy_bioc(struct btrfs_fs_info *fs_info,
+ u64 logical, u16 total_stripes)
+{
+ struct btrfs_io_context *bioc;
+
+ bioc = kzalloc(sizeof(struct btrfs_io_context) +
+ sizeof(struct btrfs_io_stripe) * total_stripes,
+ GFP_KERNEL);
+
+ if (!bioc)
+ return NULL;
+
+ refcount_set(&bioc->refs, 1);
+
+ bioc->fs_info = fs_info;
+ bioc->replace_stripe_src = -1;
+ bioc->full_stripe_logical = (u64)-1;
+ bioc->logical = logical;
+
+ return bioc;
+}
+
+typedef int (*test_func_t)(struct btrfs_fs_info *);
+
+static int test_stripe_tree_delete_tail(struct btrfs_fs_info *fs_info)
+{
+ struct btrfs_trans_handle trans;
+ struct btrfs_io_context *bioc;
+ const u64 map_type = BTRFS_BLOCK_GROUP_DATA | BTRFS_BLOCK_GROUP_RAID1;
+ const int total_stripes = btrfs_bg_type_to_factor(map_type);
+ u64 logical = SZ_8K;
+ u64 length = SZ_64K;
+ int i;
+ int ret;
+
+ btrfs_init_dummy_trans(&trans, fs_info);
+
+ bioc = alloc_dummy_bioc(fs_info, logical, total_stripes);
+ if (!bioc)
+ return -ENOMEM;
+
+ bioc->size = length;
+ bioc->map_type = map_type;
+ for (i = 0; i < total_stripes; ++i) {
+ struct btrfs_device *dev;
+
+ dev = kzalloc(sizeof(struct btrfs_device), GFP_KERNEL);
+ if (!dev) {
+ ret = -ENOMEM;
+ goto out;
+ }
+ dev->devid = i;
+ bioc->stripes[i].dev = dev;
+ bioc->stripes[i].length = length;
+ bioc->stripes[i].physical = i * SZ_8K;
+ }
+
+ ret = btrfs_insert_one_raid_extent(&trans, bioc);
+ if (ret)
+ goto out;
+
+ ret = btrfs_delete_raid_extent(&trans, logical, SZ_16K);
+
+out:
+ for (i = 0; i < total_stripes; i++)
+ kfree(bioc->stripes[i].dev);
+
+ kfree(bioc);
+ return ret;
+}
+
+static int test_stripe_tree_delete_front(struct btrfs_fs_info *fs_info)
+{
+ struct btrfs_trans_handle trans;
+ struct btrfs_io_context *bioc;
+ const u64 map_type = BTRFS_BLOCK_GROUP_DATA | BTRFS_BLOCK_GROUP_RAID1;
+ const int total_stripes = btrfs_bg_type_to_factor(map_type);
+ u64 logical = SZ_8K;
+ u64 length = SZ_64K;
+ int i;
+ int ret;
+
+ btrfs_init_dummy_trans(&trans, fs_info);
+
+ bioc = alloc_dummy_bioc(fs_info, logical, total_stripes);
+ if (!bioc)
+ return -ENOMEM;
+
+ bioc->size = length;
+ bioc->map_type = map_type;
+ for (i = 0; i < total_stripes; i++) {
+ struct btrfs_device *dev;
+
+ dev = kzalloc(sizeof(struct btrfs_device), GFP_KERNEL);
+ if (!dev) {
+ ret = -ENOMEM;
+ goto out;
+ }
+ dev->devid = i;
+ bioc->stripes[i].dev = dev;
+ bioc->stripes[i].length = length;
+ bioc->stripes[i].physical = i * SZ_8K;
+ }
+
+ ret = btrfs_insert_one_raid_extent(&trans, bioc);
+ if (ret)
+ goto out;
+ ret = btrfs_delete_raid_extent(&trans, logical, SZ_8K);
+out:
+ for (i = 0; i < total_stripes; i++)
+ kfree(bioc->stripes[i].dev);
+
+ kfree(bioc);
+ return ret;
+
+}
+
+static int test_stripe_tree_delete_whole(struct btrfs_fs_info *fs_info)
+{
+ struct btrfs_trans_handle trans;
+ struct btrfs_io_context *bioc;
+ const u64 map_type = BTRFS_BLOCK_GROUP_DATA | BTRFS_BLOCK_GROUP_RAID1;
+ const int total_stripes = btrfs_bg_type_to_factor(map_type);
+ u64 logical = SZ_8K;
+ u64 length = SZ_64K;
+ int i;
+ int ret;
+
+ btrfs_init_dummy_trans(&trans, fs_info);
+
+ bioc = alloc_dummy_bioc(fs_info, logical, total_stripes);
+ if (!bioc)
+ return -ENOMEM;
+
+ bioc->size = length;
+ bioc->map_type = map_type;
+ for (i = 0; i < total_stripes; ++i) {
+ struct btrfs_device *dev;
+
+ dev = kzalloc(sizeof(struct btrfs_device), GFP_KERNEL);
+ if (!dev) {
+ ret = -ENOMEM;
+ goto out;
+ }
+ dev->devid = i;
+ bioc->stripes[i].dev = dev;
+ bioc->stripes[i].length = length;
+ bioc->stripes[i].physical = i * SZ_8K;
+ }
+
+ ret = btrfs_insert_one_raid_extent(&trans, bioc);
+ if (ret)
+ goto out;
+
+ ret = btrfs_delete_raid_extent(&trans, logical, length);
+
+out:
+ for (i = 0; i < total_stripes; i++)
+ kfree(bioc->stripes[i].dev);
+
+ kfree(bioc);
+ return ret;
+}
+
+static int test_stripe_tree_delete(struct btrfs_fs_info *fs_info)
+{
+ test_func_t delete_tests[] = {
+ test_stripe_tree_delete_whole,
+ test_stripe_tree_delete_front,
+ test_stripe_tree_delete_tail,
+ };
+ int ret;
+
+ for (int i = 0; i < ARRAY_SIZE(delete_tests); i++) {
+ test_func_t test = delete_tests[i];
+
+ ret = test(fs_info);
+ if (ret)
+ goto out;
+ }
+
+out:
+ return ret;
+}
+
+int btrfs_test_raid_stripe_tree(u32 sectorsize, u32 nodesize)
+{
+ test_func_t tests[] = {
+ test_stripe_tree_delete,
+ };
+ struct btrfs_fs_info *fs_info;
+ struct btrfs_root *root = NULL;
+ int ret = 0;
+
+ test_msg("running raid-stripe-tree tests");
+
+ fs_info = btrfs_alloc_dummy_fs_info(nodesize, sectorsize);
+ if (!fs_info) {
+ test_std_err(TEST_ALLOC_FS_INFO);
+ ret = -ENOMEM;
+ goto out;
+ }
+
+ root = btrfs_alloc_dummy_root(fs_info);
+ if (IS_ERR(root)) {
+ test_std_err(TEST_ALLOC_ROOT);
+ ret = PTR_ERR(root);
+ goto out;
+ }
+
+ btrfs_set_super_incompat_flags(fs_info->super_copy,
+ BTRFS_FEATURE_INCOMPAT_RAID_STRIPE_TREE);
+ root->root_key.objectid = BTRFS_RAID_STRIPE_TREE_OBJECTID;
+ root->root_key.type = BTRFS_ROOT_ITEM_KEY;
+ root->root_key.offset = 0;
+ btrfs_global_root_insert(root);
+ fs_info->stripe_root = root;
+
+ root->node = alloc_test_extent_buffer(fs_info, nodesize);
+ if (IS_ERR(root->node)) {
+ test_std_err(TEST_ALLOC_EXTENT_BUFFER);
+ ret = PTR_ERR(root->node);
+ goto out;
+ }
+ btrfs_set_header_level(root->node, 0);
+ btrfs_set_header_nritems(root->node, 0);
+ root->alloc_bytenr += 2 * nodesize;
+
+ for (int i = 0; i < ARRAY_SIZE(tests); i++) {
+ test_func_t test = tests[i];
+
+ ret = test(fs_info);
+ if (ret)
+ goto out;
+ }
+out:
+ btrfs_free_dummy_root(root);
+ btrfs_free_dummy_fs_info(fs_info);
+ return ret;
+}
--
2.43.0
^ permalink raw reply related [flat|nested] 18+ messages in thread
* [PATCH v3 4/5] btrfs: don't hold dev_replace rwsem over whole of btrfs_map_block
2024-07-01 10:25 [PATCH v3 0/5] btrfs: rst: updates for RAID stripe tree Johannes Thumshirn
` (2 preceding siblings ...)
2024-07-01 10:25 ` [PATCH v3 3/5] btrfs: stripe-tree: add selftests Johannes Thumshirn
@ 2024-07-01 10:25 ` Johannes Thumshirn
2024-07-01 14:13 ` Josef Bacik
2024-07-01 10:25 ` [PATCH v3 5/5] btrfs: rst: don't print tree dump in case lookup fails Johannes Thumshirn
4 siblings, 1 reply; 18+ messages in thread
From: Johannes Thumshirn @ 2024-07-01 10:25 UTC (permalink / raw)
To: Chris Mason, Josef Bacik, David Sterba
Cc: linux-btrfs, linux-kernel, Johannes Thumshirn
From: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Don't hold the dev_replace rwsem for the entirety of btrfs_map_block().
It is only needed to protect
a) calls to find_live_mirror() and
b) calling into handle_ops_on_dev_replace().
But there is no need to hold the rwsem for any kind of set_io_stripe()
calls.
So relax taking the dev_replace rwsem to only protect both cases and check
if the device replace status has changed in the meantime, for which we have
to re-do the find_live_mirror() calls.
This fixes a deadlock on raid-stripe-tree where device replace performs a
scrub operation, which in turn calls into btrfs_map_block() to find the
physical location of the block.
Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
---
fs/btrfs/volumes.c | 28 +++++++++++++++++-----------
1 file changed, 17 insertions(+), 11 deletions(-)
diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index fcedc43ef291..4209419244a1 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -6650,14 +6650,9 @@ int btrfs_map_block(struct btrfs_fs_info *fs_info, enum btrfs_map_op op,
max_len = btrfs_max_io_len(map, map_offset, &io_geom);
*length = min_t(u64, map->chunk_len - map_offset, max_len);
+again:
down_read(&dev_replace->rwsem);
dev_replace_is_ongoing = btrfs_dev_replace_is_ongoing(dev_replace);
- /*
- * Hold the semaphore for read during the whole operation, write is
- * requested at commit time but must wait.
- */
- if (!dev_replace_is_ongoing)
- up_read(&dev_replace->rwsem);
switch (map->type & BTRFS_BLOCK_GROUP_PROFILE_MASK) {
case BTRFS_BLOCK_GROUP_RAID0:
@@ -6695,6 +6690,7 @@ int btrfs_map_block(struct btrfs_fs_info *fs_info, enum btrfs_map_op op,
"stripe index math went horribly wrong, got stripe_index=%u, num_stripes=%u",
io_geom.stripe_index, map->num_stripes);
ret = -EINVAL;
+ up_read(&dev_replace->rwsem);
goto out;
}
@@ -6710,6 +6706,8 @@ int btrfs_map_block(struct btrfs_fs_info *fs_info, enum btrfs_map_op op,
*/
num_alloc_stripes += 2;
+ up_read(&dev_replace->rwsem);
+
/*
* If this I/O maps to a single device, try to return the device and
* physical block information on the stack instead of allocating an
@@ -6782,6 +6780,18 @@ int btrfs_map_block(struct btrfs_fs_info *fs_info, enum btrfs_map_op op,
goto out;
}
+ /*
+ * Check if something changed the dev_replace state since
+ * we've checked it for the last time and if redo the whole
+ * mapping operation.
+ */
+ down_read(&dev_replace->rwsem);
+ if (dev_replace_is_ongoing !=
+ btrfs_dev_replace_is_ongoing(dev_replace)) {
+ up_read(&dev_replace->rwsem);
+ goto again;
+ }
+
if (op != BTRFS_MAP_READ)
io_geom.max_errors = btrfs_chunk_max_errors(map);
@@ -6789,6 +6799,7 @@ int btrfs_map_block(struct btrfs_fs_info *fs_info, enum btrfs_map_op op,
op != BTRFS_MAP_READ) {
handle_ops_on_dev_replace(bioc, dev_replace, logical, &io_geom);
}
+ up_read(&dev_replace->rwsem);
*bioc_ret = bioc;
bioc->num_stripes = io_geom.num_stripes;
@@ -6796,11 +6807,6 @@ int btrfs_map_block(struct btrfs_fs_info *fs_info, enum btrfs_map_op op,
bioc->mirror_num = io_geom.mirror_num;
out:
- if (dev_replace_is_ongoing) {
- lockdep_assert_held(&dev_replace->rwsem);
- /* Unlock and let waiting writers proceed */
- up_read(&dev_replace->rwsem);
- }
btrfs_free_chunk_map(map);
return ret;
}
--
2.43.0
^ permalink raw reply related [flat|nested] 18+ messages in thread
* [PATCH v3 5/5] btrfs: rst: don't print tree dump in case lookup fails
2024-07-01 10:25 [PATCH v3 0/5] btrfs: rst: updates for RAID stripe tree Johannes Thumshirn
` (3 preceding siblings ...)
2024-07-01 10:25 ` [PATCH v3 4/5] btrfs: don't hold dev_replace rwsem over whole of btrfs_map_block Johannes Thumshirn
@ 2024-07-01 10:25 ` Johannes Thumshirn
2024-07-01 14:12 ` Josef Bacik
4 siblings, 1 reply; 18+ messages in thread
From: Johannes Thumshirn @ 2024-07-01 10:25 UTC (permalink / raw)
To: Chris Mason, Josef Bacik, David Sterba
Cc: linux-btrfs, linux-kernel, Johannes Thumshirn
From: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Don't print tree dump in case a raid-stripe-tree lookup fails.
Dumping the stripe tree in case of a lookup failure was originally
intended to be a debug feature, but it turned out to be a problem, in case
of i.e. readahead.
Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
---
fs/btrfs/raid-stripe-tree.c | 6 ++----
1 file changed, 2 insertions(+), 4 deletions(-)
diff --git a/fs/btrfs/raid-stripe-tree.c b/fs/btrfs/raid-stripe-tree.c
index c5c2f19387ff..20be7d0c52e6 100644
--- a/fs/btrfs/raid-stripe-tree.c
+++ b/fs/btrfs/raid-stripe-tree.c
@@ -332,10 +332,8 @@ int btrfs_get_raid_extent_offset(struct btrfs_fs_info *fs_info,
out:
if (ret > 0)
ret = -ENOENT;
- if (ret && ret != -EIO && !stripe->is_scrub) {
- if (IS_ENABLED(CONFIG_BTRFS_DEBUG))
- btrfs_print_tree(leaf, 1);
- btrfs_err(fs_info,
+ if (ret && ret != -EIO) {
+ btrfs_debug(fs_info,
"cannot find raid-stripe for logical [%llu, %llu] devid %llu, profile %s",
logical, logical + *length, stripe->dev->devid,
btrfs_bg_type_to_raid_name(map_type));
--
2.43.0
^ permalink raw reply related [flat|nested] 18+ messages in thread
* Re: [PATCH v3 1/5] btrfs: replace stripe extents
2024-07-01 10:25 ` [PATCH v3 1/5] btrfs: replace stripe extents Johannes Thumshirn
@ 2024-07-01 13:57 ` Josef Bacik
2024-07-01 15:08 ` Johannes Thumshirn
2024-07-01 20:37 ` Josef Bacik
1 sibling, 1 reply; 18+ messages in thread
From: Josef Bacik @ 2024-07-01 13:57 UTC (permalink / raw)
To: Johannes Thumshirn
Cc: Chris Mason, David Sterba, linux-btrfs, linux-kernel,
Johannes Thumshirn
On Mon, Jul 01, 2024 at 12:25:15PM +0200, Johannes Thumshirn wrote:
> From: Johannes Thumshirn <johannes.thumshirn@wdc.com>
>
> If we can't insert a stripe extent in the RAID stripe tree, because
> the key that points to the specific position in the stripe tree is
> already existing, we have to remove the item and then replace it by a
> new item.
>
> This can happen for example on device replace operations.
>
> Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
> ---
> fs/btrfs/raid-stripe-tree.c | 34 ++++++++++++++++++++++++++++++++++
> 1 file changed, 34 insertions(+)
>
> diff --git a/fs/btrfs/raid-stripe-tree.c b/fs/btrfs/raid-stripe-tree.c
> index e6f7a234b8f6..3020820dd6e2 100644
> --- a/fs/btrfs/raid-stripe-tree.c
> +++ b/fs/btrfs/raid-stripe-tree.c
> @@ -73,6 +73,37 @@ int btrfs_delete_raid_extent(struct btrfs_trans_handle *trans, u64 start, u64 le
> return ret;
> }
>
> +static int replace_raid_extent_item(struct btrfs_trans_handle *trans,
> + struct btrfs_key *key,
> + struct btrfs_stripe_extent *stripe_extent,
> + const size_t item_size)
> +{
> + struct btrfs_fs_info *fs_info = trans->fs_info;
> + struct btrfs_root *stripe_root = fs_info->stripe_root;
> + struct btrfs_path *path;
> + int ret;
> +
> + path = btrfs_alloc_path();
> + if (!path)
> + return -ENOMEM;
> +
> + ret = btrfs_search_slot(trans, stripe_root, key, path, -1, 1);
> + if (ret)
> + goto err;
This will leak 1 and we'll get an awkward btrfs_abort_transaction() call. This
should be
if (ret) {
ret = (ret == 1) ? -ENOENT : ret;
goto err;
}
or whatever. Thanks,
Josef
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [PATCH v3 2/5] btrfs: split RAID stripes on deletion
2024-07-01 10:25 ` [PATCH v3 2/5] btrfs: split RAID stripes on deletion Johannes Thumshirn
@ 2024-07-01 14:07 ` Josef Bacik
2024-07-03 15:47 ` Johannes Thumshirn
0 siblings, 1 reply; 18+ messages in thread
From: Josef Bacik @ 2024-07-01 14:07 UTC (permalink / raw)
To: Johannes Thumshirn
Cc: Chris Mason, David Sterba, linux-btrfs, linux-kernel,
Johannes Thumshirn
On Mon, Jul 01, 2024 at 12:25:16PM +0200, Johannes Thumshirn wrote:
> From: Johannes Thumshirn <johannes.thumshirn@wdc.com>
>
> The current RAID stripe code assumes, that we will always remove a
> whole stripe entry.
>
> But if we're only removing a part of a RAID stripe we're hitting the
> ASSERT()ion checking for this condition.
>
> Instead of assuming the complete deletion of a RAID stripe, split the
> stripe if we need to.
>
> Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
> ---
> fs/btrfs/ctree.c | 1 +
> fs/btrfs/raid-stripe-tree.c | 100 +++++++++++++++++++++++++++++++++-----------
> 2 files changed, 77 insertions(+), 24 deletions(-)
>
> diff --git a/fs/btrfs/ctree.c b/fs/btrfs/ctree.c
> index e33f9f5a228d..16f9cf6360a4 100644
> --- a/fs/btrfs/ctree.c
> +++ b/fs/btrfs/ctree.c
> @@ -3863,6 +3863,7 @@ static noinline int setup_leaf_for_split(struct btrfs_trans_handle *trans,
> btrfs_item_key_to_cpu(leaf, &key, path->slots[0]);
>
> BUG_ON(key.type != BTRFS_EXTENT_DATA_KEY &&
> + key.type != BTRFS_RAID_STRIPE_KEY &&
> key.type != BTRFS_EXTENT_CSUM_KEY);
>
> if (btrfs_leaf_free_space(leaf) >= ins_len)
> diff --git a/fs/btrfs/raid-stripe-tree.c b/fs/btrfs/raid-stripe-tree.c
> index 3020820dd6e2..64e36b46cbab 100644
> --- a/fs/btrfs/raid-stripe-tree.c
> +++ b/fs/btrfs/raid-stripe-tree.c
> @@ -33,42 +33,94 @@ int btrfs_delete_raid_extent(struct btrfs_trans_handle *trans, u64 start, u64 le
> if (!path)
> return -ENOMEM;
>
> - while (1) {
> - key.objectid = start;
> - key.type = BTRFS_RAID_STRIPE_KEY;
> - key.offset = length;
> +again:
> + key.objectid = start;
> + key.type = BTRFS_RAID_STRIPE_KEY;
> + key.offset = length;
>
> - ret = btrfs_search_slot(trans, stripe_root, &key, path, -1, 1);
> - if (ret < 0)
> - break;
> - if (ret > 0) {
> - ret = 0;
> - if (path->slots[0] == 0)
> - break;
> - path->slots[0]--;
> - }
> + ret = btrfs_search_slot(trans, stripe_root, &key, path, -1, 1);
> + if (ret < 0)
> + goto out;
> + if (ret > 0) {
> + ret = 0;
> + if (path->slots[0] == 0)
> + goto out;
> + path->slots[0]--;
> + }
> +
> + leaf = path->nodes[0];
> + slot = path->slots[0];
> + btrfs_item_key_to_cpu(leaf, &key, slot);
> + found_start = key.objectid;
> + found_end = found_start + key.offset;
> +
> + /* That stripe ends before we start, we're done. */
> + if (found_end <= start)
> + goto out;
> +
> + trace_btrfs_raid_extent_delete(fs_info, start, end,
> + found_start, found_end);
> +
> + if (found_start < start) {
> + u64 diff = start - found_start;
> + struct btrfs_key new_key;
> + int num_stripes;
> + struct btrfs_stripe_extent *stripe_extent;
> +
> + new_key.objectid = start;
> + new_key.type = BTRFS_RAID_STRIPE_KEY;
> + new_key.offset = length - diff;
> +
> + ret = btrfs_duplicate_item(trans, stripe_root, path,
> + &new_key);
> + if (ret)
> + goto out;
>
> leaf = path->nodes[0];
> slot = path->slots[0];
> - btrfs_item_key_to_cpu(leaf, &key, slot);
> - found_start = key.objectid;
> - found_end = found_start + key.offset;
>
> - /* That stripe ends before we start, we're done. */
> - if (found_end <= start)
> - break;
> + num_stripes =
> + btrfs_num_raid_stripes(btrfs_item_size(leaf, slot));
> + stripe_extent =
> + btrfs_item_ptr(leaf, slot, struct btrfs_stripe_extent);
> +
> + for (int i = 0; i < num_stripes; i++) {
> + struct btrfs_raid_stride *raid_stride =
> + &stripe_extent->strides[i];
> + u64 physical =
> + btrfs_raid_stride_physical(leaf, raid_stride);
> +
> + btrfs_set_raid_stride_physical(leaf, raid_stride,
> + physical + diff);
> + }
> +
> + btrfs_mark_buffer_dirty(trans, leaf);
> + btrfs_release_path(path);
> + goto again;
> + }
> +
> + if (found_end > end) {
> + u64 diff = found_end - end;
> + struct btrfs_key new_key;
>
> - trace_btrfs_raid_extent_delete(fs_info, start, end,
> - found_start, found_end);
> + new_key.objectid = found_start;
> + new_key.type = BTRFS_RAID_STRIPE_KEY;
> + new_key.offset = length - diff;
>
> - ASSERT(found_start >= start && found_end <= end);
> - ret = btrfs_del_item(trans, stripe_root, path);
> + ret = btrfs_duplicate_item(trans, stripe_root, path,
> + &new_key);
This seems incorrect to me. If we have [0, 1MiB) and we're deleting [0,512KiB)
then the tree at this point is
[0, BTRFS_RAID_STRIPE_KEY, 512KiB]
[0, BTRFS_RAID_STRIPE_KEY, 1MiB]
which is valid as far as key ordering goes, but is a violation of the raid
stripe tree design correct? And then you do goto again, and then you'll delete
[0, BTRFS_RAID_STRIPE_KEY, 512KiB]
but leave the old one in place, correct? Thanks,
Josef
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [PATCH v3 3/5] btrfs: stripe-tree: add selftests
2024-07-01 10:25 ` [PATCH v3 3/5] btrfs: stripe-tree: add selftests Johannes Thumshirn
@ 2024-07-01 14:08 ` Josef Bacik
2024-07-01 15:09 ` Johannes Thumshirn
0 siblings, 1 reply; 18+ messages in thread
From: Josef Bacik @ 2024-07-01 14:08 UTC (permalink / raw)
To: Johannes Thumshirn
Cc: Chris Mason, David Sterba, linux-btrfs, linux-kernel,
Johannes Thumshirn
On Mon, Jul 01, 2024 at 12:25:17PM +0200, Johannes Thumshirn wrote:
> From: Johannes Thumshirn <johannes.thumshirn@wdc.com>
>
> Add self-tests for the RAID stripe tree code.
>
> Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
These tests run the code, but don't validate that the end result is what we
expect, that should be added. Thanks,
Josef
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [PATCH v3 5/5] btrfs: rst: don't print tree dump in case lookup fails
2024-07-01 10:25 ` [PATCH v3 5/5] btrfs: rst: don't print tree dump in case lookup fails Johannes Thumshirn
@ 2024-07-01 14:12 ` Josef Bacik
2024-07-01 15:03 ` Johannes Thumshirn
0 siblings, 1 reply; 18+ messages in thread
From: Josef Bacik @ 2024-07-01 14:12 UTC (permalink / raw)
To: Johannes Thumshirn
Cc: Chris Mason, David Sterba, linux-btrfs, linux-kernel,
Johannes Thumshirn
On Mon, Jul 01, 2024 at 12:25:19PM +0200, Johannes Thumshirn wrote:
> From: Johannes Thumshirn <johannes.thumshirn@wdc.com>
>
> Don't print tree dump in case a raid-stripe-tree lookup fails.
>
> Dumping the stripe tree in case of a lookup failure was originally
> intended to be a debug feature, but it turned out to be a problem, in case
> of i.e. readahead.
>
I have no objection to the change but I'm curious how readahead triggered this?
Is there a problem here, or is it just when there is a problem readahead makes
it particularly noisy? Thanks,
Josef
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [PATCH v3 4/5] btrfs: don't hold dev_replace rwsem over whole of btrfs_map_block
2024-07-01 10:25 ` [PATCH v3 4/5] btrfs: don't hold dev_replace rwsem over whole of btrfs_map_block Johannes Thumshirn
@ 2024-07-01 14:13 ` Josef Bacik
0 siblings, 0 replies; 18+ messages in thread
From: Josef Bacik @ 2024-07-01 14:13 UTC (permalink / raw)
To: Johannes Thumshirn
Cc: Chris Mason, David Sterba, linux-btrfs, linux-kernel,
Johannes Thumshirn
On Mon, Jul 01, 2024 at 12:25:18PM +0200, Johannes Thumshirn wrote:
> From: Johannes Thumshirn <johannes.thumshirn@wdc.com>
>
> Don't hold the dev_replace rwsem for the entirety of btrfs_map_block().
>
> It is only needed to protect
> a) calls to find_live_mirror() and
> b) calling into handle_ops_on_dev_replace().
>
> But there is no need to hold the rwsem for any kind of set_io_stripe()
> calls.
>
> So relax taking the dev_replace rwsem to only protect both cases and check
> if the device replace status has changed in the meantime, for which we have
> to re-do the find_live_mirror() calls.
>
> This fixes a deadlock on raid-stripe-tree where device replace performs a
> scrub operation, which in turn calls into btrfs_map_block() to find the
> physical location of the block.
>
> Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Thanks,
Josef
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [PATCH v3 5/5] btrfs: rst: don't print tree dump in case lookup fails
2024-07-01 14:12 ` Josef Bacik
@ 2024-07-01 15:03 ` Johannes Thumshirn
0 siblings, 0 replies; 18+ messages in thread
From: Johannes Thumshirn @ 2024-07-01 15:03 UTC (permalink / raw)
To: Josef Bacik, Johannes Thumshirn
Cc: Chris Mason, David Sterba, linux-btrfs@vger.kernel.org,
linux-kernel@vger.kernel.org
On 01.07.24 16:13, Josef Bacik wrote:
> On Mon, Jul 01, 2024 at 12:25:19PM +0200, Johannes Thumshirn wrote:
>> From: Johannes Thumshirn <johannes.thumshirn@wdc.com>
>>
>> Don't print tree dump in case a raid-stripe-tree lookup fails.
>>
>> Dumping the stripe tree in case of a lookup failure was originally
>> intended to be a debug feature, but it turned out to be a problem, in case
>> of i.e. readahead.
>>
>
> I have no objection to the change but I'm curious how readahead triggered this?
> Is there a problem here, or is it just when there is a problem readahead makes
> it particularly noisy? Thanks,
There still is a bug in conjunction with RST and relocation's readahead.
But as I've stated in the cover letter, it is know, trivial to trigger,
but I haven't fully root caused it yet.
All I can say is, that we're doing a use-after-free triggered by:
relocate_data_extent()
`-> relocate_file_extent_cluster()
`-> relocate_one_folio()
`-> page_cache_sync_readahead()
`-> read_pages()
This most likely comes due to readahead requesting a l2p mapping that
RST is unaware of, then splits the read bio (that I can see in my
debugging) and the endio handler frees the original bio because of an
error while the lower layer block driver still uses it.
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [PATCH v3 1/5] btrfs: replace stripe extents
2024-07-01 13:57 ` Josef Bacik
@ 2024-07-01 15:08 ` Johannes Thumshirn
2024-07-01 20:34 ` Josef Bacik
0 siblings, 1 reply; 18+ messages in thread
From: Johannes Thumshirn @ 2024-07-01 15:08 UTC (permalink / raw)
To: Josef Bacik, Johannes Thumshirn
Cc: Chris Mason, David Sterba, linux-btrfs@vger.kernel.org,
linux-kernel@vger.kernel.org
On 01.07.24 15:58, Josef Bacik wrote:
> On Mon, Jul 01, 2024 at 12:25:15PM +0200, Johannes Thumshirn wrote:
>> From: Johannes Thumshirn <johannes.thumshirn@wdc.com>
>>
>> If we can't insert a stripe extent in the RAID stripe tree, because
>> the key that points to the specific position in the stripe tree is
>> already existing, we have to remove the item and then replace it by a
>> new item.
>>
>> This can happen for example on device replace operations.
>>
>> Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
>> ---
>> fs/btrfs/raid-stripe-tree.c | 34 ++++++++++++++++++++++++++++++++++
>> 1 file changed, 34 insertions(+)
>>
>> diff --git a/fs/btrfs/raid-stripe-tree.c b/fs/btrfs/raid-stripe-tree.c
>> index e6f7a234b8f6..3020820dd6e2 100644
>> --- a/fs/btrfs/raid-stripe-tree.c
>> +++ b/fs/btrfs/raid-stripe-tree.c
>> @@ -73,6 +73,37 @@ int btrfs_delete_raid_extent(struct btrfs_trans_handle *trans, u64 start, u64 le
>> return ret;
>> }
>>
>> +static int replace_raid_extent_item(struct btrfs_trans_handle *trans,
>> + struct btrfs_key *key,
>> + struct btrfs_stripe_extent *stripe_extent,
>> + const size_t item_size)
>> +{
>> + struct btrfs_fs_info *fs_info = trans->fs_info;
>> + struct btrfs_root *stripe_root = fs_info->stripe_root;
>> + struct btrfs_path *path;
>> + int ret;
>> +
>> + path = btrfs_alloc_path();
>> + if (!path)
>> + return -ENOMEM;
>> +
>> + ret = btrfs_search_slot(trans, stripe_root, key, path, -1, 1);
>> + if (ret)
>> + goto err;
>
> This will leak 1 and we'll get an awkward btrfs_abort_transaction() call. This
> should be
>
> if (ret) {
> ret = (ret == 1) ? -ENOENT : ret;
> goto err;
> }
>
> or whatever. Thanks,
I wonder why I've never seen this in my testing. Could it be, that due
to the fact that btrfs_insert_item() returns -EEXIST on the same
key.objectid, we're more or less guaranteed it'll exist.
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [PATCH v3 3/5] btrfs: stripe-tree: add selftests
2024-07-01 14:08 ` Josef Bacik
@ 2024-07-01 15:09 ` Johannes Thumshirn
0 siblings, 0 replies; 18+ messages in thread
From: Johannes Thumshirn @ 2024-07-01 15:09 UTC (permalink / raw)
To: Josef Bacik, Johannes Thumshirn
Cc: Chris Mason, David Sterba, linux-btrfs@vger.kernel.org,
linux-kernel@vger.kernel.org
On 01.07.24 16:08, Josef Bacik wrote:
> On Mon, Jul 01, 2024 at 12:25:17PM +0200, Johannes Thumshirn wrote:
>> From: Johannes Thumshirn <johannes.thumshirn@wdc.com>
>>
>> Add self-tests for the RAID stripe tree code.
>>
>> Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
>
> These tests run the code, but don't validate that the end result is what we
> expect, that should be added. Thanks,
Right, and with the patch that doesn't pollute the console when a RST
search returns -ENOENT we can actually do this correctly.
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [PATCH v3 1/5] btrfs: replace stripe extents
2024-07-01 15:08 ` Johannes Thumshirn
@ 2024-07-01 20:34 ` Josef Bacik
0 siblings, 0 replies; 18+ messages in thread
From: Josef Bacik @ 2024-07-01 20:34 UTC (permalink / raw)
To: Johannes Thumshirn
Cc: Johannes Thumshirn, Chris Mason, David Sterba,
linux-btrfs@vger.kernel.org, linux-kernel@vger.kernel.org
On Mon, Jul 01, 2024 at 03:08:22PM +0000, Johannes Thumshirn wrote:
> On 01.07.24 15:58, Josef Bacik wrote:
> > On Mon, Jul 01, 2024 at 12:25:15PM +0200, Johannes Thumshirn wrote:
> >> From: Johannes Thumshirn <johannes.thumshirn@wdc.com>
> >>
> >> If we can't insert a stripe extent in the RAID stripe tree, because
> >> the key that points to the specific position in the stripe tree is
> >> already existing, we have to remove the item and then replace it by a
> >> new item.
> >>
> >> This can happen for example on device replace operations.
> >>
> >> Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
> >> ---
> >> fs/btrfs/raid-stripe-tree.c | 34 ++++++++++++++++++++++++++++++++++
> >> 1 file changed, 34 insertions(+)
> >>
> >> diff --git a/fs/btrfs/raid-stripe-tree.c b/fs/btrfs/raid-stripe-tree.c
> >> index e6f7a234b8f6..3020820dd6e2 100644
> >> --- a/fs/btrfs/raid-stripe-tree.c
> >> +++ b/fs/btrfs/raid-stripe-tree.c
> >> @@ -73,6 +73,37 @@ int btrfs_delete_raid_extent(struct btrfs_trans_handle *trans, u64 start, u64 le
> >> return ret;
> >> }
> >>
> >> +static int replace_raid_extent_item(struct btrfs_trans_handle *trans,
> >> + struct btrfs_key *key,
> >> + struct btrfs_stripe_extent *stripe_extent,
> >> + const size_t item_size)
> >> +{
> >> + struct btrfs_fs_info *fs_info = trans->fs_info;
> >> + struct btrfs_root *stripe_root = fs_info->stripe_root;
> >> + struct btrfs_path *path;
> >> + int ret;
> >> +
> >> + path = btrfs_alloc_path();
> >> + if (!path)
> >> + return -ENOMEM;
> >> +
> >> + ret = btrfs_search_slot(trans, stripe_root, key, path, -1, 1);
> >> + if (ret)
> >> + goto err;
> >
> > This will leak 1 and we'll get an awkward btrfs_abort_transaction() call. This
> > should be
> >
> > if (ret) {
> > ret = (ret == 1) ? -ENOENT : ret;
> > goto err;
> > }
> >
> > or whatever. Thanks,
>
> I wonder why I've never seen this in my testing. Could it be, that due
> to the fact that btrfs_insert_item() returns -EEXIST on the same
> key.objectid, we're more or less guaranteed it'll exist.
Yeah it's fine in the way it is currently, but if anything changes in the future
we're going to figure it out and be super sad we didn't just handle it right in
the first place. Thanks,
Josef
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [PATCH v3 1/5] btrfs: replace stripe extents
2024-07-01 10:25 ` [PATCH v3 1/5] btrfs: replace stripe extents Johannes Thumshirn
2024-07-01 13:57 ` Josef Bacik
@ 2024-07-01 20:37 ` Josef Bacik
2024-07-02 5:41 ` Johannes Thumshirn
1 sibling, 1 reply; 18+ messages in thread
From: Josef Bacik @ 2024-07-01 20:37 UTC (permalink / raw)
To: Johannes Thumshirn
Cc: Chris Mason, David Sterba, linux-btrfs, linux-kernel,
Johannes Thumshirn
On Mon, Jul 01, 2024 at 12:25:15PM +0200, Johannes Thumshirn wrote:
> From: Johannes Thumshirn <johannes.thumshirn@wdc.com>
>
> If we can't insert a stripe extent in the RAID stripe tree, because
> the key that points to the specific position in the stripe tree is
> already existing, we have to remove the item and then replace it by a
> new item.
>
> This can happen for example on device replace operations.
>
> Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
> ---
> fs/btrfs/raid-stripe-tree.c | 34 ++++++++++++++++++++++++++++++++++
> 1 file changed, 34 insertions(+)
>
> diff --git a/fs/btrfs/raid-stripe-tree.c b/fs/btrfs/raid-stripe-tree.c
> index e6f7a234b8f6..3020820dd6e2 100644
> --- a/fs/btrfs/raid-stripe-tree.c
> +++ b/fs/btrfs/raid-stripe-tree.c
> @@ -73,6 +73,37 @@ int btrfs_delete_raid_extent(struct btrfs_trans_handle *trans, u64 start, u64 le
> return ret;
> }
>
> +static int replace_raid_extent_item(struct btrfs_trans_handle *trans,
> + struct btrfs_key *key,
> + struct btrfs_stripe_extent *stripe_extent,
> + const size_t item_size)
> +{
> + struct btrfs_fs_info *fs_info = trans->fs_info;
> + struct btrfs_root *stripe_root = fs_info->stripe_root;
> + struct btrfs_path *path;
> + int ret;
> +
> + path = btrfs_alloc_path();
> + if (!path)
> + return -ENOMEM;
> +
> + ret = btrfs_search_slot(trans, stripe_root, key, path, -1, 1);
> + if (ret)
> + goto err;
> +
> + ret = btrfs_del_item(trans, stripe_root, path);
> + if (ret)
> + goto err;
> +
> + btrfs_free_path(path);
> +
> + return btrfs_insert_item(trans, stripe_root, key, stripe_extent,
> + item_size);
> + err:
> + btrfs_free_path(path);
> + return ret;
> +}
> +
> static int btrfs_insert_one_raid_extent(struct btrfs_trans_handle *trans,
> struct btrfs_io_context *bioc)
> {
> @@ -112,6 +143,9 @@ static int btrfs_insert_one_raid_extent(struct btrfs_trans_handle *trans,
>
> ret = btrfs_insert_item(trans, stripe_root, &stripe_key, stripe_extent,
> item_size);
> + if (ret == -EEXIST)
> + ret = replace_raid_extent_item(trans, &stripe_key,
> + stripe_extent, item_size);
I had another thought, how often is this particular thing happening? Bec ause
we're doing 3 path allocations here in the worst case. If it happens more than
say 10% of the time then we need to allocate a path once in
btrfs_insert_one_raid_extent(), do the insert, and if it fails re-use that path
to do the delete and insert the new one. Thanks,
Josef
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [PATCH v3 1/5] btrfs: replace stripe extents
2024-07-01 20:37 ` Josef Bacik
@ 2024-07-02 5:41 ` Johannes Thumshirn
0 siblings, 0 replies; 18+ messages in thread
From: Johannes Thumshirn @ 2024-07-02 5:41 UTC (permalink / raw)
To: Josef Bacik, Johannes Thumshirn
Cc: Chris Mason, David Sterba, linux-btrfs@vger.kernel.org,
linux-kernel@vger.kernel.org
On 01.07.24 22:37, Josef Bacik wrote:
>> + if (ret == -EEXIST)
>> + ret = replace_raid_extent_item(trans, &stripe_key,
>> + stripe_extent, item_size);
>
> I had another thought, how often is this particular thing happening? Bec ause
> we're doing 3 path allocations here in the worst case. If it happens more than
> say 10% of the time then we need to allocate a path once in
> btrfs_insert_one_raid_extent(), do the insert, and if it fails re-use that path
> to do the delete and insert the new one. Thanks,
That indeed is a good question. I'll add some tracepoints to see how
often this is getting called.
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [PATCH v3 2/5] btrfs: split RAID stripes on deletion
2024-07-01 14:07 ` Josef Bacik
@ 2024-07-03 15:47 ` Johannes Thumshirn
0 siblings, 0 replies; 18+ messages in thread
From: Johannes Thumshirn @ 2024-07-03 15:47 UTC (permalink / raw)
To: Josef Bacik, Johannes Thumshirn
Cc: Chris Mason, David Sterba, linux-btrfs@vger.kernel.org,
linux-kernel@vger.kernel.org
On 01.07.24 16:07, Josef Bacik wrote:
> On Mon, Jul 01, 2024 at 12:25:16PM +0200, Johannes Thumshirn wrote:
>> From: Johannes Thumshirn <johannes.thumshirn@wdc.com>
>>
>> The current RAID stripe code assumes, that we will always remove a
>> whole stripe entry.
>>
>> But if we're only removing a part of a RAID stripe we're hitting the
>> ASSERT()ion checking for this condition.
>>
>> Instead of assuming the complete deletion of a RAID stripe, split the
>> stripe if we need to.
>>
>> Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
>> ---
>> fs/btrfs/ctree.c | 1 +
>> fs/btrfs/raid-stripe-tree.c | 100 +++++++++++++++++++++++++++++++++-----------
>> 2 files changed, 77 insertions(+), 24 deletions(-)
>>
>> diff --git a/fs/btrfs/ctree.c b/fs/btrfs/ctree.c
>> index e33f9f5a228d..16f9cf6360a4 100644
>> --- a/fs/btrfs/ctree.c
>> +++ b/fs/btrfs/ctree.c
>> @@ -3863,6 +3863,7 @@ static noinline int setup_leaf_for_split(struct btrfs_trans_handle *trans,
>> btrfs_item_key_to_cpu(leaf, &key, path->slots[0]);
>>
>> BUG_ON(key.type != BTRFS_EXTENT_DATA_KEY &&
>> + key.type != BTRFS_RAID_STRIPE_KEY &&
>> key.type != BTRFS_EXTENT_CSUM_KEY);
>>
>> if (btrfs_leaf_free_space(leaf) >= ins_len)
>> diff --git a/fs/btrfs/raid-stripe-tree.c b/fs/btrfs/raid-stripe-tree.c
>> index 3020820dd6e2..64e36b46cbab 100644
>> --- a/fs/btrfs/raid-stripe-tree.c
>> +++ b/fs/btrfs/raid-stripe-tree.c
>> @@ -33,42 +33,94 @@ int btrfs_delete_raid_extent(struct btrfs_trans_handle *trans, u64 start, u64 le
>> if (!path)
>> return -ENOMEM;
>>
>> - while (1) {
>> - key.objectid = start;
>> - key.type = BTRFS_RAID_STRIPE_KEY;
>> - key.offset = length;
>> +again:
>> + key.objectid = start;
>> + key.type = BTRFS_RAID_STRIPE_KEY;
>> + key.offset = length;
>>
>> - ret = btrfs_search_slot(trans, stripe_root, &key, path, -1, 1);
>> - if (ret < 0)
>> - break;
>> - if (ret > 0) {
>> - ret = 0;
>> - if (path->slots[0] == 0)
>> - break;
>> - path->slots[0]--;
>> - }
>> + ret = btrfs_search_slot(trans, stripe_root, &key, path, -1, 1);
>> + if (ret < 0)
>> + goto out;
>> + if (ret > 0) {
>> + ret = 0;
>> + if (path->slots[0] == 0)
>> + goto out;
>> + path->slots[0]--;
>> + }
>> +
>> + leaf = path->nodes[0];
>> + slot = path->slots[0];
>> + btrfs_item_key_to_cpu(leaf, &key, slot);
>> + found_start = key.objectid;
>> + found_end = found_start + key.offset;
>> +
>> + /* That stripe ends before we start, we're done. */
>> + if (found_end <= start)
>> + goto out;
>> +
>> + trace_btrfs_raid_extent_delete(fs_info, start, end,
>> + found_start, found_end);
>> +
>> + if (found_start < start) {
>> + u64 diff = start - found_start;
>> + struct btrfs_key new_key;
>> + int num_stripes;
>> + struct btrfs_stripe_extent *stripe_extent;
>> +
>> + new_key.objectid = start;
>> + new_key.type = BTRFS_RAID_STRIPE_KEY;
>> + new_key.offset = length - diff;
>> +
>> + ret = btrfs_duplicate_item(trans, stripe_root, path,
>> + &new_key);
>> + if (ret)
>> + goto out;
>>
>> leaf = path->nodes[0];
>> slot = path->slots[0];
>> - btrfs_item_key_to_cpu(leaf, &key, slot);
>> - found_start = key.objectid;
>> - found_end = found_start + key.offset;
>>
>> - /* That stripe ends before we start, we're done. */
>> - if (found_end <= start)
>> - break;
>> + num_stripes =
>> + btrfs_num_raid_stripes(btrfs_item_size(leaf, slot));
>> + stripe_extent =
>> + btrfs_item_ptr(leaf, slot, struct btrfs_stripe_extent);
>> +
>> + for (int i = 0; i < num_stripes; i++) {
>> + struct btrfs_raid_stride *raid_stride =
>> + &stripe_extent->strides[i];
>> + u64 physical =
>> + btrfs_raid_stride_physical(leaf, raid_stride);
>> +
>> + btrfs_set_raid_stride_physical(leaf, raid_stride,
>> + physical + diff);
>> + }
>> +
>> + btrfs_mark_buffer_dirty(trans, leaf);
>> + btrfs_release_path(path);
>> + goto again;
>> + }
>> +
>> + if (found_end > end) {
>> + u64 diff = found_end - end;
>> + struct btrfs_key new_key;
>>
>> - trace_btrfs_raid_extent_delete(fs_info, start, end,
>> - found_start, found_end);
>> + new_key.objectid = found_start;
>> + new_key.type = BTRFS_RAID_STRIPE_KEY;
>> + new_key.offset = length - diff;
>>
>> - ASSERT(found_start >= start && found_end <= end);
>> - ret = btrfs_del_item(trans, stripe_root, path);
>> + ret = btrfs_duplicate_item(trans, stripe_root, path,
>> + &new_key);
>
> This seems incorrect to me. If we have [0, 1MiB) and we're deleting [0,512KiB)
> then the tree at this point is
>
> [0, BTRFS_RAID_STRIPE_KEY, 512KiB]
> [0, BTRFS_RAID_STRIPE_KEY, 1MiB]
>
> which is valid as far as key ordering goes, but is a violation of the raid
> stripe tree design correct? And then you do goto again, and then you'll delete
>
> [0, BTRFS_RAID_STRIPE_KEY, 512KiB]
>
> but leave the old one in place, correct? Thanks,
Oh right, jumping back to again removes the wrong one :/
^ permalink raw reply [flat|nested] 18+ messages in thread
end of thread, other threads:[~2024-07-03 15:47 UTC | newest]
Thread overview: 18+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-07-01 10:25 [PATCH v3 0/5] btrfs: rst: updates for RAID stripe tree Johannes Thumshirn
2024-07-01 10:25 ` [PATCH v3 1/5] btrfs: replace stripe extents Johannes Thumshirn
2024-07-01 13:57 ` Josef Bacik
2024-07-01 15:08 ` Johannes Thumshirn
2024-07-01 20:34 ` Josef Bacik
2024-07-01 20:37 ` Josef Bacik
2024-07-02 5:41 ` Johannes Thumshirn
2024-07-01 10:25 ` [PATCH v3 2/5] btrfs: split RAID stripes on deletion Johannes Thumshirn
2024-07-01 14:07 ` Josef Bacik
2024-07-03 15:47 ` Johannes Thumshirn
2024-07-01 10:25 ` [PATCH v3 3/5] btrfs: stripe-tree: add selftests Johannes Thumshirn
2024-07-01 14:08 ` Josef Bacik
2024-07-01 15:09 ` Johannes Thumshirn
2024-07-01 10:25 ` [PATCH v3 4/5] btrfs: don't hold dev_replace rwsem over whole of btrfs_map_block Johannes Thumshirn
2024-07-01 14:13 ` Josef Bacik
2024-07-01 10:25 ` [PATCH v3 5/5] btrfs: rst: don't print tree dump in case lookup fails Johannes Thumshirn
2024-07-01 14:12 ` Josef Bacik
2024-07-01 15:03 ` Johannes Thumshirn
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox