* [PATCH v8 00/11] btrfs: introduce RAID stripe tree
@ 2023-09-11 12:52 Johannes Thumshirn
2023-09-11 12:52 ` [PATCH v8 01/11] btrfs: add raid stripe tree definitions Johannes Thumshirn
` (10 more replies)
0 siblings, 11 replies; 39+ messages in thread
From: Johannes Thumshirn @ 2023-09-11 12:52 UTC (permalink / raw)
To: Chris Mason, Josef Bacik, David Sterba
Cc: Johannes Thumshirn, Christoph Hellwig, Naohiro Aota, Qu Wenruo,
Damien Le Moal, linux-btrfs, linux-kernel, Anand Jain
Updates of the raid-stripe-tree are done at ordered extent write time to safe
on bandwidth while for reading we do the stripe-tree lookup on bio mapping
time, i.e. when the logical to physical translation happens for regular btrfs
RAID as well.
The stripe tree is keyed by an extent's disk_bytenr and disk_num_bytes and
it's contents are the respective physical device id and position.
For an example 1M write (split into 126K segments due to zone-append)
rapido2:/home/johannes/src/fstests# xfs_io -fdc "pwrite -b 1M 0 1M" -c fsync /mnt/test/test
wrote 1048576/1048576 bytes at offset 0
1 MiB, 1 ops; 0.0065 sec (151.538 MiB/sec and 151.5381 ops/sec)
The tree will look as follows (both 128k buffered writes to a ZNS drive):
RAID0 case:
bash-5.2# btrfs inspect-internal dump-tree -t raid_stripe /dev/nvme0n1
btrfs-progs v6.3
raid stripe tree key (RAID_STRIPE_TREE ROOT_ITEM 0)
leaf 805535744 items 1 free space 16218 generation 8 owner RAID_STRIPE_TREE
leaf 805535744 flags 0x1(WRITTEN) backref revision 1
checksum stored 2d2d2262
checksum calced 2d2d2262
fs uuid ab05cfc6-9859-404e-970d-3999b1cb5438
chunk uuid c9470ba2-49ac-4d46-8856-438a18e6bd23
item 0 key (1073741824 RAID_STRIPE_KEY 131072) itemoff 16243 itemsize 40
encoding: RAID0
stripe 0 devid 1 offset 805306368
stripe 1 devid 2 offset 536870912
total bytes 42949672960
bytes used 294912
uuid ab05cfc6-9859-404e-970d-3999b1cb5438
RAID1 case:
bash-5.2# btrfs inspect-internal dump-tree -t raid_stripe /dev/nvme0n1
btrfs-progs v6.3
raid stripe tree key (RAID_STRIPE_TREE ROOT_ITEM 0)
leaf 805535744 items 1 free space 16218 generation 8 owner RAID_STRIPE_TREE
leaf 805535744 flags 0x1(WRITTEN) backref revision 1
checksum stored 56199539
checksum calced 56199539
fs uuid 9e693a37-fbd1-4891-aed2-e7fe64605045
chunk uuid 691874fc-1b9c-469b-bd7f-05e0e6ba88c4
item 0 key (939524096 RAID_STRIPE_KEY 131072) itemoff 16243 itemsize 40
encoding: RAID1
stripe 0 devid 1 offset 939524096
stripe 1 devid 2 offset 536870912
total bytes 42949672960
bytes used 294912
uuid 9e693a37-fbd1-4891-aed2-e7fe64605045
A design document can be found here:
https://docs.google.com/document/d/1Iui_jMidCd4MVBNSSLXRfO7p5KmvnoQL/edit?usp=sharing&ouid=103609947580185458266&rtpof=true&sd=true
The user-space part of this series can be found here:
https://lore.kernel.org/linux-btrfs/20230215143109.2721722-1-johannes.thumshirn@wdc.com
Changes to v7:
- Huge rewrite
v7 of the patchset can be found here:
https://lore.kernel.org/linux-btrfs/cover.1677750131.git.johannes.thumshirn@wdc.com/
Changes to v6:
- Fix degraded RAID1 mounts
- Fix RAID0/10 mounts
v6 of the patchset can be found here:
https://lore/kernel.org/linux-btrfs/cover.1676470614.git.johannes.thumshirn@wdc.com
Changes to v5:
- Incroporated review comments from Josef and Christoph
- Rebased onto misc-next
v5 of the patchset can be found here:
https://lore/kernel.org/linux-btrfs/cover.1675853489.git.johannes.thumshirn@wdc.com
Changes to v4:
- Added patch to check for RST feature in sysfs
- Added RST lookups for scrubbing
- Fixed the error handling bug Josef pointed out
- Only check if we need to write out a RST once per delayed_ref head
- Added support for zoned data DUP with RST
Changes to v3:
- Rebased onto 20221120124734.18634-1-hch@lst.de
- Incorporated Josef's review
- Merged related patches
v3 of the patchset can be found here:
https://lore/kernel.org/linux-btrfs/cover.1666007330.git.johannes.thumshirn@wdc.com
Changes to v2:
- Bug fixes
- Rebased onto 20220901074216.1849941-1-hch@lst.de
- Added tracepoints
- Added leak checker
- Added RAID0 and RAID10
v2 of the patchset can be found here:
https://lore.kernel.org/linux-btrfs/cover.1656513330.git.johannes.thumshirn@wdc.com
Changes to v1:
- Write the stripe-tree at delayed-ref time (Qu)
- Add a different write path for preallocation
v1 of the patchset can be found here:
https://lore.kernel.org/linux-btrfs/cover.1652711187.git.johannes.thumshirn@wdc.com/
Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
---
Johannes Thumshirn (11):
btrfs: add raid stripe tree definitions
btrfs: read raid-stripe-tree from disk
btrfs: add support for inserting raid stripe extents
btrfs: delete stripe extent on extent deletion
btrfs: lookup physical address from stripe extent
btrfs: implement RST version of scrub
btrfs: zoned: allow zoned RAID
btrfs: add raid stripe tree pretty printer
btrfs: announce presence of raid-stripe-tree in sysfs
btrfs: add trace events for RST
btrfs: add raid-stripe-tree to features enabled with debug
fs/btrfs/Makefile | 2 +-
fs/btrfs/accessors.h | 10 +
fs/btrfs/bio.c | 23 ++
fs/btrfs/block-rsv.c | 6 +
fs/btrfs/disk-io.c | 18 ++
fs/btrfs/disk-io.h | 5 +
fs/btrfs/extent-tree.c | 7 +
fs/btrfs/fs.h | 4 +-
fs/btrfs/inode.c | 8 +-
fs/btrfs/locking.c | 5 +-
fs/btrfs/ordered-data.c | 1 +
fs/btrfs/ordered-data.h | 2 +
fs/btrfs/print-tree.c | 49 ++++
fs/btrfs/raid-stripe-tree.c | 493 ++++++++++++++++++++++++++++++++++++++++
fs/btrfs/raid-stripe-tree.h | 52 +++++
fs/btrfs/scrub.c | 56 +++++
fs/btrfs/sysfs.c | 3 +
fs/btrfs/volumes.c | 43 +++-
fs/btrfs/volumes.h | 15 +-
fs/btrfs/zoned.c | 113 ++++++++-
include/trace/events/btrfs.h | 75 ++++++
include/uapi/linux/btrfs.h | 1 +
include/uapi/linux/btrfs_tree.h | 33 ++-
23 files changed, 999 insertions(+), 25 deletions(-)
---
base-commit: 133da717263112d81bb95b5535ceb2c1eeddd4e7
change-id: 20230613-raid-stripe-tree-e330c9a45cc3
Best regards,
--
Johannes Thumshirn <johannes.thumshirn@wdc.com>
^ permalink raw reply [flat|nested] 39+ messages in thread
* [PATCH v8 01/11] btrfs: add raid stripe tree definitions
2023-09-11 12:52 [PATCH v8 00/11] btrfs: introduce RAID stripe tree Johannes Thumshirn
@ 2023-09-11 12:52 ` Johannes Thumshirn
2023-09-11 21:00 ` Damien Le Moal
2023-09-12 20:32 ` David Sterba
2023-09-11 12:52 ` [PATCH v8 02/11] btrfs: read raid-stripe-tree from disk Johannes Thumshirn
` (9 subsequent siblings)
10 siblings, 2 replies; 39+ messages in thread
From: Johannes Thumshirn @ 2023-09-11 12:52 UTC (permalink / raw)
To: Chris Mason, Josef Bacik, David Sterba
Cc: Johannes Thumshirn, Christoph Hellwig, Naohiro Aota, Qu Wenruo,
Damien Le Moal, linux-btrfs, linux-kernel
Add definitions for the raid stripe tree. This tree will hold information
about the on-disk layout of the stripes in a RAID set.
Each stripe extent has a 1:1 relationship with an on-disk extent item and
is doing the logical to per-drive physical address translation for the
extent item in question.
Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
---
fs/btrfs/accessors.h | 10 ++++++++++
fs/btrfs/locking.c | 5 +++--
include/uapi/linux/btrfs_tree.h | 33 +++++++++++++++++++++++++++++++--
3 files changed, 44 insertions(+), 4 deletions(-)
diff --git a/fs/btrfs/accessors.h b/fs/btrfs/accessors.h
index f958eccff477..977ff160a024 100644
--- a/fs/btrfs/accessors.h
+++ b/fs/btrfs/accessors.h
@@ -306,6 +306,16 @@ BTRFS_SETGET_FUNCS(timespec_nsec, struct btrfs_timespec, nsec, 32);
BTRFS_SETGET_STACK_FUNCS(stack_timespec_sec, struct btrfs_timespec, sec, 64);
BTRFS_SETGET_STACK_FUNCS(stack_timespec_nsec, struct btrfs_timespec, nsec, 32);
+BTRFS_SETGET_FUNCS(stripe_extent_encoding, struct btrfs_stripe_extent, encoding, 8);
+BTRFS_SETGET_FUNCS(raid_stride_devid, struct btrfs_raid_stride, devid, 64);
+BTRFS_SETGET_FUNCS(raid_stride_physical, struct btrfs_raid_stride, physical, 64);
+BTRFS_SETGET_FUNCS(raid_stride_length, struct btrfs_raid_stride, length, 64);
+BTRFS_SETGET_STACK_FUNCS(stack_stripe_extent_encoding,
+ struct btrfs_stripe_extent, encoding, 8);
+BTRFS_SETGET_STACK_FUNCS(stack_raid_stride_devid, struct btrfs_raid_stride, devid, 64);
+BTRFS_SETGET_STACK_FUNCS(stack_raid_stride_physical, struct btrfs_raid_stride, physical, 64);
+BTRFS_SETGET_STACK_FUNCS(stack_raid_stride_length, struct btrfs_raid_stride, length, 64);
+
/* struct btrfs_dev_extent */
BTRFS_SETGET_FUNCS(dev_extent_chunk_tree, struct btrfs_dev_extent, chunk_tree, 64);
BTRFS_SETGET_FUNCS(dev_extent_chunk_objectid, struct btrfs_dev_extent,
diff --git a/fs/btrfs/locking.c b/fs/btrfs/locking.c
index 6ac4fd8cc8dc..e7760d40feab 100644
--- a/fs/btrfs/locking.c
+++ b/fs/btrfs/locking.c
@@ -58,8 +58,8 @@
static struct btrfs_lockdep_keyset {
u64 id; /* root objectid */
- /* Longest entry: btrfs-block-group-00 */
- char names[BTRFS_MAX_LEVEL][24];
+ /* Longest entry: btrfs-raid-stripe-tree-00 */
+ char names[BTRFS_MAX_LEVEL][25];
struct lock_class_key keys[BTRFS_MAX_LEVEL];
} btrfs_lockdep_keysets[] = {
{ .id = BTRFS_ROOT_TREE_OBJECTID, DEFINE_NAME("root") },
@@ -74,6 +74,7 @@ static struct btrfs_lockdep_keyset {
{ .id = BTRFS_UUID_TREE_OBJECTID, DEFINE_NAME("uuid") },
{ .id = BTRFS_FREE_SPACE_TREE_OBJECTID, DEFINE_NAME("free-space") },
{ .id = BTRFS_BLOCK_GROUP_TREE_OBJECTID, DEFINE_NAME("block-group") },
+ { .id = BTRFS_RAID_STRIPE_TREE_OBJECTID,DEFINE_NAME("raid-stripe-tree") },
{ .id = 0, DEFINE_NAME("tree") },
};
diff --git a/include/uapi/linux/btrfs_tree.h b/include/uapi/linux/btrfs_tree.h
index fc3c32186d7e..3fb758ce3ac0 100644
--- a/include/uapi/linux/btrfs_tree.h
+++ b/include/uapi/linux/btrfs_tree.h
@@ -4,9 +4,8 @@
#include <linux/btrfs.h>
#include <linux/types.h>
-#ifdef __KERNEL__
#include <linux/stddef.h>
-#else
+#ifndef __KERNEL__
#include <stddef.h>
#endif
@@ -73,6 +72,9 @@
/* Holds the block group items for extent tree v2. */
#define BTRFS_BLOCK_GROUP_TREE_OBJECTID 11ULL
+/* tracks RAID stripes in block groups. */
+#define BTRFS_RAID_STRIPE_TREE_OBJECTID 12ULL
+
/* device stats in the device tree */
#define BTRFS_DEV_STATS_OBJECTID 0ULL
@@ -285,6 +287,8 @@
*/
#define BTRFS_QGROUP_RELATION_KEY 246
+#define BTRFS_RAID_STRIPE_KEY 247
+
/*
* Obsolete name, see BTRFS_TEMPORARY_ITEM_KEY.
*/
@@ -719,6 +723,31 @@ struct btrfs_free_space_header {
__le64 num_bitmaps;
} __attribute__ ((__packed__));
+struct btrfs_raid_stride {
+ /* btrfs device-id this raid extent lives on */
+ __le64 devid;
+ /* physical location on disk */
+ __le64 physical;
+ /* length of stride on this disk */
+ __le64 length;
+};
+
+#define BTRFS_STRIPE_DUP 0
+#define BTRFS_STRIPE_RAID0 1
+#define BTRFS_STRIPE_RAID1 2
+#define BTRFS_STRIPE_RAID1C3 3
+#define BTRFS_STRIPE_RAID1C4 4
+#define BTRFS_STRIPE_RAID5 5
+#define BTRFS_STRIPE_RAID6 6
+#define BTRFS_STRIPE_RAID10 7
+
+struct btrfs_stripe_extent {
+ __u8 encoding;
+ __u8 reserved[7];
+ /* array of raid strides this stripe is composed of */
+ __DECLARE_FLEX_ARRAY(struct btrfs_raid_stride, strides);
+};
+
#define BTRFS_HEADER_FLAG_WRITTEN (1ULL << 0)
#define BTRFS_HEADER_FLAG_RELOC (1ULL << 1)
--
2.41.0
^ permalink raw reply related [flat|nested] 39+ messages in thread
* [PATCH v8 02/11] btrfs: read raid-stripe-tree from disk
2023-09-11 12:52 [PATCH v8 00/11] btrfs: introduce RAID stripe tree Johannes Thumshirn
2023-09-11 12:52 ` [PATCH v8 01/11] btrfs: add raid stripe tree definitions Johannes Thumshirn
@ 2023-09-11 12:52 ` Johannes Thumshirn
2023-09-14 9:27 ` Qu Wenruo
2023-09-11 12:52 ` [PATCH v8 03/11] btrfs: add support for inserting raid stripe extents Johannes Thumshirn
` (8 subsequent siblings)
10 siblings, 1 reply; 39+ messages in thread
From: Johannes Thumshirn @ 2023-09-11 12:52 UTC (permalink / raw)
To: Chris Mason, Josef Bacik, David Sterba
Cc: Johannes Thumshirn, Christoph Hellwig, Naohiro Aota, Qu Wenruo,
Damien Le Moal, linux-btrfs, linux-kernel, Anand Jain
If we find a raid-stripe-tree on mount, read it from disk.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
---
fs/btrfs/block-rsv.c | 6 ++++++
fs/btrfs/disk-io.c | 18 ++++++++++++++++++
fs/btrfs/disk-io.h | 5 +++++
fs/btrfs/fs.h | 1 +
include/uapi/linux/btrfs.h | 1 +
5 files changed, 31 insertions(+)
diff --git a/fs/btrfs/block-rsv.c b/fs/btrfs/block-rsv.c
index 77684c5e0c8b..4e55e5f30f7f 100644
--- a/fs/btrfs/block-rsv.c
+++ b/fs/btrfs/block-rsv.c
@@ -354,6 +354,11 @@ void btrfs_update_global_block_rsv(struct btrfs_fs_info *fs_info)
min_items++;
}
+ if (btrfs_fs_incompat(fs_info, RAID_STRIPE_TREE)) {
+ num_bytes += btrfs_root_used(&fs_info->stripe_root->root_item);
+ min_items++;
+ }
+
/*
* But we also want to reserve enough space so we can do the fallback
* global reserve for an unlink, which is an additional
@@ -405,6 +410,7 @@ void btrfs_init_root_block_rsv(struct btrfs_root *root)
case BTRFS_EXTENT_TREE_OBJECTID:
case BTRFS_FREE_SPACE_TREE_OBJECTID:
case BTRFS_BLOCK_GROUP_TREE_OBJECTID:
+ case BTRFS_RAID_STRIPE_TREE_OBJECTID:
root->block_rsv = &fs_info->delayed_refs_rsv;
break;
case BTRFS_ROOT_TREE_OBJECTID:
diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index 4c5d71065ea8..1ecebcfc1c17 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -1179,6 +1179,8 @@ static struct btrfs_root *btrfs_get_global_root(struct btrfs_fs_info *fs_info,
return btrfs_grab_root(fs_info->block_group_root);
case BTRFS_FREE_SPACE_TREE_OBJECTID:
return btrfs_grab_root(btrfs_global_root(fs_info, &key));
+ case BTRFS_RAID_STRIPE_TREE_OBJECTID:
+ return btrfs_grab_root(fs_info->stripe_root);
default:
return NULL;
}
@@ -1259,6 +1261,7 @@ void btrfs_free_fs_info(struct btrfs_fs_info *fs_info)
btrfs_put_root(fs_info->fs_root);
btrfs_put_root(fs_info->data_reloc_root);
btrfs_put_root(fs_info->block_group_root);
+ btrfs_put_root(fs_info->stripe_root);
btrfs_check_leaked_roots(fs_info);
btrfs_extent_buffer_leak_debug_check(fs_info);
kfree(fs_info->super_copy);
@@ -1804,6 +1807,7 @@ static void free_root_pointers(struct btrfs_fs_info *info, bool free_chunk_root)
free_root_extent_buffers(info->fs_root);
free_root_extent_buffers(info->data_reloc_root);
free_root_extent_buffers(info->block_group_root);
+ free_root_extent_buffers(info->stripe_root);
if (free_chunk_root)
free_root_extent_buffers(info->chunk_root);
}
@@ -2280,6 +2284,20 @@ static int btrfs_read_roots(struct btrfs_fs_info *fs_info)
fs_info->uuid_root = root;
}
+ if (btrfs_fs_incompat(fs_info, RAID_STRIPE_TREE)) {
+ location.objectid = BTRFS_RAID_STRIPE_TREE_OBJECTID;
+ root = btrfs_read_tree_root(tree_root, &location);
+ if (IS_ERR(root)) {
+ if (!btrfs_test_opt(fs_info, IGNOREBADROOTS)) {
+ ret = PTR_ERR(root);
+ goto out;
+ }
+ } else {
+ set_bit(BTRFS_ROOT_TRACK_DIRTY, &root->state);
+ fs_info->stripe_root = root;
+ }
+ }
+
return 0;
out:
btrfs_warn(fs_info, "failed to read root (objectid=%llu): %d",
diff --git a/fs/btrfs/disk-io.h b/fs/btrfs/disk-io.h
index 02b645744a82..8b7f01a01c44 100644
--- a/fs/btrfs/disk-io.h
+++ b/fs/btrfs/disk-io.h
@@ -103,6 +103,11 @@ static inline struct btrfs_root *btrfs_grab_root(struct btrfs_root *root)
return NULL;
}
+static inline struct btrfs_root *btrfs_stripe_tree_root(struct btrfs_fs_info *fs_info)
+{
+ return fs_info->stripe_root;
+}
+
void btrfs_put_root(struct btrfs_root *root);
void btrfs_mark_buffer_dirty(struct extent_buffer *buf);
int btrfs_buffer_uptodate(struct extent_buffer *buf, u64 parent_transid,
diff --git a/fs/btrfs/fs.h b/fs/btrfs/fs.h
index d84a390336fc..5c7778e8b5ed 100644
--- a/fs/btrfs/fs.h
+++ b/fs/btrfs/fs.h
@@ -367,6 +367,7 @@ struct btrfs_fs_info {
struct btrfs_root *uuid_root;
struct btrfs_root *data_reloc_root;
struct btrfs_root *block_group_root;
+ struct btrfs_root *stripe_root;
/* The log root tree is a directory of all the other log roots */
struct btrfs_root *log_root_tree;
diff --git a/include/uapi/linux/btrfs.h b/include/uapi/linux/btrfs.h
index dbb8b96da50d..b9a1d9af8ae8 100644
--- a/include/uapi/linux/btrfs.h
+++ b/include/uapi/linux/btrfs.h
@@ -333,6 +333,7 @@ struct btrfs_ioctl_fs_info_args {
#define BTRFS_FEATURE_INCOMPAT_RAID1C34 (1ULL << 11)
#define BTRFS_FEATURE_INCOMPAT_ZONED (1ULL << 12)
#define BTRFS_FEATURE_INCOMPAT_EXTENT_TREE_V2 (1ULL << 13)
+#define BTRFS_FEATURE_INCOMPAT_RAID_STRIPE_TREE (1ULL << 14)
struct btrfs_ioctl_feature_flags {
__u64 compat_flags;
--
2.41.0
^ permalink raw reply related [flat|nested] 39+ messages in thread
* [PATCH v8 03/11] btrfs: add support for inserting raid stripe extents
2023-09-11 12:52 [PATCH v8 00/11] btrfs: introduce RAID stripe tree Johannes Thumshirn
2023-09-11 12:52 ` [PATCH v8 01/11] btrfs: add raid stripe tree definitions Johannes Thumshirn
2023-09-11 12:52 ` [PATCH v8 02/11] btrfs: read raid-stripe-tree from disk Johannes Thumshirn
@ 2023-09-11 12:52 ` Johannes Thumshirn
2023-09-13 16:50 ` David Sterba
` (2 more replies)
2023-09-11 12:52 ` [PATCH v8 04/11] btrfs: delete stripe extent on extent deletion Johannes Thumshirn
` (7 subsequent siblings)
10 siblings, 3 replies; 39+ messages in thread
From: Johannes Thumshirn @ 2023-09-11 12:52 UTC (permalink / raw)
To: Chris Mason, Josef Bacik, David Sterba
Cc: Johannes Thumshirn, Christoph Hellwig, Naohiro Aota, Qu Wenruo,
Damien Le Moal, linux-btrfs, linux-kernel
Add support for inserting stripe extents into the raid stripe tree on
completion of every write that needs an extra logical-to-physical
translation when using RAID.
Inserting the stripe extents happens after the data I/O has completed,
this is done to a) support zone-append and b) rule out the possibility of
a RAID-write-hole.
Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
---
fs/btrfs/Makefile | 2 +-
fs/btrfs/bio.c | 23 ++++
fs/btrfs/extent-tree.c | 1 +
fs/btrfs/inode.c | 8 +-
fs/btrfs/ordered-data.c | 1 +
fs/btrfs/ordered-data.h | 2 +
fs/btrfs/raid-stripe-tree.c | 266 ++++++++++++++++++++++++++++++++++++++++++++
fs/btrfs/raid-stripe-tree.h | 34 ++++++
fs/btrfs/volumes.c | 4 +-
fs/btrfs/volumes.h | 15 ++-
10 files changed, 347 insertions(+), 9 deletions(-)
diff --git a/fs/btrfs/Makefile b/fs/btrfs/Makefile
index c57d80729d4f..525af975f61c 100644
--- a/fs/btrfs/Makefile
+++ b/fs/btrfs/Makefile
@@ -33,7 +33,7 @@ btrfs-y += super.o ctree.o extent-tree.o print-tree.o root-tree.o dir-item.o \
uuid-tree.o props.o free-space-tree.o tree-checker.o space-info.o \
block-rsv.o delalloc-space.o block-group.o discard.o reflink.o \
subpage.o tree-mod-log.o extent-io-tree.o fs.o messages.o bio.o \
- lru_cache.o
+ lru_cache.o raid-stripe-tree.o
btrfs-$(CONFIG_BTRFS_FS_POSIX_ACL) += acl.o
btrfs-$(CONFIG_BTRFS_FS_REF_VERIFY) += ref-verify.o
diff --git a/fs/btrfs/bio.c b/fs/btrfs/bio.c
index 31ff36990404..ddbe6f8d4ea2 100644
--- a/fs/btrfs/bio.c
+++ b/fs/btrfs/bio.c
@@ -14,6 +14,7 @@
#include "rcu-string.h"
#include "zoned.h"
#include "file-item.h"
+#include "raid-stripe-tree.h"
static struct bio_set btrfs_bioset;
static struct bio_set btrfs_clone_bioset;
@@ -415,6 +416,9 @@ static void btrfs_orig_write_end_io(struct bio *bio)
else
bio->bi_status = BLK_STS_OK;
+ if (bio_op(bio) == REQ_OP_ZONE_APPEND && !bio->bi_status)
+ stripe->physical = bio->bi_iter.bi_sector << SECTOR_SHIFT;
+
btrfs_orig_bbio_end_io(bbio);
btrfs_put_bioc(bioc);
}
@@ -426,6 +430,8 @@ static void btrfs_clone_write_end_io(struct bio *bio)
if (bio->bi_status) {
atomic_inc(&stripe->bioc->error);
btrfs_log_dev_io_error(bio, stripe->dev);
+ } else if (bio_op(bio) == REQ_OP_ZONE_APPEND) {
+ stripe->physical = bio->bi_iter.bi_sector << SECTOR_SHIFT;
}
/* Pass on control to the original bio this one was cloned from */
@@ -487,6 +493,7 @@ static void btrfs_submit_mirrored_bio(struct btrfs_io_context *bioc, int dev_nr)
bio->bi_private = &bioc->stripes[dev_nr];
bio->bi_iter.bi_sector = bioc->stripes[dev_nr].physical >> SECTOR_SHIFT;
bioc->stripes[dev_nr].bioc = bioc;
+ bioc->size = bio->bi_iter.bi_size;
btrfs_submit_dev_bio(bioc->stripes[dev_nr].dev, bio);
}
@@ -496,6 +503,8 @@ static void __btrfs_submit_bio(struct bio *bio, struct btrfs_io_context *bioc,
if (!bioc) {
/* Single mirror read/write fast path. */
btrfs_bio(bio)->mirror_num = mirror_num;
+ if (bio_op(bio) != REQ_OP_READ)
+ btrfs_bio(bio)->orig_physical = smap->physical;
bio->bi_iter.bi_sector = smap->physical >> SECTOR_SHIFT;
if (bio_op(bio) != REQ_OP_READ)
btrfs_bio(bio)->orig_physical = smap->physical;
@@ -688,6 +697,20 @@ static bool btrfs_submit_chunk(struct btrfs_bio *bbio, int mirror_num)
bio->bi_opf |= REQ_OP_ZONE_APPEND;
}
+ if (is_data_bbio(bbio) && bioc &&
+ btrfs_need_stripe_tree_update(bioc->fs_info,
+ bioc->map_type)) {
+ /*
+ * No locking for the list update, as we only add to
+ * the list in the I/O submission path, and list
+ * iteration only happens in the completion path,
+ * which can't happen until after the last submission.
+ */
+ btrfs_get_bioc(bioc);
+ list_add_tail(&bioc->ordered_entry,
+ &bbio->ordered->bioc_list);
+ }
+
/*
* Csum items for reloc roots have already been cloned at this
* point, so they are handled as part of the no-checksum case.
diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index 6f6838226fe7..2e11a699ab77 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -42,6 +42,7 @@
#include "file-item.h"
#include "orphan.h"
#include "tree-checker.h"
+#include "raid-stripe-tree.h"
#undef SCRAMBLE_DELAYED_REFS
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index bafca05940d7..6f71630248da 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -71,6 +71,7 @@
#include "super.h"
#include "orphan.h"
#include "backref.h"
+#include "raid-stripe-tree.h"
struct btrfs_iget_args {
u64 ino;
@@ -3091,6 +3092,10 @@ int btrfs_finish_one_ordered(struct btrfs_ordered_extent *ordered_extent)
trans->block_rsv = &inode->block_rsv;
+ ret = btrfs_insert_raid_extent(trans, ordered_extent);
+ if (ret)
+ goto out;
+
if (test_bit(BTRFS_ORDERED_COMPRESSED, &ordered_extent->flags))
compress_type = ordered_extent->compress_type;
if (test_bit(BTRFS_ORDERED_PREALLOC, &ordered_extent->flags)) {
@@ -3224,7 +3229,8 @@ int btrfs_finish_one_ordered(struct btrfs_ordered_extent *ordered_extent)
int btrfs_finish_ordered_io(struct btrfs_ordered_extent *ordered)
{
if (btrfs_is_zoned(btrfs_sb(ordered->inode->i_sb)) &&
- !test_bit(BTRFS_ORDERED_IOERR, &ordered->flags))
+ !test_bit(BTRFS_ORDERED_IOERR, &ordered->flags) &&
+ list_empty(&ordered->bioc_list))
btrfs_finish_ordered_zoned(ordered);
return btrfs_finish_one_ordered(ordered);
}
diff --git a/fs/btrfs/ordered-data.c b/fs/btrfs/ordered-data.c
index 345c449d588c..55c7d5543265 100644
--- a/fs/btrfs/ordered-data.c
+++ b/fs/btrfs/ordered-data.c
@@ -191,6 +191,7 @@ static struct btrfs_ordered_extent *alloc_ordered_extent(
INIT_LIST_HEAD(&entry->log_list);
INIT_LIST_HEAD(&entry->root_extent_list);
INIT_LIST_HEAD(&entry->work_list);
+ INIT_LIST_HEAD(&entry->bioc_list);
init_completion(&entry->completion);
/*
diff --git a/fs/btrfs/ordered-data.h b/fs/btrfs/ordered-data.h
index 173bd5c5df26..1c51ac57e5df 100644
--- a/fs/btrfs/ordered-data.h
+++ b/fs/btrfs/ordered-data.h
@@ -151,6 +151,8 @@ struct btrfs_ordered_extent {
struct completion completion;
struct btrfs_work flush_work;
struct list_head work_list;
+
+ struct list_head bioc_list;
};
static inline void
diff --git a/fs/btrfs/raid-stripe-tree.c b/fs/btrfs/raid-stripe-tree.c
new file mode 100644
index 000000000000..2415698a8fef
--- /dev/null
+++ b/fs/btrfs/raid-stripe-tree.c
@@ -0,0 +1,266 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * Copyright (C) 2023 Western Digital Corporation or its affiliates.
+ */
+
+#include <linux/btrfs_tree.h>
+
+#include "ctree.h"
+#include "fs.h"
+#include "accessors.h"
+#include "transaction.h"
+#include "disk-io.h"
+#include "raid-stripe-tree.h"
+#include "volumes.h"
+#include "misc.h"
+#include "print-tree.h"
+
+static u8 btrfs_bg_type_to_raid_encoding(u64 map_type)
+{
+ switch (map_type & BTRFS_BLOCK_GROUP_PROFILE_MASK) {
+ case BTRFS_BLOCK_GROUP_DUP:
+ return BTRFS_STRIPE_DUP;
+ case BTRFS_BLOCK_GROUP_RAID0:
+ return BTRFS_STRIPE_RAID0;
+ case BTRFS_BLOCK_GROUP_RAID1:
+ return BTRFS_STRIPE_RAID1;
+ case BTRFS_BLOCK_GROUP_RAID1C3:
+ return BTRFS_STRIPE_RAID1C3;
+ case BTRFS_BLOCK_GROUP_RAID1C4:
+ return BTRFS_STRIPE_RAID1C4;
+ case BTRFS_BLOCK_GROUP_RAID5:
+ return BTRFS_STRIPE_RAID5;
+ case BTRFS_BLOCK_GROUP_RAID6:
+ return BTRFS_STRIPE_RAID6;
+ case BTRFS_BLOCK_GROUP_RAID10:
+ return BTRFS_STRIPE_RAID10;
+ default:
+ ASSERT(0);
+ }
+}
+
+static int btrfs_insert_one_raid_extent(struct btrfs_trans_handle *trans,
+ int num_stripes,
+ struct btrfs_io_context *bioc)
+{
+ struct btrfs_fs_info *fs_info = trans->fs_info;
+ struct btrfs_key stripe_key;
+ struct btrfs_root *stripe_root = btrfs_stripe_tree_root(fs_info);
+ u8 encoding = btrfs_bg_type_to_raid_encoding(bioc->map_type);
+ struct btrfs_stripe_extent *stripe_extent;
+ size_t item_size;
+ int ret;
+
+ item_size = struct_size(stripe_extent, strides, num_stripes);
+
+ stripe_extent = kzalloc(item_size, GFP_NOFS);
+ if (!stripe_extent) {
+ btrfs_abort_transaction(trans, -ENOMEM);
+ btrfs_end_transaction(trans);
+ return -ENOMEM;
+ }
+
+ btrfs_set_stack_stripe_extent_encoding(stripe_extent, encoding);
+ for (int i = 0; i < num_stripes; i++) {
+ u64 devid = bioc->stripes[i].dev->devid;
+ u64 physical = bioc->stripes[i].physical;
+ u64 length = bioc->stripes[i].length;
+ struct btrfs_raid_stride *raid_stride =
+ &stripe_extent->strides[i];
+
+ if (length == 0)
+ length = bioc->size;
+
+ btrfs_set_stack_raid_stride_devid(raid_stride, devid);
+ btrfs_set_stack_raid_stride_physical(raid_stride, physical);
+ btrfs_set_stack_raid_stride_length(raid_stride, length);
+ }
+
+ stripe_key.objectid = bioc->logical;
+ stripe_key.type = BTRFS_RAID_STRIPE_KEY;
+ stripe_key.offset = bioc->size;
+
+ ret = btrfs_insert_item(trans, stripe_root, &stripe_key, stripe_extent,
+ item_size);
+ if (ret)
+ btrfs_abort_transaction(trans, ret);
+
+ kfree(stripe_extent);
+
+ return ret;
+}
+
+static int btrfs_insert_mirrored_raid_extents(struct btrfs_trans_handle *trans,
+ struct btrfs_ordered_extent *ordered,
+ u64 map_type)
+{
+ int num_stripes = btrfs_bg_type_to_factor(map_type);
+ struct btrfs_io_context *bioc;
+ int ret;
+
+ list_for_each_entry(bioc, &ordered->bioc_list, ordered_entry) {
+ ret = btrfs_insert_one_raid_extent(trans, num_stripes, bioc);
+ if (ret)
+ return ret;
+ }
+
+ return 0;
+}
+
+static int btrfs_insert_striped_mirrored_raid_extents(
+ struct btrfs_trans_handle *trans,
+ struct btrfs_ordered_extent *ordered,
+ u64 map_type)
+{
+ struct btrfs_io_context *bioc;
+ struct btrfs_io_context *rbioc;
+ const int nstripes = list_count_nodes(&ordered->bioc_list);
+ const int index = btrfs_bg_flags_to_raid_index(map_type);
+ const int substripes = btrfs_raid_array[index].sub_stripes;
+ const int max_stripes = trans->fs_info->fs_devices->rw_devices / 2;
+ int left = nstripes;
+ int stripe = 0, j = 0;
+ int i = 0;
+ int ret = 0;
+ u64 stripe_end;
+ u64 prev_end;
+
+ if (nstripes == 1)
+ return btrfs_insert_mirrored_raid_extents(trans, ordered, map_type);
+
+ rbioc = kzalloc(struct_size(rbioc, stripes, nstripes * substripes),
+ GFP_KERNEL);
+ if (!rbioc)
+ return -ENOMEM;
+
+ rbioc->map_type = map_type;
+ rbioc->logical = list_first_entry(&ordered->bioc_list, typeof(*rbioc),
+ ordered_entry)->logical;
+
+ stripe_end = rbioc->logical;
+ prev_end = stripe_end;
+ list_for_each_entry(bioc, &ordered->bioc_list, ordered_entry) {
+
+ rbioc->size += bioc->size;
+ for (j = 0; j < substripes; j++) {
+ stripe = i + j;
+ rbioc->stripes[stripe].dev = bioc->stripes[j].dev;
+ rbioc->stripes[stripe].physical = bioc->stripes[j].physical;
+ rbioc->stripes[stripe].length = bioc->size;
+ }
+
+ stripe_end += rbioc->size;
+ if (i >= nstripes ||
+ (stripe_end - prev_end >= max_stripes * BTRFS_STRIPE_LEN)) {
+ ret = btrfs_insert_one_raid_extent(trans,
+ nstripes * substripes,
+ rbioc);
+ if (ret)
+ goto out;
+
+ left -= nstripes;
+ i = 0;
+ rbioc->logical += rbioc->size;
+ rbioc->size = 0;
+ } else {
+ i += substripes;
+ prev_end = stripe_end;
+ }
+ }
+
+ if (left) {
+ bioc = list_prev_entry(bioc, ordered_entry);
+ ret = btrfs_insert_one_raid_extent(trans, substripes, bioc);
+ }
+
+out:
+ kfree(rbioc);
+ return ret;
+}
+
+static int btrfs_insert_striped_raid_extents(struct btrfs_trans_handle *trans,
+ struct btrfs_ordered_extent *ordered,
+ u64 map_type)
+{
+ struct btrfs_io_context *bioc;
+ struct btrfs_io_context *rbioc;
+ const int nstripes = list_count_nodes(&ordered->bioc_list);
+ int i = 0;
+ int ret = 0;
+
+ rbioc = kzalloc(struct_size(rbioc, stripes, nstripes), GFP_KERNEL);
+ if (!rbioc)
+ return -ENOMEM;
+ rbioc->map_type = map_type;
+ rbioc->logical = list_first_entry(&ordered->bioc_list, typeof(*rbioc),
+ ordered_entry)->logical;
+
+ list_for_each_entry(bioc, &ordered->bioc_list, ordered_entry) {
+ rbioc->size += bioc->size;
+ rbioc->stripes[i].dev = bioc->stripes[0].dev;
+ rbioc->stripes[i].physical = bioc->stripes[0].physical;
+ rbioc->stripes[i].length = bioc->size;
+
+ if (i == nstripes - 1) {
+ ret = btrfs_insert_one_raid_extent(trans, nstripes, rbioc);
+ if (ret)
+ goto out;
+
+ i = 0;
+ rbioc->logical += rbioc->size;
+ rbioc->size = 0;
+ } else {
+ i++;
+ }
+ }
+
+ if (i && i < nstripes - 1)
+ ret = btrfs_insert_one_raid_extent(trans, i, rbioc);
+
+out:
+ kfree(rbioc);
+ return ret;
+}
+
+int btrfs_insert_raid_extent(struct btrfs_trans_handle *trans,
+ struct btrfs_ordered_extent *ordered_extent)
+{
+ struct btrfs_io_context *bioc;
+ u64 map_type;
+ int ret;
+
+ if (!trans->fs_info->stripe_root)
+ return 0;
+
+ map_type = list_first_entry(&ordered_extent->bioc_list, typeof(*bioc),
+ ordered_entry)->map_type;
+
+ switch (map_type & BTRFS_BLOCK_GROUP_PROFILE_MASK) {
+ case BTRFS_BLOCK_GROUP_DUP:
+ case BTRFS_BLOCK_GROUP_RAID1:
+ case BTRFS_BLOCK_GROUP_RAID1C3:
+ case BTRFS_BLOCK_GROUP_RAID1C4:
+ ret = btrfs_insert_mirrored_raid_extents(trans, ordered_extent,
+ map_type);
+ break;
+ case BTRFS_BLOCK_GROUP_RAID0:
+ ret = btrfs_insert_striped_raid_extents(trans, ordered_extent,
+ map_type);
+ break;
+ case BTRFS_BLOCK_GROUP_RAID10:
+ ret = btrfs_insert_striped_mirrored_raid_extents(trans, ordered_extent, map_type);
+ break;
+ default:
+ ret = -EINVAL;
+ break;
+ }
+
+ while (!list_empty(&ordered_extent->bioc_list)) {
+ bioc = list_first_entry(&ordered_extent->bioc_list,
+ typeof(*bioc), ordered_entry);
+ list_del(&bioc->ordered_entry);
+ btrfs_put_bioc(bioc);
+ }
+
+ return ret;
+}
diff --git a/fs/btrfs/raid-stripe-tree.h b/fs/btrfs/raid-stripe-tree.h
new file mode 100644
index 000000000000..f36e4c2d46b0
--- /dev/null
+++ b/fs/btrfs/raid-stripe-tree.h
@@ -0,0 +1,34 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+/*
+ * Copyright (C) 2023 Western Digital Corporation or its affiliates.
+ */
+
+#ifndef BTRFS_RAID_STRIPE_TREE_H
+#define BTRFS_RAID_STRIPE_TREE_H
+
+#include "disk-io.h"
+
+struct btrfs_io_context;
+struct btrfs_io_stripe;
+
+int btrfs_insert_raid_extent(struct btrfs_trans_handle *trans,
+ struct btrfs_ordered_extent *ordered_extent);
+
+static inline bool btrfs_need_stripe_tree_update(struct btrfs_fs_info *fs_info,
+ u64 map_type)
+{
+ u64 type = map_type & BTRFS_BLOCK_GROUP_TYPE_MASK;
+ u64 profile = map_type & BTRFS_BLOCK_GROUP_PROFILE_MASK;
+
+ if (!btrfs_stripe_tree_root(fs_info))
+ return false;
+
+ if (type != BTRFS_BLOCK_GROUP_DATA)
+ return false;
+
+ if (profile & BTRFS_BLOCK_GROUP_RAID1_MASK)
+ return true;
+
+ return false;
+}
+#endif
diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index 871a55d36e32..0c0fd4eb4848 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -5881,6 +5881,7 @@ static int find_live_mirror(struct btrfs_fs_info *fs_info,
}
static struct btrfs_io_context *alloc_btrfs_io_context(struct btrfs_fs_info *fs_info,
+ u64 logical,
u16 total_stripes)
{
struct btrfs_io_context *bioc;
@@ -5900,6 +5901,7 @@ static struct btrfs_io_context *alloc_btrfs_io_context(struct btrfs_fs_info *fs_
bioc->fs_info = fs_info;
bioc->replace_stripe_src = -1;
bioc->full_stripe_logical = (u64)-1;
+ bioc->logical = logical;
return bioc;
}
@@ -6434,7 +6436,7 @@ int btrfs_map_block(struct btrfs_fs_info *fs_info, enum btrfs_map_op op,
goto out;
}
- bioc = alloc_btrfs_io_context(fs_info, num_alloc_stripes);
+ bioc = alloc_btrfs_io_context(fs_info, logical, num_alloc_stripes);
if (!bioc) {
ret = -ENOMEM;
goto out;
diff --git a/fs/btrfs/volumes.h b/fs/btrfs/volumes.h
index 576bfcb5b764..8604bfbbf510 100644
--- a/fs/btrfs/volumes.h
+++ b/fs/btrfs/volumes.h
@@ -390,12 +390,11 @@ struct btrfs_fs_devices {
struct btrfs_io_stripe {
struct btrfs_device *dev;
- union {
- /* Block mapping */
- u64 physical;
- /* For the endio handler */
- struct btrfs_io_context *bioc;
- };
+ /* Block mapping */
+ u64 physical;
+ u64 length;
+ /* For the endio handler */
+ struct btrfs_io_context *bioc;
};
struct btrfs_discard_stripe {
@@ -428,6 +427,10 @@ struct btrfs_io_context {
atomic_t error;
u16 max_errors;
+ u64 logical;
+ u64 size;
+ struct list_head ordered_entry;
+
/*
* The total number of stripes, including the extra duplicated
* stripe for replace.
--
2.41.0
^ permalink raw reply related [flat|nested] 39+ messages in thread
* [PATCH v8 04/11] btrfs: delete stripe extent on extent deletion
2023-09-11 12:52 [PATCH v8 00/11] btrfs: introduce RAID stripe tree Johannes Thumshirn
` (2 preceding siblings ...)
2023-09-11 12:52 ` [PATCH v8 03/11] btrfs: add support for inserting raid stripe extents Johannes Thumshirn
@ 2023-09-11 12:52 ` Johannes Thumshirn
2023-09-11 12:52 ` [PATCH v8 05/11] btrfs: lookup physical address from stripe extent Johannes Thumshirn
` (6 subsequent siblings)
10 siblings, 0 replies; 39+ messages in thread
From: Johannes Thumshirn @ 2023-09-11 12:52 UTC (permalink / raw)
To: Chris Mason, Josef Bacik, David Sterba
Cc: Johannes Thumshirn, Christoph Hellwig, Naohiro Aota, Qu Wenruo,
Damien Le Moal, linux-btrfs, linux-kernel
As each stripe extent is tied to an extent item, delete the stripe extent
once the corresponding extent item is deleted.
Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
---
fs/btrfs/extent-tree.c | 6 +++++
fs/btrfs/raid-stripe-tree.c | 60 +++++++++++++++++++++++++++++++++++++++++++++
fs/btrfs/raid-stripe-tree.h | 2 ++
3 files changed, 68 insertions(+)
diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index 2e11a699ab77..c64dd3fd4463 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -2857,6 +2857,12 @@ static int do_free_extent_accounting(struct btrfs_trans_handle *trans,
btrfs_abort_transaction(trans, ret);
return ret;
}
+
+ ret = btrfs_delete_raid_extent(trans, bytenr, num_bytes);
+ if (ret) {
+ btrfs_abort_transaction(trans, ret);
+ return ret;
+ }
}
ret = add_to_free_space_tree(trans, bytenr, num_bytes);
diff --git a/fs/btrfs/raid-stripe-tree.c b/fs/btrfs/raid-stripe-tree.c
index 2415698a8fef..5b12f40877b5 100644
--- a/fs/btrfs/raid-stripe-tree.c
+++ b/fs/btrfs/raid-stripe-tree.c
@@ -15,6 +15,66 @@
#include "misc.h"
#include "print-tree.h"
+int btrfs_delete_raid_extent(struct btrfs_trans_handle *trans, u64 start,
+ u64 length)
+{
+ struct btrfs_fs_info *fs_info = trans->fs_info;
+ struct btrfs_root *stripe_root = btrfs_stripe_tree_root(fs_info);
+ struct btrfs_path *path;
+ struct btrfs_key key;
+ struct extent_buffer *leaf;
+ u64 found_start;
+ u64 found_end;
+ u64 end = start + length;
+ int slot;
+ int ret;
+
+ if (!stripe_root)
+ return 0;
+
+ path = btrfs_alloc_path();
+ if (!path)
+ return -ENOMEM;
+
+ while (1) {
+
+ key.objectid = start;
+ key.type = BTRFS_RAID_STRIPE_KEY;
+ key.offset = length;
+
+ ret = btrfs_search_slot(trans, stripe_root, &key, path, -1, 1);
+ if (ret < 0)
+ break;
+ if (ret > 0) {
+ ret = 0;
+ if (path->slots[0] == 0)
+ break;
+ path->slots[0]--;
+ }
+
+ leaf = path->nodes[0];
+ slot = path->slots[0];
+ btrfs_item_key_to_cpu(leaf, &key, slot);
+ found_start = key.objectid;
+ found_end = found_start + key.offset;
+
+ /* That stripe ends before we start, we're done */
+ if (found_end <= start)
+ break;
+
+ ASSERT(found_start >= start && found_end <= end);
+ ret = btrfs_del_item(trans, stripe_root, path);
+ if (ret)
+ break;
+
+ btrfs_release_path(path);
+ }
+
+ btrfs_free_path(path);
+ return ret;
+
+}
+
static u8 btrfs_bg_type_to_raid_encoding(u64 map_type)
{
switch (map_type & BTRFS_BLOCK_GROUP_PROFILE_MASK) {
diff --git a/fs/btrfs/raid-stripe-tree.h b/fs/btrfs/raid-stripe-tree.h
index f36e4c2d46b0..7560dc501a65 100644
--- a/fs/btrfs/raid-stripe-tree.h
+++ b/fs/btrfs/raid-stripe-tree.h
@@ -11,6 +11,8 @@
struct btrfs_io_context;
struct btrfs_io_stripe;
+int btrfs_delete_raid_extent(struct btrfs_trans_handle *trans, u64 start,
+ u64 length);
int btrfs_insert_raid_extent(struct btrfs_trans_handle *trans,
struct btrfs_ordered_extent *ordered_extent);
--
2.41.0
^ permalink raw reply related [flat|nested] 39+ messages in thread
* [PATCH v8 05/11] btrfs: lookup physical address from stripe extent
2023-09-11 12:52 [PATCH v8 00/11] btrfs: introduce RAID stripe tree Johannes Thumshirn
` (3 preceding siblings ...)
2023-09-11 12:52 ` [PATCH v8 04/11] btrfs: delete stripe extent on extent deletion Johannes Thumshirn
@ 2023-09-11 12:52 ` Johannes Thumshirn
2023-09-14 9:18 ` Qu Wenruo
2023-09-11 12:52 ` [PATCH v8 06/11] btrfs: implement RST version of scrub Johannes Thumshirn
` (5 subsequent siblings)
10 siblings, 1 reply; 39+ messages in thread
From: Johannes Thumshirn @ 2023-09-11 12:52 UTC (permalink / raw)
To: Chris Mason, Josef Bacik, David Sterba
Cc: Johannes Thumshirn, Christoph Hellwig, Naohiro Aota, Qu Wenruo,
Damien Le Moal, linux-btrfs, linux-kernel
Lookup the physical address from the raid stripe tree when a read on an
RAID volume formatted with the raid stripe tree was attempted.
If the requested logical address was not found in the stripe tree, it may
still be in the in-memory ordered stripe tree, so fallback to searching
the ordered stripe tree in this case.
Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
---
fs/btrfs/raid-stripe-tree.c | 159 ++++++++++++++++++++++++++++++++++++++++++++
fs/btrfs/raid-stripe-tree.h | 11 +++
fs/btrfs/volumes.c | 37 ++++++++---
3 files changed, 198 insertions(+), 9 deletions(-)
diff --git a/fs/btrfs/raid-stripe-tree.c b/fs/btrfs/raid-stripe-tree.c
index 5b12f40877b5..7ed02e4b79ec 100644
--- a/fs/btrfs/raid-stripe-tree.c
+++ b/fs/btrfs/raid-stripe-tree.c
@@ -324,3 +324,162 @@ int btrfs_insert_raid_extent(struct btrfs_trans_handle *trans,
return ret;
}
+
+static bool btrfs_check_for_extent(struct btrfs_fs_info *fs_info, u64 logical,
+ u64 length, struct btrfs_path *path)
+{
+ struct btrfs_root *extent_root = btrfs_extent_root(fs_info, logical);
+ struct btrfs_key key;
+ int ret;
+
+ btrfs_release_path(path);
+
+ key.objectid = logical;
+ key.type = BTRFS_EXTENT_ITEM_KEY;
+ key.offset = length;
+
+ ret = btrfs_search_slot(NULL, extent_root, &key, path, 0, 0);
+
+ return ret;
+}
+
+int btrfs_get_raid_extent_offset(struct btrfs_fs_info *fs_info,
+ u64 logical, u64 *length, u64 map_type,
+ u32 stripe_index,
+ struct btrfs_io_stripe *stripe)
+{
+ struct btrfs_root *stripe_root = btrfs_stripe_tree_root(fs_info);
+ struct btrfs_stripe_extent *stripe_extent;
+ struct btrfs_key stripe_key;
+ struct btrfs_key found_key;
+ struct btrfs_path *path;
+ struct extent_buffer *leaf;
+ int num_stripes;
+ u8 encoding;
+ u64 offset;
+ u64 found_logical;
+ u64 found_length;
+ u64 end;
+ u64 found_end;
+ int slot;
+ int ret;
+ int i;
+
+ stripe_key.objectid = logical;
+ stripe_key.type = BTRFS_RAID_STRIPE_KEY;
+ stripe_key.offset = 0;
+
+ path = btrfs_alloc_path();
+ if (!path)
+ return -ENOMEM;
+
+ ret = btrfs_search_slot(NULL, stripe_root, &stripe_key, path, 0, 0);
+ if (ret < 0)
+ goto free_path;
+ if (ret) {
+ if (path->slots[0] != 0)
+ path->slots[0]--;
+ }
+
+ end = logical + *length;
+
+ while (1) {
+ leaf = path->nodes[0];
+ slot = path->slots[0];
+
+ btrfs_item_key_to_cpu(leaf, &found_key, slot);
+ found_logical = found_key.objectid;
+ found_length = found_key.offset;
+ found_end = found_logical + found_length;
+
+ if (found_logical > end) {
+ ret = -ENOENT;
+ goto out;
+ }
+
+ if (in_range(logical, found_logical, found_length))
+ break;
+
+ ret = btrfs_next_item(stripe_root, path);
+ if (ret)
+ goto out;
+ }
+
+ offset = logical - found_logical;
+
+ /*
+ * If we have a logically contiguous, but physically noncontinuous
+ * range, we need to split the bio. Record the length after which we
+ * must split the bio.
+ */
+ if (end > found_end)
+ *length -= end - found_end;
+
+ num_stripes = btrfs_num_raid_stripes(btrfs_item_size(leaf, slot));
+ stripe_extent = btrfs_item_ptr(leaf, slot, struct btrfs_stripe_extent);
+ encoding = btrfs_stripe_extent_encoding(leaf, stripe_extent);
+
+ if (encoding != btrfs_bg_type_to_raid_encoding(map_type)) {
+ ret = -ENOENT;
+ goto out;
+ }
+
+ for (i = 0; i < num_stripes; i++) {
+ struct btrfs_raid_stride *stride = &stripe_extent->strides[i];
+ u64 devid = btrfs_raid_stride_devid(leaf, stride);
+ u64 len = btrfs_raid_stride_length(leaf, stride);
+ u64 physical = btrfs_raid_stride_physical(leaf, stride);
+
+ if (offset >= len) {
+ offset -= len;
+
+ if (offset >= BTRFS_STRIPE_LEN)
+ continue;
+ }
+
+ if (devid != stripe->dev->devid)
+ continue;
+
+ if ((map_type & BTRFS_BLOCK_GROUP_DUP) && stripe_index != i)
+ continue;
+
+ stripe->physical = physical + offset;
+
+ ret = 0;
+ goto free_path;
+ }
+
+ /*
+ * If we're here, we haven't found the requested devid in the stripe.
+ */
+ ret = -ENOENT;
+out:
+ if (ret > 0)
+ ret = -ENOENT;
+ if (ret && ret != -EIO) {
+ /*
+ * Check if the range we're looking for is actually backed by
+ * an extent. This can happen, e.g. when scrub is running on a
+ * block-group and the extent it is trying to scrub get's
+ * deleted in the meantime. Although scrub is setting the
+ * block-group to read-only, deletion of extents are still
+ * allowed. If the extent is gone, simply return ENOENT and be
+ * good.
+ */
+ if (btrfs_check_for_extent(fs_info, logical, *length, path)) {
+ ret = -ENOENT;
+ goto free_path;
+ }
+
+ if (IS_ENABLED(CONFIG_BTRFS_DEBUG))
+ btrfs_print_tree(leaf, 1);
+ btrfs_err(fs_info,
+ "cannot find raid-stripe for logical [%llu, %llu] devid %llu, profile %s",
+ logical, logical + *length, stripe->dev->devid,
+ btrfs_bg_type_to_raid_name(map_type));
+ }
+free_path:
+ btrfs_free_path(path);
+
+ return ret;
+}
diff --git a/fs/btrfs/raid-stripe-tree.h b/fs/btrfs/raid-stripe-tree.h
index 7560dc501a65..40aa553ae8aa 100644
--- a/fs/btrfs/raid-stripe-tree.h
+++ b/fs/btrfs/raid-stripe-tree.h
@@ -13,6 +13,10 @@ struct btrfs_io_stripe;
int btrfs_delete_raid_extent(struct btrfs_trans_handle *trans, u64 start,
u64 length);
+int btrfs_get_raid_extent_offset(struct btrfs_fs_info *fs_info,
+ u64 logical, u64 *length, u64 map_type,
+ u32 stripe_index,
+ struct btrfs_io_stripe *stripe);
int btrfs_insert_raid_extent(struct btrfs_trans_handle *trans,
struct btrfs_ordered_extent *ordered_extent);
@@ -33,4 +37,11 @@ static inline bool btrfs_need_stripe_tree_update(struct btrfs_fs_info *fs_info,
return false;
}
+
+static inline int btrfs_num_raid_stripes(u32 item_size)
+{
+ return (item_size - offsetof(struct btrfs_stripe_extent, strides)) /
+ sizeof(struct btrfs_raid_stride);
+}
+
#endif
diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index 0c0fd4eb4848..7c25f5c77788 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -35,6 +35,7 @@
#include "relocation.h"
#include "scrub.h"
#include "super.h"
+#include "raid-stripe-tree.h"
#define BTRFS_BLOCK_GROUP_STRIPE_MASK (BTRFS_BLOCK_GROUP_RAID0 | \
BTRFS_BLOCK_GROUP_RAID10 | \
@@ -6206,12 +6207,22 @@ static u64 btrfs_max_io_len(struct map_lookup *map, enum btrfs_map_op op,
return U64_MAX;
}
-static void set_io_stripe(struct btrfs_io_stripe *dst, const struct map_lookup *map,
- u32 stripe_index, u64 stripe_offset, u32 stripe_nr)
+static int set_io_stripe(struct btrfs_fs_info *fs_info, enum btrfs_map_op op,
+ u64 logical, u64 *length, struct btrfs_io_stripe *dst,
+ struct map_lookup *map, u32 stripe_index,
+ u64 stripe_offset, u64 stripe_nr)
{
dst->dev = map->stripes[stripe_index].dev;
+
+ if (op == BTRFS_MAP_READ &&
+ btrfs_need_stripe_tree_update(fs_info, map->type))
+ return btrfs_get_raid_extent_offset(fs_info, logical, length,
+ map->type, stripe_index,
+ dst);
+
dst->physical = map->stripes[stripe_index].physical +
stripe_offset + btrfs_stripe_nr_to_offset(stripe_nr);
+ return 0;
}
/*
@@ -6428,11 +6439,11 @@ int btrfs_map_block(struct btrfs_fs_info *fs_info, enum btrfs_map_op op,
*/
if (smap && num_alloc_stripes == 1 &&
!((map->type & BTRFS_BLOCK_GROUP_RAID56_MASK) && mirror_num > 1)) {
- set_io_stripe(smap, map, stripe_index, stripe_offset, stripe_nr);
+ ret = set_io_stripe(fs_info, op, logical, length, smap, map,
+ stripe_index, stripe_offset, stripe_nr);
if (mirror_num_ret)
*mirror_num_ret = mirror_num;
*bioc_ret = NULL;
- ret = 0;
goto out;
}
@@ -6463,21 +6474,29 @@ int btrfs_map_block(struct btrfs_fs_info *fs_info, enum btrfs_map_op op,
bioc->full_stripe_logical = em->start +
btrfs_stripe_nr_to_offset(stripe_nr * data_stripes);
for (i = 0; i < num_stripes; i++)
- set_io_stripe(&bioc->stripes[i], map,
- (i + stripe_nr) % num_stripes,
- stripe_offset, stripe_nr);
+ ret = set_io_stripe(fs_info, op, logical, length,
+ &bioc->stripes[i], map,
+ (i + stripe_nr) % num_stripes,
+ stripe_offset, stripe_nr);
} else {
/*
* For all other non-RAID56 profiles, just copy the target
* stripe into the bioc.
*/
for (i = 0; i < num_stripes; i++) {
- set_io_stripe(&bioc->stripes[i], map, stripe_index,
- stripe_offset, stripe_nr);
+ ret = set_io_stripe(fs_info, op, logical, length,
+ &bioc->stripes[i], map, stripe_index,
+ stripe_offset, stripe_nr);
stripe_index++;
}
}
+ if (ret) {
+ *bioc_ret = NULL;
+ btrfs_put_bioc(bioc);
+ goto out;
+ }
+
if (op != BTRFS_MAP_READ)
max_errors = btrfs_chunk_max_errors(map);
--
2.41.0
^ permalink raw reply related [flat|nested] 39+ messages in thread
* [PATCH v8 06/11] btrfs: implement RST version of scrub
2023-09-11 12:52 [PATCH v8 00/11] btrfs: introduce RAID stripe tree Johannes Thumshirn
` (4 preceding siblings ...)
2023-09-11 12:52 ` [PATCH v8 05/11] btrfs: lookup physical address from stripe extent Johannes Thumshirn
@ 2023-09-11 12:52 ` Johannes Thumshirn
2023-09-13 9:51 ` Qu Wenruo
2023-09-13 16:59 ` David Sterba
2023-09-11 12:52 ` [PATCH v8 07/11] btrfs: zoned: allow zoned RAID Johannes Thumshirn
` (4 subsequent siblings)
10 siblings, 2 replies; 39+ messages in thread
From: Johannes Thumshirn @ 2023-09-11 12:52 UTC (permalink / raw)
To: Chris Mason, Josef Bacik, David Sterba
Cc: Johannes Thumshirn, Christoph Hellwig, Naohiro Aota, Qu Wenruo,
Damien Le Moal, linux-btrfs, linux-kernel
A filesystem that uses the RAID stripe tree for logical to physical
address translation can't use the regular scrub path, that reads all
stripes and then checks if a sector is unused afterwards.
When using the RAID stripe tree, this will result in lookup errors, as the
stripe tree doesn't know the requested logical addresses.
Instead, look up stripes that are backed by the extent bitmap.
Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
---
fs/btrfs/scrub.c | 56 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++
1 file changed, 56 insertions(+)
diff --git a/fs/btrfs/scrub.c b/fs/btrfs/scrub.c
index f16220ce5fba..5101e0a3f83e 100644
--- a/fs/btrfs/scrub.c
+++ b/fs/btrfs/scrub.c
@@ -23,6 +23,7 @@
#include "accessors.h"
#include "file-item.h"
#include "scrub.h"
+#include "raid-stripe-tree.h"
/*
* This is only the first step towards a full-features scrub. It reads all
@@ -1634,6 +1635,56 @@ static void scrub_reset_stripe(struct scrub_stripe *stripe)
}
}
+static void scrub_submit_extent_sector_read(struct scrub_ctx *sctx,
+ struct scrub_stripe *stripe)
+{
+ struct btrfs_fs_info *fs_info = stripe->bg->fs_info;
+ struct btrfs_bio *bbio = NULL;
+ int mirror = stripe->mirror_num;
+ int i;
+
+ atomic_inc(&stripe->pending_io);
+
+ for_each_set_bit(i, &stripe->extent_sector_bitmap, stripe->nr_sectors) {
+ struct page *page;
+ int pgoff;
+
+ page = scrub_stripe_get_page(stripe, i);
+ pgoff = scrub_stripe_get_page_offset(stripe, i);
+
+ /* The current sector cannot be merged, submit the bio. */
+ if (bbio &&
+ ((i > 0 && !test_bit(i - 1, &stripe->extent_sector_bitmap)) ||
+ bbio->bio.bi_iter.bi_size >= BTRFS_STRIPE_LEN)) {
+ ASSERT(bbio->bio.bi_iter.bi_size);
+ atomic_inc(&stripe->pending_io);
+ btrfs_submit_bio(bbio, mirror);
+ bbio = NULL;
+ }
+
+ if (!bbio) {
+ bbio = btrfs_bio_alloc(stripe->nr_sectors, REQ_OP_READ,
+ fs_info, scrub_read_endio, stripe);
+ bbio->bio.bi_iter.bi_sector = (stripe->logical +
+ (i << fs_info->sectorsize_bits)) >> SECTOR_SHIFT;
+ }
+
+ __bio_add_page(&bbio->bio, page, fs_info->sectorsize, pgoff);
+ }
+
+ if (bbio) {
+ ASSERT(bbio->bio.bi_iter.bi_size);
+ atomic_inc(&stripe->pending_io);
+ btrfs_submit_bio(bbio, mirror);
+ }
+
+ if (atomic_dec_and_test(&stripe->pending_io)) {
+ wake_up(&stripe->io_wait);
+ INIT_WORK(&stripe->work, scrub_stripe_read_repair_worker);
+ queue_work(stripe->bg->fs_info->scrub_workers, &stripe->work);
+ }
+}
+
static void scrub_submit_initial_read(struct scrub_ctx *sctx,
struct scrub_stripe *stripe)
{
@@ -1645,6 +1696,11 @@ static void scrub_submit_initial_read(struct scrub_ctx *sctx,
ASSERT(stripe->mirror_num > 0);
ASSERT(test_bit(SCRUB_STRIPE_FLAG_INITIALIZED, &stripe->state));
+ if (btrfs_need_stripe_tree_update(fs_info, stripe->bg->flags)) {
+ scrub_submit_extent_sector_read(sctx, stripe);
+ return;
+ }
+
bbio = btrfs_bio_alloc(SCRUB_STRIPE_PAGES, REQ_OP_READ, fs_info,
scrub_read_endio, stripe);
--
2.41.0
^ permalink raw reply related [flat|nested] 39+ messages in thread
* [PATCH v8 07/11] btrfs: zoned: allow zoned RAID
2023-09-11 12:52 [PATCH v8 00/11] btrfs: introduce RAID stripe tree Johannes Thumshirn
` (5 preceding siblings ...)
2023-09-11 12:52 ` [PATCH v8 06/11] btrfs: implement RST version of scrub Johannes Thumshirn
@ 2023-09-11 12:52 ` Johannes Thumshirn
2023-09-12 20:49 ` David Sterba
2023-09-11 12:52 ` [PATCH v8 08/11] btrfs: add raid stripe tree pretty printer Johannes Thumshirn
` (3 subsequent siblings)
10 siblings, 1 reply; 39+ messages in thread
From: Johannes Thumshirn @ 2023-09-11 12:52 UTC (permalink / raw)
To: Chris Mason, Josef Bacik, David Sterba
Cc: Johannes Thumshirn, Christoph Hellwig, Naohiro Aota, Qu Wenruo,
Damien Le Moal, linux-btrfs, linux-kernel
When we have a raid-stripe-tree, we can do RAID0/1/10 on zoned devices for
data block-groups. For meta-data block-groups, we don't actually need
anything special, as all meta-data I/O is protected by the
btrfs_zoned_meta_io_lock() already.
Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
---
fs/btrfs/raid-stripe-tree.h | 7 ++-
fs/btrfs/volumes.c | 2 +
fs/btrfs/zoned.c | 113 +++++++++++++++++++++++++++++++++++++++++++-
3 files changed, 119 insertions(+), 3 deletions(-)
diff --git a/fs/btrfs/raid-stripe-tree.h b/fs/btrfs/raid-stripe-tree.h
index 40aa553ae8aa..30c7d5981890 100644
--- a/fs/btrfs/raid-stripe-tree.h
+++ b/fs/btrfs/raid-stripe-tree.h
@@ -8,6 +8,11 @@
#include "disk-io.h"
+#define BTRFS_RST_SUPP_BLOCK_GROUP_MASK (BTRFS_BLOCK_GROUP_DUP |\
+ BTRFS_BLOCK_GROUP_RAID1_MASK |\
+ BTRFS_BLOCK_GROUP_RAID0 |\
+ BTRFS_BLOCK_GROUP_RAID10)
+
struct btrfs_io_context;
struct btrfs_io_stripe;
@@ -32,7 +37,7 @@ static inline bool btrfs_need_stripe_tree_update(struct btrfs_fs_info *fs_info,
if (type != BTRFS_BLOCK_GROUP_DATA)
return false;
- if (profile & BTRFS_BLOCK_GROUP_RAID1_MASK)
+ if (profile & BTRFS_RST_SUPP_BLOCK_GROUP_MASK)
return true;
return false;
diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index 7c25f5c77788..9f17e5f290f4 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -6438,6 +6438,8 @@ int btrfs_map_block(struct btrfs_fs_info *fs_info, enum btrfs_map_op op,
* I/O context structure.
*/
if (smap && num_alloc_stripes == 1 &&
+ !(btrfs_need_stripe_tree_update(fs_info, map->type) &&
+ op != BTRFS_MAP_READ) &&
!((map->type & BTRFS_BLOCK_GROUP_RAID56_MASK) && mirror_num > 1)) {
ret = set_io_stripe(fs_info, op, logical, length, smap, map,
stripe_index, stripe_offset, stripe_nr);
diff --git a/fs/btrfs/zoned.c b/fs/btrfs/zoned.c
index c6eedf4bfba9..4ca36875058c 100644
--- a/fs/btrfs/zoned.c
+++ b/fs/btrfs/zoned.c
@@ -1481,8 +1481,9 @@ int btrfs_load_block_group_zone_info(struct btrfs_block_group *cache, bool new)
set_bit(BLOCK_GROUP_FLAG_ZONE_IS_ACTIVE, &cache->runtime_flags);
break;
case BTRFS_BLOCK_GROUP_DUP:
- if (map->type & BTRFS_BLOCK_GROUP_DATA) {
- btrfs_err(fs_info, "zoned: profile DUP not yet supported on data bg");
+ if (map->type & BTRFS_BLOCK_GROUP_DATA &&
+ !btrfs_stripe_tree_root(fs_info)) {
+ btrfs_err(fs_info, "zoned: data DUP profile needs stripe_root");
ret = -EINVAL;
goto out;
}
@@ -1520,8 +1521,116 @@ int btrfs_load_block_group_zone_info(struct btrfs_block_group *cache, bool new)
cache->zone_capacity = min(caps[0], caps[1]);
break;
case BTRFS_BLOCK_GROUP_RAID1:
+ case BTRFS_BLOCK_GROUP_RAID1C3:
+ case BTRFS_BLOCK_GROUP_RAID1C4:
+ if (map->type & BTRFS_BLOCK_GROUP_DATA &&
+ !btrfs_stripe_tree_root(fs_info)) {
+ btrfs_err(fs_info,
+ "zoned: data %s needs stripe_root",
+ btrfs_bg_type_to_raid_name(map->type));
+ ret = -EIO;
+ goto out;
+
+ }
+
+ for (i = 0; i < map->num_stripes; i++) {
+ if (alloc_offsets[i] == WP_MISSING_DEV ||
+ alloc_offsets[i] == WP_CONVENTIONAL)
+ continue;
+
+ if ((alloc_offsets[0] != alloc_offsets[i]) &&
+ !btrfs_test_opt(fs_info, DEGRADED)) {
+ btrfs_err(fs_info,
+ "zoned: write pointer offset mismatch of zones in %s profile",
+ btrfs_bg_type_to_raid_name(map->type));
+ ret = -EIO;
+ goto out;
+ }
+ if (test_bit(0, active) != test_bit(i, active)) {
+ if (!btrfs_test_opt(fs_info, DEGRADED) &&
+ !btrfs_zone_activate(cache)) {
+ ret = -EIO;
+ goto out;
+ }
+ } else {
+ if (test_bit(0, active))
+ set_bit(BLOCK_GROUP_FLAG_ZONE_IS_ACTIVE,
+ &cache->runtime_flags);
+ }
+ /*
+ * In case a device is missing we have a cap of 0, so don't
+ * use it.
+ */
+ cache->zone_capacity = min_not_zero(caps[0], caps[i]);
+ }
+
+ if (alloc_offsets[0] != WP_MISSING_DEV)
+ cache->alloc_offset = alloc_offsets[0];
+ else
+ cache->alloc_offset = alloc_offsets[i - 1];
+ break;
case BTRFS_BLOCK_GROUP_RAID0:
+ if (map->type & BTRFS_BLOCK_GROUP_DATA &&
+ !btrfs_stripe_tree_root(fs_info)) {
+ btrfs_err(fs_info,
+ "zoned: data %s needs stripe_root",
+ btrfs_bg_type_to_raid_name(map->type));
+ ret = -EIO;
+ goto out;
+
+ }
+ for (i = 0; i < map->num_stripes; i++) {
+ if (alloc_offsets[i] == WP_MISSING_DEV ||
+ alloc_offsets[i] == WP_CONVENTIONAL)
+ continue;
+
+ if (test_bit(0, active) != test_bit(i, active)) {
+ if (!btrfs_zone_activate(cache)) {
+ ret = -EIO;
+ goto out;
+ }
+ } else {
+ if (test_bit(0, active))
+ set_bit(BLOCK_GROUP_FLAG_ZONE_IS_ACTIVE,
+ &cache->runtime_flags);
+ }
+ cache->zone_capacity += caps[i];
+ cache->alloc_offset += alloc_offsets[i];
+
+ }
+ break;
case BTRFS_BLOCK_GROUP_RAID10:
+ if (map->type & BTRFS_BLOCK_GROUP_DATA &&
+ !btrfs_stripe_tree_root(fs_info)) {
+ btrfs_err(fs_info,
+ "zoned: data %s needs stripe_root",
+ btrfs_bg_type_to_raid_name(map->type));
+ ret = -EIO;
+ goto out;
+
+ }
+ for (i = 0; i < map->num_stripes; i++) {
+ if (alloc_offsets[i] == WP_MISSING_DEV ||
+ alloc_offsets[i] == WP_CONVENTIONAL)
+ continue;
+
+ if (test_bit(0, active) != test_bit(i, active)) {
+ if (!btrfs_zone_activate(cache)) {
+ ret = -EIO;
+ goto out;
+ }
+ } else {
+ if (test_bit(0, active))
+ set_bit(BLOCK_GROUP_FLAG_ZONE_IS_ACTIVE,
+ &cache->runtime_flags);
+ }
+ if ((i % map->sub_stripes) == 0) {
+ cache->zone_capacity += caps[i];
+ cache->alloc_offset += alloc_offsets[i];
+ }
+
+ }
+ break;
case BTRFS_BLOCK_GROUP_RAID5:
case BTRFS_BLOCK_GROUP_RAID6:
/* non-single profiles are not supported yet */
--
2.41.0
^ permalink raw reply related [flat|nested] 39+ messages in thread
* [PATCH v8 08/11] btrfs: add raid stripe tree pretty printer
2023-09-11 12:52 [PATCH v8 00/11] btrfs: introduce RAID stripe tree Johannes Thumshirn
` (6 preceding siblings ...)
2023-09-11 12:52 ` [PATCH v8 07/11] btrfs: zoned: allow zoned RAID Johannes Thumshirn
@ 2023-09-11 12:52 ` Johannes Thumshirn
2023-09-12 20:42 ` David Sterba
2023-09-11 12:52 ` [PATCH v8 09/11] btrfs: announce presence of raid-stripe-tree in sysfs Johannes Thumshirn
` (2 subsequent siblings)
10 siblings, 1 reply; 39+ messages in thread
From: Johannes Thumshirn @ 2023-09-11 12:52 UTC (permalink / raw)
To: Chris Mason, Josef Bacik, David Sterba
Cc: Johannes Thumshirn, Christoph Hellwig, Naohiro Aota, Qu Wenruo,
Damien Le Moal, linux-btrfs, linux-kernel
Decode raid-stripe-tree entries on btrfs_print_tree().
Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
---
fs/btrfs/print-tree.c | 49 +++++++++++++++++++++++++++++++++++++++++++++++++
1 file changed, 49 insertions(+)
diff --git a/fs/btrfs/print-tree.c b/fs/btrfs/print-tree.c
index 0c93439e929f..f01919e4bb37 100644
--- a/fs/btrfs/print-tree.c
+++ b/fs/btrfs/print-tree.c
@@ -9,6 +9,7 @@
#include "print-tree.h"
#include "accessors.h"
#include "tree-checker.h"
+#include "raid-stripe-tree.h"
struct root_name_map {
u64 id;
@@ -28,6 +29,7 @@ static const struct root_name_map root_map[] = {
{ BTRFS_FREE_SPACE_TREE_OBJECTID, "FREE_SPACE_TREE" },
{ BTRFS_BLOCK_GROUP_TREE_OBJECTID, "BLOCK_GROUP_TREE" },
{ BTRFS_DATA_RELOC_TREE_OBJECTID, "DATA_RELOC_TREE" },
+ { BTRFS_RAID_STRIPE_TREE_OBJECTID, "RAID_STRIPE_TREE" },
};
const char *btrfs_root_name(const struct btrfs_key *key, char *buf)
@@ -189,6 +191,48 @@ static void print_uuid_item(const struct extent_buffer *l, unsigned long offset,
}
}
+struct raid_encoding_map {
+ u8 encoding;
+ char name[16];
+};
+
+static const struct raid_encoding_map raid_map[] = {
+ { BTRFS_STRIPE_DUP, "DUP" },
+ { BTRFS_STRIPE_RAID0, "RAID0" },
+ { BTRFS_STRIPE_RAID1, "RAID1" },
+ { BTRFS_STRIPE_RAID1C3, "RAID1C3" },
+ { BTRFS_STRIPE_RAID1C4, "RAID1C4" },
+ { BTRFS_STRIPE_RAID5, "RAID5" },
+ { BTRFS_STRIPE_RAID6, "RAID6" },
+ { BTRFS_STRIPE_RAID10, "RAID10" }
+};
+
+static const char *stripe_encoding_name(u8 encoding)
+{
+ for (int i = 0; i < ARRAY_SIZE(raid_map); i++) {
+ if (raid_map[i].encoding == encoding)
+ return raid_map[i].name;
+ }
+
+ return "UNKNOWN";
+}
+
+static void print_raid_stripe_key(const struct extent_buffer *eb, u32 item_size,
+ struct btrfs_stripe_extent *stripe)
+{
+ int num_stripes = btrfs_num_raid_stripes(item_size);
+ u8 encoding = btrfs_stripe_extent_encoding(eb, stripe);
+ int i;
+
+ pr_info("\t\t\tencoding: %s\n", stripe_encoding_name(encoding));
+
+ for (i = 0; i < num_stripes; i++)
+ pr_info("\t\t\tstride %d devid %llu physical %llu length %llu\n",
+ i, btrfs_raid_stride_devid(eb, &stripe->strides[i]),
+ btrfs_raid_stride_physical(eb, &stripe->strides[i]),
+ btrfs_raid_stride_length(eb, &stripe->strides[i]));
+}
+
/*
* Helper to output refs and locking status of extent buffer. Useful to debug
* race condition related problems.
@@ -349,6 +393,11 @@ void btrfs_print_leaf(const struct extent_buffer *l)
print_uuid_item(l, btrfs_item_ptr_offset(l, i),
btrfs_item_size(l, i));
break;
+ case BTRFS_RAID_STRIPE_KEY:
+ print_raid_stripe_key(l, btrfs_item_size(l, i),
+ btrfs_item_ptr(l, i,
+ struct btrfs_stripe_extent));
+ break;
}
}
}
--
2.41.0
^ permalink raw reply related [flat|nested] 39+ messages in thread
* [PATCH v8 09/11] btrfs: announce presence of raid-stripe-tree in sysfs
2023-09-11 12:52 [PATCH v8 00/11] btrfs: introduce RAID stripe tree Johannes Thumshirn
` (7 preceding siblings ...)
2023-09-11 12:52 ` [PATCH v8 08/11] btrfs: add raid stripe tree pretty printer Johannes Thumshirn
@ 2023-09-11 12:52 ` Johannes Thumshirn
2023-09-11 12:52 ` [PATCH v8 10/11] btrfs: add trace events for RST Johannes Thumshirn
2023-09-11 12:52 ` [PATCH v8 11/11] btrfs: add raid-stripe-tree to features enabled with debug Johannes Thumshirn
10 siblings, 0 replies; 39+ messages in thread
From: Johannes Thumshirn @ 2023-09-11 12:52 UTC (permalink / raw)
To: Chris Mason, Josef Bacik, David Sterba
Cc: Johannes Thumshirn, Christoph Hellwig, Naohiro Aota, Qu Wenruo,
Damien Le Moal, linux-btrfs, linux-kernel
If a filesystem with a raid-stripe-tree is mounted, show the RST feature
in sysfs.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
---
fs/btrfs/sysfs.c | 3 +++
1 file changed, 3 insertions(+)
diff --git a/fs/btrfs/sysfs.c b/fs/btrfs/sysfs.c
index b1d1ac25237b..1bab3d7d251e 100644
--- a/fs/btrfs/sysfs.c
+++ b/fs/btrfs/sysfs.c
@@ -297,6 +297,8 @@ BTRFS_FEAT_ATTR_INCOMPAT(zoned, ZONED);
#ifdef CONFIG_BTRFS_DEBUG
/* Remove once support for extent tree v2 is feature complete */
BTRFS_FEAT_ATTR_INCOMPAT(extent_tree_v2, EXTENT_TREE_V2);
+/* Remove once support for raid stripe tree is feature complete */
+BTRFS_FEAT_ATTR_INCOMPAT(raid_stripe_tree, RAID_STRIPE_TREE);
#endif
#ifdef CONFIG_FS_VERITY
BTRFS_FEAT_ATTR_COMPAT_RO(verity, VERITY);
@@ -327,6 +329,7 @@ static struct attribute *btrfs_supported_feature_attrs[] = {
#endif
#ifdef CONFIG_BTRFS_DEBUG
BTRFS_FEAT_ATTR_PTR(extent_tree_v2),
+ BTRFS_FEAT_ATTR_PTR(raid_stripe_tree),
#endif
#ifdef CONFIG_FS_VERITY
BTRFS_FEAT_ATTR_PTR(verity),
--
2.41.0
^ permalink raw reply related [flat|nested] 39+ messages in thread
* [PATCH v8 10/11] btrfs: add trace events for RST
2023-09-11 12:52 [PATCH v8 00/11] btrfs: introduce RAID stripe tree Johannes Thumshirn
` (8 preceding siblings ...)
2023-09-11 12:52 ` [PATCH v8 09/11] btrfs: announce presence of raid-stripe-tree in sysfs Johannes Thumshirn
@ 2023-09-11 12:52 ` Johannes Thumshirn
2023-09-12 20:46 ` David Sterba
2023-09-11 12:52 ` [PATCH v8 11/11] btrfs: add raid-stripe-tree to features enabled with debug Johannes Thumshirn
10 siblings, 1 reply; 39+ messages in thread
From: Johannes Thumshirn @ 2023-09-11 12:52 UTC (permalink / raw)
To: Chris Mason, Josef Bacik, David Sterba
Cc: Johannes Thumshirn, Christoph Hellwig, Naohiro Aota, Qu Wenruo,
Damien Le Moal, linux-btrfs, linux-kernel
Add trace events for raid-stripe-tree operations.
Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
---
fs/btrfs/raid-stripe-tree.c | 8 +++++
include/trace/events/btrfs.h | 75 ++++++++++++++++++++++++++++++++++++++++++++
2 files changed, 83 insertions(+)
diff --git a/fs/btrfs/raid-stripe-tree.c b/fs/btrfs/raid-stripe-tree.c
index 7ed02e4b79ec..5a9952cf557c 100644
--- a/fs/btrfs/raid-stripe-tree.c
+++ b/fs/btrfs/raid-stripe-tree.c
@@ -62,6 +62,9 @@ int btrfs_delete_raid_extent(struct btrfs_trans_handle *trans, u64 start,
if (found_end <= start)
break;
+ trace_btrfs_raid_extent_delete(fs_info, start, end,
+ found_start, found_end);
+
ASSERT(found_start >= start && found_end <= end);
ret = btrfs_del_item(trans, stripe_root, path);
if (ret)
@@ -120,6 +123,8 @@ static int btrfs_insert_one_raid_extent(struct btrfs_trans_handle *trans,
return -ENOMEM;
}
+ trace_btrfs_insert_one_raid_extent(fs_info, bioc->logical, bioc->size,
+ num_stripes);
btrfs_set_stack_stripe_extent_encoding(stripe_extent, encoding);
for (int i = 0; i < num_stripes; i++) {
u64 devid = bioc->stripes[i].dev->devid;
@@ -445,6 +450,9 @@ int btrfs_get_raid_extent_offset(struct btrfs_fs_info *fs_info,
stripe->physical = physical + offset;
+ trace_btrfs_get_raid_extent_offset(fs_info, logical, *length,
+ stripe->physical, devid);
+
ret = 0;
goto free_path;
}
diff --git a/include/trace/events/btrfs.h b/include/trace/events/btrfs.h
index b2db2c2f1c57..e2c6f1199212 100644
--- a/include/trace/events/btrfs.h
+++ b/include/trace/events/btrfs.h
@@ -2497,6 +2497,81 @@ DEFINE_EVENT(btrfs_raid56_bio, raid56_write,
TP_ARGS(rbio, bio, trace_info)
);
+TRACE_EVENT(btrfs_insert_one_raid_extent,
+
+ TP_PROTO(struct btrfs_fs_info *fs_info, u64 logical, u64 length,
+ int num_stripes),
+
+ TP_ARGS(fs_info, logical, length, num_stripes),
+
+ TP_STRUCT__entry_btrfs(
+ __field( u64, logical )
+ __field( u64, length )
+ __field( int, num_stripes )
+ ),
+
+ TP_fast_assign_btrfs(fs_info,
+ __entry->logical = logical;
+ __entry->length = length;
+ __entry->num_stripes = num_stripes;
+ ),
+
+ TP_printk_btrfs("logical=%llu, length=%llu, num_stripes=%d",
+ __entry->logical, __entry->length,
+ __entry->num_stripes)
+);
+
+TRACE_EVENT(btrfs_raid_extent_delete,
+
+ TP_PROTO(struct btrfs_fs_info *fs_info, u64 start, u64 end,
+ u64 found_start, u64 found_end),
+
+ TP_ARGS(fs_info, start, end, found_start, found_end),
+
+ TP_STRUCT__entry_btrfs(
+ __field( u64, start )
+ __field( u64, end )
+ __field( u64, found_start )
+ __field( u64, found_end )
+ ),
+
+ TP_fast_assign_btrfs(fs_info,
+ __entry->start = start;
+ __entry->end = end;
+ __entry->found_start = found_start;
+ __entry->found_end = found_end;
+ ),
+
+ TP_printk_btrfs("start=%llu, end=%llu, found_start=%llu, found_end=%llu",
+ __entry->start, __entry->end, __entry->found_start,
+ __entry->found_end)
+);
+
+TRACE_EVENT(btrfs_get_raid_extent_offset,
+
+ TP_PROTO(struct btrfs_fs_info *fs_info, u64 logical, u64 length,
+ u64 physical, u64 devid),
+
+ TP_ARGS(fs_info, logical, length, physical, devid),
+
+ TP_STRUCT__entry_btrfs(
+ __field( u64, logical )
+ __field( u64, length )
+ __field( u64, physical )
+ __field( u64, devid )
+ ),
+
+ TP_fast_assign_btrfs(fs_info,
+ __entry->logical = logical;
+ __entry->length = length;
+ __entry->physical = physical;
+ __entry->devid = devid;
+ ),
+
+ TP_printk_btrfs("logical=%llu, length=%llu, physical=%llu, devid=%llu",
+ __entry->logical, __entry->length, __entry->physical,
+ __entry->devid)
+);
#endif /* _TRACE_BTRFS_H */
/* This part must be outside protection */
--
2.41.0
^ permalink raw reply related [flat|nested] 39+ messages in thread
* [PATCH v8 11/11] btrfs: add raid-stripe-tree to features enabled with debug
2023-09-11 12:52 [PATCH v8 00/11] btrfs: introduce RAID stripe tree Johannes Thumshirn
` (9 preceding siblings ...)
2023-09-11 12:52 ` [PATCH v8 10/11] btrfs: add trace events for RST Johannes Thumshirn
@ 2023-09-11 12:52 ` Johannes Thumshirn
10 siblings, 0 replies; 39+ messages in thread
From: Johannes Thumshirn @ 2023-09-11 12:52 UTC (permalink / raw)
To: Chris Mason, Josef Bacik, David Sterba
Cc: Johannes Thumshirn, Christoph Hellwig, Naohiro Aota, Qu Wenruo,
Damien Le Moal, linux-btrfs, linux-kernel
Until the RAID stripe tree code is well enough tested and feature
complete, "hide" it behind CONFIG_BTRFS_DEBUG so only people who
want to use it are actually using it.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
---
fs/btrfs/fs.h | 3 ++-
1 file changed, 2 insertions(+), 1 deletion(-)
diff --git a/fs/btrfs/fs.h b/fs/btrfs/fs.h
index 5c7778e8b5ed..0f5894e2bdeb 100644
--- a/fs/btrfs/fs.h
+++ b/fs/btrfs/fs.h
@@ -223,7 +223,8 @@ enum {
*/
#define BTRFS_FEATURE_INCOMPAT_SUPP \
(BTRFS_FEATURE_INCOMPAT_SUPP_STABLE | \
- BTRFS_FEATURE_INCOMPAT_EXTENT_TREE_V2)
+ BTRFS_FEATURE_INCOMPAT_EXTENT_TREE_V2 |\
+ BTRFS_FEATURE_INCOMPAT_RAID_STRIPE_TREE)
#else
--
2.41.0
^ permalink raw reply related [flat|nested] 39+ messages in thread
* Re: [PATCH v8 01/11] btrfs: add raid stripe tree definitions
2023-09-11 12:52 ` [PATCH v8 01/11] btrfs: add raid stripe tree definitions Johannes Thumshirn
@ 2023-09-11 21:00 ` Damien Le Moal
2023-09-12 6:09 ` Johannes Thumshirn
2023-09-12 20:32 ` David Sterba
1 sibling, 1 reply; 39+ messages in thread
From: Damien Le Moal @ 2023-09-11 21:00 UTC (permalink / raw)
To: Johannes Thumshirn, Chris Mason, Josef Bacik, David Sterba
Cc: Christoph Hellwig, Naohiro Aota, Qu Wenruo, linux-btrfs,
linux-kernel
On 9/11/23 21:52, Johannes Thumshirn wrote:
> Add definitions for the raid stripe tree. This tree will hold information
> about the on-disk layout of the stripes in a RAID set.
>
> Each stripe extent has a 1:1 relationship with an on-disk extent item and
> is doing the logical to per-drive physical address translation for the
> extent item in question.
>
> Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
> ---
> fs/btrfs/accessors.h | 10 ++++++++++
> fs/btrfs/locking.c | 5 +++--
> include/uapi/linux/btrfs_tree.h | 33 +++++++++++++++++++++++++++++++--
> 3 files changed, 44 insertions(+), 4 deletions(-)
>
> diff --git a/fs/btrfs/accessors.h b/fs/btrfs/accessors.h
> index f958eccff477..977ff160a024 100644
> --- a/fs/btrfs/accessors.h
> +++ b/fs/btrfs/accessors.h
> @@ -306,6 +306,16 @@ BTRFS_SETGET_FUNCS(timespec_nsec, struct btrfs_timespec, nsec, 32);
> BTRFS_SETGET_STACK_FUNCS(stack_timespec_sec, struct btrfs_timespec, sec, 64);
> BTRFS_SETGET_STACK_FUNCS(stack_timespec_nsec, struct btrfs_timespec, nsec, 32);
>
> +BTRFS_SETGET_FUNCS(stripe_extent_encoding, struct btrfs_stripe_extent, encoding, 8);
> +BTRFS_SETGET_FUNCS(raid_stride_devid, struct btrfs_raid_stride, devid, 64);
> +BTRFS_SETGET_FUNCS(raid_stride_physical, struct btrfs_raid_stride, physical, 64);
> +BTRFS_SETGET_FUNCS(raid_stride_length, struct btrfs_raid_stride, length, 64);
> +BTRFS_SETGET_STACK_FUNCS(stack_stripe_extent_encoding,
> + struct btrfs_stripe_extent, encoding, 8);
> +BTRFS_SETGET_STACK_FUNCS(stack_raid_stride_devid, struct btrfs_raid_stride, devid, 64);
> +BTRFS_SETGET_STACK_FUNCS(stack_raid_stride_physical, struct btrfs_raid_stride, physical, 64);
> +BTRFS_SETGET_STACK_FUNCS(stack_raid_stride_length, struct btrfs_raid_stride, length, 64);
> +
> /* struct btrfs_dev_extent */
> BTRFS_SETGET_FUNCS(dev_extent_chunk_tree, struct btrfs_dev_extent, chunk_tree, 64);
> BTRFS_SETGET_FUNCS(dev_extent_chunk_objectid, struct btrfs_dev_extent,
> diff --git a/fs/btrfs/locking.c b/fs/btrfs/locking.c
> index 6ac4fd8cc8dc..e7760d40feab 100644
> --- a/fs/btrfs/locking.c
> +++ b/fs/btrfs/locking.c
> @@ -58,8 +58,8 @@
>
> static struct btrfs_lockdep_keyset {
> u64 id; /* root objectid */
> - /* Longest entry: btrfs-block-group-00 */
> - char names[BTRFS_MAX_LEVEL][24];
> + /* Longest entry: btrfs-raid-stripe-tree-00 */
> + char names[BTRFS_MAX_LEVEL][25];
> struct lock_class_key keys[BTRFS_MAX_LEVEL];
> } btrfs_lockdep_keysets[] = {
> { .id = BTRFS_ROOT_TREE_OBJECTID, DEFINE_NAME("root") },
> @@ -74,6 +74,7 @@ static struct btrfs_lockdep_keyset {
> { .id = BTRFS_UUID_TREE_OBJECTID, DEFINE_NAME("uuid") },
> { .id = BTRFS_FREE_SPACE_TREE_OBJECTID, DEFINE_NAME("free-space") },
> { .id = BTRFS_BLOCK_GROUP_TREE_OBJECTID, DEFINE_NAME("block-group") },
> + { .id = BTRFS_RAID_STRIPE_TREE_OBJECTID,DEFINE_NAME("raid-stripe-tree") },
> { .id = 0, DEFINE_NAME("tree") },
> };
>
> diff --git a/include/uapi/linux/btrfs_tree.h b/include/uapi/linux/btrfs_tree.h
> index fc3c32186d7e..3fb758ce3ac0 100644
> --- a/include/uapi/linux/btrfs_tree.h
> +++ b/include/uapi/linux/btrfs_tree.h
> @@ -4,9 +4,8 @@
>
> #include <linux/btrfs.h>
> #include <linux/types.h>
> -#ifdef __KERNEL__
> #include <linux/stddef.h>
> -#else
> +#ifndef __KERNEL__
> #include <stddef.h>
> #endif
This change seems unrelated to the RAID stripe tree. Should this be a patch on
its own ?
>
> @@ -73,6 +72,9 @@
> /* Holds the block group items for extent tree v2. */
> #define BTRFS_BLOCK_GROUP_TREE_OBJECTID 11ULL
>
> +/* tracks RAID stripes in block groups. */
> +#define BTRFS_RAID_STRIPE_TREE_OBJECTID 12ULL
> +
> /* device stats in the device tree */
> #define BTRFS_DEV_STATS_OBJECTID 0ULL
>
> @@ -285,6 +287,8 @@
> */
> #define BTRFS_QGROUP_RELATION_KEY 246
>
> +#define BTRFS_RAID_STRIPE_KEY 247
> +
> /*
> * Obsolete name, see BTRFS_TEMPORARY_ITEM_KEY.
> */
> @@ -719,6 +723,31 @@ struct btrfs_free_space_header {
> __le64 num_bitmaps;
> } __attribute__ ((__packed__));
>
> +struct btrfs_raid_stride {
> + /* btrfs device-id this raid extent lives on */
> + __le64 devid;
> + /* physical location on disk */
> + __le64 physical;
> + /* length of stride on this disk */
> + __le64 length;
> +};
> +
> +#define BTRFS_STRIPE_DUP 0
> +#define BTRFS_STRIPE_RAID0 1
> +#define BTRFS_STRIPE_RAID1 2
> +#define BTRFS_STRIPE_RAID1C3 3
> +#define BTRFS_STRIPE_RAID1C4 4
> +#define BTRFS_STRIPE_RAID5 5
> +#define BTRFS_STRIPE_RAID6 6
> +#define BTRFS_STRIPE_RAID10 7
> +
> +struct btrfs_stripe_extent {
> + __u8 encoding;
> + __u8 reserved[7];
> + /* array of raid strides this stripe is composed of */
> + __DECLARE_FLEX_ARRAY(struct btrfs_raid_stride, strides);
> +};
> +
> #define BTRFS_HEADER_FLAG_WRITTEN (1ULL << 0)
> #define BTRFS_HEADER_FLAG_RELOC (1ULL << 1)
>
>
--
Damien Le Moal
Western Digital Research
^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: [PATCH v8 01/11] btrfs: add raid stripe tree definitions
2023-09-11 21:00 ` Damien Le Moal
@ 2023-09-12 6:09 ` Johannes Thumshirn
0 siblings, 0 replies; 39+ messages in thread
From: Johannes Thumshirn @ 2023-09-12 6:09 UTC (permalink / raw)
To: Damien Le Moal, Chris Mason, Josef Bacik, David Sterba
Cc: Christoph Hellwig, Naohiro Aota, Qu Wenruo,
linux-btrfs@vger.kernel.org, linux-kernel@vger.kernel.org
On 11.09.23 23:01, Damien Le Moal wrote:
>> diff --git a/include/uapi/linux/btrfs_tree.h b/include/uapi/linux/btrfs_tree.h
>> index fc3c32186d7e..3fb758ce3ac0 100644
>> --- a/include/uapi/linux/btrfs_tree.h
>> +++ b/include/uapi/linux/btrfs_tree.h
>> @@ -4,9 +4,8 @@
>>
>> #include <linux/btrfs.h>
>> #include <linux/types.h>
>> -#ifdef __KERNEL__
>> #include <linux/stddef.h>
>> -#else
>> +#ifndef __KERNEL__
>> #include <stddef.h>
>> #endif
>
> This change seems unrelated to the RAID stripe tree. Should this be a patch on
> its own ?
Nope it isn't. This patch introduces a user of __DECLARE_FLEX_ARRAY()
and without the moved ifdef userspace can't find the definition of it.
^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: [PATCH v8 01/11] btrfs: add raid stripe tree definitions
2023-09-11 12:52 ` [PATCH v8 01/11] btrfs: add raid stripe tree definitions Johannes Thumshirn
2023-09-11 21:00 ` Damien Le Moal
@ 2023-09-12 20:32 ` David Sterba
2023-09-13 6:02 ` Johannes Thumshirn
1 sibling, 1 reply; 39+ messages in thread
From: David Sterba @ 2023-09-12 20:32 UTC (permalink / raw)
To: Johannes Thumshirn
Cc: Chris Mason, Josef Bacik, David Sterba, Christoph Hellwig,
Naohiro Aota, Qu Wenruo, Damien Le Moal, linux-btrfs,
linux-kernel
On Mon, Sep 11, 2023 at 05:52:02AM -0700, Johannes Thumshirn wrote:
> Add definitions for the raid stripe tree. This tree will hold information
> about the on-disk layout of the stripes in a RAID set.
>
> Each stripe extent has a 1:1 relationship with an on-disk extent item and
> is doing the logical to per-drive physical address translation for the
> extent item in question.
>
> Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
> ---
> fs/btrfs/accessors.h | 10 ++++++++++
> fs/btrfs/locking.c | 5 +++--
> include/uapi/linux/btrfs_tree.h | 33 +++++++++++++++++++++++++++++++--
> 3 files changed, 44 insertions(+), 4 deletions(-)
>
> diff --git a/fs/btrfs/accessors.h b/fs/btrfs/accessors.h
> index f958eccff477..977ff160a024 100644
> --- a/fs/btrfs/accessors.h
> +++ b/fs/btrfs/accessors.h
> @@ -306,6 +306,16 @@ BTRFS_SETGET_FUNCS(timespec_nsec, struct btrfs_timespec, nsec, 32);
> BTRFS_SETGET_STACK_FUNCS(stack_timespec_sec, struct btrfs_timespec, sec, 64);
> BTRFS_SETGET_STACK_FUNCS(stack_timespec_nsec, struct btrfs_timespec, nsec, 32);
>
> +BTRFS_SETGET_FUNCS(stripe_extent_encoding, struct btrfs_stripe_extent, encoding, 8);
What is encoding referring to?
> +BTRFS_SETGET_FUNCS(raid_stride_devid, struct btrfs_raid_stride, devid, 64);
> +BTRFS_SETGET_FUNCS(raid_stride_physical, struct btrfs_raid_stride, physical, 64);
> +BTRFS_SETGET_FUNCS(raid_stride_length, struct btrfs_raid_stride, length, 64);
> +BTRFS_SETGET_STACK_FUNCS(stack_stripe_extent_encoding,
> + struct btrfs_stripe_extent, encoding, 8);
> +BTRFS_SETGET_STACK_FUNCS(stack_raid_stride_devid, struct btrfs_raid_stride, devid, 64);
> +BTRFS_SETGET_STACK_FUNCS(stack_raid_stride_physical, struct btrfs_raid_stride, physical, 64);
> +BTRFS_SETGET_STACK_FUNCS(stack_raid_stride_length, struct btrfs_raid_stride, length, 64);
> +
> /* struct btrfs_dev_extent */
> BTRFS_SETGET_FUNCS(dev_extent_chunk_tree, struct btrfs_dev_extent, chunk_tree, 64);
> BTRFS_SETGET_FUNCS(dev_extent_chunk_objectid, struct btrfs_dev_extent,
> diff --git a/fs/btrfs/locking.c b/fs/btrfs/locking.c
> index 6ac4fd8cc8dc..e7760d40feab 100644
> --- a/fs/btrfs/locking.c
> +++ b/fs/btrfs/locking.c
> @@ -58,8 +58,8 @@
>
> static struct btrfs_lockdep_keyset {
> u64 id; /* root objectid */
> - /* Longest entry: btrfs-block-group-00 */
> - char names[BTRFS_MAX_LEVEL][24];
> + /* Longest entry: btrfs-raid-stripe-tree-00 */
> + char names[BTRFS_MAX_LEVEL][25];
Length of "btrfs-raid-stripe-tree-00" is 25, there should be +1 for the
NUL, also length aligned to at least 4 is better.
> struct lock_class_key keys[BTRFS_MAX_LEVEL];
> } btrfs_lockdep_keysets[] = {
> { .id = BTRFS_ROOT_TREE_OBJECTID, DEFINE_NAME("root") },
> @@ -74,6 +74,7 @@ static struct btrfs_lockdep_keyset {
> { .id = BTRFS_UUID_TREE_OBJECTID, DEFINE_NAME("uuid") },
> { .id = BTRFS_FREE_SPACE_TREE_OBJECTID, DEFINE_NAME("free-space") },
> { .id = BTRFS_BLOCK_GROUP_TREE_OBJECTID, DEFINE_NAME("block-group") },
> + { .id = BTRFS_RAID_STRIPE_TREE_OBJECTID,DEFINE_NAME("raid-stripe-tree") },
The naming is without the "tree"
> { .id = 0, DEFINE_NAME("tree") },
> };
>
> diff --git a/include/uapi/linux/btrfs_tree.h b/include/uapi/linux/btrfs_tree.h
> index fc3c32186d7e..3fb758ce3ac0 100644
> --- a/include/uapi/linux/btrfs_tree.h
> +++ b/include/uapi/linux/btrfs_tree.h
> @@ -4,9 +4,8 @@
>
> #include <linux/btrfs.h>
> #include <linux/types.h>
> -#ifdef __KERNEL__
> #include <linux/stddef.h>
> -#else
> +#ifndef __KERNEL__
> #include <stddef.h>
> #endif
>
> @@ -73,6 +72,9 @@
> /* Holds the block group items for extent tree v2. */
> #define BTRFS_BLOCK_GROUP_TREE_OBJECTID 11ULL
>
> +/* tracks RAID stripes in block groups. */
Tracks ...
> +#define BTRFS_RAID_STRIPE_TREE_OBJECTID 12ULL
> +
> /* device stats in the device tree */
> #define BTRFS_DEV_STATS_OBJECTID 0ULL
>
> @@ -285,6 +287,8 @@
> */
> #define BTRFS_QGROUP_RELATION_KEY 246
>
> +#define BTRFS_RAID_STRIPE_KEY 247
Any particular reason you chose 247 for the key number? It does not
leave any gap after BTRFS_QGROUP_RELATION_KEY and before
BTRFS_BALANCE_ITEM_KEY. If this is related to extents then please find
more suitable group of keys where to put it.
> +
> /*
> * Obsolete name, see BTRFS_TEMPORARY_ITEM_KEY.
> */
> @@ -719,6 +723,31 @@ struct btrfs_free_space_header {
> __le64 num_bitmaps;
> } __attribute__ ((__packed__));
>
> +struct btrfs_raid_stride {
> + /* btrfs device-id this raid extent lives on */
Comments should be full sentences.
> + __le64 devid;
> + /* physical location on disk */
> + __le64 physical;
> + /* length of stride on this disk */
> + __le64 length;
> +};
__attribute__ ((__packed__));
> +
> +#define BTRFS_STRIPE_DUP 0
> +#define BTRFS_STRIPE_RAID0 1
> +#define BTRFS_STRIPE_RAID1 2
> +#define BTRFS_STRIPE_RAID1C3 3
> +#define BTRFS_STRIPE_RAID1C4 4
> +#define BTRFS_STRIPE_RAID5 5
> +#define BTRFS_STRIPE_RAID6 6
> +#define BTRFS_STRIPE_RAID10 7
This is probably defining the on-disk format so some consistency is
desired, there are already the BTRFS_BLOCK_GROUP_* types, from which the
BTRFS_RAID_* are derive, so the BTRFS_STRIPE_* values should match the
order and ideally the values themselves if possible.
> +
> +struct btrfs_stripe_extent {
> + __u8 encoding;
> + __u8 reserved[7];
> + /* array of raid strides this stripe is composed of */
> + __DECLARE_FLEX_ARRAY(struct btrfs_raid_stride, strides);
Do we really whant to declare that as __DECLARE_FLEX_ARRAY? It's not a
standard macro and obscures the definition.
> +};
> +
> #define BTRFS_HEADER_FLAG_WRITTEN (1ULL << 0)
> #define BTRFS_HEADER_FLAG_RELOC (1ULL << 1)
>
>
> --
> 2.41.0
^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: [PATCH v8 08/11] btrfs: add raid stripe tree pretty printer
2023-09-11 12:52 ` [PATCH v8 08/11] btrfs: add raid stripe tree pretty printer Johannes Thumshirn
@ 2023-09-12 20:42 ` David Sterba
2023-09-13 5:34 ` Johannes Thumshirn
0 siblings, 1 reply; 39+ messages in thread
From: David Sterba @ 2023-09-12 20:42 UTC (permalink / raw)
To: Johannes Thumshirn
Cc: Chris Mason, Josef Bacik, David Sterba, Christoph Hellwig,
Naohiro Aota, Qu Wenruo, Damien Le Moal, linux-btrfs,
linux-kernel
On Mon, Sep 11, 2023 at 05:52:09AM -0700, Johannes Thumshirn wrote:
> Decode raid-stripe-tree entries on btrfs_print_tree().
>
> Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
> ---
> fs/btrfs/print-tree.c | 49 +++++++++++++++++++++++++++++++++++++++++++++++++
> 1 file changed, 49 insertions(+)
>
> diff --git a/fs/btrfs/print-tree.c b/fs/btrfs/print-tree.c
> index 0c93439e929f..f01919e4bb37 100644
> --- a/fs/btrfs/print-tree.c
> +++ b/fs/btrfs/print-tree.c
> @@ -9,6 +9,7 @@
> #include "print-tree.h"
> #include "accessors.h"
> #include "tree-checker.h"
> +#include "raid-stripe-tree.h"
>
> struct root_name_map {
> u64 id;
> @@ -28,6 +29,7 @@ static const struct root_name_map root_map[] = {
> { BTRFS_FREE_SPACE_TREE_OBJECTID, "FREE_SPACE_TREE" },
> { BTRFS_BLOCK_GROUP_TREE_OBJECTID, "BLOCK_GROUP_TREE" },
> { BTRFS_DATA_RELOC_TREE_OBJECTID, "DATA_RELOC_TREE" },
> + { BTRFS_RAID_STRIPE_TREE_OBJECTID, "RAID_STRIPE_TREE" },
> };
>
> const char *btrfs_root_name(const struct btrfs_key *key, char *buf)
> @@ -189,6 +191,48 @@ static void print_uuid_item(const struct extent_buffer *l, unsigned long offset,
> }
> }
>
> +struct raid_encoding_map {
> + u8 encoding;
> + char name[16];
> +};
> +
> +static const struct raid_encoding_map raid_map[] = {
> + { BTRFS_STRIPE_DUP, "DUP" },
> + { BTRFS_STRIPE_RAID0, "RAID0" },
> + { BTRFS_STRIPE_RAID1, "RAID1" },
> + { BTRFS_STRIPE_RAID1C3, "RAID1C3" },
> + { BTRFS_STRIPE_RAID1C4, "RAID1C4" },
> + { BTRFS_STRIPE_RAID5, "RAID5" },
> + { BTRFS_STRIPE_RAID6, "RAID6" },
> + { BTRFS_STRIPE_RAID10, "RAID10" }
> +};
Instead of another table tranlating constants to raid names, can you
somehow utilize the btrfs_raid_array table? If the STRIPE values match
the RAID (the indexes to the table) you could add a simple wrapper.
> +
> +static const char *stripe_encoding_name(u8 encoding)
> +{
> + for (int i = 0; i < ARRAY_SIZE(raid_map); i++) {
> + if (raid_map[i].encoding == encoding)
> + return raid_map[i].name;
> + }
> +
> + return "UNKNOWN";
> +}
> +
> +static void print_raid_stripe_key(const struct extent_buffer *eb, u32 item_size,
> + struct btrfs_stripe_extent *stripe)
> +{
> + int num_stripes = btrfs_num_raid_stripes(item_size);
> + u8 encoding = btrfs_stripe_extent_encoding(eb, stripe);
> + int i;
> +
> + pr_info("\t\t\tencoding: %s\n", stripe_encoding_name(encoding));
> +
> + for (i = 0; i < num_stripes; i++)
> + pr_info("\t\t\tstride %d devid %llu physical %llu length %llu\n",
> + i, btrfs_raid_stride_devid(eb, &stripe->strides[i]),
> + btrfs_raid_stride_physical(eb, &stripe->strides[i]),
> + btrfs_raid_stride_length(eb, &stripe->strides[i]));
> +}
> +
> /*
> * Helper to output refs and locking status of extent buffer. Useful to debug
> * race condition related problems.
> @@ -349,6 +393,11 @@ void btrfs_print_leaf(const struct extent_buffer *l)
> print_uuid_item(l, btrfs_item_ptr_offset(l, i),
> btrfs_item_size(l, i));
> break;
> + case BTRFS_RAID_STRIPE_KEY:
> + print_raid_stripe_key(l, btrfs_item_size(l, i),
> + btrfs_item_ptr(l, i,
> + struct btrfs_stripe_extent));
> + break;
> }
> }
> }
>
> --
> 2.41.0
^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: [PATCH v8 10/11] btrfs: add trace events for RST
2023-09-11 12:52 ` [PATCH v8 10/11] btrfs: add trace events for RST Johannes Thumshirn
@ 2023-09-12 20:46 ` David Sterba
0 siblings, 0 replies; 39+ messages in thread
From: David Sterba @ 2023-09-12 20:46 UTC (permalink / raw)
To: Johannes Thumshirn
Cc: Chris Mason, Josef Bacik, David Sterba, Christoph Hellwig,
Naohiro Aota, Qu Wenruo, Damien Le Moal, linux-btrfs,
linux-kernel
On Mon, Sep 11, 2023 at 05:52:11AM -0700, Johannes Thumshirn wrote:
> Add trace events for raid-stripe-tree operations.
>
> Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
> ---
> fs/btrfs/raid-stripe-tree.c | 8 +++++
> include/trace/events/btrfs.h | 75 ++++++++++++++++++++++++++++++++++++++++++++
> 2 files changed, 83 insertions(+)
>
> diff --git a/fs/btrfs/raid-stripe-tree.c b/fs/btrfs/raid-stripe-tree.c
> index 7ed02e4b79ec..5a9952cf557c 100644
> --- a/fs/btrfs/raid-stripe-tree.c
> +++ b/fs/btrfs/raid-stripe-tree.c
> @@ -62,6 +62,9 @@ int btrfs_delete_raid_extent(struct btrfs_trans_handle *trans, u64 start,
> if (found_end <= start)
> break;
>
> + trace_btrfs_raid_extent_delete(fs_info, start, end,
> + found_start, found_end);
> +
> ASSERT(found_start >= start && found_end <= end);
> ret = btrfs_del_item(trans, stripe_root, path);
> if (ret)
> @@ -120,6 +123,8 @@ static int btrfs_insert_one_raid_extent(struct btrfs_trans_handle *trans,
> return -ENOMEM;
> }
>
> + trace_btrfs_insert_one_raid_extent(fs_info, bioc->logical, bioc->size,
> + num_stripes);
> btrfs_set_stack_stripe_extent_encoding(stripe_extent, encoding);
> for (int i = 0; i < num_stripes; i++) {
> u64 devid = bioc->stripes[i].dev->devid;
> @@ -445,6 +450,9 @@ int btrfs_get_raid_extent_offset(struct btrfs_fs_info *fs_info,
>
> stripe->physical = physical + offset;
>
> + trace_btrfs_get_raid_extent_offset(fs_info, logical, *length,
> + stripe->physical, devid);
> +
> ret = 0;
> goto free_path;
> }
> diff --git a/include/trace/events/btrfs.h b/include/trace/events/btrfs.h
> index b2db2c2f1c57..e2c6f1199212 100644
> --- a/include/trace/events/btrfs.h
> +++ b/include/trace/events/btrfs.h
> @@ -2497,6 +2497,81 @@ DEFINE_EVENT(btrfs_raid56_bio, raid56_write,
> TP_ARGS(rbio, bio, trace_info)
> );
>
> +TRACE_EVENT(btrfs_insert_one_raid_extent,
> +
> + TP_PROTO(struct btrfs_fs_info *fs_info, u64 logical, u64 length,
const struct fs_info
> + int num_stripes),
> +
> + TP_ARGS(fs_info, logical, length, num_stripes),
> +
> + TP_STRUCT__entry_btrfs(
> + __field( u64, logical )
> + __field( u64, length )
> + __field( int, num_stripes )
> + ),
> +
> + TP_fast_assign_btrfs(fs_info,
> + __entry->logical = logical;
> + __entry->length = length;
> + __entry->num_stripes = num_stripes;
> + ),
> +
> + TP_printk_btrfs("logical=%llu, length=%llu, num_stripes=%d",
> + __entry->logical, __entry->length,
> + __entry->num_stripes)
Tracepoint messages should follow the formatting guidelines
https://btrfs.readthedocs.io/en/latest/dev/Development-notes.html#tracepoints
> +);
> +
> +TRACE_EVENT(btrfs_raid_extent_delete,
> +
> + TP_PROTO(struct btrfs_fs_info *fs_info, u64 start, u64 end,
> + u64 found_start, u64 found_end),
> +
> + TP_ARGS(fs_info, start, end, found_start, found_end),
> +
> + TP_STRUCT__entry_btrfs(
> + __field( u64, start )
> + __field( u64, end )
> + __field( u64, found_start )
> + __field( u64, found_end )
> + ),
> +
> + TP_fast_assign_btrfs(fs_info,
> + __entry->start = start;
> + __entry->end = end;
> + __entry->found_start = found_start;
> + __entry->found_end = found_end;
Tracepoints follow the fancy spacing and alignment in the assign blocks.
> + ),
> +
> + TP_printk_btrfs("start=%llu, end=%llu, found_start=%llu, found_end=%llu",
> + __entry->start, __entry->end, __entry->found_start,
> + __entry->found_end)
> +);
^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: [PATCH v8 07/11] btrfs: zoned: allow zoned RAID
2023-09-11 12:52 ` [PATCH v8 07/11] btrfs: zoned: allow zoned RAID Johannes Thumshirn
@ 2023-09-12 20:49 ` David Sterba
2023-09-13 5:41 ` Johannes Thumshirn
0 siblings, 1 reply; 39+ messages in thread
From: David Sterba @ 2023-09-12 20:49 UTC (permalink / raw)
To: Johannes Thumshirn
Cc: Chris Mason, Josef Bacik, David Sterba, Christoph Hellwig,
Naohiro Aota, Qu Wenruo, Damien Le Moal, linux-btrfs,
linux-kernel
On Mon, Sep 11, 2023 at 05:52:08AM -0700, Johannes Thumshirn wrote:
> When we have a raid-stripe-tree, we can do RAID0/1/10 on zoned devices for
> data block-groups. For meta-data block-groups, we don't actually need
> anything special, as all meta-data I/O is protected by the
> btrfs_zoned_meta_io_lock() already.
>
> Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
> ---
> fs/btrfs/raid-stripe-tree.h | 7 ++-
> fs/btrfs/volumes.c | 2 +
> fs/btrfs/zoned.c | 113 +++++++++++++++++++++++++++++++++++++++++++-
> 3 files changed, 119 insertions(+), 3 deletions(-)
>
> diff --git a/fs/btrfs/raid-stripe-tree.h b/fs/btrfs/raid-stripe-tree.h
> index 40aa553ae8aa..30c7d5981890 100644
> --- a/fs/btrfs/raid-stripe-tree.h
> +++ b/fs/btrfs/raid-stripe-tree.h
> @@ -8,6 +8,11 @@
>
> #include "disk-io.h"
>
> +#define BTRFS_RST_SUPP_BLOCK_GROUP_MASK (BTRFS_BLOCK_GROUP_DUP |\
> + BTRFS_BLOCK_GROUP_RAID1_MASK |\
> + BTRFS_BLOCK_GROUP_RAID0 |\
> + BTRFS_BLOCK_GROUP_RAID10)
> +
> struct btrfs_io_context;
> struct btrfs_io_stripe;
>
> @@ -32,7 +37,7 @@ static inline bool btrfs_need_stripe_tree_update(struct btrfs_fs_info *fs_info,
> if (type != BTRFS_BLOCK_GROUP_DATA)
> return false;
>
> - if (profile & BTRFS_BLOCK_GROUP_RAID1_MASK)
> + if (profile & BTRFS_RST_SUPP_BLOCK_GROUP_MASK)
> return true;
>
> return false;
> diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
> index 7c25f5c77788..9f17e5f290f4 100644
> --- a/fs/btrfs/volumes.c
> +++ b/fs/btrfs/volumes.c
> @@ -6438,6 +6438,8 @@ int btrfs_map_block(struct btrfs_fs_info *fs_info, enum btrfs_map_op op,
> * I/O context structure.
> */
> if (smap && num_alloc_stripes == 1 &&
> + !(btrfs_need_stripe_tree_update(fs_info, map->type) &&
> + op != BTRFS_MAP_READ) &&
> !((map->type & BTRFS_BLOCK_GROUP_RAID56_MASK) && mirror_num > 1)) {
> ret = set_io_stripe(fs_info, op, logical, length, smap, map,
> stripe_index, stripe_offset, stripe_nr);
> diff --git a/fs/btrfs/zoned.c b/fs/btrfs/zoned.c
> index c6eedf4bfba9..4ca36875058c 100644
> --- a/fs/btrfs/zoned.c
> +++ b/fs/btrfs/zoned.c
> @@ -1481,8 +1481,9 @@ int btrfs_load_block_group_zone_info(struct btrfs_block_group *cache, bool new)
> set_bit(BLOCK_GROUP_FLAG_ZONE_IS_ACTIVE, &cache->runtime_flags);
> break;
> case BTRFS_BLOCK_GROUP_DUP:
> - if (map->type & BTRFS_BLOCK_GROUP_DATA) {
> - btrfs_err(fs_info, "zoned: profile DUP not yet supported on data bg");
> + if (map->type & BTRFS_BLOCK_GROUP_DATA &&
> + !btrfs_stripe_tree_root(fs_info)) {
> + btrfs_err(fs_info, "zoned: data DUP profile needs stripe_root");
> ret = -EINVAL;
> goto out;
> }
> @@ -1520,8 +1521,116 @@ int btrfs_load_block_group_zone_info(struct btrfs_block_group *cache, bool new)
> cache->zone_capacity = min(caps[0], caps[1]);
> break;
> case BTRFS_BLOCK_GROUP_RAID1:
> + case BTRFS_BLOCK_GROUP_RAID1C3:
> + case BTRFS_BLOCK_GROUP_RAID1C4:
This
> + if (map->type & BTRFS_BLOCK_GROUP_DATA &&
> + !btrfs_stripe_tree_root(fs_info)) {
> + btrfs_err(fs_info,
> + "zoned: data %s needs stripe_root",
> + btrfs_bg_type_to_raid_name(map->type));
> + ret = -EIO;
> + goto out;
> +
> + }
> +
> + for (i = 0; i < map->num_stripes; i++) {
> + if (alloc_offsets[i] == WP_MISSING_DEV ||
> + alloc_offsets[i] == WP_CONVENTIONAL)
> + continue;
> +
> + if ((alloc_offsets[0] != alloc_offsets[i]) &&
> + !btrfs_test_opt(fs_info, DEGRADED)) {
> + btrfs_err(fs_info,
> + "zoned: write pointer offset mismatch of zones in %s profile",
> + btrfs_bg_type_to_raid_name(map->type));
> + ret = -EIO;
> + goto out;
> + }
> + if (test_bit(0, active) != test_bit(i, active)) {
> + if (!btrfs_test_opt(fs_info, DEGRADED) &&
> + !btrfs_zone_activate(cache)) {
> + ret = -EIO;
> + goto out;
> + }
> + } else {
> + if (test_bit(0, active))
> + set_bit(BLOCK_GROUP_FLAG_ZONE_IS_ACTIVE,
> + &cache->runtime_flags);
> + }
> + /*
> + * In case a device is missing we have a cap of 0, so don't
> + * use it.
> + */
> + cache->zone_capacity = min_not_zero(caps[0], caps[i]);
> + }
> +
> + if (alloc_offsets[0] != WP_MISSING_DEV)
> + cache->alloc_offset = alloc_offsets[0];
> + else
> + cache->alloc_offset = alloc_offsets[i - 1];
whole block
> + break;
> case BTRFS_BLOCK_GROUP_RAID0:
and
> + if (map->type & BTRFS_BLOCK_GROUP_DATA &&
> + !btrfs_stripe_tree_root(fs_info)) {
> + btrfs_err(fs_info,
> + "zoned: data %s needs stripe_root",
> + btrfs_bg_type_to_raid_name(map->type));
> + ret = -EIO;
> + goto out;
> +
> + }
> + for (i = 0; i < map->num_stripes; i++) {
> + if (alloc_offsets[i] == WP_MISSING_DEV ||
> + alloc_offsets[i] == WP_CONVENTIONAL)
> + continue;
> +
> + if (test_bit(0, active) != test_bit(i, active)) {
> + if (!btrfs_zone_activate(cache)) {
> + ret = -EIO;
> + goto out;
> + }
> + } else {
> + if (test_bit(0, active))
> + set_bit(BLOCK_GROUP_FLAG_ZONE_IS_ACTIVE,
> + &cache->runtime_flags);
> + }
> + cache->zone_capacity += caps[i];
> + cache->alloc_offset += alloc_offsets[i];
> +
> + }
> + break;
> case BTRFS_BLOCK_GROUP_RAID10:
> + if (map->type & BTRFS_BLOCK_GROUP_DATA &&
> + !btrfs_stripe_tree_root(fs_info)) {
> + btrfs_err(fs_info,
> + "zoned: data %s needs stripe_root",
> + btrfs_bg_type_to_raid_name(map->type));
> + ret = -EIO;
> + goto out;
> +
> + }
> + for (i = 0; i < map->num_stripes; i++) {
> + if (alloc_offsets[i] == WP_MISSING_DEV ||
> + alloc_offsets[i] == WP_CONVENTIONAL)
> + continue;
> +
> + if (test_bit(0, active) != test_bit(i, active)) {
> + if (!btrfs_zone_activate(cache)) {
> + ret = -EIO;
> + goto out;
> + }
> + } else {
> + if (test_bit(0, active))
> + set_bit(BLOCK_GROUP_FLAG_ZONE_IS_ACTIVE,
> + &cache->runtime_flags);
> + }
> + if ((i % map->sub_stripes) == 0) {
> + cache->zone_capacity += caps[i];
> + cache->alloc_offset += alloc_offsets[i];
> + }
> +
> + }
> + break;
Seem to be quite long and nested for a case, can they be factored to
helpers?
> case BTRFS_BLOCK_GROUP_RAID5:
> case BTRFS_BLOCK_GROUP_RAID6:
> /* non-single profiles are not supported yet */
>
> --
> 2.41.0
^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: [PATCH v8 08/11] btrfs: add raid stripe tree pretty printer
2023-09-12 20:42 ` David Sterba
@ 2023-09-13 5:34 ` Johannes Thumshirn
0 siblings, 0 replies; 39+ messages in thread
From: Johannes Thumshirn @ 2023-09-13 5:34 UTC (permalink / raw)
To: dsterba@suse.cz
Cc: Chris Mason, Josef Bacik, David Sterba, Christoph Hellwig,
Naohiro Aota, Qu Wenruo, Damien Le Moal,
linux-btrfs@vger.kernel.org, linux-kernel@vger.kernel.org
On 12.09.23 22:42, David Sterba wrote:
>> +struct raid_encoding_map {
>> + u8 encoding;
>> + char name[16];
>> +};
>> +
>> +static const struct raid_encoding_map raid_map[] = {
>> + { BTRFS_STRIPE_DUP, "DUP" },
>> + { BTRFS_STRIPE_RAID0, "RAID0" },
>> + { BTRFS_STRIPE_RAID1, "RAID1" },
>> + { BTRFS_STRIPE_RAID1C3, "RAID1C3" },
>> + { BTRFS_STRIPE_RAID1C4, "RAID1C4" },
>> + { BTRFS_STRIPE_RAID5, "RAID5" },
>> + { BTRFS_STRIPE_RAID6, "RAID6" },
>> + { BTRFS_STRIPE_RAID10, "RAID10" }
>> +};
>
> Instead of another table tranlating constants to raid names, can you
> somehow utilize the btrfs_raid_array table? If the STRIPE values match
> the RAID (the indexes to the table) you could add a simple wrapper.
Sure.
^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: [PATCH v8 07/11] btrfs: zoned: allow zoned RAID
2023-09-12 20:49 ` David Sterba
@ 2023-09-13 5:41 ` Johannes Thumshirn
2023-09-13 14:52 ` David Sterba
0 siblings, 1 reply; 39+ messages in thread
From: Johannes Thumshirn @ 2023-09-13 5:41 UTC (permalink / raw)
To: dsterba@suse.cz
Cc: Chris Mason, Josef Bacik, David Sterba, Christoph Hellwig,
Naohiro Aota, Qu Wenruo, Damien Le Moal,
linux-btrfs@vger.kernel.org, linux-kernel@vger.kernel.org
On 12.09.23 22:49, David Sterba wrote:
> On Mon, Sep 11, 2023 at 05:52:08AM -0700, Johannes Thumshirn wrote:
>> When we have a raid-stripe-tree, we can do RAID0/1/10 on zoned devices for
>> data block-groups. For meta-data block-groups, we don't actually need
>> anything special, as all meta-data I/O is protected by the
>> btrfs_zoned_meta_io_lock() already.
>>
>> Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
>> ---
>> fs/btrfs/raid-stripe-tree.h | 7 ++-
>> fs/btrfs/volumes.c | 2 +
>> fs/btrfs/zoned.c | 113 +++++++++++++++++++++++++++++++++++++++++++-
>> 3 files changed, 119 insertions(+), 3 deletions(-)
>>
>> diff --git a/fs/btrfs/raid-stripe-tree.h b/fs/btrfs/raid-stripe-tree.h
>> index 40aa553ae8aa..30c7d5981890 100644
>> --- a/fs/btrfs/raid-stripe-tree.h
>> +++ b/fs/btrfs/raid-stripe-tree.h
>> @@ -8,6 +8,11 @@
>>
>> #include "disk-io.h"
>>
>> +#define BTRFS_RST_SUPP_BLOCK_GROUP_MASK (BTRFS_BLOCK_GROUP_DUP |\
>> + BTRFS_BLOCK_GROUP_RAID1_MASK |\
>> + BTRFS_BLOCK_GROUP_RAID0 |\
>> + BTRFS_BLOCK_GROUP_RAID10)
>> +
>> struct btrfs_io_context;
>> struct btrfs_io_stripe;
>>
>> @@ -32,7 +37,7 @@ static inline bool btrfs_need_stripe_tree_update(struct btrfs_fs_info *fs_info,
>> if (type != BTRFS_BLOCK_GROUP_DATA)
>> return false;
>>
>> - if (profile & BTRFS_BLOCK_GROUP_RAID1_MASK)
>> + if (profile & BTRFS_RST_SUPP_BLOCK_GROUP_MASK)
>> return true;
>>
>> return false;
>> diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
>> index 7c25f5c77788..9f17e5f290f4 100644
>> --- a/fs/btrfs/volumes.c
>> +++ b/fs/btrfs/volumes.c
>> @@ -6438,6 +6438,8 @@ int btrfs_map_block(struct btrfs_fs_info *fs_info, enum btrfs_map_op op,
>> * I/O context structure.
>> */
>> if (smap && num_alloc_stripes == 1 &&
>> + !(btrfs_need_stripe_tree_update(fs_info, map->type) &&
>> + op != BTRFS_MAP_READ) &&
>> !((map->type & BTRFS_BLOCK_GROUP_RAID56_MASK) && mirror_num > 1)) {
>> ret = set_io_stripe(fs_info, op, logical, length, smap, map,
>> stripe_index, stripe_offset, stripe_nr);
>> diff --git a/fs/btrfs/zoned.c b/fs/btrfs/zoned.c
>> index c6eedf4bfba9..4ca36875058c 100644
>> --- a/fs/btrfs/zoned.c
>> +++ b/fs/btrfs/zoned.c
>> @@ -1481,8 +1481,9 @@ int btrfs_load_block_group_zone_info(struct btrfs_block_group *cache, bool new)
>> set_bit(BLOCK_GROUP_FLAG_ZONE_IS_ACTIVE, &cache->runtime_flags);
>> break;
>> case BTRFS_BLOCK_GROUP_DUP:
>> - if (map->type & BTRFS_BLOCK_GROUP_DATA) {
>> - btrfs_err(fs_info, "zoned: profile DUP not yet supported on data bg");
>> + if (map->type & BTRFS_BLOCK_GROUP_DATA &&
>> + !btrfs_stripe_tree_root(fs_info)) {
>> + btrfs_err(fs_info, "zoned: data DUP profile needs stripe_root");
>> ret = -EINVAL;
>> goto out;
>> }
>> @@ -1520,8 +1521,116 @@ int btrfs_load_block_group_zone_info(struct btrfs_block_group *cache, bool new)
>> cache->zone_capacity = min(caps[0], caps[1]);
>> break;
>> case BTRFS_BLOCK_GROUP_RAID1:
>> + case BTRFS_BLOCK_GROUP_RAID1C3:
>> + case BTRFS_BLOCK_GROUP_RAID1C4:
>
> This
>
>> + if (map->type & BTRFS_BLOCK_GROUP_DATA &&
>> + !btrfs_stripe_tree_root(fs_info)) {
>> + btrfs_err(fs_info,
>> + "zoned: data %s needs stripe_root",
>> + btrfs_bg_type_to_raid_name(map->type));
>> + ret = -EIO;
>> + goto out;
>> +
>> + }
>> +
>> + for (i = 0; i < map->num_stripes; i++) {
>> + if (alloc_offsets[i] == WP_MISSING_DEV ||
>> + alloc_offsets[i] == WP_CONVENTIONAL)
>> + continue;
>> +
>> + if ((alloc_offsets[0] != alloc_offsets[i]) &&
>> + !btrfs_test_opt(fs_info, DEGRADED)) {
>> + btrfs_err(fs_info,
>> + "zoned: write pointer offset mismatch of zones in %s profile",
>> + btrfs_bg_type_to_raid_name(map->type));
>> + ret = -EIO;
>> + goto out;
>> + }
>> + if (test_bit(0, active) != test_bit(i, active)) {
>> + if (!btrfs_test_opt(fs_info, DEGRADED) &&
>> + !btrfs_zone_activate(cache)) {
>> + ret = -EIO;
>> + goto out;
>> + }
>> + } else {
>> + if (test_bit(0, active))
>> + set_bit(BLOCK_GROUP_FLAG_ZONE_IS_ACTIVE,
>> + &cache->runtime_flags);
>> + }
>> + /*
>> + * In case a device is missing we have a cap of 0, so don't
>> + * use it.
>> + */
>> + cache->zone_capacity = min_not_zero(caps[0], caps[i]);
>> + }
>> +
>> + if (alloc_offsets[0] != WP_MISSING_DEV)
>> + cache->alloc_offset = alloc_offsets[0];
>> + else
>> + cache->alloc_offset = alloc_offsets[i - 1];
>
> whole block
>
>> + break;
>> case BTRFS_BLOCK_GROUP_RAID0:
>
> and
>
>> + if (map->type & BTRFS_BLOCK_GROUP_DATA &&
>> + !btrfs_stripe_tree_root(fs_info)) {
>> + btrfs_err(fs_info,
>> + "zoned: data %s needs stripe_root",
>> + btrfs_bg_type_to_raid_name(map->type));
>> + ret = -EIO;
>> + goto out;
>> +
>> + }
>> + for (i = 0; i < map->num_stripes; i++) {
>> + if (alloc_offsets[i] == WP_MISSING_DEV ||
>> + alloc_offsets[i] == WP_CONVENTIONAL)
>> + continue;
>> +
>> + if (test_bit(0, active) != test_bit(i, active)) {
>> + if (!btrfs_zone_activate(cache)) {
>> + ret = -EIO;
>> + goto out;
>> + }
>> + } else {
>> + if (test_bit(0, active))
>> + set_bit(BLOCK_GROUP_FLAG_ZONE_IS_ACTIVE,
>> + &cache->runtime_flags);
>> + }
>> + cache->zone_capacity += caps[i];
>> + cache->alloc_offset += alloc_offsets[i];
>> +
>> + }
>> + break;
>> case BTRFS_BLOCK_GROUP_RAID10:
>> + if (map->type & BTRFS_BLOCK_GROUP_DATA &&
>> + !btrfs_stripe_tree_root(fs_info)) {
>> + btrfs_err(fs_info,
>> + "zoned: data %s needs stripe_root",
>> + btrfs_bg_type_to_raid_name(map->type));
>> + ret = -EIO;
>> + goto out;
>> +
>> + }
>> + for (i = 0; i < map->num_stripes; i++) {
>> + if (alloc_offsets[i] == WP_MISSING_DEV ||
>> + alloc_offsets[i] == WP_CONVENTIONAL)
>> + continue;
>> +
>> + if (test_bit(0, active) != test_bit(i, active)) {
>> + if (!btrfs_zone_activate(cache)) {
>> + ret = -EIO;
>> + goto out;
>> + }
>> + } else {
>> + if (test_bit(0, active))
>> + set_bit(BLOCK_GROUP_FLAG_ZONE_IS_ACTIVE,
>> + &cache->runtime_flags);
>> + }
>> + if ((i % map->sub_stripes) == 0) {
>> + cache->zone_capacity += caps[i];
>> + cache->alloc_offset += alloc_offsets[i];
>> + }
>> +
>> + }
>> + break;
>
> Seem to be quite long and nested for a case, can they be factored to
> helpers?
Sure, but I'd love to have
https://lore.kernel.org/all/20230605085108.580976-1-hch@lst.de/
pulled in first. This patchset handles (among other things) the DUP and
single cases as well.
^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: [PATCH v8 01/11] btrfs: add raid stripe tree definitions
2023-09-12 20:32 ` David Sterba
@ 2023-09-13 6:02 ` Johannes Thumshirn
2023-09-13 14:49 ` David Sterba
0 siblings, 1 reply; 39+ messages in thread
From: Johannes Thumshirn @ 2023-09-13 6:02 UTC (permalink / raw)
To: dsterba@suse.cz
Cc: Chris Mason, Josef Bacik, David Sterba, Christoph Hellwig,
Naohiro Aota, Qu Wenruo, Damien Le Moal,
linux-btrfs@vger.kernel.org, linux-kernel@vger.kernel.org
On 12.09.23 22:32, David Sterba wrote:
>> @@ -306,6 +306,16 @@ BTRFS_SETGET_FUNCS(timespec_nsec, struct btrfs_timespec, nsec, 32);
>> BTRFS_SETGET_STACK_FUNCS(stack_timespec_sec, struct btrfs_timespec, sec, 64);
>> BTRFS_SETGET_STACK_FUNCS(stack_timespec_nsec, struct btrfs_timespec, nsec, 32);
>>
>> +BTRFS_SETGET_FUNCS(stripe_extent_encoding, struct btrfs_stripe_extent, encoding, 8);
>
> What is encoding referring to?
At the moment (only) the RAID type. But in the future it can be expanded
to all kinds of encodings, like Reed-Solomon, Butterfly-Codes, etc...
>> static struct btrfs_lockdep_keyset {
>> u64 id; /* root objectid */
>> - /* Longest entry: btrfs-block-group-00 */
>> - char names[BTRFS_MAX_LEVEL][24];
>> + /* Longest entry: btrfs-raid-stripe-tree-00 */
>> + char names[BTRFS_MAX_LEVEL][25];
>
> Length of "btrfs-raid-stripe-tree-00" is 25, there should be +1 for the
> NUL, also length aligned to at least 4 is better.
>
OK.
>> struct lock_class_key keys[BTRFS_MAX_LEVEL];
>> } btrfs_lockdep_keysets[] = {
>> { .id = BTRFS_ROOT_TREE_OBJECTID, DEFINE_NAME("root") },
>> @@ -74,6 +74,7 @@ static struct btrfs_lockdep_keyset {
>> { .id = BTRFS_UUID_TREE_OBJECTID, DEFINE_NAME("uuid") },
>> { .id = BTRFS_FREE_SPACE_TREE_OBJECTID, DEFINE_NAME("free-space") },
>> { .id = BTRFS_BLOCK_GROUP_TREE_OBJECTID, DEFINE_NAME("block-group") },
>> + { .id = BTRFS_RAID_STRIPE_TREE_OBJECTID,DEFINE_NAME("raid-stripe-tree") },
>
> The naming is without the "tree"
OK
>> @@ -73,6 +72,9 @@
>> /* Holds the block group items for extent tree v2. */
>> #define BTRFS_BLOCK_GROUP_TREE_OBJECTID 11ULL
>>
>> +/* tracks RAID stripes in block groups. */
>
> Tracks ...
>
OK
>> +#define BTRFS_RAID_STRIPE_TREE_OBJECTID 12ULL
>> +
>> /* device stats in the device tree */
>> #define BTRFS_DEV_STATS_OBJECTID 0ULL
>>
>> @@ -285,6 +287,8 @@
>> */
>> #define BTRFS_QGROUP_RELATION_KEY 246
>>
>> +#define BTRFS_RAID_STRIPE_KEY 247
>
> Any particular reason you chose 247 for the key number? It does not
> leave any gap after BTRFS_QGROUP_RELATION_KEY and before
> BTRFS_BALANCE_ITEM_KEY. If this is related to extents then please find
> more suitable group of keys where to put it.
Nope, it was just the last free spot.
>
>> +
>> /*
>> * Obsolete name, see BTRFS_TEMPORARY_ITEM_KEY.
>> */
>> @@ -719,6 +723,31 @@ struct btrfs_free_space_header {
>> __le64 num_bitmaps;
>> } __attribute__ ((__packed__));
>>
>> +struct btrfs_raid_stride {
>> + /* btrfs device-id this raid extent lives on */
>
> Comments should be full sentences.
OK
>
>> + __le64 devid;
>> + /* physical location on disk */
>> + __le64 physical;
>> + /* length of stride on this disk */
>> + __le64 length;
>> +};
>
> __attribute__ ((__packed__));
The structure doesn't have any holes in it so packed is not needed.
I might also be misinformed, but doesn't packed potentially lead to bad
code generation on some platforms? I've always been under the
impression that packed forces the compiler to do byte-wise loads and
stores. But as I've said, I might be misinformed.
>
>> +
>> +#define BTRFS_STRIPE_DUP 0
>> +#define BTRFS_STRIPE_RAID0 1
>> +#define BTRFS_STRIPE_RAID1 2
>> +#define BTRFS_STRIPE_RAID1C3 3
>> +#define BTRFS_STRIPE_RAID1C4 4
>> +#define BTRFS_STRIPE_RAID5 5
>> +#define BTRFS_STRIPE_RAID6 6
>> +#define BTRFS_STRIPE_RAID10 7
>
> This is probably defining the on-disk format so some consistency is
> desired, there are already the BTRFS_BLOCK_GROUP_* types, from which the
> BTRFS_RAID_* are derive, so the BTRFS_STRIPE_* values should match the
> order and ideally the values themselves if possible.
>
>> +
>> +struct btrfs_stripe_extent {
>> + __u8 encoding;
>> + __u8 reserved[7];
>> + /* array of raid strides this stripe is composed of */
>> + __DECLARE_FLEX_ARRAY(struct btrfs_raid_stride, strides);
>
> Do we really whant to declare that as __DECLARE_FLEX_ARRAY? It's not a
> standard macro and obscures the definition.
>
Indeed we do not anymore, as this version does introduce another u64
before the strides array! I'll gladly get rid of it.
^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: [PATCH v8 06/11] btrfs: implement RST version of scrub
2023-09-11 12:52 ` [PATCH v8 06/11] btrfs: implement RST version of scrub Johannes Thumshirn
@ 2023-09-13 9:51 ` Qu Wenruo
2023-09-13 16:59 ` David Sterba
1 sibling, 0 replies; 39+ messages in thread
From: Qu Wenruo @ 2023-09-13 9:51 UTC (permalink / raw)
To: Johannes Thumshirn, Chris Mason, Josef Bacik, David Sterba
Cc: Christoph Hellwig, Naohiro Aota, Qu Wenruo, Damien Le Moal,
linux-btrfs, linux-kernel
On 2023/9/11 22:22, Johannes Thumshirn wrote:
> A filesystem that uses the RAID stripe tree for logical to physical
> address translation can't use the regular scrub path, that reads all
> stripes and then checks if a sector is unused afterwards.
>
> When using the RAID stripe tree, this will result in lookup errors, as the
> stripe tree doesn't know the requested logical addresses.
>
> Instead, look up stripes that are backed by the extent bitmap.
>
> Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
> ---
> fs/btrfs/scrub.c | 56 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> 1 file changed, 56 insertions(+)
>
> diff --git a/fs/btrfs/scrub.c b/fs/btrfs/scrub.c
> index f16220ce5fba..5101e0a3f83e 100644
> --- a/fs/btrfs/scrub.c
> +++ b/fs/btrfs/scrub.c
> @@ -23,6 +23,7 @@
> #include "accessors.h"
> #include "file-item.h"
> #include "scrub.h"
> +#include "raid-stripe-tree.h"
>
> /*
> * This is only the first step towards a full-features scrub. It reads all
> @@ -1634,6 +1635,56 @@ static void scrub_reset_stripe(struct scrub_stripe *stripe)
> }
> }
>
> +static void scrub_submit_extent_sector_read(struct scrub_ctx *sctx,
> + struct scrub_stripe *stripe)
> +{
> + struct btrfs_fs_info *fs_info = stripe->bg->fs_info;
> + struct btrfs_bio *bbio = NULL;
> + int mirror = stripe->mirror_num;
> + int i;
> +
> + atomic_inc(&stripe->pending_io);
> +
> + for_each_set_bit(i, &stripe->extent_sector_bitmap, stripe->nr_sectors) {
> + struct page *page;
> + int pgoff;
> +
> + page = scrub_stripe_get_page(stripe, i);
> + pgoff = scrub_stripe_get_page_offset(stripe, i);
> +
> + /* The current sector cannot be merged, submit the bio. */
> + if (bbio &&
> + ((i > 0 && !test_bit(i - 1, &stripe->extent_sector_bitmap)) ||
> + bbio->bio.bi_iter.bi_size >= BTRFS_STRIPE_LEN)) {
> + ASSERT(bbio->bio.bi_iter.bi_size);
> + atomic_inc(&stripe->pending_io);
> + btrfs_submit_bio(bbio, mirror);
> + bbio = NULL;
> + }
> +
> + if (!bbio) {
> + bbio = btrfs_bio_alloc(stripe->nr_sectors, REQ_OP_READ,
> + fs_info, scrub_read_endio, stripe);
> + bbio->bio.bi_iter.bi_sector = (stripe->logical +
> + (i << fs_info->sectorsize_bits)) >> SECTOR_SHIFT;
> + }
> +
> + __bio_add_page(&bbio->bio, page, fs_info->sectorsize, pgoff);
> + }
> +
> + if (bbio) {
> + ASSERT(bbio->bio.bi_iter.bi_size);
> + atomic_inc(&stripe->pending_io);
> + btrfs_submit_bio(bbio, mirror);
Since RST is looked up during btrfs_submit_bio() (to be more accurate,
set_io_stripe()), and I just checked there is no special requirement to
make btrfs to lookup using commit root.
This means we can have a problem that extent items and RST are out-of-sync.
For scrub, all the extent items are searched using commit root, but
btrfs_get_raid_extent_offset() is only using current root.
Thus you would got some problems during fsstress and scrub.
We need some way to distinguish scrub bbio from regular ones (which is a
completely new requirement).
For now only scrub doesn't initialize bbio->inode, thus it can be used
to do the distinguish (at least for now).
Thanks,
Qu
> + }
> +
> + if (atomic_dec_and_test(&stripe->pending_io)) {
> + wake_up(&stripe->io_wait);
> + INIT_WORK(&stripe->work, scrub_stripe_read_repair_worker);
> + queue_work(stripe->bg->fs_info->scrub_workers, &stripe->work);
> + }
> +}
> +
> static void scrub_submit_initial_read(struct scrub_ctx *sctx,
> struct scrub_stripe *stripe)
> {
> @@ -1645,6 +1696,11 @@ static void scrub_submit_initial_read(struct scrub_ctx *sctx,
> ASSERT(stripe->mirror_num > 0);
> ASSERT(test_bit(SCRUB_STRIPE_FLAG_INITIALIZED, &stripe->state));
>
> + if (btrfs_need_stripe_tree_update(fs_info, stripe->bg->flags)) {
> + scrub_submit_extent_sector_read(sctx, stripe);
> + return;
> + }
> +
> bbio = btrfs_bio_alloc(SCRUB_STRIPE_PAGES, REQ_OP_READ, fs_info,
> scrub_read_endio, stripe);
>
>
^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: [PATCH v8 01/11] btrfs: add raid stripe tree definitions
2023-09-13 6:02 ` Johannes Thumshirn
@ 2023-09-13 14:49 ` David Sterba
2023-09-13 14:57 ` Johannes Thumshirn
0 siblings, 1 reply; 39+ messages in thread
From: David Sterba @ 2023-09-13 14:49 UTC (permalink / raw)
To: Johannes Thumshirn
Cc: dsterba@suse.cz, Chris Mason, Josef Bacik, David Sterba,
Christoph Hellwig, Naohiro Aota, Qu Wenruo, Damien Le Moal,
linux-btrfs@vger.kernel.org, linux-kernel@vger.kernel.org
On Wed, Sep 13, 2023 at 06:02:09AM +0000, Johannes Thumshirn wrote:
> On 12.09.23 22:32, David Sterba wrote:
> >> @@ -306,6 +306,16 @@ BTRFS_SETGET_FUNCS(timespec_nsec, struct btrfs_timespec, nsec, 32);
> >> BTRFS_SETGET_STACK_FUNCS(stack_timespec_sec, struct btrfs_timespec, sec, 64);
> >> BTRFS_SETGET_STACK_FUNCS(stack_timespec_nsec, struct btrfs_timespec, nsec, 32);
> >>
> >> +BTRFS_SETGET_FUNCS(stripe_extent_encoding, struct btrfs_stripe_extent, encoding, 8);
> >
> > What is encoding referring to?
>
> At the moment (only) the RAID type. But in the future it can be expanded
> to all kinds of encodings, like Reed-Solomon, Butterfly-Codes, etc...
I see, could it be better called ECC? Like stripe_extent_ecc, that would
be clear that it's for the correction, encoding sounds is too generic.
> >> + __le64 devid;
> >> + /* physical location on disk */
> >> + __le64 physical;
> >> + /* length of stride on this disk */
> >> + __le64 length;
> >> +};
> >
> > __attribute__ ((__packed__));
>
> The structure doesn't have any holes in it so packed is not needed.
>
> I might also be misinformed, but doesn't packed potentially lead to bad
> code generation on some platforms? I've always been under the
> impression that packed forces the compiler to do byte-wise loads and
> stores. But as I've said, I might be misinformed.
All on-disk structures have the packed attribute so for consistency and
future safety it should be here too, even if it technically does not
need it due to alignment. In addition, strucutres that need padding
would be also problematic, e.g. u64 followed by u32 needs 4 bytes of
padding but the next item after it would be placed right after u32.
It's right that on some platforms unaligned access is done by more code
but for the same reason on such platforms we can't let the compiler
decide the layout when the structure is directly mapped onto the blocks.
^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: [PATCH v8 07/11] btrfs: zoned: allow zoned RAID
2023-09-13 5:41 ` Johannes Thumshirn
@ 2023-09-13 14:52 ` David Sterba
2023-09-13 14:59 ` Johannes Thumshirn
0 siblings, 1 reply; 39+ messages in thread
From: David Sterba @ 2023-09-13 14:52 UTC (permalink / raw)
To: Johannes Thumshirn
Cc: dsterba@suse.cz, Chris Mason, Josef Bacik, David Sterba,
Christoph Hellwig, Naohiro Aota, Qu Wenruo, Damien Le Moal,
linux-btrfs@vger.kernel.org, linux-kernel@vger.kernel.org
On Wed, Sep 13, 2023 at 05:41:52AM +0000, Johannes Thumshirn wrote:
> On 12.09.23 22:49, David Sterba wrote:
> > On Mon, Sep 11, 2023 at 05:52:08AM -0700, Johannes Thumshirn wrote:
> >> When we have a raid-stripe-tree, we can do RAID0/1/10 on zoned devices for
> >> data block-groups. For meta-data block-groups, we don't actually need
> >> anything special, as all meta-data I/O is protected by the
> >> btrfs_zoned_meta_io_lock() already.
> >>
> >> Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
> >> ---
> >> fs/btrfs/raid-stripe-tree.h | 7 ++-
> >> fs/btrfs/volumes.c | 2 +
> >> fs/btrfs/zoned.c | 113 +++++++++++++++++++++++++++++++++++++++++++-
> >> 3 files changed, 119 insertions(+), 3 deletions(-)
> >>
> >> diff --git a/fs/btrfs/raid-stripe-tree.h b/fs/btrfs/raid-stripe-tree.h
> >> index 40aa553ae8aa..30c7d5981890 100644
> >> --- a/fs/btrfs/raid-stripe-tree.h
> >> +++ b/fs/btrfs/raid-stripe-tree.h
> >> @@ -8,6 +8,11 @@
> >>
> >> #include "disk-io.h"
> >>
> >> +#define BTRFS_RST_SUPP_BLOCK_GROUP_MASK (BTRFS_BLOCK_GROUP_DUP |\
> >> + BTRFS_BLOCK_GROUP_RAID1_MASK |\
> >> + BTRFS_BLOCK_GROUP_RAID0 |\
> >> + BTRFS_BLOCK_GROUP_RAID10)
> >> +
> >> struct btrfs_io_context;
> >> struct btrfs_io_stripe;
> >>
> >> @@ -32,7 +37,7 @@ static inline bool btrfs_need_stripe_tree_update(struct btrfs_fs_info *fs_info,
> >> if (type != BTRFS_BLOCK_GROUP_DATA)
> >> return false;
> >>
> >> - if (profile & BTRFS_BLOCK_GROUP_RAID1_MASK)
> >> + if (profile & BTRFS_RST_SUPP_BLOCK_GROUP_MASK)
> >> return true;
> >>
> >> return false;
> >> diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
> >> index 7c25f5c77788..9f17e5f290f4 100644
> >> --- a/fs/btrfs/volumes.c
> >> +++ b/fs/btrfs/volumes.c
> >> @@ -6438,6 +6438,8 @@ int btrfs_map_block(struct btrfs_fs_info *fs_info, enum btrfs_map_op op,
> >> * I/O context structure.
> >> */
> >> if (smap && num_alloc_stripes == 1 &&
> >> + !(btrfs_need_stripe_tree_update(fs_info, map->type) &&
> >> + op != BTRFS_MAP_READ) &&
> >> !((map->type & BTRFS_BLOCK_GROUP_RAID56_MASK) && mirror_num > 1)) {
> >> ret = set_io_stripe(fs_info, op, logical, length, smap, map,
> >> stripe_index, stripe_offset, stripe_nr);
> >> diff --git a/fs/btrfs/zoned.c b/fs/btrfs/zoned.c
> >> index c6eedf4bfba9..4ca36875058c 100644
> >> --- a/fs/btrfs/zoned.c
> >> +++ b/fs/btrfs/zoned.c
> >> @@ -1481,8 +1481,9 @@ int btrfs_load_block_group_zone_info(struct btrfs_block_group *cache, bool new)
> >> set_bit(BLOCK_GROUP_FLAG_ZONE_IS_ACTIVE, &cache->runtime_flags);
> >> break;
> >> case BTRFS_BLOCK_GROUP_DUP:
> >> - if (map->type & BTRFS_BLOCK_GROUP_DATA) {
> >> - btrfs_err(fs_info, "zoned: profile DUP not yet supported on data bg");
> >> + if (map->type & BTRFS_BLOCK_GROUP_DATA &&
> >> + !btrfs_stripe_tree_root(fs_info)) {
> >> + btrfs_err(fs_info, "zoned: data DUP profile needs stripe_root");
> >> ret = -EINVAL;
> >> goto out;
> >> }
> >> @@ -1520,8 +1521,116 @@ int btrfs_load_block_group_zone_info(struct btrfs_block_group *cache, bool new)
> >> cache->zone_capacity = min(caps[0], caps[1]);
> >> break;
> >> case BTRFS_BLOCK_GROUP_RAID1:
> >> + case BTRFS_BLOCK_GROUP_RAID1C3:
> >> + case BTRFS_BLOCK_GROUP_RAID1C4:
> >
> > This
> >
> >> + if (map->type & BTRFS_BLOCK_GROUP_DATA &&
> >> + !btrfs_stripe_tree_root(fs_info)) {
> >> + btrfs_err(fs_info,
> >> + "zoned: data %s needs stripe_root",
> >> + btrfs_bg_type_to_raid_name(map->type));
> >> + ret = -EIO;
> >> + goto out;
> >> +
> >> + }
> >> +
> >> + for (i = 0; i < map->num_stripes; i++) {
> >> + if (alloc_offsets[i] == WP_MISSING_DEV ||
> >> + alloc_offsets[i] == WP_CONVENTIONAL)
> >> + continue;
> >> +
> >> + if ((alloc_offsets[0] != alloc_offsets[i]) &&
> >> + !btrfs_test_opt(fs_info, DEGRADED)) {
> >> + btrfs_err(fs_info,
> >> + "zoned: write pointer offset mismatch of zones in %s profile",
> >> + btrfs_bg_type_to_raid_name(map->type));
> >> + ret = -EIO;
> >> + goto out;
> >> + }
> >> + if (test_bit(0, active) != test_bit(i, active)) {
> >> + if (!btrfs_test_opt(fs_info, DEGRADED) &&
> >> + !btrfs_zone_activate(cache)) {
> >> + ret = -EIO;
> >> + goto out;
> >> + }
> >> + } else {
> >> + if (test_bit(0, active))
> >> + set_bit(BLOCK_GROUP_FLAG_ZONE_IS_ACTIVE,
> >> + &cache->runtime_flags);
> >> + }
> >> + /*
> >> + * In case a device is missing we have a cap of 0, so don't
> >> + * use it.
> >> + */
> >> + cache->zone_capacity = min_not_zero(caps[0], caps[i]);
> >> + }
> >> +
> >> + if (alloc_offsets[0] != WP_MISSING_DEV)
> >> + cache->alloc_offset = alloc_offsets[0];
> >> + else
> >> + cache->alloc_offset = alloc_offsets[i - 1];
> >
> > whole block
> >
> >> + break;
> >> case BTRFS_BLOCK_GROUP_RAID0:
> >
> > and
> >
> >> + if (map->type & BTRFS_BLOCK_GROUP_DATA &&
> >> + !btrfs_stripe_tree_root(fs_info)) {
> >> + btrfs_err(fs_info,
> >> + "zoned: data %s needs stripe_root",
> >> + btrfs_bg_type_to_raid_name(map->type));
> >> + ret = -EIO;
> >> + goto out;
> >> +
> >> + }
> >> + for (i = 0; i < map->num_stripes; i++) {
> >> + if (alloc_offsets[i] == WP_MISSING_DEV ||
> >> + alloc_offsets[i] == WP_CONVENTIONAL)
> >> + continue;
> >> +
> >> + if (test_bit(0, active) != test_bit(i, active)) {
> >> + if (!btrfs_zone_activate(cache)) {
> >> + ret = -EIO;
> >> + goto out;
> >> + }
> >> + } else {
> >> + if (test_bit(0, active))
> >> + set_bit(BLOCK_GROUP_FLAG_ZONE_IS_ACTIVE,
> >> + &cache->runtime_flags);
> >> + }
> >> + cache->zone_capacity += caps[i];
> >> + cache->alloc_offset += alloc_offsets[i];
> >> +
> >> + }
> >> + break;
> >> case BTRFS_BLOCK_GROUP_RAID10:
> >> + if (map->type & BTRFS_BLOCK_GROUP_DATA &&
> >> + !btrfs_stripe_tree_root(fs_info)) {
> >> + btrfs_err(fs_info,
> >> + "zoned: data %s needs stripe_root",
> >> + btrfs_bg_type_to_raid_name(map->type));
> >> + ret = -EIO;
> >> + goto out;
> >> +
> >> + }
> >> + for (i = 0; i < map->num_stripes; i++) {
> >> + if (alloc_offsets[i] == WP_MISSING_DEV ||
> >> + alloc_offsets[i] == WP_CONVENTIONAL)
> >> + continue;
> >> +
> >> + if (test_bit(0, active) != test_bit(i, active)) {
> >> + if (!btrfs_zone_activate(cache)) {
> >> + ret = -EIO;
> >> + goto out;
> >> + }
> >> + } else {
> >> + if (test_bit(0, active))
> >> + set_bit(BLOCK_GROUP_FLAG_ZONE_IS_ACTIVE,
> >> + &cache->runtime_flags);
> >> + }
> >> + if ((i % map->sub_stripes) == 0) {
> >> + cache->zone_capacity += caps[i];
> >> + cache->alloc_offset += alloc_offsets[i];
> >> + }
> >> +
> >> + }
> >> + break;
> >
> > Seem to be quite long and nested for a case, can they be factored to
> > helpers?
>
> Sure, but I'd love to have
> https://lore.kernel.org/all/20230605085108.580976-1-hch@lst.de/
> pulled in first. This patchset handles (among other things) the DUP and
> single cases as well.
I see, the patches still apply cleanly so I'll add them to misc-next.
^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: [PATCH v8 01/11] btrfs: add raid stripe tree definitions
2023-09-13 14:49 ` David Sterba
@ 2023-09-13 14:57 ` Johannes Thumshirn
2023-09-13 16:06 ` David Sterba
0 siblings, 1 reply; 39+ messages in thread
From: Johannes Thumshirn @ 2023-09-13 14:57 UTC (permalink / raw)
To: dsterba@suse.cz
Cc: Chris Mason, Josef Bacik, David Sterba, Christoph Hellwig,
Naohiro Aota, Qu Wenruo, Damien Le Moal,
linux-btrfs@vger.kernel.org, linux-kernel@vger.kernel.org
On 13.09.23 16:50, David Sterba wrote:
> On Wed, Sep 13, 2023 at 06:02:09AM +0000, Johannes Thumshirn wrote:
>> On 12.09.23 22:32, David Sterba wrote:
>>>> @@ -306,6 +306,16 @@ BTRFS_SETGET_FUNCS(timespec_nsec, struct btrfs_timespec, nsec, 32);
>>>> BTRFS_SETGET_STACK_FUNCS(stack_timespec_sec, struct btrfs_timespec, sec, 64);
>>>> BTRFS_SETGET_STACK_FUNCS(stack_timespec_nsec, struct btrfs_timespec, nsec, 32);
>>>>
>>>> +BTRFS_SETGET_FUNCS(stripe_extent_encoding, struct btrfs_stripe_extent, encoding, 8);
>>>
>>> What is encoding referring to?
>>
>> At the moment (only) the RAID type. But in the future it can be expanded
>> to all kinds of encodings, like Reed-Solomon, Butterfly-Codes, etc...
>
> I see, could it be better called ECC? Like stripe_extent_ecc, that would
> be clear that it's for the correction, encoding sounds is too generic.
Hmm but for RAID0 there is no correction, so not really as well. I'd
suggest 'type', but I /think/ for RAID5/6 we'll need type=data and
type=parity (and future ECC as well).
Maybe level, as in RAID level? I know currently it is redundant, as we
can derive it from the block-group.
^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: [PATCH v8 07/11] btrfs: zoned: allow zoned RAID
2023-09-13 14:52 ` David Sterba
@ 2023-09-13 14:59 ` Johannes Thumshirn
0 siblings, 0 replies; 39+ messages in thread
From: Johannes Thumshirn @ 2023-09-13 14:59 UTC (permalink / raw)
To: dsterba@suse.cz
Cc: Chris Mason, Josef Bacik, David Sterba, Christoph Hellwig,
Naohiro Aota, Qu Wenruo, Damien Le Moal,
linux-btrfs@vger.kernel.org, linux-kernel@vger.kernel.org
On 13.09.23 16:52, David Sterba wrote:
>> Sure, but I'd love to have
>> https://lore.kernel.org/all/20230605085108.580976-1-hch@lst.de/
>> pulled in first. This patchset handles (among other things) the DUP and
>> single cases as well.
>
> I see, the patches still apply cleanly so I'll add them to misc-next.
>
Thanks
^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: [PATCH v8 01/11] btrfs: add raid stripe tree definitions
2023-09-13 14:57 ` Johannes Thumshirn
@ 2023-09-13 16:06 ` David Sterba
0 siblings, 0 replies; 39+ messages in thread
From: David Sterba @ 2023-09-13 16:06 UTC (permalink / raw)
To: Johannes Thumshirn
Cc: Chris Mason, Josef Bacik, David Sterba, Christoph Hellwig,
Naohiro Aota, Qu Wenruo, Damien Le Moal,
linux-btrfs@vger.kernel.org, linux-kernel@vger.kernel.org
On Wed, Sep 13, 2023 at 02:57:50PM +0000, Johannes Thumshirn wrote:
> On 13.09.23 16:50, David Sterba wrote:
> > On Wed, Sep 13, 2023 at 06:02:09AM +0000, Johannes Thumshirn wrote:
> >> On 12.09.23 22:32, David Sterba wrote:
> >>>> @@ -306,6 +306,16 @@ BTRFS_SETGET_FUNCS(timespec_nsec, struct btrfs_timespec, nsec, 32);
> >>>> BTRFS_SETGET_STACK_FUNCS(stack_timespec_sec, struct btrfs_timespec, sec, 64);
> >>>> BTRFS_SETGET_STACK_FUNCS(stack_timespec_nsec, struct btrfs_timespec, nsec, 32);
> >>>>
> >>>> +BTRFS_SETGET_FUNCS(stripe_extent_encoding, struct btrfs_stripe_extent, encoding, 8);
> >>>
> >>> What is encoding referring to?
> >>
> >> At the moment (only) the RAID type. But in the future it can be expanded
> >> to all kinds of encodings, like Reed-Solomon, Butterfly-Codes, etc...
> >
> > I see, could it be better called ECC? Like stripe_extent_ecc, that would
> > be clear that it's for the correction, encoding sounds is too generic.
>
> Hmm but for RAID0 there is no correction, so not really as well. I'd
> suggest 'type', but I /think/ for RAID5/6 we'll need type=data and
> type=parity (and future ECC as well).
>
> Maybe level, as in RAID level? I know currently it is redundant, as we
> can derive it from the block-group.
Ok, let's keep encoding, we might actually need the genric meaning, what
I was missing was the context.
^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: [PATCH v8 03/11] btrfs: add support for inserting raid stripe extents
2023-09-11 12:52 ` [PATCH v8 03/11] btrfs: add support for inserting raid stripe extents Johannes Thumshirn
@ 2023-09-13 16:50 ` David Sterba
2023-09-13 16:57 ` David Sterba
2023-09-14 9:25 ` Qu Wenruo
2 siblings, 0 replies; 39+ messages in thread
From: David Sterba @ 2023-09-13 16:50 UTC (permalink / raw)
To: Johannes Thumshirn
Cc: Chris Mason, Josef Bacik, David Sterba, Christoph Hellwig,
Naohiro Aota, Qu Wenruo, Damien Le Moal, linux-btrfs,
linux-kernel
On Mon, Sep 11, 2023 at 05:52:04AM -0700, Johannes Thumshirn wrote:
> Add support for inserting stripe extents into the raid stripe tree on
> completion of every write that needs an extra logical-to-physical
> translation when using RAID.
>
> Inserting the stripe extents happens after the data I/O has completed,
> this is done to a) support zone-append and b) rule out the possibility of
> a RAID-write-hole.
>
> Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
> ---
> fs/btrfs/Makefile | 2 +-
> fs/btrfs/bio.c | 23 ++++
> fs/btrfs/extent-tree.c | 1 +
> fs/btrfs/inode.c | 8 +-
> fs/btrfs/ordered-data.c | 1 +
> fs/btrfs/ordered-data.h | 2 +
> fs/btrfs/raid-stripe-tree.c | 266 ++++++++++++++++++++++++++++++++++++++++++++
> fs/btrfs/raid-stripe-tree.h | 34 ++++++
> fs/btrfs/volumes.c | 4 +-
> fs/btrfs/volumes.h | 15 ++-
> 10 files changed, 347 insertions(+), 9 deletions(-)
>
> diff --git a/fs/btrfs/Makefile b/fs/btrfs/Makefile
> index c57d80729d4f..525af975f61c 100644
> --- a/fs/btrfs/Makefile
> +++ b/fs/btrfs/Makefile
> @@ -33,7 +33,7 @@ btrfs-y += super.o ctree.o extent-tree.o print-tree.o root-tree.o dir-item.o \
> uuid-tree.o props.o free-space-tree.o tree-checker.o space-info.o \
> block-rsv.o delalloc-space.o block-group.o discard.o reflink.o \
> subpage.o tree-mod-log.o extent-io-tree.o fs.o messages.o bio.o \
> - lru_cache.o
> + lru_cache.o raid-stripe-tree.o
>
> btrfs-$(CONFIG_BTRFS_FS_POSIX_ACL) += acl.o
> btrfs-$(CONFIG_BTRFS_FS_REF_VERIFY) += ref-verify.o
> diff --git a/fs/btrfs/bio.c b/fs/btrfs/bio.c
> index 31ff36990404..ddbe6f8d4ea2 100644
> --- a/fs/btrfs/bio.c
> +++ b/fs/btrfs/bio.c
> @@ -14,6 +14,7 @@
> #include "rcu-string.h"
> #include "zoned.h"
> #include "file-item.h"
> +#include "raid-stripe-tree.h"
>
> static struct bio_set btrfs_bioset;
> static struct bio_set btrfs_clone_bioset;
> @@ -415,6 +416,9 @@ static void btrfs_orig_write_end_io(struct bio *bio)
> else
> bio->bi_status = BLK_STS_OK;
>
> + if (bio_op(bio) == REQ_OP_ZONE_APPEND && !bio->bi_status)
> + stripe->physical = bio->bi_iter.bi_sector << SECTOR_SHIFT;
> +
> btrfs_orig_bbio_end_io(bbio);
> btrfs_put_bioc(bioc);
> }
> @@ -426,6 +430,8 @@ static void btrfs_clone_write_end_io(struct bio *bio)
> if (bio->bi_status) {
> atomic_inc(&stripe->bioc->error);
> btrfs_log_dev_io_error(bio, stripe->dev);
> + } else if (bio_op(bio) == REQ_OP_ZONE_APPEND) {
> + stripe->physical = bio->bi_iter.bi_sector << SECTOR_SHIFT;
> }
>
> /* Pass on control to the original bio this one was cloned from */
> @@ -487,6 +493,7 @@ static void btrfs_submit_mirrored_bio(struct btrfs_io_context *bioc, int dev_nr)
> bio->bi_private = &bioc->stripes[dev_nr];
> bio->bi_iter.bi_sector = bioc->stripes[dev_nr].physical >> SECTOR_SHIFT;
> bioc->stripes[dev_nr].bioc = bioc;
> + bioc->size = bio->bi_iter.bi_size;
> btrfs_submit_dev_bio(bioc->stripes[dev_nr].dev, bio);
> }
>
> @@ -496,6 +503,8 @@ static void __btrfs_submit_bio(struct bio *bio, struct btrfs_io_context *bioc,
> if (!bioc) {
> /* Single mirror read/write fast path. */
> btrfs_bio(bio)->mirror_num = mirror_num;
> + if (bio_op(bio) != REQ_OP_READ)
> + btrfs_bio(bio)->orig_physical = smap->physical;
> bio->bi_iter.bi_sector = smap->physical >> SECTOR_SHIFT;
> if (bio_op(bio) != REQ_OP_READ)
> btrfs_bio(bio)->orig_physical = smap->physical;
> @@ -688,6 +697,20 @@ static bool btrfs_submit_chunk(struct btrfs_bio *bbio, int mirror_num)
> bio->bi_opf |= REQ_OP_ZONE_APPEND;
> }
>
> + if (is_data_bbio(bbio) && bioc &&
> + btrfs_need_stripe_tree_update(bioc->fs_info,
> + bioc->map_type)) {
> + /*
> + * No locking for the list update, as we only add to
> + * the list in the I/O submission path, and list
> + * iteration only happens in the completion path,
> + * which can't happen until after the last submission.
> + */
> + btrfs_get_bioc(bioc);
> + list_add_tail(&bioc->ordered_entry,
> + &bbio->ordered->bioc_list);
> + }
> +
> /*
> * Csum items for reloc roots have already been cloned at this
> * point, so they are handled as part of the no-checksum case.
> diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
> index 6f6838226fe7..2e11a699ab77 100644
> --- a/fs/btrfs/extent-tree.c
> +++ b/fs/btrfs/extent-tree.c
> @@ -42,6 +42,7 @@
> #include "file-item.h"
> #include "orphan.h"
> #include "tree-checker.h"
> +#include "raid-stripe-tree.h"
>
> #undef SCRAMBLE_DELAYED_REFS
>
> diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
> index bafca05940d7..6f71630248da 100644
> --- a/fs/btrfs/inode.c
> +++ b/fs/btrfs/inode.c
> @@ -71,6 +71,7 @@
> #include "super.h"
> #include "orphan.h"
> #include "backref.h"
> +#include "raid-stripe-tree.h"
>
> struct btrfs_iget_args {
> u64 ino;
> @@ -3091,6 +3092,10 @@ int btrfs_finish_one_ordered(struct btrfs_ordered_extent *ordered_extent)
>
> trans->block_rsv = &inode->block_rsv;
>
> + ret = btrfs_insert_raid_extent(trans, ordered_extent);
> + if (ret)
> + goto out;
> +
> if (test_bit(BTRFS_ORDERED_COMPRESSED, &ordered_extent->flags))
> compress_type = ordered_extent->compress_type;
> if (test_bit(BTRFS_ORDERED_PREALLOC, &ordered_extent->flags)) {
> @@ -3224,7 +3229,8 @@ int btrfs_finish_one_ordered(struct btrfs_ordered_extent *ordered_extent)
> int btrfs_finish_ordered_io(struct btrfs_ordered_extent *ordered)
> {
> if (btrfs_is_zoned(btrfs_sb(ordered->inode->i_sb)) &&
> - !test_bit(BTRFS_ORDERED_IOERR, &ordered->flags))
> + !test_bit(BTRFS_ORDERED_IOERR, &ordered->flags) &&
> + list_empty(&ordered->bioc_list))
> btrfs_finish_ordered_zoned(ordered);
> return btrfs_finish_one_ordered(ordered);
> }
> diff --git a/fs/btrfs/ordered-data.c b/fs/btrfs/ordered-data.c
> index 345c449d588c..55c7d5543265 100644
> --- a/fs/btrfs/ordered-data.c
> +++ b/fs/btrfs/ordered-data.c
> @@ -191,6 +191,7 @@ static struct btrfs_ordered_extent *alloc_ordered_extent(
> INIT_LIST_HEAD(&entry->log_list);
> INIT_LIST_HEAD(&entry->root_extent_list);
> INIT_LIST_HEAD(&entry->work_list);
> + INIT_LIST_HEAD(&entry->bioc_list);
> init_completion(&entry->completion);
>
> /*
> diff --git a/fs/btrfs/ordered-data.h b/fs/btrfs/ordered-data.h
> index 173bd5c5df26..1c51ac57e5df 100644
> --- a/fs/btrfs/ordered-data.h
> +++ b/fs/btrfs/ordered-data.h
> @@ -151,6 +151,8 @@ struct btrfs_ordered_extent {
> struct completion completion;
> struct btrfs_work flush_work;
> struct list_head work_list;
> +
> + struct list_head bioc_list;
> };
>
> static inline void
> diff --git a/fs/btrfs/raid-stripe-tree.c b/fs/btrfs/raid-stripe-tree.c
> new file mode 100644
> index 000000000000..2415698a8fef
> --- /dev/null
> +++ b/fs/btrfs/raid-stripe-tree.c
> @@ -0,0 +1,266 @@
> +// SPDX-License-Identifier: GPL-2.0
> +/*
> + * Copyright (C) 2023 Western Digital Corporation or its affiliates.
> + */
> +
> +#include <linux/btrfs_tree.h>
> +
> +#include "ctree.h"
> +#include "fs.h"
> +#include "accessors.h"
> +#include "transaction.h"
> +#include "disk-io.h"
> +#include "raid-stripe-tree.h"
> +#include "volumes.h"
> +#include "misc.h"
> +#include "print-tree.h"
> +
> +static u8 btrfs_bg_type_to_raid_encoding(u64 map_type)
> +{
> + switch (map_type & BTRFS_BLOCK_GROUP_PROFILE_MASK) {
> + case BTRFS_BLOCK_GROUP_DUP:
> + return BTRFS_STRIPE_DUP;
> + case BTRFS_BLOCK_GROUP_RAID0:
> + return BTRFS_STRIPE_RAID0;
> + case BTRFS_BLOCK_GROUP_RAID1:
> + return BTRFS_STRIPE_RAID1;
> + case BTRFS_BLOCK_GROUP_RAID1C3:
> + return BTRFS_STRIPE_RAID1C3;
> + case BTRFS_BLOCK_GROUP_RAID1C4:
> + return BTRFS_STRIPE_RAID1C4;
> + case BTRFS_BLOCK_GROUP_RAID5:
> + return BTRFS_STRIPE_RAID5;
> + case BTRFS_BLOCK_GROUP_RAID6:
> + return BTRFS_STRIPE_RAID6;
> + case BTRFS_BLOCK_GROUP_RAID10:
> + return BTRFS_STRIPE_RAID10;
> + default:
> + ASSERT(0);
> + }
> +}
> +
> +static int btrfs_insert_one_raid_extent(struct btrfs_trans_handle *trans,
> + int num_stripes,
> + struct btrfs_io_context *bioc)
> +{
> + struct btrfs_fs_info *fs_info = trans->fs_info;
> + struct btrfs_key stripe_key;
> + struct btrfs_root *stripe_root = btrfs_stripe_tree_root(fs_info);
> + u8 encoding = btrfs_bg_type_to_raid_encoding(bioc->map_type);
> + struct btrfs_stripe_extent *stripe_extent;
> + size_t item_size;
> + int ret;
> +
> + item_size = struct_size(stripe_extent, strides, num_stripes);
> +
> + stripe_extent = kzalloc(item_size, GFP_NOFS);
> + if (!stripe_extent) {
> + btrfs_abort_transaction(trans, -ENOMEM);
> + btrfs_end_transaction(trans);
> + return -ENOMEM;
> + }
> +
> + btrfs_set_stack_stripe_extent_encoding(stripe_extent, encoding);
> + for (int i = 0; i < num_stripes; i++) {
> + u64 devid = bioc->stripes[i].dev->devid;
> + u64 physical = bioc->stripes[i].physical;
> + u64 length = bioc->stripes[i].length;
> + struct btrfs_raid_stride *raid_stride =
> + &stripe_extent->strides[i];
> +
> + if (length == 0)
> + length = bioc->size;
> +
> + btrfs_set_stack_raid_stride_devid(raid_stride, devid);
> + btrfs_set_stack_raid_stride_physical(raid_stride, physical);
> + btrfs_set_stack_raid_stride_length(raid_stride, length);
> + }
> +
> + stripe_key.objectid = bioc->logical;
> + stripe_key.type = BTRFS_RAID_STRIPE_KEY;
> + stripe_key.offset = bioc->size;
> +
> + ret = btrfs_insert_item(trans, stripe_root, &stripe_key, stripe_extent,
> + item_size);
> + if (ret)
> + btrfs_abort_transaction(trans, ret);
> +
> + kfree(stripe_extent);
> +
> + return ret;
> +}
> +
> +static int btrfs_insert_mirrored_raid_extents(struct btrfs_trans_handle *trans,
> + struct btrfs_ordered_extent *ordered,
> + u64 map_type)
> +{
> + int num_stripes = btrfs_bg_type_to_factor(map_type);
> + struct btrfs_io_context *bioc;
> + int ret;
> +
> + list_for_each_entry(bioc, &ordered->bioc_list, ordered_entry) {
> + ret = btrfs_insert_one_raid_extent(trans, num_stripes, bioc);
> + if (ret)
> + return ret;
> + }
> +
> + return 0;
> +}
> +
> +static int btrfs_insert_striped_mirrored_raid_extents(
> + struct btrfs_trans_handle *trans,
> + struct btrfs_ordered_extent *ordered,
> + u64 map_type)
> +{
> + struct btrfs_io_context *bioc;
> + struct btrfs_io_context *rbioc;
> + const int nstripes = list_count_nodes(&ordered->bioc_list);
> + const int index = btrfs_bg_flags_to_raid_index(map_type);
> + const int substripes = btrfs_raid_array[index].sub_stripes;
> + const int max_stripes = trans->fs_info->fs_devices->rw_devices / 2;
This could use the table based mirroring factor, right? Though we have
only raid10 and not the other C3 and C4 it's better to avoid the
hard coded constants.
^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: [PATCH v8 03/11] btrfs: add support for inserting raid stripe extents
2023-09-11 12:52 ` [PATCH v8 03/11] btrfs: add support for inserting raid stripe extents Johannes Thumshirn
2023-09-13 16:50 ` David Sterba
@ 2023-09-13 16:57 ` David Sterba
2023-09-14 9:25 ` Qu Wenruo
2 siblings, 0 replies; 39+ messages in thread
From: David Sterba @ 2023-09-13 16:57 UTC (permalink / raw)
To: Johannes Thumshirn
Cc: Chris Mason, Josef Bacik, David Sterba, Christoph Hellwig,
Naohiro Aota, Qu Wenruo, Damien Le Moal, linux-btrfs,
linux-kernel
On Mon, Sep 11, 2023 at 05:52:04AM -0700, Johannes Thumshirn wrote:
> +static int btrfs_insert_striped_mirrored_raid_extents(
> + struct btrfs_trans_handle *trans,
> + struct btrfs_ordered_extent *ordered,
> + u64 map_type)
> +{
> + struct btrfs_io_context *bioc;
> + struct btrfs_io_context *rbioc;
> + const int nstripes = list_count_nodes(&ordered->bioc_list);
> + const int index = btrfs_bg_flags_to_raid_index(map_type);
> + const int substripes = btrfs_raid_array[index].sub_stripes;
> + const int max_stripes = trans->fs_info->fs_devices->rw_devices / 2;
> + int left = nstripes;
> + int stripe = 0, j = 0;
> + int i = 0;
Please move the initialization right before the block that uses the
variables.
> + int ret = 0;
> + u64 stripe_end;
> + u64 prev_end;
> +
> + if (nstripes == 1)
> + return btrfs_insert_mirrored_raid_extents(trans, ordered, map_type);
> +
> + rbioc = kzalloc(struct_size(rbioc, stripes, nstripes * substripes),
> + GFP_KERNEL);
> + if (!rbioc)
> + return -ENOMEM;
> +
> + rbioc->map_type = map_type;
> + rbioc->logical = list_first_entry(&ordered->bioc_list, typeof(*rbioc),
> + ordered_entry)->logical;
> +
> + stripe_end = rbioc->logical;
> + prev_end = stripe_end;
Like here, initializing i
> + list_for_each_entry(bioc, &ordered->bioc_list, ordered_entry) {
> +
> + rbioc->size += bioc->size;
> + for (j = 0; j < substripes; j++) {
And if you don't use 'j' outside of the for cycle you can use the
delcarations inside the for (...).
> + stripe = i + j;
> + rbioc->stripes[stripe].dev = bioc->stripes[j].dev;
> + rbioc->stripes[stripe].physical = bioc->stripes[j].physical;
> + rbioc->stripes[stripe].length = bioc->size;
> + }
> +
> + stripe_end += rbioc->size;
> + if (i >= nstripes ||
> + (stripe_end - prev_end >= max_stripes * BTRFS_STRIPE_LEN)) {
> + ret = btrfs_insert_one_raid_extent(trans,
> + nstripes * substripes,
> + rbioc);
> + if (ret)
> + goto out;
> +
> + left -= nstripes;
> + i = 0;
> + rbioc->logical += rbioc->size;
> + rbioc->size = 0;
> + } else {
> + i += substripes;
> + prev_end = stripe_end;
> + }
> + }
> +
> + if (left) {
> + bioc = list_prev_entry(bioc, ordered_entry);
> + ret = btrfs_insert_one_raid_extent(trans, substripes, bioc);
> + }
> +
> +out:
> + kfree(rbioc);
> + return ret;
> +}
> +
> +static int btrfs_insert_striped_raid_extents(struct btrfs_trans_handle *trans,
> + struct btrfs_ordered_extent *ordered,
> + u64 map_type)
> +{
> + struct btrfs_io_context *bioc;
> + struct btrfs_io_context *rbioc;
> + const int nstripes = list_count_nodes(&ordered->bioc_list);
> + int i = 0;
> + int ret = 0;
> +
> + rbioc = kzalloc(struct_size(rbioc, stripes, nstripes), GFP_KERNEL);
You can't use GFP_KERNEL generally in any function that takes a
transaction handle parameter. Either GFP_NOFS or with the
memalloc_nofs_* protection.
> + if (!rbioc)
> + return -ENOMEM;
> + rbioc->map_type = map_type;
> + rbioc->logical = list_first_entry(&ordered->bioc_list, typeof(*rbioc),
> + ordered_entry)->logical;
> +
Maybe initializing 'i' here would be better so it's consistent with
other code.
> + list_for_each_entry(bioc, &ordered->bioc_list, ordered_entry) {
> + rbioc->size += bioc->size;
> + rbioc->stripes[i].dev = bioc->stripes[0].dev;
> + rbioc->stripes[i].physical = bioc->stripes[0].physical;
> + rbioc->stripes[i].length = bioc->size;
> +
> + if (i == nstripes - 1) {
> + ret = btrfs_insert_one_raid_extent(trans, nstripes, rbioc);
> + if (ret)
> + goto out;
> +
> + i = 0;
> + rbioc->logical += rbioc->size;
> + rbioc->size = 0;
> + } else {
> + i++;
> + }
> + }
> +
> + if (i && i < nstripes - 1)
> + ret = btrfs_insert_one_raid_extent(trans, i, rbioc);
> +
> +out:
> + kfree(rbioc);
> + return ret;
> +}
> +
> +int btrfs_insert_raid_extent(struct btrfs_trans_handle *trans,
> + struct btrfs_ordered_extent *ordered_extent)
> +{
> + struct btrfs_io_context *bioc;
> + u64 map_type;
> + int ret;
> +
> + if (!trans->fs_info->stripe_root)
> + return 0;
> +
> + map_type = list_first_entry(&ordered_extent->bioc_list, typeof(*bioc),
> + ordered_entry)->map_type;
> +
> + switch (map_type & BTRFS_BLOCK_GROUP_PROFILE_MASK) {
> + case BTRFS_BLOCK_GROUP_DUP:
> + case BTRFS_BLOCK_GROUP_RAID1:
> + case BTRFS_BLOCK_GROUP_RAID1C3:
> + case BTRFS_BLOCK_GROUP_RAID1C4:
> + ret = btrfs_insert_mirrored_raid_extents(trans, ordered_extent,
> + map_type);
> + break;
> + case BTRFS_BLOCK_GROUP_RAID0:
> + ret = btrfs_insert_striped_raid_extents(trans, ordered_extent,
> + map_type);
> + break;
> + case BTRFS_BLOCK_GROUP_RAID10:
> + ret = btrfs_insert_striped_mirrored_raid_extents(trans, ordered_extent, map_type);
> + break;
> + default:
> + ret = -EINVAL;
> + break;
> + }
> +
> + while (!list_empty(&ordered_extent->bioc_list)) {
> + bioc = list_first_entry(&ordered_extent->bioc_list,
> + typeof(*bioc), ordered_entry);
> + list_del(&bioc->ordered_entry);
> + btrfs_put_bioc(bioc);
> + }
> +
> + return ret;
> +}
> diff --git a/fs/btrfs/raid-stripe-tree.h b/fs/btrfs/raid-stripe-tree.h
> new file mode 100644
> index 000000000000..f36e4c2d46b0
> --- /dev/null
> +++ b/fs/btrfs/raid-stripe-tree.h
> @@ -0,0 +1,34 @@
> +/* SPDX-License-Identifier: GPL-2.0 */
> +/*
> + * Copyright (C) 2023 Western Digital Corporation or its affiliates.
> + */
> +
> +#ifndef BTRFS_RAID_STRIPE_TREE_H
> +#define BTRFS_RAID_STRIPE_TREE_H
> +
> +#include "disk-io.h"
> +
> +struct btrfs_io_context;
> +struct btrfs_io_stripe;
Please add more forward declarations, btrfs_trans_handle,
btrfs_ordered_extent or fs_info.
> +
> +int btrfs_insert_raid_extent(struct btrfs_trans_handle *trans,
> + struct btrfs_ordered_extent *ordered_extent);
> +
> +static inline bool btrfs_need_stripe_tree_update(struct btrfs_fs_info *fs_info,
> + u64 map_type)
> +{
> + u64 type = map_type & BTRFS_BLOCK_GROUP_TYPE_MASK;
> + u64 profile = map_type & BTRFS_BLOCK_GROUP_PROFILE_MASK;
> +
> + if (!btrfs_stripe_tree_root(fs_info))
> + return false;
> +
> + if (type != BTRFS_BLOCK_GROUP_DATA)
> + return false;
> +
> + if (profile & BTRFS_BLOCK_GROUP_RAID1_MASK)
> + return true;
> +
> + return false;
> +}
> +#endif
^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: [PATCH v8 06/11] btrfs: implement RST version of scrub
2023-09-11 12:52 ` [PATCH v8 06/11] btrfs: implement RST version of scrub Johannes Thumshirn
2023-09-13 9:51 ` Qu Wenruo
@ 2023-09-13 16:59 ` David Sterba
1 sibling, 0 replies; 39+ messages in thread
From: David Sterba @ 2023-09-13 16:59 UTC (permalink / raw)
To: Johannes Thumshirn
Cc: Chris Mason, Josef Bacik, David Sterba, Christoph Hellwig,
Naohiro Aota, Qu Wenruo, Damien Le Moal, linux-btrfs,
linux-kernel
On Mon, Sep 11, 2023 at 05:52:07AM -0700, Johannes Thumshirn wrote:
> A filesystem that uses the RAID stripe tree for logical to physical
> address translation can't use the regular scrub path, that reads all
> stripes and then checks if a sector is unused afterwards.
>
> When using the RAID stripe tree, this will result in lookup errors, as the
> stripe tree doesn't know the requested logical addresses.
>
> Instead, look up stripes that are backed by the extent bitmap.
>
> Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
> ---
> fs/btrfs/scrub.c | 56 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> 1 file changed, 56 insertions(+)
>
> diff --git a/fs/btrfs/scrub.c b/fs/btrfs/scrub.c
> index f16220ce5fba..5101e0a3f83e 100644
> --- a/fs/btrfs/scrub.c
> +++ b/fs/btrfs/scrub.c
> @@ -23,6 +23,7 @@
> #include "accessors.h"
> #include "file-item.h"
> #include "scrub.h"
> +#include "raid-stripe-tree.h"
>
> /*
> * This is only the first step towards a full-features scrub. It reads all
> @@ -1634,6 +1635,56 @@ static void scrub_reset_stripe(struct scrub_stripe *stripe)
> }
> }
>
> +static void scrub_submit_extent_sector_read(struct scrub_ctx *sctx,
> + struct scrub_stripe *stripe)
> +{
> + struct btrfs_fs_info *fs_info = stripe->bg->fs_info;
> + struct btrfs_bio *bbio = NULL;
> + int mirror = stripe->mirror_num;
> + int i;
> +
> + atomic_inc(&stripe->pending_io);
> +
> + for_each_set_bit(i, &stripe->extent_sector_bitmap, stripe->nr_sectors) {
> + struct page *page;
> + int pgoff;
This should be unsigned int.
> +
> + page = scrub_stripe_get_page(stripe, i);
> + pgoff = scrub_stripe_get_page_offset(stripe, i);
You can probably move the initializations right to the declarations, I
think we have that elsewhere too.
> + /* The current sector cannot be merged, submit the bio. */
> + if (bbio &&
> + ((i > 0 && !test_bit(i - 1, &stripe->extent_sector_bitmap)) ||
> + bbio->bio.bi_iter.bi_size >= BTRFS_STRIPE_LEN)) {
> + ASSERT(bbio->bio.bi_iter.bi_size);
> + atomic_inc(&stripe->pending_io);
> + btrfs_submit_bio(bbio, mirror);
> + bbio = NULL;
> + }
> +
> + if (!bbio) {
> + bbio = btrfs_bio_alloc(stripe->nr_sectors, REQ_OP_READ,
> + fs_info, scrub_read_endio, stripe);
> + bbio->bio.bi_iter.bi_sector = (stripe->logical +
> + (i << fs_info->sectorsize_bits)) >> SECTOR_SHIFT;
> + }
> +
> + __bio_add_page(&bbio->bio, page, fs_info->sectorsize, pgoff);
> + }
> +
> + if (bbio) {
> + ASSERT(bbio->bio.bi_iter.bi_size);
> + atomic_inc(&stripe->pending_io);
> + btrfs_submit_bio(bbio, mirror);
> + }
> +
> + if (atomic_dec_and_test(&stripe->pending_io)) {
> + wake_up(&stripe->io_wait);
> + INIT_WORK(&stripe->work, scrub_stripe_read_repair_worker);
> + queue_work(stripe->bg->fs_info->scrub_workers, &stripe->work);
> + }
> +}
^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: [PATCH v8 05/11] btrfs: lookup physical address from stripe extent
2023-09-11 12:52 ` [PATCH v8 05/11] btrfs: lookup physical address from stripe extent Johannes Thumshirn
@ 2023-09-14 9:18 ` Qu Wenruo
2023-09-14 9:45 ` Johannes Thumshirn
2023-09-14 14:16 ` Johannes Thumshirn
0 siblings, 2 replies; 39+ messages in thread
From: Qu Wenruo @ 2023-09-14 9:18 UTC (permalink / raw)
To: Johannes Thumshirn, Chris Mason, Josef Bacik, David Sterba
Cc: Christoph Hellwig, Naohiro Aota, Qu Wenruo, Damien Le Moal,
linux-btrfs, linux-kernel
On 2023/9/11 22:22, Johannes Thumshirn wrote:
> Lookup the physical address from the raid stripe tree when a read on an
> RAID volume formatted with the raid stripe tree was attempted.
>
> If the requested logical address was not found in the stripe tree, it may
> still be in the in-memory ordered stripe tree, so fallback to searching
> the ordered stripe tree in this case.
>
> Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
> ---
> fs/btrfs/raid-stripe-tree.c | 159 ++++++++++++++++++++++++++++++++++++++++++++
> fs/btrfs/raid-stripe-tree.h | 11 +++
> fs/btrfs/volumes.c | 37 ++++++++---
> 3 files changed, 198 insertions(+), 9 deletions(-)
>
> diff --git a/fs/btrfs/raid-stripe-tree.c b/fs/btrfs/raid-stripe-tree.c
> index 5b12f40877b5..7ed02e4b79ec 100644
> --- a/fs/btrfs/raid-stripe-tree.c
> +++ b/fs/btrfs/raid-stripe-tree.c
> @@ -324,3 +324,162 @@ int btrfs_insert_raid_extent(struct btrfs_trans_handle *trans,
>
> return ret;
> }
> +
> +static bool btrfs_check_for_extent(struct btrfs_fs_info *fs_info, u64 logical,
> + u64 length, struct btrfs_path *path)
> +{
> + struct btrfs_root *extent_root = btrfs_extent_root(fs_info, logical);
> + struct btrfs_key key;
> + int ret;
> +
> + btrfs_release_path(path);
> +
> + key.objectid = logical;
> + key.type = BTRFS_EXTENT_ITEM_KEY;
> + key.offset = length;
> +
> + ret = btrfs_search_slot(NULL, extent_root, &key, path, 0, 0);
> +
> + return ret;
> +}
> +
> +int btrfs_get_raid_extent_offset(struct btrfs_fs_info *fs_info,
> + u64 logical, u64 *length, u64 map_type,
> + u32 stripe_index,
> + struct btrfs_io_stripe *stripe)
> +{
> + struct btrfs_root *stripe_root = btrfs_stripe_tree_root(fs_info);
> + struct btrfs_stripe_extent *stripe_extent;
> + struct btrfs_key stripe_key;
> + struct btrfs_key found_key;
> + struct btrfs_path *path;
> + struct extent_buffer *leaf;
> + int num_stripes;
> + u8 encoding;
> + u64 offset;
> + u64 found_logical;
> + u64 found_length;
> + u64 end;
> + u64 found_end;
> + int slot;
> + int ret;
> + int i;
> +
> + stripe_key.objectid = logical;
> + stripe_key.type = BTRFS_RAID_STRIPE_KEY;
> + stripe_key.offset = 0;
> +
> + path = btrfs_alloc_path();
> + if (!path)
> + return -ENOMEM;
> +
> + ret = btrfs_search_slot(NULL, stripe_root, &stripe_key, path, 0, 0);
> + if (ret < 0)
> + goto free_path;
> + if (ret) {
> + if (path->slots[0] != 0)
> + path->slots[0]--;
IIRC we have btrfs_previous_item() to do the forward search.
> + }
> +
> + end = logical + *length;
IMHO we can make it const and initialize it at the definition part.
> +
> + while (1) {
Here we only can hit at most one RST item, thus I'd recommend to remove
the while().
Although this would mean we will need a if () to handle (ret > 0) case,
but it may still be a little easier to read than a loop.
You may want to refer to btrfs_lookup_csum() for the non-loop
implementation.
> + leaf = path->nodes[0];
> + slot = path->slots[0];
> +
> + btrfs_item_key_to_cpu(leaf, &found_key, slot);
> + found_logical = found_key.objectid;
> + found_length = found_key.offset;
> + found_end = found_logical + found_length;
> +
> + if (found_logical > end) {
> + ret = -ENOENT;
> + goto out;
> + }
> +
> + if (in_range(logical, found_logical, found_length))
> + break;
> +
> + ret = btrfs_next_item(stripe_root, path);
> + if (ret)
> + goto out;
> + }
> +
> + offset = logical - found_logical;
> +
> + /*
> + * If we have a logically contiguous, but physically noncontinuous
> + * range, we need to split the bio. Record the length after which we
> + * must split the bio.
> + */
> + if (end > found_end)
> + *length -= end - found_end;
> +
> + num_stripes = btrfs_num_raid_stripes(btrfs_item_size(leaf, slot));
> + stripe_extent = btrfs_item_ptr(leaf, slot, struct btrfs_stripe_extent);
> + encoding = btrfs_stripe_extent_encoding(leaf, stripe_extent);
> +
> + if (encoding != btrfs_bg_type_to_raid_encoding(map_type)) {
> + ret = -ENOENT;
> + goto out;
This looks like a very weird situation, we have a bg with a different type.
Should we do some warning or is there some valid situation for this?
> + }
> +
> + for (i = 0; i < num_stripes; i++) {
> + struct btrfs_raid_stride *stride = &stripe_extent->strides[i];
> + u64 devid = btrfs_raid_stride_devid(leaf, stride);
> + u64 len = btrfs_raid_stride_length(leaf, stride);
> + u64 physical = btrfs_raid_stride_physical(leaf, stride);
> +
> + if (offset >= len) {
> + offset -= len;
> +
> + if (offset >= BTRFS_STRIPE_LEN)
> + continue;
> + }
> +
> + if (devid != stripe->dev->devid)
> + continue;
> +
> + if ((map_type & BTRFS_BLOCK_GROUP_DUP) && stripe_index != i)
> + continue;
> +
> + stripe->physical = physical + offset;
> +
> + ret = 0;
> + goto free_path;
> + }
> +
> + /*
> + * If we're here, we haven't found the requested devid in the stripe.
> + */
> + ret = -ENOENT;
> +out:
> + if (ret > 0)
> + ret = -ENOENT;
> + if (ret && ret != -EIO) {
> + /*
> + * Check if the range we're looking for is actually backed by
> + * an extent. This can happen, e.g. when scrub is running on a
> + * block-group and the extent it is trying to scrub get's
> + * deleted in the meantime. Although scrub is setting the
> + * block-group to read-only, deletion of extents are still
> + * allowed. If the extent is gone, simply return ENOENT and be
> + * good.
> + */
As mentioned in the next patch (sorry for the reversed order), this
should be handled in a different way (by only searching commit root for
scrub usage).
> + if (btrfs_check_for_extent(fs_info, logical, *length, path)) {
> + ret = -ENOENT;
> + goto free_path;
> + }
> +
> + if (IS_ENABLED(CONFIG_BTRFS_DEBUG))
> + btrfs_print_tree(leaf, 1);
> + btrfs_err(fs_info,
> + "cannot find raid-stripe for logical [%llu, %llu] devid %llu, profile %s",
> + logical, logical + *length, stripe->dev->devid,
> + btrfs_bg_type_to_raid_name(map_type));
> + }
> +free_path:
> + btrfs_free_path(path);
> +
> + return ret;
> +}
> diff --git a/fs/btrfs/raid-stripe-tree.h b/fs/btrfs/raid-stripe-tree.h
> index 7560dc501a65..40aa553ae8aa 100644
> --- a/fs/btrfs/raid-stripe-tree.h
> +++ b/fs/btrfs/raid-stripe-tree.h
> @@ -13,6 +13,10 @@ struct btrfs_io_stripe;
>
> int btrfs_delete_raid_extent(struct btrfs_trans_handle *trans, u64 start,
> u64 length);
> +int btrfs_get_raid_extent_offset(struct btrfs_fs_info *fs_info,
> + u64 logical, u64 *length, u64 map_type,
> + u32 stripe_index,
> + struct btrfs_io_stripe *stripe);
> int btrfs_insert_raid_extent(struct btrfs_trans_handle *trans,
> struct btrfs_ordered_extent *ordered_extent);
>
> @@ -33,4 +37,11 @@ static inline bool btrfs_need_stripe_tree_update(struct btrfs_fs_info *fs_info,
>
> return false;
> }
> +
> +static inline int btrfs_num_raid_stripes(u32 item_size)
> +{
> + return (item_size - offsetof(struct btrfs_stripe_extent, strides)) /
> + sizeof(struct btrfs_raid_stride);
> +}
> +
> #endif
> diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
> index 0c0fd4eb4848..7c25f5c77788 100644
> --- a/fs/btrfs/volumes.c
> +++ b/fs/btrfs/volumes.c
> @@ -35,6 +35,7 @@
> #include "relocation.h"
> #include "scrub.h"
> #include "super.h"
> +#include "raid-stripe-tree.h"
>
> #define BTRFS_BLOCK_GROUP_STRIPE_MASK (BTRFS_BLOCK_GROUP_RAID0 | \
> BTRFS_BLOCK_GROUP_RAID10 | \
> @@ -6206,12 +6207,22 @@ static u64 btrfs_max_io_len(struct map_lookup *map, enum btrfs_map_op op,
> return U64_MAX;
> }
>
> -static void set_io_stripe(struct btrfs_io_stripe *dst, const struct map_lookup *map,
> - u32 stripe_index, u64 stripe_offset, u32 stripe_nr)
> +static int set_io_stripe(struct btrfs_fs_info *fs_info, enum btrfs_map_op op,
> + u64 logical, u64 *length, struct btrfs_io_stripe *dst,
> + struct map_lookup *map, u32 stripe_index,
> + u64 stripe_offset, u64 stripe_nr)
Do we need @length to be a pointer?
IIRC we can return the number of bytes we mapped, or <0 for errors.
Thus at least @length doesn't need to be a pointer.
Thanks,
Qu
> {
> dst->dev = map->stripes[stripe_index].dev;
> +
> + if (op == BTRFS_MAP_READ &&
> + btrfs_need_stripe_tree_update(fs_info, map->type))
> + return btrfs_get_raid_extent_offset(fs_info, logical, length,
> + map->type, stripe_index,
> + dst);
> +
> dst->physical = map->stripes[stripe_index].physical +
> stripe_offset + btrfs_stripe_nr_to_offset(stripe_nr);
> + return 0;
> }
>
> /*
> @@ -6428,11 +6439,11 @@ int btrfs_map_block(struct btrfs_fs_info *fs_info, enum btrfs_map_op op,
> */
> if (smap && num_alloc_stripes == 1 &&
> !((map->type & BTRFS_BLOCK_GROUP_RAID56_MASK) && mirror_num > 1)) {
> - set_io_stripe(smap, map, stripe_index, stripe_offset, stripe_nr);
> + ret = set_io_stripe(fs_info, op, logical, length, smap, map,
> + stripe_index, stripe_offset, stripe_nr);
> if (mirror_num_ret)
> *mirror_num_ret = mirror_num;
> *bioc_ret = NULL;
> - ret = 0;
> goto out;
> }
>
> @@ -6463,21 +6474,29 @@ int btrfs_map_block(struct btrfs_fs_info *fs_info, enum btrfs_map_op op,
> bioc->full_stripe_logical = em->start +
> btrfs_stripe_nr_to_offset(stripe_nr * data_stripes);
> for (i = 0; i < num_stripes; i++)
> - set_io_stripe(&bioc->stripes[i], map,
> - (i + stripe_nr) % num_stripes,
> - stripe_offset, stripe_nr);
> + ret = set_io_stripe(fs_info, op, logical, length,
> + &bioc->stripes[i], map,
> + (i + stripe_nr) % num_stripes,
> + stripe_offset, stripe_nr);
> } else {
> /*
> * For all other non-RAID56 profiles, just copy the target
> * stripe into the bioc.
> */
> for (i = 0; i < num_stripes; i++) {
> - set_io_stripe(&bioc->stripes[i], map, stripe_index,
> - stripe_offset, stripe_nr);
> + ret = set_io_stripe(fs_info, op, logical, length,
> + &bioc->stripes[i], map, stripe_index,
> + stripe_offset, stripe_nr);
> stripe_index++;
> }
> }
>
> + if (ret) {
> + *bioc_ret = NULL;
> + btrfs_put_bioc(bioc);
> + goto out;
> + }
> +
> if (op != BTRFS_MAP_READ)
> max_errors = btrfs_chunk_max_errors(map);
>
>
^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: [PATCH v8 03/11] btrfs: add support for inserting raid stripe extents
2023-09-11 12:52 ` [PATCH v8 03/11] btrfs: add support for inserting raid stripe extents Johannes Thumshirn
2023-09-13 16:50 ` David Sterba
2023-09-13 16:57 ` David Sterba
@ 2023-09-14 9:25 ` Qu Wenruo
2023-09-14 9:51 ` Johannes Thumshirn
2 siblings, 1 reply; 39+ messages in thread
From: Qu Wenruo @ 2023-09-14 9:25 UTC (permalink / raw)
To: Johannes Thumshirn, Chris Mason, Josef Bacik, David Sterba
Cc: Christoph Hellwig, Naohiro Aota, Qu Wenruo, Damien Le Moal,
linux-btrfs, linux-kernel
On 2023/9/11 22:22, Johannes Thumshirn wrote:
> Add support for inserting stripe extents into the raid stripe tree on
> completion of every write that needs an extra logical-to-physical
> translation when using RAID.
>
> Inserting the stripe extents happens after the data I/O has completed,
> this is done to a) support zone-append and b) rule out the possibility of
> a RAID-write-hole.
>
> Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
> ---
> fs/btrfs/Makefile | 2 +-
> fs/btrfs/bio.c | 23 ++++
> fs/btrfs/extent-tree.c | 1 +
> fs/btrfs/inode.c | 8 +-
> fs/btrfs/ordered-data.c | 1 +
> fs/btrfs/ordered-data.h | 2 +
> fs/btrfs/raid-stripe-tree.c | 266 ++++++++++++++++++++++++++++++++++++++++++++
> fs/btrfs/raid-stripe-tree.h | 34 ++++++
> fs/btrfs/volumes.c | 4 +-
> fs/btrfs/volumes.h | 15 ++-
> 10 files changed, 347 insertions(+), 9 deletions(-)
>
> diff --git a/fs/btrfs/Makefile b/fs/btrfs/Makefile
> index c57d80729d4f..525af975f61c 100644
> --- a/fs/btrfs/Makefile
> +++ b/fs/btrfs/Makefile
> @@ -33,7 +33,7 @@ btrfs-y += super.o ctree.o extent-tree.o print-tree.o root-tree.o dir-item.o \
> uuid-tree.o props.o free-space-tree.o tree-checker.o space-info.o \
> block-rsv.o delalloc-space.o block-group.o discard.o reflink.o \
> subpage.o tree-mod-log.o extent-io-tree.o fs.o messages.o bio.o \
> - lru_cache.o
> + lru_cache.o raid-stripe-tree.o
>
> btrfs-$(CONFIG_BTRFS_FS_POSIX_ACL) += acl.o
> btrfs-$(CONFIG_BTRFS_FS_REF_VERIFY) += ref-verify.o
> diff --git a/fs/btrfs/bio.c b/fs/btrfs/bio.c
> index 31ff36990404..ddbe6f8d4ea2 100644
> --- a/fs/btrfs/bio.c
> +++ b/fs/btrfs/bio.c
> @@ -14,6 +14,7 @@
> #include "rcu-string.h"
> #include "zoned.h"
> #include "file-item.h"
> +#include "raid-stripe-tree.h"
>
> static struct bio_set btrfs_bioset;
> static struct bio_set btrfs_clone_bioset;
> @@ -415,6 +416,9 @@ static void btrfs_orig_write_end_io(struct bio *bio)
> else
> bio->bi_status = BLK_STS_OK;
>
> + if (bio_op(bio) == REQ_OP_ZONE_APPEND && !bio->bi_status)
> + stripe->physical = bio->bi_iter.bi_sector << SECTOR_SHIFT;
> +
> btrfs_orig_bbio_end_io(bbio);
> btrfs_put_bioc(bioc);
> }
> @@ -426,6 +430,8 @@ static void btrfs_clone_write_end_io(struct bio *bio)
> if (bio->bi_status) {
> atomic_inc(&stripe->bioc->error);
> btrfs_log_dev_io_error(bio, stripe->dev);
> + } else if (bio_op(bio) == REQ_OP_ZONE_APPEND) {
> + stripe->physical = bio->bi_iter.bi_sector << SECTOR_SHIFT;
> }
>
> /* Pass on control to the original bio this one was cloned from */
> @@ -487,6 +493,7 @@ static void btrfs_submit_mirrored_bio(struct btrfs_io_context *bioc, int dev_nr)
> bio->bi_private = &bioc->stripes[dev_nr];
> bio->bi_iter.bi_sector = bioc->stripes[dev_nr].physical >> SECTOR_SHIFT;
> bioc->stripes[dev_nr].bioc = bioc;
> + bioc->size = bio->bi_iter.bi_size;
> btrfs_submit_dev_bio(bioc->stripes[dev_nr].dev, bio);
> }
>
> @@ -496,6 +503,8 @@ static void __btrfs_submit_bio(struct bio *bio, struct btrfs_io_context *bioc,
> if (!bioc) {
> /* Single mirror read/write fast path. */
> btrfs_bio(bio)->mirror_num = mirror_num;
> + if (bio_op(bio) != REQ_OP_READ)
> + btrfs_bio(bio)->orig_physical = smap->physical;
> bio->bi_iter.bi_sector = smap->physical >> SECTOR_SHIFT;
> if (bio_op(bio) != REQ_OP_READ)
> btrfs_bio(bio)->orig_physical = smap->physical;
> @@ -688,6 +697,20 @@ static bool btrfs_submit_chunk(struct btrfs_bio *bbio, int mirror_num)
> bio->bi_opf |= REQ_OP_ZONE_APPEND;
> }
>
> + if (is_data_bbio(bbio) && bioc &&
> + btrfs_need_stripe_tree_update(bioc->fs_info,
> + bioc->map_type)) {
> + /*
> + * No locking for the list update, as we only add to
> + * the list in the I/O submission path, and list
> + * iteration only happens in the completion path,
> + * which can't happen until after the last submission.
> + */
> + btrfs_get_bioc(bioc);
> + list_add_tail(&bioc->ordered_entry,
> + &bbio->ordered->bioc_list);
> + }
> +
> /*
> * Csum items for reloc roots have already been cloned at this
> * point, so they are handled as part of the no-checksum case.
> diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
> index 6f6838226fe7..2e11a699ab77 100644
> --- a/fs/btrfs/extent-tree.c
> +++ b/fs/btrfs/extent-tree.c
> @@ -42,6 +42,7 @@
> #include "file-item.h"
> #include "orphan.h"
> #include "tree-checker.h"
> +#include "raid-stripe-tree.h"
>
> #undef SCRAMBLE_DELAYED_REFS
>
> diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
> index bafca05940d7..6f71630248da 100644
> --- a/fs/btrfs/inode.c
> +++ b/fs/btrfs/inode.c
> @@ -71,6 +71,7 @@
> #include "super.h"
> #include "orphan.h"
> #include "backref.h"
> +#include "raid-stripe-tree.h"
>
> struct btrfs_iget_args {
> u64 ino;
> @@ -3091,6 +3092,10 @@ int btrfs_finish_one_ordered(struct btrfs_ordered_extent *ordered_extent)
>
> trans->block_rsv = &inode->block_rsv;
>
> + ret = btrfs_insert_raid_extent(trans, ordered_extent);
> + if (ret)
> + goto out;
> +
> if (test_bit(BTRFS_ORDERED_COMPRESSED, &ordered_extent->flags))
> compress_type = ordered_extent->compress_type;
> if (test_bit(BTRFS_ORDERED_PREALLOC, &ordered_extent->flags)) {
> @@ -3224,7 +3229,8 @@ int btrfs_finish_one_ordered(struct btrfs_ordered_extent *ordered_extent)
> int btrfs_finish_ordered_io(struct btrfs_ordered_extent *ordered)
> {
> if (btrfs_is_zoned(btrfs_sb(ordered->inode->i_sb)) &&
> - !test_bit(BTRFS_ORDERED_IOERR, &ordered->flags))
> + !test_bit(BTRFS_ORDERED_IOERR, &ordered->flags) &&
> + list_empty(&ordered->bioc_list))
> btrfs_finish_ordered_zoned(ordered);
> return btrfs_finish_one_ordered(ordered);
> }
> diff --git a/fs/btrfs/ordered-data.c b/fs/btrfs/ordered-data.c
> index 345c449d588c..55c7d5543265 100644
> --- a/fs/btrfs/ordered-data.c
> +++ b/fs/btrfs/ordered-data.c
> @@ -191,6 +191,7 @@ static struct btrfs_ordered_extent *alloc_ordered_extent(
> INIT_LIST_HEAD(&entry->log_list);
> INIT_LIST_HEAD(&entry->root_extent_list);
> INIT_LIST_HEAD(&entry->work_list);
> + INIT_LIST_HEAD(&entry->bioc_list);
> init_completion(&entry->completion);
>
> /*
> diff --git a/fs/btrfs/ordered-data.h b/fs/btrfs/ordered-data.h
> index 173bd5c5df26..1c51ac57e5df 100644
> --- a/fs/btrfs/ordered-data.h
> +++ b/fs/btrfs/ordered-data.h
> @@ -151,6 +151,8 @@ struct btrfs_ordered_extent {
> struct completion completion;
> struct btrfs_work flush_work;
> struct list_head work_list;
> +
> + struct list_head bioc_list;
> };
>
> static inline void
> diff --git a/fs/btrfs/raid-stripe-tree.c b/fs/btrfs/raid-stripe-tree.c
> new file mode 100644
> index 000000000000..2415698a8fef
> --- /dev/null
> +++ b/fs/btrfs/raid-stripe-tree.c
> @@ -0,0 +1,266 @@
> +// SPDX-License-Identifier: GPL-2.0
> +/*
> + * Copyright (C) 2023 Western Digital Corporation or its affiliates.
> + */
> +
> +#include <linux/btrfs_tree.h>
> +
> +#include "ctree.h"
> +#include "fs.h"
> +#include "accessors.h"
> +#include "transaction.h"
> +#include "disk-io.h"
> +#include "raid-stripe-tree.h"
> +#include "volumes.h"
> +#include "misc.h"
> +#include "print-tree.h"
> +
> +static u8 btrfs_bg_type_to_raid_encoding(u64 map_type)
> +{
> + switch (map_type & BTRFS_BLOCK_GROUP_PROFILE_MASK) {
> + case BTRFS_BLOCK_GROUP_DUP:
> + return BTRFS_STRIPE_DUP;
> + case BTRFS_BLOCK_GROUP_RAID0:
> + return BTRFS_STRIPE_RAID0;
> + case BTRFS_BLOCK_GROUP_RAID1:
> + return BTRFS_STRIPE_RAID1;
> + case BTRFS_BLOCK_GROUP_RAID1C3:
> + return BTRFS_STRIPE_RAID1C3;
> + case BTRFS_BLOCK_GROUP_RAID1C4:
> + return BTRFS_STRIPE_RAID1C4;
> + case BTRFS_BLOCK_GROUP_RAID5:
> + return BTRFS_STRIPE_RAID5;
> + case BTRFS_BLOCK_GROUP_RAID6:
> + return BTRFS_STRIPE_RAID6;
> + case BTRFS_BLOCK_GROUP_RAID10:
> + return BTRFS_STRIPE_RAID10;
> + default:
> + ASSERT(0);
> + }
> +}
> +
> +static int btrfs_insert_one_raid_extent(struct btrfs_trans_handle *trans,
> + int num_stripes,
> + struct btrfs_io_context *bioc)
> +{
> + struct btrfs_fs_info *fs_info = trans->fs_info;
> + struct btrfs_key stripe_key;
> + struct btrfs_root *stripe_root = btrfs_stripe_tree_root(fs_info);
> + u8 encoding = btrfs_bg_type_to_raid_encoding(bioc->map_type);
> + struct btrfs_stripe_extent *stripe_extent;
> + size_t item_size;
> + int ret;
> +
> + item_size = struct_size(stripe_extent, strides, num_stripes);
I guess David has already pointed out this can be done at initialization
and make it const.
> +
> + stripe_extent = kzalloc(item_size, GFP_NOFS);
> + if (!stripe_extent) {
> + btrfs_abort_transaction(trans, -ENOMEM);
> + btrfs_end_transaction(trans);
> + return -ENOMEM;
> + }
> +
> + btrfs_set_stack_stripe_extent_encoding(stripe_extent, encoding);
> + for (int i = 0; i < num_stripes; i++) {
> + u64 devid = bioc->stripes[i].dev->devid;
> + u64 physical = bioc->stripes[i].physical;
> + u64 length = bioc->stripes[i].length;
> + struct btrfs_raid_stride *raid_stride =
> + &stripe_extent->strides[i];
> +
> + if (length == 0)
> + length = bioc->size;
> +
> + btrfs_set_stack_raid_stride_devid(raid_stride, devid);
> + btrfs_set_stack_raid_stride_physical(raid_stride, physical);
> + btrfs_set_stack_raid_stride_length(raid_stride, length);
> + }
> +
> + stripe_key.objectid = bioc->logical;
> + stripe_key.type = BTRFS_RAID_STRIPE_KEY;
> + stripe_key.offset = bioc->size;
> +
> + ret = btrfs_insert_item(trans, stripe_root, &stripe_key, stripe_extent,
> + item_size);
Have you tested in near-real-world on how continous the RST items could
be for RAID0/RAID10?
My concern here is, we may want to try our best to reduce the size of
RST, due to the 64K BTRFS_STRIPE_LEN.
> + if (ret)
> + btrfs_abort_transaction(trans, ret);
> +
> + kfree(stripe_extent);
> +
> + return ret;
> +}
> +
> +static int btrfs_insert_mirrored_raid_extents(struct btrfs_trans_handle *trans,
> + struct btrfs_ordered_extent *ordered,
> + u64 map_type)
> +{
> + int num_stripes = btrfs_bg_type_to_factor(map_type);
> + struct btrfs_io_context *bioc;
> + int ret;
> +
> + list_for_each_entry(bioc, &ordered->bioc_list, ordered_entry) {
> + ret = btrfs_insert_one_raid_extent(trans, num_stripes, bioc);
> + if (ret)
> + return ret;
> + }
> +
> + return 0;
> +}
> +
> +static int btrfs_insert_striped_mirrored_raid_extents(
> + struct btrfs_trans_handle *trans,
> + struct btrfs_ordered_extent *ordered,
> + u64 map_type)
> +{
> + struct btrfs_io_context *bioc;
> + struct btrfs_io_context *rbioc;
> + const int nstripes = list_count_nodes(&ordered->bioc_list);
> + const int index = btrfs_bg_flags_to_raid_index(map_type);
> + const int substripes = btrfs_raid_array[index].sub_stripes;
> + const int max_stripes = trans->fs_info->fs_devices->rw_devices / 2;
> + int left = nstripes;
> + int stripe = 0, j = 0;
> + int i = 0;
> + int ret = 0;
> + u64 stripe_end;
> + u64 prev_end;
> +
> + if (nstripes == 1)
> + return btrfs_insert_mirrored_raid_extents(trans, ordered, map_type);
> +
> + rbioc = kzalloc(struct_size(rbioc, stripes, nstripes * substripes),
> + GFP_KERNEL);
> + if (!rbioc)
> + return -ENOMEM;
> +
> + rbioc->map_type = map_type;
> + rbioc->logical = list_first_entry(&ordered->bioc_list, typeof(*rbioc),
> + ordered_entry)->logical;
> +
> + stripe_end = rbioc->logical;
> + prev_end = stripe_end;
> + list_for_each_entry(bioc, &ordered->bioc_list, ordered_entry) {
> +
> + rbioc->size += bioc->size;
> + for (j = 0; j < substripes; j++) {
> + stripe = i + j;
> + rbioc->stripes[stripe].dev = bioc->stripes[j].dev;
> + rbioc->stripes[stripe].physical = bioc->stripes[j].physical;
> + rbioc->stripes[stripe].length = bioc->size;
> + }
> +
> + stripe_end += rbioc->size;
> + if (i >= nstripes ||
> + (stripe_end - prev_end >= max_stripes * BTRFS_STRIPE_LEN)) {
> + ret = btrfs_insert_one_raid_extent(trans,
> + nstripes * substripes,
> + rbioc);
> + if (ret)
> + goto out;
> +
> + left -= nstripes;
> + i = 0;
> + rbioc->logical += rbioc->size;
> + rbioc->size = 0;
> + } else {
> + i += substripes;
> + prev_end = stripe_end;
> + }
> + }
> +
> + if (left) {
> + bioc = list_prev_entry(bioc, ordered_entry);
> + ret = btrfs_insert_one_raid_extent(trans, substripes, bioc);
> + }
> +
> +out:
> + kfree(rbioc);
> + return ret;
> +}
> +
> +static int btrfs_insert_striped_raid_extents(struct btrfs_trans_handle *trans,
> + struct btrfs_ordered_extent *ordered,
> + u64 map_type)
> +{
> + struct btrfs_io_context *bioc;
> + struct btrfs_io_context *rbioc;
> + const int nstripes = list_count_nodes(&ordered->bioc_list);
> + int i = 0;
> + int ret = 0;
> +
> + rbioc = kzalloc(struct_size(rbioc, stripes, nstripes), GFP_KERNEL);
> + if (!rbioc)
> + return -ENOMEM;
> + rbioc->map_type = map_type;
> + rbioc->logical = list_first_entry(&ordered->bioc_list, typeof(*rbioc),
> + ordered_entry)->logical;
> +
> + list_for_each_entry(bioc, &ordered->bioc_list, ordered_entry) {
> + rbioc->size += bioc->size;
> + rbioc->stripes[i].dev = bioc->stripes[0].dev;
> + rbioc->stripes[i].physical = bioc->stripes[0].physical;
> + rbioc->stripes[i].length = bioc->size;
> +
> + if (i == nstripes - 1) {
> + ret = btrfs_insert_one_raid_extent(trans, nstripes, rbioc);
> + if (ret)
> + goto out;
> +
> + i = 0;
> + rbioc->logical += rbioc->size;
> + rbioc->size = 0;
> + } else {
> + i++;
> + }
> + }
> +
> + if (i && i < nstripes - 1)
> + ret = btrfs_insert_one_raid_extent(trans, i, rbioc);
> +
> +out:
> + kfree(rbioc);
> + return ret;
> +}
> +
> +int btrfs_insert_raid_extent(struct btrfs_trans_handle *trans,
> + struct btrfs_ordered_extent *ordered_extent)
> +{
> + struct btrfs_io_context *bioc;
> + u64 map_type;
> + int ret;
> +
> + if (!trans->fs_info->stripe_root)
> + return 0;
> +
> + map_type = list_first_entry(&ordered_extent->bioc_list, typeof(*bioc),
> + ordered_entry)->map_type;
> +
> + switch (map_type & BTRFS_BLOCK_GROUP_PROFILE_MASK) {
> + case BTRFS_BLOCK_GROUP_DUP:
> + case BTRFS_BLOCK_GROUP_RAID1:
> + case BTRFS_BLOCK_GROUP_RAID1C3:
> + case BTRFS_BLOCK_GROUP_RAID1C4:
> + ret = btrfs_insert_mirrored_raid_extents(trans, ordered_extent,
> + map_type);
> + break;
> + case BTRFS_BLOCK_GROUP_RAID0:
> + ret = btrfs_insert_striped_raid_extents(trans, ordered_extent,
> + map_type);
> + break;
> + case BTRFS_BLOCK_GROUP_RAID10:
> + ret = btrfs_insert_striped_mirrored_raid_extents(trans, ordered_extent, map_type);
> + break;
> + default:
> + ret = -EINVAL;
Maybe we want to be a little more noisy?
Thanks,
Qu
> + break;
> + }
> +
> + while (!list_empty(&ordered_extent->bioc_list)) {
> + bioc = list_first_entry(&ordered_extent->bioc_list,
> + typeof(*bioc), ordered_entry);
> + list_del(&bioc->ordered_entry);
> + btrfs_put_bioc(bioc);
> + }
> +
> + return ret;
> +}
> diff --git a/fs/btrfs/raid-stripe-tree.h b/fs/btrfs/raid-stripe-tree.h
> new file mode 100644
> index 000000000000..f36e4c2d46b0
> --- /dev/null
> +++ b/fs/btrfs/raid-stripe-tree.h
> @@ -0,0 +1,34 @@
> +/* SPDX-License-Identifier: GPL-2.0 */
> +/*
> + * Copyright (C) 2023 Western Digital Corporation or its affiliates.
> + */
> +
> +#ifndef BTRFS_RAID_STRIPE_TREE_H
> +#define BTRFS_RAID_STRIPE_TREE_H
> +
> +#include "disk-io.h"
> +
> +struct btrfs_io_context;
> +struct btrfs_io_stripe;
> +
> +int btrfs_insert_raid_extent(struct btrfs_trans_handle *trans,
> + struct btrfs_ordered_extent *ordered_extent);
> +
> +static inline bool btrfs_need_stripe_tree_update(struct btrfs_fs_info *fs_info,
> + u64 map_type)
> +{
> + u64 type = map_type & BTRFS_BLOCK_GROUP_TYPE_MASK;
> + u64 profile = map_type & BTRFS_BLOCK_GROUP_PROFILE_MASK;
> +
> + if (!btrfs_stripe_tree_root(fs_info))
> + return false;
> +
> + if (type != BTRFS_BLOCK_GROUP_DATA)
> + return false;
> +
> + if (profile & BTRFS_BLOCK_GROUP_RAID1_MASK)
> + return true;
> +
> + return false;
> +}
> +#endif
> diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
> index 871a55d36e32..0c0fd4eb4848 100644
> --- a/fs/btrfs/volumes.c
> +++ b/fs/btrfs/volumes.c
> @@ -5881,6 +5881,7 @@ static int find_live_mirror(struct btrfs_fs_info *fs_info,
> }
>
> static struct btrfs_io_context *alloc_btrfs_io_context(struct btrfs_fs_info *fs_info,
> + u64 logical,
> u16 total_stripes)
> {
> struct btrfs_io_context *bioc;
> @@ -5900,6 +5901,7 @@ static struct btrfs_io_context *alloc_btrfs_io_context(struct btrfs_fs_info *fs_
> bioc->fs_info = fs_info;
> bioc->replace_stripe_src = -1;
> bioc->full_stripe_logical = (u64)-1;
> + bioc->logical = logical;
>
> return bioc;
> }
> @@ -6434,7 +6436,7 @@ int btrfs_map_block(struct btrfs_fs_info *fs_info, enum btrfs_map_op op,
> goto out;
> }
>
> - bioc = alloc_btrfs_io_context(fs_info, num_alloc_stripes);
> + bioc = alloc_btrfs_io_context(fs_info, logical, num_alloc_stripes);
> if (!bioc) {
> ret = -ENOMEM;
> goto out;
> diff --git a/fs/btrfs/volumes.h b/fs/btrfs/volumes.h
> index 576bfcb5b764..8604bfbbf510 100644
> --- a/fs/btrfs/volumes.h
> +++ b/fs/btrfs/volumes.h
> @@ -390,12 +390,11 @@ struct btrfs_fs_devices {
>
> struct btrfs_io_stripe {
> struct btrfs_device *dev;
> - union {
> - /* Block mapping */
> - u64 physical;
> - /* For the endio handler */
> - struct btrfs_io_context *bioc;
> - };
> + /* Block mapping */
> + u64 physical;
> + u64 length;
> + /* For the endio handler */
> + struct btrfs_io_context *bioc;
> };
>
> struct btrfs_discard_stripe {
> @@ -428,6 +427,10 @@ struct btrfs_io_context {
> atomic_t error;
> u16 max_errors;
>
> + u64 logical;
> + u64 size;
> + struct list_head ordered_entry;
> +
> /*
> * The total number of stripes, including the extra duplicated
> * stripe for replace.
>
^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: [PATCH v8 02/11] btrfs: read raid-stripe-tree from disk
2023-09-11 12:52 ` [PATCH v8 02/11] btrfs: read raid-stripe-tree from disk Johannes Thumshirn
@ 2023-09-14 9:27 ` Qu Wenruo
2023-09-14 9:33 ` Johannes Thumshirn
0 siblings, 1 reply; 39+ messages in thread
From: Qu Wenruo @ 2023-09-14 9:27 UTC (permalink / raw)
To: Johannes Thumshirn, Chris Mason, Josef Bacik, David Sterba
Cc: Christoph Hellwig, Naohiro Aota, Qu Wenruo, Damien Le Moal,
linux-btrfs, linux-kernel, Anand Jain
On 2023/9/11 22:22, Johannes Thumshirn wrote:
> If we find a raid-stripe-tree on mount, read it from disk.
>
> Reviewed-by: Josef Bacik <josef@toxicpanda.com>
> Reviewed-by: Anand Jain <anand.jain@oracle.com>
> Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
> ---
> fs/btrfs/block-rsv.c | 6 ++++++
> fs/btrfs/disk-io.c | 18 ++++++++++++++++++
> fs/btrfs/disk-io.h | 5 +++++
> fs/btrfs/fs.h | 1 +
> include/uapi/linux/btrfs.h | 1 +
> 5 files changed, 31 insertions(+)
>
> diff --git a/fs/btrfs/block-rsv.c b/fs/btrfs/block-rsv.c
> index 77684c5e0c8b..4e55e5f30f7f 100644
> --- a/fs/btrfs/block-rsv.c
> +++ b/fs/btrfs/block-rsv.c
> @@ -354,6 +354,11 @@ void btrfs_update_global_block_rsv(struct btrfs_fs_info *fs_info)
> min_items++;
> }
>
> + if (btrfs_fs_incompat(fs_info, RAID_STRIPE_TREE)) {
> + num_bytes += btrfs_root_used(&fs_info->stripe_root->root_item);
> + min_items++;
> + }
> +
> /*
> * But we also want to reserve enough space so we can do the fallback
> * global reserve for an unlink, which is an additional
> @@ -405,6 +410,7 @@ void btrfs_init_root_block_rsv(struct btrfs_root *root)
> case BTRFS_EXTENT_TREE_OBJECTID:
> case BTRFS_FREE_SPACE_TREE_OBJECTID:
> case BTRFS_BLOCK_GROUP_TREE_OBJECTID:
> + case BTRFS_RAID_STRIPE_TREE_OBJECTID:
> root->block_rsv = &fs_info->delayed_refs_rsv;
> break;
> case BTRFS_ROOT_TREE_OBJECTID:
> diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
> index 4c5d71065ea8..1ecebcfc1c17 100644
> --- a/fs/btrfs/disk-io.c
> +++ b/fs/btrfs/disk-io.c
> @@ -1179,6 +1179,8 @@ static struct btrfs_root *btrfs_get_global_root(struct btrfs_fs_info *fs_info,
> return btrfs_grab_root(fs_info->block_group_root);
> case BTRFS_FREE_SPACE_TREE_OBJECTID:
> return btrfs_grab_root(btrfs_global_root(fs_info, &key));
> + case BTRFS_RAID_STRIPE_TREE_OBJECTID:
> + return btrfs_grab_root(fs_info->stripe_root);
> default:
> return NULL;
> }
> @@ -1259,6 +1261,7 @@ void btrfs_free_fs_info(struct btrfs_fs_info *fs_info)
> btrfs_put_root(fs_info->fs_root);
> btrfs_put_root(fs_info->data_reloc_root);
> btrfs_put_root(fs_info->block_group_root);
> + btrfs_put_root(fs_info->stripe_root);
> btrfs_check_leaked_roots(fs_info);
> btrfs_extent_buffer_leak_debug_check(fs_info);
> kfree(fs_info->super_copy);
> @@ -1804,6 +1807,7 @@ static void free_root_pointers(struct btrfs_fs_info *info, bool free_chunk_root)
> free_root_extent_buffers(info->fs_root);
> free_root_extent_buffers(info->data_reloc_root);
> free_root_extent_buffers(info->block_group_root);
> + free_root_extent_buffers(info->stripe_root);
> if (free_chunk_root)
> free_root_extent_buffers(info->chunk_root);
> }
> @@ -2280,6 +2284,20 @@ static int btrfs_read_roots(struct btrfs_fs_info *fs_info)
> fs_info->uuid_root = root;
> }
>
> + if (btrfs_fs_incompat(fs_info, RAID_STRIPE_TREE)) {
> + location.objectid = BTRFS_RAID_STRIPE_TREE_OBJECTID;
> + root = btrfs_read_tree_root(tree_root, &location);
> + if (IS_ERR(root)) {
> + if (!btrfs_test_opt(fs_info, IGNOREBADROOTS)) {
> + ret = PTR_ERR(root);
> + goto out;
> + }
> + } else {
> + set_bit(BTRFS_ROOT_TRACK_DIRTY, &root->state);
> + fs_info->stripe_root = root;
> + }
> + }
> +
> return 0;
> out:
> btrfs_warn(fs_info, "failed to read root (objectid=%llu): %d",
> diff --git a/fs/btrfs/disk-io.h b/fs/btrfs/disk-io.h
> index 02b645744a82..8b7f01a01c44 100644
> --- a/fs/btrfs/disk-io.h
> +++ b/fs/btrfs/disk-io.h
> @@ -103,6 +103,11 @@ static inline struct btrfs_root *btrfs_grab_root(struct btrfs_root *root)
> return NULL;
> }
>
> +static inline struct btrfs_root *btrfs_stripe_tree_root(struct btrfs_fs_info *fs_info)
> +{
> + return fs_info->stripe_root;
> +}
> +
Do we really need this? IIRC we never have a wrapper or fs_info->fs_root.
Thanks,
Qu
> void btrfs_put_root(struct btrfs_root *root);
> void btrfs_mark_buffer_dirty(struct extent_buffer *buf);
> int btrfs_buffer_uptodate(struct extent_buffer *buf, u64 parent_transid,
> diff --git a/fs/btrfs/fs.h b/fs/btrfs/fs.h
> index d84a390336fc..5c7778e8b5ed 100644
> --- a/fs/btrfs/fs.h
> +++ b/fs/btrfs/fs.h
> @@ -367,6 +367,7 @@ struct btrfs_fs_info {
> struct btrfs_root *uuid_root;
> struct btrfs_root *data_reloc_root;
> struct btrfs_root *block_group_root;
> + struct btrfs_root *stripe_root;
>
> /* The log root tree is a directory of all the other log roots */
> struct btrfs_root *log_root_tree;
> diff --git a/include/uapi/linux/btrfs.h b/include/uapi/linux/btrfs.h
> index dbb8b96da50d..b9a1d9af8ae8 100644
> --- a/include/uapi/linux/btrfs.h
> +++ b/include/uapi/linux/btrfs.h
> @@ -333,6 +333,7 @@ struct btrfs_ioctl_fs_info_args {
> #define BTRFS_FEATURE_INCOMPAT_RAID1C34 (1ULL << 11)
> #define BTRFS_FEATURE_INCOMPAT_ZONED (1ULL << 12)
> #define BTRFS_FEATURE_INCOMPAT_EXTENT_TREE_V2 (1ULL << 13)
> +#define BTRFS_FEATURE_INCOMPAT_RAID_STRIPE_TREE (1ULL << 14)
>
> struct btrfs_ioctl_feature_flags {
> __u64 compat_flags;
>
^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: [PATCH v8 02/11] btrfs: read raid-stripe-tree from disk
2023-09-14 9:27 ` Qu Wenruo
@ 2023-09-14 9:33 ` Johannes Thumshirn
0 siblings, 0 replies; 39+ messages in thread
From: Johannes Thumshirn @ 2023-09-14 9:33 UTC (permalink / raw)
To: Qu Wenruo, Chris Mason, Josef Bacik, David Sterba
Cc: Christoph Hellwig, Naohiro Aota, Qu Wenruo, Damien Le Moal,
linux-btrfs@vger.kernel.org, linux-kernel@vger.kernel.org,
Anand Jain
On 14.09.23 11:27, Qu Wenruo wrote:
>> +static inline struct btrfs_root *btrfs_stripe_tree_root(struct btrfs_fs_info *fs_info)
>> +{
>> + return fs_info->stripe_root;
>> +}
>> +
>
> Do we really need this? IIRC we never have a wrapper or fs_info->fs_root.
This was requested from Josef a while ago, to make the conversion to
per-block-group stripe trees easier. But hch also wanted me to remove it
(and I thought I already did) so lemme get rid of it if Josef doesn't
speak up.
^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: [PATCH v8 05/11] btrfs: lookup physical address from stripe extent
2023-09-14 9:18 ` Qu Wenruo
@ 2023-09-14 9:45 ` Johannes Thumshirn
2023-09-14 14:16 ` Johannes Thumshirn
1 sibling, 0 replies; 39+ messages in thread
From: Johannes Thumshirn @ 2023-09-14 9:45 UTC (permalink / raw)
To: Qu Wenruo, Chris Mason, Josef Bacik, David Sterba
Cc: Christoph Hellwig, Naohiro Aota, Qu Wenruo, Damien Le Moal,
linux-btrfs@vger.kernel.org, linux-kernel@vger.kernel.org
On 14.09.23 11:18, Qu Wenruo wrote:
>> + if (ret) {
>> + if (path->slots[0] != 0)
>> + path->slots[0]--;
>
> IIRC we have btrfs_previous_item() to do the forward search.
>
>> + }
>> +
>> + end = logical + *length;
>
> IMHO we can make it const and initialize it at the definition part.
Right.
>> +
>> + while (1) {
>
> Here we only can hit at most one RST item, thus I'd recommend to remove
> the while().
>
> Although this would mean we will need a if () to handle (ret > 0) case,
> but it may still be a little easier to read than a loop.
>
> You may want to refer to btrfs_lookup_csum() for the non-loop
> implementation.
Sure I'll look into it.
>> +
>> + if (encoding != btrfs_bg_type_to_raid_encoding(map_type)) {
>> + ret = -ENOENT;
>> + goto out;
>
> This looks like a very weird situation, we have a bg with a different type.
> Should we do some warning or is there some valid situation for this?
>
Yep and probably return -EUCLEAN and set the FS to r/o.
>> +out:
>> + if (ret > 0)
>> + ret = -ENOENT;
>> + if (ret && ret != -EIO) {
>> + /*
>> + * Check if the range we're looking for is actually backed by
>> + * an extent. This can happen, e.g. when scrub is running on a
>> + * block-group and the extent it is trying to scrub get's
>> + * deleted in the meantime. Although scrub is setting the
>> + * block-group to read-only, deletion of extents are still
>> + * allowed. If the extent is gone, simply return ENOENT and be
>> + * good.
>> + */
>
> As mentioned in the next patch (sorry for the reversed order), this
> should be handled in a different way (by only searching commit root for
> scrub usage).
Yep I already have a prototype for that, but it needs more testing.
>>
>> -static void set_io_stripe(struct btrfs_io_stripe *dst, const struct map_lookup *map,
>> - u32 stripe_index, u64 stripe_offset, u32 stripe_nr)
>> +static int set_io_stripe(struct btrfs_fs_info *fs_info, enum btrfs_map_op op,
>> + u64 logical, u64 *length, struct btrfs_io_stripe *dst,
>> + struct map_lookup *map, u32 stripe_index,
>> + u64 stripe_offset, u64 stripe_nr)
> Do we need @length to be a pointer?
> IIRC we can return the number of bytes we mapped, or <0 for errors.
> Thus at least @length doesn't need to be a pointer.
Good point, I'll update.
^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: [PATCH v8 03/11] btrfs: add support for inserting raid stripe extents
2023-09-14 9:25 ` Qu Wenruo
@ 2023-09-14 9:51 ` Johannes Thumshirn
2023-09-14 10:06 ` Qu Wenruo
0 siblings, 1 reply; 39+ messages in thread
From: Johannes Thumshirn @ 2023-09-14 9:51 UTC (permalink / raw)
To: Qu Wenruo, Chris Mason, Josef Bacik, David Sterba
Cc: Christoph Hellwig, Naohiro Aota, Qu Wenruo, Damien Le Moal,
linux-btrfs@vger.kernel.org, linux-kernel@vger.kernel.org
On 14.09.23 11:25, Qu Wenruo wrote:
>> +static int btrfs_insert_one_raid_extent(struct btrfs_trans_handle *trans,
>> + int num_stripes,
>> + struct btrfs_io_context *bioc)
>> +{
>> + struct btrfs_fs_info *fs_info = trans->fs_info;
>> + struct btrfs_key stripe_key;
>> + struct btrfs_root *stripe_root = btrfs_stripe_tree_root(fs_info);
>> + u8 encoding = btrfs_bg_type_to_raid_encoding(bioc->map_type);
>> + struct btrfs_stripe_extent *stripe_extent;
>> + size_t item_size;
>> + int ret;
>> +
>> + item_size = struct_size(stripe_extent, strides, num_stripes);
>
> I guess David has already pointed out this can be done at initialization
> and make it const.
Will do
>
>> +
>> + stripe_extent = kzalloc(item_size, GFP_NOFS);
>> + if (!stripe_extent) {
>> + btrfs_abort_transaction(trans, -ENOMEM);
>> + btrfs_end_transaction(trans);
>> + return -ENOMEM;
>> + }
>> +
>> + btrfs_set_stack_stripe_extent_encoding(stripe_extent, encoding);
>> + for (int i = 0; i < num_stripes; i++) {
>> + u64 devid = bioc->stripes[i].dev->devid;
>> + u64 physical = bioc->stripes[i].physical;
>> + u64 length = bioc->stripes[i].length;
>> + struct btrfs_raid_stride *raid_stride =
>> + &stripe_extent->strides[i];
>> +
>> + if (length == 0)
>> + length = bioc->size;
>> +
>> + btrfs_set_stack_raid_stride_devid(raid_stride, devid);
>> + btrfs_set_stack_raid_stride_physical(raid_stride, physical);
>> + btrfs_set_stack_raid_stride_length(raid_stride, length);
>> + }
>> +
>> + stripe_key.objectid = bioc->logical;
>> + stripe_key.type = BTRFS_RAID_STRIPE_KEY;
>> + stripe_key.offset = bioc->size;
>> +
>> + ret = btrfs_insert_item(trans, stripe_root, &stripe_key, stripe_extent,
>> + item_size);
>
> Have you tested in near-real-world on how continous the RST items could
> be for RAID0/RAID10?
>
> My concern here is, we may want to try our best to reduce the size of
> RST, due to the 64K BTRFS_STRIPE_LEN.
>
There are two things I can do for it. First is trying to merge contiguus
RST items and second make BTRFS_STRIPE_LEN a mkfs time constant instead
of a compile time constant.
But both can be done in a second step.
>> + switch (map_type & BTRFS_BLOCK_GROUP_PROFILE_MASK) {
>> + case BTRFS_BLOCK_GROUP_DUP:
>> + case BTRFS_BLOCK_GROUP_RAID1:
>> + case BTRFS_BLOCK_GROUP_RAID1C3:
>> + case BTRFS_BLOCK_GROUP_RAID1C4:
>> + ret = btrfs_insert_mirrored_raid_extents(trans, ordered_extent,
>> + map_type);
>> + break;
>> + case BTRFS_BLOCK_GROUP_RAID0:
>> + ret = btrfs_insert_striped_raid_extents(trans, ordered_extent,
>> + map_type);
>> + break;
>> + case BTRFS_BLOCK_GROUP_RAID10:
>> + ret = btrfs_insert_striped_mirrored_raid_extents(trans, ordered_extent, map_type);
>> + break;
>> + default:
>> + ret = -EINVAL;
>
> Maybe we want to be a little more noisy?
OK.
^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: [PATCH v8 03/11] btrfs: add support for inserting raid stripe extents
2023-09-14 9:51 ` Johannes Thumshirn
@ 2023-09-14 10:06 ` Qu Wenruo
2023-09-14 15:35 ` Johannes Thumshirn
0 siblings, 1 reply; 39+ messages in thread
From: Qu Wenruo @ 2023-09-14 10:06 UTC (permalink / raw)
To: Johannes Thumshirn, Chris Mason, Josef Bacik, David Sterba
Cc: Christoph Hellwig, Naohiro Aota, Qu Wenruo, Damien Le Moal,
linux-btrfs@vger.kernel.org, linux-kernel@vger.kernel.org
On 2023/9/14 19:21, Johannes Thumshirn wrote:
> On 14.09.23 11:25, Qu Wenruo wrote:
>>> +static int btrfs_insert_one_raid_extent(struct btrfs_trans_handle *trans,
>>> + int num_stripes,
>>> + struct btrfs_io_context *bioc)
>>> +{
>>> + struct btrfs_fs_info *fs_info = trans->fs_info;
>>> + struct btrfs_key stripe_key;
>>> + struct btrfs_root *stripe_root = btrfs_stripe_tree_root(fs_info);
>>> + u8 encoding = btrfs_bg_type_to_raid_encoding(bioc->map_type);
>>> + struct btrfs_stripe_extent *stripe_extent;
>>> + size_t item_size;
>>> + int ret;
>>> +
>>> + item_size = struct_size(stripe_extent, strides, num_stripes);
>>
>> I guess David has already pointed out this can be done at initialization
>> and make it const.
>
> Will do
>
>>
>>> +
>>> + stripe_extent = kzalloc(item_size, GFP_NOFS);
>>> + if (!stripe_extent) {
>>> + btrfs_abort_transaction(trans, -ENOMEM);
>>> + btrfs_end_transaction(trans);
>>> + return -ENOMEM;
>>> + }
>>> +
>>> + btrfs_set_stack_stripe_extent_encoding(stripe_extent, encoding);
>>> + for (int i = 0; i < num_stripes; i++) {
>>> + u64 devid = bioc->stripes[i].dev->devid;
>>> + u64 physical = bioc->stripes[i].physical;
>>> + u64 length = bioc->stripes[i].length;
>>> + struct btrfs_raid_stride *raid_stride =
>>> + &stripe_extent->strides[i];
>>> +
>>> + if (length == 0)
>>> + length = bioc->size;
>>> +
>>> + btrfs_set_stack_raid_stride_devid(raid_stride, devid);
>>> + btrfs_set_stack_raid_stride_physical(raid_stride, physical);
>>> + btrfs_set_stack_raid_stride_length(raid_stride, length);
>>> + }
>>> +
>>> + stripe_key.objectid = bioc->logical;
>>> + stripe_key.type = BTRFS_RAID_STRIPE_KEY;
>>> + stripe_key.offset = bioc->size;
>>> +
>>> + ret = btrfs_insert_item(trans, stripe_root, &stripe_key, stripe_extent,
>>> + item_size);
>>
>> Have you tested in near-real-world on how continous the RST items could
>> be for RAID0/RAID10?
>>
>> My concern here is, we may want to try our best to reduce the size of
>> RST, due to the 64K BTRFS_STRIPE_LEN.
>>
>
> There are two things I can do for it. First is trying to merge contiguus
> RST items
This is much easier, as the RST lookup code is already taking the length
into consideration, thus only the add path need some work.
Although I'm not sure how effective it would be in real world.
As if the merge rate is only 5%, then it barely makes a difference.
Maybe you don't need to implement a full merge in this version, but just
do some trace events to see the merge rate?
> and second make BTRFS_STRIPE_LEN a mkfs time constant instead
> of a compile time constant.
Please be very careful about this, we have quite some bitmap relying on
this. (IIRC RAID56 and scrub)
Currently unsigned long can only support up to 64 bits, thus the maximum
stripe length would be 256K, but I'm pretty sure there would be other
hidden traps somewhere else.
Otherwise the main workflow of RST looks good to me.
Thanks,
Qu
>
> But both can be done in a second step.
>
>>> + switch (map_type & BTRFS_BLOCK_GROUP_PROFILE_MASK) {
>>> + case BTRFS_BLOCK_GROUP_DUP:
>>> + case BTRFS_BLOCK_GROUP_RAID1:
>>> + case BTRFS_BLOCK_GROUP_RAID1C3:
>>> + case BTRFS_BLOCK_GROUP_RAID1C4:
>>> + ret = btrfs_insert_mirrored_raid_extents(trans, ordered_extent,
>>> + map_type);
>>> + break;
>>> + case BTRFS_BLOCK_GROUP_RAID0:
>>> + ret = btrfs_insert_striped_raid_extents(trans, ordered_extent,
>>> + map_type);
>>> + break;
>>> + case BTRFS_BLOCK_GROUP_RAID10:
>>> + ret = btrfs_insert_striped_mirrored_raid_extents(trans, ordered_extent, map_type);
>>> + break;
>>> + default:
>>> + ret = -EINVAL;
>>
>> Maybe we want to be a little more noisy?
>
> OK.
>
^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: [PATCH v8 05/11] btrfs: lookup physical address from stripe extent
2023-09-14 9:18 ` Qu Wenruo
2023-09-14 9:45 ` Johannes Thumshirn
@ 2023-09-14 14:16 ` Johannes Thumshirn
1 sibling, 0 replies; 39+ messages in thread
From: Johannes Thumshirn @ 2023-09-14 14:16 UTC (permalink / raw)
To: Qu Wenruo, Chris Mason, Josef Bacik, David Sterba
Cc: Christoph Hellwig, Naohiro Aota, Qu Wenruo, Damien Le Moal,
linux-btrfs@vger.kernel.org, linux-kernel@vger.kernel.org
On 14.09.23 11:18, Qu Wenruo wrote:
>> +
>> + while (1) {
>
> Here we only can hit at most one RST item, thus I'd recommend to remove
> the while().
>
> Although this would mean we will need a if () to handle (ret > 0) case,
> but it may still be a little easier to read than a loop.
>
> You may want to refer to btrfs_lookup_csum() for the non-loop
> implementation.
Actually debug prints have shown that I do indeed sometimes hit the case
where I need to call btrfs_next_item(). So the loop has to stay.
>> -static void set_io_stripe(struct btrfs_io_stripe *dst, const struct map_lookup *map,
>> - u32 stripe_index, u64 stripe_offset, u32 stripe_nr)
>> +static int set_io_stripe(struct btrfs_fs_info *fs_info, enum btrfs_map_op op,
>> + u64 logical, u64 *length, struct btrfs_io_stripe *dst,
>> + struct map_lookup *map, u32 stripe_index,
>> + u64 stripe_offset, u64 stripe_nr)
> Do we need @length to be a pointer?
> IIRC we can return the number of bytes we mapped, or <0 for errors.
> Thus at least @length doesn't need to be a pointer.
I thought about this a bit more. btrfs_map_block() also gets length
passed in by reference, so it makes sense to do as well in
set_io_stripe() and btrfs_get_raid_extent_offset() IMHO.
^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: [PATCH v8 03/11] btrfs: add support for inserting raid stripe extents
2023-09-14 10:06 ` Qu Wenruo
@ 2023-09-14 15:35 ` Johannes Thumshirn
0 siblings, 0 replies; 39+ messages in thread
From: Johannes Thumshirn @ 2023-09-14 15:35 UTC (permalink / raw)
To: Qu Wenruo, Chris Mason, Josef Bacik, David Sterba
Cc: Christoph Hellwig, Naohiro Aota, Qu Wenruo, Damien Le Moal,
linux-btrfs@vger.kernel.org, linux-kernel@vger.kernel.org
On 14.09.23 12:07, Qu Wenruo wrote:
>>>> + ret = btrfs_insert_item(trans, stripe_root, &stripe_key, stripe_extent,
>>>> + item_size);
>>>
>>> Have you tested in near-real-world on how continous the RST items could
>>> be for RAID0/RAID10?
>>>
>>> My concern here is, we may want to try our best to reduce the size of
>>> RST, due to the 64K BTRFS_STRIPE_LEN.
>>>
>>
>> There are two things I can do for it. First is trying to merge contiguus
>> RST items
>
> This is much easier, as the RST lookup code is already taking the length
> into consideration, thus only the add path need some work.
>
> Although I'm not sure how effective it would be in real world.
> As if the merge rate is only 5%, then it barely makes a difference.
>
I think this will be very much workload dependent. Also just having
logically contiguous entries doesn't help much. All the physical strides
have to be contiguous as well.
> Maybe you don't need to implement a full merge in this version, but just
> do some trace events to see the merge rate?
>
>> and second make BTRFS_STRIPE_LEN a mkfs time constant instead
>> of a compile time constant.
>
> Please be very careful about this, we have quite some bitmap relying on
> this. (IIRC RAID56 and scrub)
>
> Currently unsigned long can only support up to 64 bits, thus the maximum
> stripe length would be 256K, but I'm pretty sure there would be other
> hidden traps somewhere else.
>
Yeah, this will be (both) a longer term research project as well. RAID56
has priority. Then Erasure Coding.
^ permalink raw reply [flat|nested] 39+ messages in thread
end of thread, other threads:[~2023-09-14 15:35 UTC | newest]
Thread overview: 39+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2023-09-11 12:52 [PATCH v8 00/11] btrfs: introduce RAID stripe tree Johannes Thumshirn
2023-09-11 12:52 ` [PATCH v8 01/11] btrfs: add raid stripe tree definitions Johannes Thumshirn
2023-09-11 21:00 ` Damien Le Moal
2023-09-12 6:09 ` Johannes Thumshirn
2023-09-12 20:32 ` David Sterba
2023-09-13 6:02 ` Johannes Thumshirn
2023-09-13 14:49 ` David Sterba
2023-09-13 14:57 ` Johannes Thumshirn
2023-09-13 16:06 ` David Sterba
2023-09-11 12:52 ` [PATCH v8 02/11] btrfs: read raid-stripe-tree from disk Johannes Thumshirn
2023-09-14 9:27 ` Qu Wenruo
2023-09-14 9:33 ` Johannes Thumshirn
2023-09-11 12:52 ` [PATCH v8 03/11] btrfs: add support for inserting raid stripe extents Johannes Thumshirn
2023-09-13 16:50 ` David Sterba
2023-09-13 16:57 ` David Sterba
2023-09-14 9:25 ` Qu Wenruo
2023-09-14 9:51 ` Johannes Thumshirn
2023-09-14 10:06 ` Qu Wenruo
2023-09-14 15:35 ` Johannes Thumshirn
2023-09-11 12:52 ` [PATCH v8 04/11] btrfs: delete stripe extent on extent deletion Johannes Thumshirn
2023-09-11 12:52 ` [PATCH v8 05/11] btrfs: lookup physical address from stripe extent Johannes Thumshirn
2023-09-14 9:18 ` Qu Wenruo
2023-09-14 9:45 ` Johannes Thumshirn
2023-09-14 14:16 ` Johannes Thumshirn
2023-09-11 12:52 ` [PATCH v8 06/11] btrfs: implement RST version of scrub Johannes Thumshirn
2023-09-13 9:51 ` Qu Wenruo
2023-09-13 16:59 ` David Sterba
2023-09-11 12:52 ` [PATCH v8 07/11] btrfs: zoned: allow zoned RAID Johannes Thumshirn
2023-09-12 20:49 ` David Sterba
2023-09-13 5:41 ` Johannes Thumshirn
2023-09-13 14:52 ` David Sterba
2023-09-13 14:59 ` Johannes Thumshirn
2023-09-11 12:52 ` [PATCH v8 08/11] btrfs: add raid stripe tree pretty printer Johannes Thumshirn
2023-09-12 20:42 ` David Sterba
2023-09-13 5:34 ` Johannes Thumshirn
2023-09-11 12:52 ` [PATCH v8 09/11] btrfs: announce presence of raid-stripe-tree in sysfs Johannes Thumshirn
2023-09-11 12:52 ` [PATCH v8 10/11] btrfs: add trace events for RST Johannes Thumshirn
2023-09-12 20:46 ` David Sterba
2023-09-11 12:52 ` [PATCH v8 11/11] btrfs: add raid-stripe-tree to features enabled with debug Johannes Thumshirn
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).