* [PATCH v9 00/11] btrfs: introduce RAID stripe tree
@ 2023-09-14 16:06 Johannes Thumshirn
2023-09-14 16:06 ` [PATCH v9 01/11] btrfs: add raid stripe tree definitions Johannes Thumshirn
` (11 more replies)
0 siblings, 12 replies; 29+ messages in thread
From: Johannes Thumshirn @ 2023-09-14 16:06 UTC (permalink / raw)
To: Chris Mason, Josef Bacik, David Sterba
Cc: Christoph Hellwig, Naohiro Aota, Qu Wenruo, Damien Le Moal,
linux-btrfs, linux-kernel, Johannes Thumshirn
Updates of the raid-stripe-tree are done at ordered extent write time to safe
on bandwidth while for reading we do the stripe-tree lookup on bio mapping
time, i.e. when the logical to physical translation happens for regular btrfs
RAID as well.
The stripe tree is keyed by an extent's disk_bytenr and disk_num_bytes and
it's contents are the respective physical device id and position.
For an example 1M write (split into 126K segments due to zone-append)
rapido2:/home/johannes/src/fstests# xfs_io -fdc "pwrite -b 1M 0 1M" -c fsync /mnt/test/test
wrote 1048576/1048576 bytes at offset 0
1 MiB, 1 ops; 0.0065 sec (151.538 MiB/sec and 151.5381 ops/sec)
The tree will look as follows (both 128k buffered writes to a ZNS drive):
RAID0 case:
bash-5.2# btrfs inspect-internal dump-tree -t raid_stripe /dev/nvme0n1
btrfs-progs v6.3
raid stripe tree key (RAID_STRIPE_TREE ROOT_ITEM 0)
leaf 805535744 items 1 free space 16218 generation 8 owner RAID_STRIPE_TREE
leaf 805535744 flags 0x1(WRITTEN) backref revision 1
checksum stored 2d2d2262
checksum calced 2d2d2262
fs uuid ab05cfc6-9859-404e-970d-3999b1cb5438
chunk uuid c9470ba2-49ac-4d46-8856-438a18e6bd23
item 0 key (1073741824 RAID_STRIPE_KEY 131072) itemoff 16243 itemsize 56
encoding: RAID0
stripe 0 devid 1 offset 805306368 length 131072
stripe 1 devid 2 offset 536870912 length 131072
total bytes 42949672960
bytes used 294912
uuid ab05cfc6-9859-404e-970d-3999b1cb5438
RAID1 case:
bash-5.2# btrfs inspect-internal dump-tree -t raid_stripe /dev/nvme0n1
btrfs-progs v6.3
raid stripe tree key (RAID_STRIPE_TREE ROOT_ITEM 0)
leaf 805535744 items 1 free space 16218 generation 8 owner RAID_STRIPE_TREE
leaf 805535744 flags 0x1(WRITTEN) backref revision 1
checksum stored 56199539
checksum calced 56199539
fs uuid 9e693a37-fbd1-4891-aed2-e7fe64605045
chunk uuid 691874fc-1b9c-469b-bd7f-05e0e6ba88c4
item 0 key (939524096 RAID_STRIPE_KEY 131072) itemoff 16243 itemsize 56
encoding: RAID1
stripe 0 devid 1 offset 939524096 length 65536
stripe 1 devid 2 offset 536870912 length 65536
total bytes 42949672960
bytes used 294912
uuid 9e693a37-fbd1-4891-aed2-e7fe64605045
A design document can be found here:
https://docs.google.com/document/d/1Iui_jMidCd4MVBNSSLXRfO7p5KmvnoQL/edit?usp=sharing&ouid=103609947580185458266&rtpof=true&sd=true
The user-space part of this series can be found here:
https://lore.kernel.org/linux-btrfs/20230215143109.2721722-1-johannes.thumshirn@wdc.com
Changes to v8:
- Changed tracepoints according to David's comments
- Mark on-disk structures as packed
- Got rid of __DECLARE_FLEX_ARRAY
- Rebase onto misc-next
- Split out helpers for new btrfs_load_block_group_zone_info RAID cases
- Constify declarations where possible
- Initialise variables before use
- Lower scope of variables
- Remove btrfs_stripe_root() helper
- Pick different BTRFS_RAID_STRIPE_KEY constant
- Reorder on-disk encoding types to match the raid_index
- And possibly more, please git range-diff the versions
- Link to v8: https://lore.kernel.org/r/20230911-raid-stripe-tree-v8-0-647676fa852c@wdc.com
Changes to v7:
- Huge rewrite
v7 of the patchset can be found here:
https://lore.kernel.org/linux-btrfs/cover.1677750131.git.johannes.thumshirn@wdc.com/
Changes to v6:
- Fix degraded RAID1 mounts
- Fix RAID0/10 mounts
v6 of the patchset can be found here:
https://lore/kernel.org/linux-btrfs/cover.1676470614.git.johannes.thumshirn@wdc.com
Changes to v5:
- Incroporated review comments from Josef and Christoph
- Rebased onto misc-next
v5 of the patchset can be found here:
https://lore/kernel.org/linux-btrfs/cover.1675853489.git.johannes.thumshirn@wdc.com
Changes to v4:
- Added patch to check for RST feature in sysfs
- Added RST lookups for scrubbing
- Fixed the error handling bug Josef pointed out
- Only check if we need to write out a RST once per delayed_ref head
- Added support for zoned data DUP with RST
Changes to v3:
- Rebased onto 20221120124734.18634-1-hch@lst.de
- Incorporated Josef's review
- Merged related patches
v3 of the patchset can be found here:
https://lore/kernel.org/linux-btrfs/cover.1666007330.git.johannes.thumshirn@wdc.com
Changes to v2:
- Bug fixes
- Rebased onto 20220901074216.1849941-1-hch@lst.de
- Added tracepoints
- Added leak checker
- Added RAID0 and RAID10
v2 of the patchset can be found here:
https://lore.kernel.org/linux-btrfs/cover.1656513330.git.johannes.thumshirn@wdc.com
Changes to v1:
- Write the stripe-tree at delayed-ref time (Qu)
- Add a different write path for preallocation
v1 of the patchset can be found here:
https://lore.kernel.org/linux-btrfs/cover.1652711187.git.johannes.thumshirn@wdc.com/
Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
---
Johannes Thumshirn (11):
btrfs: add raid stripe tree definitions
btrfs: read raid-stripe-tree from disk
btrfs: add support for inserting raid stripe extents
btrfs: delete stripe extent on extent deletion
btrfs: lookup physical address from stripe extent
btrfs: implement RST version of scrub
btrfs: zoned: allow zoned RAID
btrfs: add raid stripe tree pretty printer
btrfs: announce presence of raid-stripe-tree in sysfs
btrfs: add trace events for RST
btrfs: add raid-stripe-tree to features enabled with debug
fs/btrfs/Makefile | 2 +-
fs/btrfs/accessors.h | 10 +
fs/btrfs/bio.c | 25 +++
fs/btrfs/block-rsv.c | 6 +
fs/btrfs/disk-io.c | 18 ++
fs/btrfs/extent-tree.c | 7 +
fs/btrfs/fs.h | 4 +-
fs/btrfs/inode.c | 8 +-
fs/btrfs/locking.c | 1 +
fs/btrfs/ordered-data.c | 1 +
fs/btrfs/ordered-data.h | 2 +
fs/btrfs/print-tree.c | 26 +++
fs/btrfs/raid-stripe-tree.c | 449 ++++++++++++++++++++++++++++++++++++++++
fs/btrfs/raid-stripe-tree.h | 52 +++++
fs/btrfs/scrub.c | 53 +++++
fs/btrfs/sysfs.c | 3 +
fs/btrfs/volumes.c | 43 +++-
fs/btrfs/volumes.h | 16 +-
fs/btrfs/zoned.c | 144 ++++++++++++-
include/trace/events/btrfs.h | 75 +++++++
include/uapi/linux/btrfs.h | 1 +
include/uapi/linux/btrfs_tree.h | 31 +++
22 files changed, 954 insertions(+), 23 deletions(-)
---
base-commit: 1d73023d96965a5c8fb76a39aec88d840ebe5b21
change-id: 20230613-raid-stripe-tree-e330c9a45cc3
Best regards,
--
Johannes Thumshirn <johannes.thumshirn@wdc.com>
^ permalink raw reply [flat|nested] 29+ messages in thread
* [PATCH v9 01/11] btrfs: add raid stripe tree definitions
2023-09-14 16:06 [PATCH v9 00/11] btrfs: introduce RAID stripe tree Johannes Thumshirn
@ 2023-09-14 16:06 ` Johannes Thumshirn
2023-09-15 0:22 ` Qu Wenruo
2023-09-14 16:06 ` [PATCH v9 02/11] btrfs: read raid-stripe-tree from disk Johannes Thumshirn
` (10 subsequent siblings)
11 siblings, 1 reply; 29+ messages in thread
From: Johannes Thumshirn @ 2023-09-14 16:06 UTC (permalink / raw)
To: Chris Mason, Josef Bacik, David Sterba
Cc: Christoph Hellwig, Naohiro Aota, Qu Wenruo, Damien Le Moal,
linux-btrfs, linux-kernel, Johannes Thumshirn
Add definitions for the raid stripe tree. This tree will hold information
about the on-disk layout of the stripes in a RAID set.
Each stripe extent has a 1:1 relationship with an on-disk extent item and
is doing the logical to per-drive physical address translation for the
extent item in question.
Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
---
fs/btrfs/accessors.h | 10 ++++++++++
fs/btrfs/locking.c | 1 +
include/uapi/linux/btrfs_tree.h | 31 +++++++++++++++++++++++++++++++
3 files changed, 42 insertions(+)
diff --git a/fs/btrfs/accessors.h b/fs/btrfs/accessors.h
index f958eccff477..977ff160a024 100644
--- a/fs/btrfs/accessors.h
+++ b/fs/btrfs/accessors.h
@@ -306,6 +306,16 @@ BTRFS_SETGET_FUNCS(timespec_nsec, struct btrfs_timespec, nsec, 32);
BTRFS_SETGET_STACK_FUNCS(stack_timespec_sec, struct btrfs_timespec, sec, 64);
BTRFS_SETGET_STACK_FUNCS(stack_timespec_nsec, struct btrfs_timespec, nsec, 32);
+BTRFS_SETGET_FUNCS(stripe_extent_encoding, struct btrfs_stripe_extent, encoding, 8);
+BTRFS_SETGET_FUNCS(raid_stride_devid, struct btrfs_raid_stride, devid, 64);
+BTRFS_SETGET_FUNCS(raid_stride_physical, struct btrfs_raid_stride, physical, 64);
+BTRFS_SETGET_FUNCS(raid_stride_length, struct btrfs_raid_stride, length, 64);
+BTRFS_SETGET_STACK_FUNCS(stack_stripe_extent_encoding,
+ struct btrfs_stripe_extent, encoding, 8);
+BTRFS_SETGET_STACK_FUNCS(stack_raid_stride_devid, struct btrfs_raid_stride, devid, 64);
+BTRFS_SETGET_STACK_FUNCS(stack_raid_stride_physical, struct btrfs_raid_stride, physical, 64);
+BTRFS_SETGET_STACK_FUNCS(stack_raid_stride_length, struct btrfs_raid_stride, length, 64);
+
/* struct btrfs_dev_extent */
BTRFS_SETGET_FUNCS(dev_extent_chunk_tree, struct btrfs_dev_extent, chunk_tree, 64);
BTRFS_SETGET_FUNCS(dev_extent_chunk_objectid, struct btrfs_dev_extent,
diff --git a/fs/btrfs/locking.c b/fs/btrfs/locking.c
index 6ac4fd8cc8dc..74d8e2003f58 100644
--- a/fs/btrfs/locking.c
+++ b/fs/btrfs/locking.c
@@ -74,6 +74,7 @@ static struct btrfs_lockdep_keyset {
{ .id = BTRFS_UUID_TREE_OBJECTID, DEFINE_NAME("uuid") },
{ .id = BTRFS_FREE_SPACE_TREE_OBJECTID, DEFINE_NAME("free-space") },
{ .id = BTRFS_BLOCK_GROUP_TREE_OBJECTID, DEFINE_NAME("block-group") },
+ { .id = BTRFS_RAID_STRIPE_TREE_OBJECTID, DEFINE_NAME("raid-stripe") },
{ .id = 0, DEFINE_NAME("tree") },
};
diff --git a/include/uapi/linux/btrfs_tree.h b/include/uapi/linux/btrfs_tree.h
index fc3c32186d7e..6d9c43416b6e 100644
--- a/include/uapi/linux/btrfs_tree.h
+++ b/include/uapi/linux/btrfs_tree.h
@@ -73,6 +73,9 @@
/* Holds the block group items for extent tree v2. */
#define BTRFS_BLOCK_GROUP_TREE_OBJECTID 11ULL
+/* Tracks RAID stripes in block groups. */
+#define BTRFS_RAID_STRIPE_TREE_OBJECTID 12ULL
+
/* device stats in the device tree */
#define BTRFS_DEV_STATS_OBJECTID 0ULL
@@ -261,6 +264,8 @@
#define BTRFS_DEV_ITEM_KEY 216
#define BTRFS_CHUNK_ITEM_KEY 228
+#define BTRFS_RAID_STRIPE_KEY 230
+
/*
* Records the overall state of the qgroups.
* There's only one instance of this key present,
@@ -719,6 +724,32 @@ struct btrfs_free_space_header {
__le64 num_bitmaps;
} __attribute__ ((__packed__));
+struct btrfs_raid_stride {
+ /* The btrfs device-id this raid extent lives on */
+ __le64 devid;
+ /* The physical location on disk */
+ __le64 physical;
+ /* The length of stride on this disk */
+ __le64 length;
+} __attribute__ ((__packed__));
+
+/* The stripe_extent::encoding, 1:1 mapping of enum btrfs_raid_types */
+#define BTRFS_STRIPE_RAID0 1
+#define BTRFS_STRIPE_RAID1 2
+#define BTRFS_STRIPE_DUP 3
+#define BTRFS_STRIPE_RAID10 4
+#define BTRFS_STRIPE_RAID5 5
+#define BTRFS_STRIPE_RAID6 6
+#define BTRFS_STRIPE_RAID1C3 7
+#define BTRFS_STRIPE_RAID1C4 8
+
+struct btrfs_stripe_extent {
+ __u8 encoding;
+ __u8 reserved[7];
+ /* An array of raid strides this stripe is composed of */
+ struct btrfs_raid_stride strides[];
+} __attribute__ ((__packed__));
+
#define BTRFS_HEADER_FLAG_WRITTEN (1ULL << 0)
#define BTRFS_HEADER_FLAG_RELOC (1ULL << 1)
--
2.41.0
^ permalink raw reply related [flat|nested] 29+ messages in thread
* [PATCH v9 02/11] btrfs: read raid-stripe-tree from disk
2023-09-14 16:06 [PATCH v9 00/11] btrfs: introduce RAID stripe tree Johannes Thumshirn
2023-09-14 16:06 ` [PATCH v9 01/11] btrfs: add raid stripe tree definitions Johannes Thumshirn
@ 2023-09-14 16:06 ` Johannes Thumshirn
2023-09-14 16:06 ` [PATCH v9 03/11] btrfs: add support for inserting raid stripe extents Johannes Thumshirn
` (9 subsequent siblings)
11 siblings, 0 replies; 29+ messages in thread
From: Johannes Thumshirn @ 2023-09-14 16:06 UTC (permalink / raw)
To: Chris Mason, Josef Bacik, David Sterba
Cc: Christoph Hellwig, Naohiro Aota, Qu Wenruo, Damien Le Moal,
linux-btrfs, linux-kernel, Johannes Thumshirn
If we find a raid-stripe-tree on mount, read it from disk.
Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
---
fs/btrfs/block-rsv.c | 6 ++++++
fs/btrfs/disk-io.c | 18 ++++++++++++++++++
fs/btrfs/fs.h | 1 +
include/uapi/linux/btrfs.h | 1 +
4 files changed, 26 insertions(+)
diff --git a/fs/btrfs/block-rsv.c b/fs/btrfs/block-rsv.c
index 6a8f9629bbbd..ceb5f586a2d5 100644
--- a/fs/btrfs/block-rsv.c
+++ b/fs/btrfs/block-rsv.c
@@ -356,6 +356,11 @@ void btrfs_update_global_block_rsv(struct btrfs_fs_info *fs_info)
min_items++;
}
+ if (btrfs_fs_incompat(fs_info, RAID_STRIPE_TREE)) {
+ num_bytes += btrfs_root_used(&fs_info->stripe_root->root_item);
+ min_items++;
+ }
+
/*
* But we also want to reserve enough space so we can do the fallback
* global reserve for an unlink, which is an additional
@@ -407,6 +412,7 @@ void btrfs_init_root_block_rsv(struct btrfs_root *root)
case BTRFS_EXTENT_TREE_OBJECTID:
case BTRFS_FREE_SPACE_TREE_OBJECTID:
case BTRFS_BLOCK_GROUP_TREE_OBJECTID:
+ case BTRFS_RAID_STRIPE_TREE_OBJECTID:
root->block_rsv = &fs_info->delayed_refs_rsv;
break;
case BTRFS_ROOT_TREE_OBJECTID:
diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index 163f37ad1b27..dc577b3c53f6 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -1179,6 +1179,8 @@ static struct btrfs_root *btrfs_get_global_root(struct btrfs_fs_info *fs_info,
return btrfs_grab_root(fs_info->block_group_root);
case BTRFS_FREE_SPACE_TREE_OBJECTID:
return btrfs_grab_root(btrfs_global_root(fs_info, &key));
+ case BTRFS_RAID_STRIPE_TREE_OBJECTID:
+ return btrfs_grab_root(fs_info->stripe_root);
default:
return NULL;
}
@@ -1259,6 +1261,7 @@ void btrfs_free_fs_info(struct btrfs_fs_info *fs_info)
btrfs_put_root(fs_info->fs_root);
btrfs_put_root(fs_info->data_reloc_root);
btrfs_put_root(fs_info->block_group_root);
+ btrfs_put_root(fs_info->stripe_root);
btrfs_check_leaked_roots(fs_info);
btrfs_extent_buffer_leak_debug_check(fs_info);
kfree(fs_info->super_copy);
@@ -1804,6 +1807,7 @@ static void free_root_pointers(struct btrfs_fs_info *info, bool free_chunk_root)
free_root_extent_buffers(info->fs_root);
free_root_extent_buffers(info->data_reloc_root);
free_root_extent_buffers(info->block_group_root);
+ free_root_extent_buffers(info->stripe_root);
if (free_chunk_root)
free_root_extent_buffers(info->chunk_root);
}
@@ -2280,6 +2284,20 @@ static int btrfs_read_roots(struct btrfs_fs_info *fs_info)
fs_info->uuid_root = root;
}
+ if (btrfs_fs_incompat(fs_info, RAID_STRIPE_TREE)) {
+ location.objectid = BTRFS_RAID_STRIPE_TREE_OBJECTID;
+ root = btrfs_read_tree_root(tree_root, &location);
+ if (IS_ERR(root)) {
+ if (!btrfs_test_opt(fs_info, IGNOREBADROOTS)) {
+ ret = PTR_ERR(root);
+ goto out;
+ }
+ } else {
+ set_bit(BTRFS_ROOT_TRACK_DIRTY, &root->state);
+ fs_info->stripe_root = root;
+ }
+ }
+
return 0;
out:
btrfs_warn(fs_info, "failed to read root (objectid=%llu): %d",
diff --git a/fs/btrfs/fs.h b/fs/btrfs/fs.h
index d84a390336fc..5c7778e8b5ed 100644
--- a/fs/btrfs/fs.h
+++ b/fs/btrfs/fs.h
@@ -367,6 +367,7 @@ struct btrfs_fs_info {
struct btrfs_root *uuid_root;
struct btrfs_root *data_reloc_root;
struct btrfs_root *block_group_root;
+ struct btrfs_root *stripe_root;
/* The log root tree is a directory of all the other log roots */
struct btrfs_root *log_root_tree;
diff --git a/include/uapi/linux/btrfs.h b/include/uapi/linux/btrfs.h
index dbb8b96da50d..b9a1d9af8ae8 100644
--- a/include/uapi/linux/btrfs.h
+++ b/include/uapi/linux/btrfs.h
@@ -333,6 +333,7 @@ struct btrfs_ioctl_fs_info_args {
#define BTRFS_FEATURE_INCOMPAT_RAID1C34 (1ULL << 11)
#define BTRFS_FEATURE_INCOMPAT_ZONED (1ULL << 12)
#define BTRFS_FEATURE_INCOMPAT_EXTENT_TREE_V2 (1ULL << 13)
+#define BTRFS_FEATURE_INCOMPAT_RAID_STRIPE_TREE (1ULL << 14)
struct btrfs_ioctl_feature_flags {
__u64 compat_flags;
--
2.41.0
^ permalink raw reply related [flat|nested] 29+ messages in thread
* [PATCH v9 03/11] btrfs: add support for inserting raid stripe extents
2023-09-14 16:06 [PATCH v9 00/11] btrfs: introduce RAID stripe tree Johannes Thumshirn
2023-09-14 16:06 ` [PATCH v9 01/11] btrfs: add raid stripe tree definitions Johannes Thumshirn
2023-09-14 16:06 ` [PATCH v9 02/11] btrfs: read raid-stripe-tree from disk Johannes Thumshirn
@ 2023-09-14 16:06 ` Johannes Thumshirn
2023-09-14 18:07 ` David Sterba
` (2 more replies)
2023-09-14 16:06 ` [PATCH v9 04/11] btrfs: delete stripe extent on extent deletion Johannes Thumshirn
` (8 subsequent siblings)
11 siblings, 3 replies; 29+ messages in thread
From: Johannes Thumshirn @ 2023-09-14 16:06 UTC (permalink / raw)
To: Chris Mason, Josef Bacik, David Sterba
Cc: Christoph Hellwig, Naohiro Aota, Qu Wenruo, Damien Le Moal,
linux-btrfs, linux-kernel, Johannes Thumshirn
Add support for inserting stripe extents into the raid stripe tree on
completion of every write that needs an extra logical-to-physical
translation when using RAID.
Inserting the stripe extents happens after the data I/O has completed,
this is done to a) support zone-append and b) rule out the possibility of
a RAID-write-hole.
Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
---
fs/btrfs/Makefile | 2 +-
fs/btrfs/bio.c | 23 +++++
fs/btrfs/extent-tree.c | 1 +
fs/btrfs/inode.c | 8 +-
fs/btrfs/ordered-data.c | 1 +
fs/btrfs/ordered-data.h | 2 +
fs/btrfs/raid-stripe-tree.c | 245 ++++++++++++++++++++++++++++++++++++++++++++
fs/btrfs/raid-stripe-tree.h | 34 ++++++
fs/btrfs/volumes.c | 4 +-
fs/btrfs/volumes.h | 15 +--
10 files changed, 326 insertions(+), 9 deletions(-)
diff --git a/fs/btrfs/Makefile b/fs/btrfs/Makefile
index c57d80729d4f..525af975f61c 100644
--- a/fs/btrfs/Makefile
+++ b/fs/btrfs/Makefile
@@ -33,7 +33,7 @@ btrfs-y += super.o ctree.o extent-tree.o print-tree.o root-tree.o dir-item.o \
uuid-tree.o props.o free-space-tree.o tree-checker.o space-info.o \
block-rsv.o delalloc-space.o block-group.o discard.o reflink.o \
subpage.o tree-mod-log.o extent-io-tree.o fs.o messages.o bio.o \
- lru_cache.o
+ lru_cache.o raid-stripe-tree.o
btrfs-$(CONFIG_BTRFS_FS_POSIX_ACL) += acl.o
btrfs-$(CONFIG_BTRFS_FS_REF_VERIFY) += ref-verify.o
diff --git a/fs/btrfs/bio.c b/fs/btrfs/bio.c
index 31ff36990404..ddbe6f8d4ea2 100644
--- a/fs/btrfs/bio.c
+++ b/fs/btrfs/bio.c
@@ -14,6 +14,7 @@
#include "rcu-string.h"
#include "zoned.h"
#include "file-item.h"
+#include "raid-stripe-tree.h"
static struct bio_set btrfs_bioset;
static struct bio_set btrfs_clone_bioset;
@@ -415,6 +416,9 @@ static void btrfs_orig_write_end_io(struct bio *bio)
else
bio->bi_status = BLK_STS_OK;
+ if (bio_op(bio) == REQ_OP_ZONE_APPEND && !bio->bi_status)
+ stripe->physical = bio->bi_iter.bi_sector << SECTOR_SHIFT;
+
btrfs_orig_bbio_end_io(bbio);
btrfs_put_bioc(bioc);
}
@@ -426,6 +430,8 @@ static void btrfs_clone_write_end_io(struct bio *bio)
if (bio->bi_status) {
atomic_inc(&stripe->bioc->error);
btrfs_log_dev_io_error(bio, stripe->dev);
+ } else if (bio_op(bio) == REQ_OP_ZONE_APPEND) {
+ stripe->physical = bio->bi_iter.bi_sector << SECTOR_SHIFT;
}
/* Pass on control to the original bio this one was cloned from */
@@ -487,6 +493,7 @@ static void btrfs_submit_mirrored_bio(struct btrfs_io_context *bioc, int dev_nr)
bio->bi_private = &bioc->stripes[dev_nr];
bio->bi_iter.bi_sector = bioc->stripes[dev_nr].physical >> SECTOR_SHIFT;
bioc->stripes[dev_nr].bioc = bioc;
+ bioc->size = bio->bi_iter.bi_size;
btrfs_submit_dev_bio(bioc->stripes[dev_nr].dev, bio);
}
@@ -496,6 +503,8 @@ static void __btrfs_submit_bio(struct bio *bio, struct btrfs_io_context *bioc,
if (!bioc) {
/* Single mirror read/write fast path. */
btrfs_bio(bio)->mirror_num = mirror_num;
+ if (bio_op(bio) != REQ_OP_READ)
+ btrfs_bio(bio)->orig_physical = smap->physical;
bio->bi_iter.bi_sector = smap->physical >> SECTOR_SHIFT;
if (bio_op(bio) != REQ_OP_READ)
btrfs_bio(bio)->orig_physical = smap->physical;
@@ -688,6 +697,20 @@ static bool btrfs_submit_chunk(struct btrfs_bio *bbio, int mirror_num)
bio->bi_opf |= REQ_OP_ZONE_APPEND;
}
+ if (is_data_bbio(bbio) && bioc &&
+ btrfs_need_stripe_tree_update(bioc->fs_info,
+ bioc->map_type)) {
+ /*
+ * No locking for the list update, as we only add to
+ * the list in the I/O submission path, and list
+ * iteration only happens in the completion path,
+ * which can't happen until after the last submission.
+ */
+ btrfs_get_bioc(bioc);
+ list_add_tail(&bioc->ordered_entry,
+ &bbio->ordered->bioc_list);
+ }
+
/*
* Csum items for reloc roots have already been cloned at this
* point, so they are handled as part of the no-checksum case.
diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index cb12bfb047e7..959d7449ea0d 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -42,6 +42,7 @@
#include "file-item.h"
#include "orphan.h"
#include "tree-checker.h"
+#include "raid-stripe-tree.h"
#undef SCRAMBLE_DELAYED_REFS
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index e02a5ba5b533..b5e0ed3a36f7 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -71,6 +71,7 @@
#include "super.h"
#include "orphan.h"
#include "backref.h"
+#include "raid-stripe-tree.h"
struct btrfs_iget_args {
u64 ino;
@@ -3091,6 +3092,10 @@ int btrfs_finish_one_ordered(struct btrfs_ordered_extent *ordered_extent)
trans->block_rsv = &inode->block_rsv;
+ ret = btrfs_insert_raid_extent(trans, ordered_extent);
+ if (ret)
+ goto out;
+
if (test_bit(BTRFS_ORDERED_COMPRESSED, &ordered_extent->flags))
compress_type = ordered_extent->compress_type;
if (test_bit(BTRFS_ORDERED_PREALLOC, &ordered_extent->flags)) {
@@ -3224,7 +3229,8 @@ int btrfs_finish_one_ordered(struct btrfs_ordered_extent *ordered_extent)
int btrfs_finish_ordered_io(struct btrfs_ordered_extent *ordered)
{
if (btrfs_is_zoned(btrfs_sb(ordered->inode->i_sb)) &&
- !test_bit(BTRFS_ORDERED_IOERR, &ordered->flags))
+ !test_bit(BTRFS_ORDERED_IOERR, &ordered->flags) &&
+ list_empty(&ordered->bioc_list))
btrfs_finish_ordered_zoned(ordered);
return btrfs_finish_one_ordered(ordered);
}
diff --git a/fs/btrfs/ordered-data.c b/fs/btrfs/ordered-data.c
index 345c449d588c..55c7d5543265 100644
--- a/fs/btrfs/ordered-data.c
+++ b/fs/btrfs/ordered-data.c
@@ -191,6 +191,7 @@ static struct btrfs_ordered_extent *alloc_ordered_extent(
INIT_LIST_HEAD(&entry->log_list);
INIT_LIST_HEAD(&entry->root_extent_list);
INIT_LIST_HEAD(&entry->work_list);
+ INIT_LIST_HEAD(&entry->bioc_list);
init_completion(&entry->completion);
/*
diff --git a/fs/btrfs/ordered-data.h b/fs/btrfs/ordered-data.h
index 173bd5c5df26..1c51ac57e5df 100644
--- a/fs/btrfs/ordered-data.h
+++ b/fs/btrfs/ordered-data.h
@@ -151,6 +151,8 @@ struct btrfs_ordered_extent {
struct completion completion;
struct btrfs_work flush_work;
struct list_head work_list;
+
+ struct list_head bioc_list;
};
static inline void
diff --git a/fs/btrfs/raid-stripe-tree.c b/fs/btrfs/raid-stripe-tree.c
new file mode 100644
index 000000000000..7cdcc45a8796
--- /dev/null
+++ b/fs/btrfs/raid-stripe-tree.c
@@ -0,0 +1,245 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * Copyright (C) 2023 Western Digital Corporation or its affiliates.
+ */
+
+#include <linux/btrfs_tree.h>
+
+#include "ctree.h"
+#include "fs.h"
+#include "accessors.h"
+#include "transaction.h"
+#include "disk-io.h"
+#include "raid-stripe-tree.h"
+#include "volumes.h"
+#include "misc.h"
+#include "print-tree.h"
+
+static int btrfs_insert_one_raid_extent(struct btrfs_trans_handle *trans,
+ int num_stripes,
+ struct btrfs_io_context *bioc)
+{
+ struct btrfs_fs_info *fs_info = trans->fs_info;
+ struct btrfs_key stripe_key;
+ struct btrfs_root *stripe_root = fs_info->stripe_root;
+ u8 encoding = btrfs_bg_flags_to_raid_index(bioc->map_type);
+ struct btrfs_stripe_extent *stripe_extent;
+ const size_t item_size = struct_size(stripe_extent, strides, num_stripes);
+ int ret;
+
+ stripe_extent = kzalloc(item_size, GFP_NOFS);
+ if (!stripe_extent) {
+ btrfs_abort_transaction(trans, -ENOMEM);
+ btrfs_end_transaction(trans);
+ return -ENOMEM;
+ }
+
+ btrfs_set_stack_stripe_extent_encoding(stripe_extent, encoding);
+ for (int i = 0; i < num_stripes; i++) {
+ u64 devid = bioc->stripes[i].dev->devid;
+ u64 physical = bioc->stripes[i].physical;
+ u64 length = bioc->stripes[i].length;
+ struct btrfs_raid_stride *raid_stride =
+ &stripe_extent->strides[i];
+
+ if (length == 0)
+ length = bioc->size;
+
+ btrfs_set_stack_raid_stride_devid(raid_stride, devid);
+ btrfs_set_stack_raid_stride_physical(raid_stride, physical);
+ btrfs_set_stack_raid_stride_length(raid_stride, length);
+ }
+
+ stripe_key.objectid = bioc->logical;
+ stripe_key.type = BTRFS_RAID_STRIPE_KEY;
+ stripe_key.offset = bioc->size;
+
+ ret = btrfs_insert_item(trans, stripe_root, &stripe_key, stripe_extent,
+ item_size);
+ if (ret)
+ btrfs_abort_transaction(trans, ret);
+
+ kfree(stripe_extent);
+
+ return ret;
+}
+
+static int btrfs_insert_mirrored_raid_extents(struct btrfs_trans_handle *trans,
+ struct btrfs_ordered_extent *ordered,
+ u64 map_type)
+{
+ int num_stripes = btrfs_bg_type_to_factor(map_type);
+ struct btrfs_io_context *bioc;
+ int ret;
+
+ list_for_each_entry(bioc, &ordered->bioc_list, ordered_entry) {
+ ret = btrfs_insert_one_raid_extent(trans, num_stripes, bioc);
+ if (ret)
+ return ret;
+ }
+
+ return 0;
+}
+
+static int btrfs_insert_striped_mirrored_raid_extents(
+ struct btrfs_trans_handle *trans,
+ struct btrfs_ordered_extent *ordered,
+ u64 map_type)
+{
+ struct btrfs_io_context *bioc;
+ struct btrfs_io_context *rbioc;
+ const int nstripes = list_count_nodes(&ordered->bioc_list);
+ const int index = btrfs_bg_flags_to_raid_index(map_type);
+ const int substripes = btrfs_raid_array[index].sub_stripes;
+ const int max_stripes =
+ trans->fs_info->fs_devices->rw_devices / substripes;
+ int left = nstripes;
+ int i;
+ int ret = 0;
+ u64 stripe_end;
+ u64 prev_end;
+
+ if (nstripes == 1)
+ return btrfs_insert_mirrored_raid_extents(trans, ordered, map_type);
+
+ rbioc = kzalloc(struct_size(rbioc, stripes, nstripes * substripes),
+ GFP_NOFS);
+ if (!rbioc)
+ return -ENOMEM;
+
+ rbioc->map_type = map_type;
+ rbioc->logical = list_first_entry(&ordered->bioc_list, typeof(*rbioc),
+ ordered_entry)->logical;
+
+ stripe_end = rbioc->logical;
+ prev_end = stripe_end;
+ i = 0;
+ list_for_each_entry(bioc, &ordered->bioc_list, ordered_entry) {
+
+ rbioc->size += bioc->size;
+ for (int j = 0; j < substripes; j++) {
+ int stripe = i + j;
+ rbioc->stripes[stripe].dev = bioc->stripes[j].dev;
+ rbioc->stripes[stripe].physical = bioc->stripes[j].physical;
+ rbioc->stripes[stripe].length = bioc->size;
+ }
+
+ stripe_end += rbioc->size;
+ if (i >= nstripes ||
+ (stripe_end - prev_end >= max_stripes * BTRFS_STRIPE_LEN)) {
+ ret = btrfs_insert_one_raid_extent(trans,
+ nstripes * substripes,
+ rbioc);
+ if (ret)
+ goto out;
+
+ left -= nstripes;
+ i = 0;
+ rbioc->logical += rbioc->size;
+ rbioc->size = 0;
+ } else {
+ i += substripes;
+ prev_end = stripe_end;
+ }
+ }
+
+ if (left) {
+ bioc = list_prev_entry(bioc, ordered_entry);
+ ret = btrfs_insert_one_raid_extent(trans, substripes, bioc);
+ }
+
+out:
+ kfree(rbioc);
+ return ret;
+}
+
+static int btrfs_insert_striped_raid_extents(struct btrfs_trans_handle *trans,
+ struct btrfs_ordered_extent *ordered,
+ u64 map_type)
+{
+ struct btrfs_io_context *bioc;
+ struct btrfs_io_context *rbioc;
+ const int nstripes = list_count_nodes(&ordered->bioc_list);
+ int i;
+ int ret = 0;
+
+ rbioc = kzalloc(struct_size(rbioc, stripes, nstripes), GFP_NOFS);
+ if (!rbioc)
+ return -ENOMEM;
+ rbioc->map_type = map_type;
+ rbioc->logical = list_first_entry(&ordered->bioc_list, typeof(*rbioc),
+ ordered_entry)->logical;
+
+ i = 0;
+ list_for_each_entry(bioc, &ordered->bioc_list, ordered_entry) {
+ rbioc->size += bioc->size;
+ rbioc->stripes[i].dev = bioc->stripes[0].dev;
+ rbioc->stripes[i].physical = bioc->stripes[0].physical;
+ rbioc->stripes[i].length = bioc->size;
+
+ if (i == nstripes - 1) {
+ ret = btrfs_insert_one_raid_extent(trans, nstripes, rbioc);
+ if (ret)
+ goto out;
+
+ i = 0;
+ rbioc->logical += rbioc->size;
+ rbioc->size = 0;
+ } else {
+ i++;
+ }
+ }
+
+ if (i && i < nstripes - 1)
+ ret = btrfs_insert_one_raid_extent(trans, i, rbioc);
+
+out:
+ kfree(rbioc);
+ return ret;
+}
+
+int btrfs_insert_raid_extent(struct btrfs_trans_handle *trans,
+ struct btrfs_ordered_extent *ordered_extent)
+{
+ struct btrfs_io_context *bioc;
+ u64 map_type;
+ int ret;
+
+ if (!trans->fs_info->stripe_root)
+ return 0;
+
+ map_type = list_first_entry(&ordered_extent->bioc_list, typeof(*bioc),
+ ordered_entry)->map_type;
+
+ switch (map_type & BTRFS_BLOCK_GROUP_PROFILE_MASK) {
+ case BTRFS_BLOCK_GROUP_DUP:
+ case BTRFS_BLOCK_GROUP_RAID1:
+ case BTRFS_BLOCK_GROUP_RAID1C3:
+ case BTRFS_BLOCK_GROUP_RAID1C4:
+ ret = btrfs_insert_mirrored_raid_extents(trans, ordered_extent,
+ map_type);
+ break;
+ case BTRFS_BLOCK_GROUP_RAID0:
+ ret = btrfs_insert_striped_raid_extents(trans, ordered_extent,
+ map_type);
+ break;
+ case BTRFS_BLOCK_GROUP_RAID10:
+ ret = btrfs_insert_striped_mirrored_raid_extents(trans, ordered_extent, map_type);
+ break;
+ default:
+ btrfs_err(trans->fs_info, "unknown block-group profile %lld",
+ map_type & BTRFS_BLOCK_GROUP_PROFILE_MASK);
+ ASSERT(0);
+ ret = -EINVAL;
+ break;
+ }
+
+ while (!list_empty(&ordered_extent->bioc_list)) {
+ bioc = list_first_entry(&ordered_extent->bioc_list,
+ typeof(*bioc), ordered_entry);
+ list_del(&bioc->ordered_entry);
+ btrfs_put_bioc(bioc);
+ }
+
+ return ret;
+}
diff --git a/fs/btrfs/raid-stripe-tree.h b/fs/btrfs/raid-stripe-tree.h
new file mode 100644
index 000000000000..884f0e99d5e8
--- /dev/null
+++ b/fs/btrfs/raid-stripe-tree.h
@@ -0,0 +1,34 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+/*
+ * Copyright (C) 2023 Western Digital Corporation or its affiliates.
+ */
+
+#ifndef BTRFS_RAID_STRIPE_TREE_H
+#define BTRFS_RAID_STRIPE_TREE_H
+
+struct btrfs_io_context;
+struct btrfs_io_stripe;
+struct btrfs_ordered_extent;
+struct btrfs_trans_handle;
+
+int btrfs_insert_raid_extent(struct btrfs_trans_handle *trans,
+ struct btrfs_ordered_extent *ordered_extent);
+
+static inline bool btrfs_need_stripe_tree_update(struct btrfs_fs_info *fs_info,
+ u64 map_type)
+{
+ u64 type = map_type & BTRFS_BLOCK_GROUP_TYPE_MASK;
+ u64 profile = map_type & BTRFS_BLOCK_GROUP_PROFILE_MASK;
+
+ if (!fs_info->stripe_root)
+ return false;
+
+ if (type != BTRFS_BLOCK_GROUP_DATA)
+ return false;
+
+ if (profile & BTRFS_BLOCK_GROUP_RAID1_MASK)
+ return true;
+
+ return false;
+}
+#endif
diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index a1eae8b5b412..c2bac87912c7 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -5984,6 +5984,7 @@ static int find_live_mirror(struct btrfs_fs_info *fs_info,
}
static struct btrfs_io_context *alloc_btrfs_io_context(struct btrfs_fs_info *fs_info,
+ u64 logical,
u16 total_stripes)
{
struct btrfs_io_context *bioc;
@@ -6003,6 +6004,7 @@ static struct btrfs_io_context *alloc_btrfs_io_context(struct btrfs_fs_info *fs_
bioc->fs_info = fs_info;
bioc->replace_stripe_src = -1;
bioc->full_stripe_logical = (u64)-1;
+ bioc->logical = logical;
return bioc;
}
@@ -6537,7 +6539,7 @@ int btrfs_map_block(struct btrfs_fs_info *fs_info, enum btrfs_map_op op,
goto out;
}
- bioc = alloc_btrfs_io_context(fs_info, num_alloc_stripes);
+ bioc = alloc_btrfs_io_context(fs_info, logical, num_alloc_stripes);
if (!bioc) {
ret = -ENOMEM;
goto out;
diff --git a/fs/btrfs/volumes.h b/fs/btrfs/volumes.h
index 26397adc8706..2043aff6e966 100644
--- a/fs/btrfs/volumes.h
+++ b/fs/btrfs/volumes.h
@@ -390,12 +390,11 @@ struct btrfs_fs_devices {
struct btrfs_io_stripe {
struct btrfs_device *dev;
- union {
- /* Block mapping */
- u64 physical;
- /* For the endio handler */
- struct btrfs_io_context *bioc;
- };
+ /* Block mapping */
+ u64 physical;
+ u64 length;
+ /* For the endio handler */
+ struct btrfs_io_context *bioc;
};
struct btrfs_discard_stripe {
@@ -428,6 +427,10 @@ struct btrfs_io_context {
atomic_t error;
u16 max_errors;
+ u64 logical;
+ u64 size;
+ struct list_head ordered_entry;
+
/*
* The total number of stripes, including the extra duplicated
* stripe for replace.
--
2.41.0
^ permalink raw reply related [flat|nested] 29+ messages in thread
* [PATCH v9 04/11] btrfs: delete stripe extent on extent deletion
2023-09-14 16:06 [PATCH v9 00/11] btrfs: introduce RAID stripe tree Johannes Thumshirn
` (2 preceding siblings ...)
2023-09-14 16:06 ` [PATCH v9 03/11] btrfs: add support for inserting raid stripe extents Johannes Thumshirn
@ 2023-09-14 16:06 ` Johannes Thumshirn
2023-09-14 16:07 ` [PATCH v9 05/11] btrfs: lookup physical address from stripe extent Johannes Thumshirn
` (7 subsequent siblings)
11 siblings, 0 replies; 29+ messages in thread
From: Johannes Thumshirn @ 2023-09-14 16:06 UTC (permalink / raw)
To: Chris Mason, Josef Bacik, David Sterba
Cc: Christoph Hellwig, Naohiro Aota, Qu Wenruo, Damien Le Moal,
linux-btrfs, linux-kernel, Johannes Thumshirn
As each stripe extent is tied to an extent item, delete the stripe extent
once the corresponding extent item is deleted.
Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
---
fs/btrfs/extent-tree.c | 6 +++++
fs/btrfs/raid-stripe-tree.c | 60 +++++++++++++++++++++++++++++++++++++++++++++
fs/btrfs/raid-stripe-tree.h | 2 ++
3 files changed, 68 insertions(+)
diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index 959d7449ea0d..27859c7773ce 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -2860,6 +2860,12 @@ static int do_free_extent_accounting(struct btrfs_trans_handle *trans,
btrfs_abort_transaction(trans, ret);
return ret;
}
+
+ ret = btrfs_delete_raid_extent(trans, bytenr, num_bytes);
+ if (ret) {
+ btrfs_abort_transaction(trans, ret);
+ return ret;
+ }
}
ret = add_to_free_space_tree(trans, bytenr, num_bytes);
diff --git a/fs/btrfs/raid-stripe-tree.c b/fs/btrfs/raid-stripe-tree.c
index 7cdcc45a8796..517bc08803f1 100644
--- a/fs/btrfs/raid-stripe-tree.c
+++ b/fs/btrfs/raid-stripe-tree.c
@@ -15,6 +15,66 @@
#include "misc.h"
#include "print-tree.h"
+int btrfs_delete_raid_extent(struct btrfs_trans_handle *trans, u64 start,
+ u64 length)
+{
+ struct btrfs_fs_info *fs_info = trans->fs_info;
+ struct btrfs_root *stripe_root = fs_info->stripe_root;
+ struct btrfs_path *path;
+ struct btrfs_key key;
+ struct extent_buffer *leaf;
+ u64 found_start;
+ u64 found_end;
+ u64 end = start + length;
+ int slot;
+ int ret;
+
+ if (!stripe_root)
+ return 0;
+
+ path = btrfs_alloc_path();
+ if (!path)
+ return -ENOMEM;
+
+ while (1) {
+
+ key.objectid = start;
+ key.type = BTRFS_RAID_STRIPE_KEY;
+ key.offset = length;
+
+ ret = btrfs_search_slot(trans, stripe_root, &key, path, -1, 1);
+ if (ret < 0)
+ break;
+ if (ret > 0) {
+ ret = 0;
+ if (path->slots[0] == 0)
+ break;
+ path->slots[0]--;
+ }
+
+ leaf = path->nodes[0];
+ slot = path->slots[0];
+ btrfs_item_key_to_cpu(leaf, &key, slot);
+ found_start = key.objectid;
+ found_end = found_start + key.offset;
+
+ /* That stripe ends before we start, we're done */
+ if (found_end <= start)
+ break;
+
+ ASSERT(found_start >= start && found_end <= end);
+ ret = btrfs_del_item(trans, stripe_root, path);
+ if (ret)
+ break;
+
+ btrfs_release_path(path);
+ }
+
+ btrfs_free_path(path);
+ return ret;
+
+}
+
static int btrfs_insert_one_raid_extent(struct btrfs_trans_handle *trans,
int num_stripes,
struct btrfs_io_context *bioc)
diff --git a/fs/btrfs/raid-stripe-tree.h b/fs/btrfs/raid-stripe-tree.h
index 884f0e99d5e8..b3a127c997c8 100644
--- a/fs/btrfs/raid-stripe-tree.h
+++ b/fs/btrfs/raid-stripe-tree.h
@@ -11,6 +11,8 @@ struct btrfs_io_stripe;
struct btrfs_ordered_extent;
struct btrfs_trans_handle;
+int btrfs_delete_raid_extent(struct btrfs_trans_handle *trans, u64 start,
+ u64 length);
int btrfs_insert_raid_extent(struct btrfs_trans_handle *trans,
struct btrfs_ordered_extent *ordered_extent);
--
2.41.0
^ permalink raw reply related [flat|nested] 29+ messages in thread
* [PATCH v9 05/11] btrfs: lookup physical address from stripe extent
2023-09-14 16:06 [PATCH v9 00/11] btrfs: introduce RAID stripe tree Johannes Thumshirn
` (3 preceding siblings ...)
2023-09-14 16:06 ` [PATCH v9 04/11] btrfs: delete stripe extent on extent deletion Johannes Thumshirn
@ 2023-09-14 16:07 ` Johannes Thumshirn
2023-09-14 17:57 ` David Sterba
2023-09-14 16:07 ` [PATCH v9 06/11] btrfs: implement RST version of scrub Johannes Thumshirn
` (6 subsequent siblings)
11 siblings, 1 reply; 29+ messages in thread
From: Johannes Thumshirn @ 2023-09-14 16:07 UTC (permalink / raw)
To: Chris Mason, Josef Bacik, David Sterba
Cc: Christoph Hellwig, Naohiro Aota, Qu Wenruo, Damien Le Moal,
linux-btrfs, linux-kernel, Johannes Thumshirn
Lookup the physical address from the raid stripe tree when a read on an
RAID volume formatted with the raid stripe tree was attempted.
Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
---
fs/btrfs/raid-stripe-tree.c | 130 ++++++++++++++++++++++++++++++++++++++++++++
fs/btrfs/raid-stripe-tree.h | 11 ++++
fs/btrfs/volumes.c | 37 ++++++++++---
3 files changed, 169 insertions(+), 9 deletions(-)
diff --git a/fs/btrfs/raid-stripe-tree.c b/fs/btrfs/raid-stripe-tree.c
index 517bc08803f1..697a6e1fd255 100644
--- a/fs/btrfs/raid-stripe-tree.c
+++ b/fs/btrfs/raid-stripe-tree.c
@@ -303,3 +303,133 @@ int btrfs_insert_raid_extent(struct btrfs_trans_handle *trans,
return ret;
}
+
+int btrfs_get_raid_extent_offset(struct btrfs_fs_info *fs_info,
+ u64 logical, u64 *length, u64 map_type,
+ u32 stripe_index,
+ struct btrfs_io_stripe *stripe)
+{
+ struct btrfs_root *stripe_root = fs_info->stripe_root;
+ struct btrfs_stripe_extent *stripe_extent;
+ struct btrfs_key stripe_key;
+ struct btrfs_key found_key;
+ struct btrfs_path *path;
+ struct extent_buffer *leaf;
+ const u64 end = logical + *length;
+ int num_stripes;
+ u8 encoding;
+ u64 offset;
+ u64 found_logical;
+ u64 found_length;
+ u64 found_end;
+ int slot;
+ int ret;
+ int i;
+
+ stripe_key.objectid = logical;
+ stripe_key.type = BTRFS_RAID_STRIPE_KEY;
+ stripe_key.offset = 0;
+
+ path = btrfs_alloc_path();
+ if (!path)
+ return -ENOMEM;
+
+ ret = btrfs_search_slot(NULL, stripe_root, &stripe_key, path, 0, 0);
+ if (ret < 0)
+ goto free_path;
+ if (ret) {
+ if (path->slots[0] != 0)
+ path->slots[0]--;
+ }
+
+
+ while (1) {
+ leaf = path->nodes[0];
+ slot = path->slots[0];
+
+ btrfs_item_key_to_cpu(leaf, &found_key, slot);
+ found_logical = found_key.objectid;
+ found_length = found_key.offset;
+ found_end = found_logical + found_length;
+
+ if (found_logical > end) {
+ ret = -ENOENT;
+ goto out;
+ }
+
+ if (in_range(logical, found_logical, found_length))
+ break;
+
+ ret = btrfs_next_item(stripe_root, path);
+ if (ret)
+ goto out;
+ }
+
+ offset = logical - found_logical;
+
+ /*
+ * If we have a logically contiguous, but physically noncontinuous
+ * range, we need to split the bio. Record the length after which we
+ * must split the bio.
+ */
+ if (end > found_end)
+ *length -= end - found_end;
+
+ num_stripes = btrfs_num_raid_stripes(btrfs_item_size(leaf, slot));
+ stripe_extent = btrfs_item_ptr(leaf, slot, struct btrfs_stripe_extent);
+ encoding = btrfs_stripe_extent_encoding(leaf, stripe_extent);
+
+ if (encoding != btrfs_bg_flags_to_raid_index(map_type)) {
+ ret = -EUCLEAN;
+ btrfs_handle_fs_error(fs_info, ret,
+ "on-disk stripe encoding %d doesn't match RAID index %d",
+ encoding,
+ btrfs_bg_flags_to_raid_index(map_type));
+ goto out;
+ }
+
+ for (i = 0; i < num_stripes; i++) {
+ struct btrfs_raid_stride *stride = &stripe_extent->strides[i];
+ u64 devid = btrfs_raid_stride_devid(leaf, stride);
+ u64 len = btrfs_raid_stride_length(leaf, stride);
+ u64 physical = btrfs_raid_stride_physical(leaf, stride);
+
+ if (offset >= len) {
+ offset -= len;
+
+ if (offset >= BTRFS_STRIPE_LEN)
+ continue;
+ }
+
+ if (devid != stripe->dev->devid)
+ continue;
+
+ if ((map_type & BTRFS_BLOCK_GROUP_DUP) && stripe_index != i)
+ continue;
+
+ stripe->physical = physical + offset;
+
+ ret = 0;
+ goto free_path;
+ }
+
+ /*
+ * If we're here, we haven't found the requested devid in the stripe.
+ */
+ ret = -ENOENT;
+out:
+ if (ret > 0)
+ ret = -ENOENT;
+ if (ret && ret != -EIO) {
+ if (IS_ENABLED(CONFIG_BTRFS_DEBUG))
+ btrfs_print_tree(leaf, 1);
+ btrfs_err(fs_info,
+ "cannot find raid-stripe for logical [%llu, %llu] devid %llu, profile %s",
+ logical, logical + *length, stripe->dev->devid,
+ btrfs_bg_type_to_raid_name(map_type));
+ }
+free_path:
+ btrfs_free_path(path);
+
+ return ret;
+}
diff --git a/fs/btrfs/raid-stripe-tree.h b/fs/btrfs/raid-stripe-tree.h
index b3a127c997c8..5d9629a815c1 100644
--- a/fs/btrfs/raid-stripe-tree.h
+++ b/fs/btrfs/raid-stripe-tree.h
@@ -13,6 +13,10 @@ struct btrfs_trans_handle;
int btrfs_delete_raid_extent(struct btrfs_trans_handle *trans, u64 start,
u64 length);
+int btrfs_get_raid_extent_offset(struct btrfs_fs_info *fs_info,
+ u64 logical, u64 *length, u64 map_type,
+ u32 stripe_index,
+ struct btrfs_io_stripe *stripe);
int btrfs_insert_raid_extent(struct btrfs_trans_handle *trans,
struct btrfs_ordered_extent *ordered_extent);
@@ -33,4 +37,11 @@ static inline bool btrfs_need_stripe_tree_update(struct btrfs_fs_info *fs_info,
return false;
}
+
+static inline int btrfs_num_raid_stripes(u32 item_size)
+{
+ return (item_size - offsetof(struct btrfs_stripe_extent, strides)) /
+ sizeof(struct btrfs_raid_stride);
+}
+
#endif
diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index c2bac87912c7..2326dbcf85f6 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -35,6 +35,7 @@
#include "relocation.h"
#include "scrub.h"
#include "super.h"
+#include "raid-stripe-tree.h"
#define BTRFS_BLOCK_GROUP_STRIPE_MASK (BTRFS_BLOCK_GROUP_RAID0 | \
BTRFS_BLOCK_GROUP_RAID10 | \
@@ -6309,12 +6310,22 @@ static u64 btrfs_max_io_len(struct map_lookup *map, enum btrfs_map_op op,
return U64_MAX;
}
-static void set_io_stripe(struct btrfs_io_stripe *dst, const struct map_lookup *map,
- u32 stripe_index, u64 stripe_offset, u32 stripe_nr)
+static int set_io_stripe(struct btrfs_fs_info *fs_info, enum btrfs_map_op op,
+ u64 logical, u64 *length, struct btrfs_io_stripe *dst,
+ struct map_lookup *map, u32 stripe_index,
+ u64 stripe_offset, u64 stripe_nr)
{
dst->dev = map->stripes[stripe_index].dev;
+
+ if (op == BTRFS_MAP_READ &&
+ btrfs_need_stripe_tree_update(fs_info, map->type))
+ return btrfs_get_raid_extent_offset(fs_info, logical, length,
+ map->type, stripe_index,
+ dst);
+
dst->physical = map->stripes[stripe_index].physical +
stripe_offset + btrfs_stripe_nr_to_offset(stripe_nr);
+ return 0;
}
/*
@@ -6531,11 +6542,11 @@ int btrfs_map_block(struct btrfs_fs_info *fs_info, enum btrfs_map_op op,
*/
if (smap && num_alloc_stripes == 1 &&
!((map->type & BTRFS_BLOCK_GROUP_RAID56_MASK) && mirror_num > 1)) {
- set_io_stripe(smap, map, stripe_index, stripe_offset, stripe_nr);
+ ret = set_io_stripe(fs_info, op, logical, length, smap, map,
+ stripe_index, stripe_offset, stripe_nr);
if (mirror_num_ret)
*mirror_num_ret = mirror_num;
*bioc_ret = NULL;
- ret = 0;
goto out;
}
@@ -6566,21 +6577,29 @@ int btrfs_map_block(struct btrfs_fs_info *fs_info, enum btrfs_map_op op,
bioc->full_stripe_logical = em->start +
btrfs_stripe_nr_to_offset(stripe_nr * data_stripes);
for (i = 0; i < num_stripes; i++)
- set_io_stripe(&bioc->stripes[i], map,
- (i + stripe_nr) % num_stripes,
- stripe_offset, stripe_nr);
+ ret = set_io_stripe(fs_info, op, logical, length,
+ &bioc->stripes[i], map,
+ (i + stripe_nr) % num_stripes,
+ stripe_offset, stripe_nr);
} else {
/*
* For all other non-RAID56 profiles, just copy the target
* stripe into the bioc.
*/
for (i = 0; i < num_stripes; i++) {
- set_io_stripe(&bioc->stripes[i], map, stripe_index,
- stripe_offset, stripe_nr);
+ ret = set_io_stripe(fs_info, op, logical, length,
+ &bioc->stripes[i], map, stripe_index,
+ stripe_offset, stripe_nr);
stripe_index++;
}
}
+ if (ret) {
+ *bioc_ret = NULL;
+ btrfs_put_bioc(bioc);
+ goto out;
+ }
+
if (op != BTRFS_MAP_READ)
max_errors = btrfs_chunk_max_errors(map);
--
2.41.0
^ permalink raw reply related [flat|nested] 29+ messages in thread
* [PATCH v9 06/11] btrfs: implement RST version of scrub
2023-09-14 16:06 [PATCH v9 00/11] btrfs: introduce RAID stripe tree Johannes Thumshirn
` (4 preceding siblings ...)
2023-09-14 16:07 ` [PATCH v9 05/11] btrfs: lookup physical address from stripe extent Johannes Thumshirn
@ 2023-09-14 16:07 ` Johannes Thumshirn
2023-09-15 0:58 ` Qu Wenruo
2023-09-14 16:07 ` [PATCH v9 07/11] btrfs: zoned: allow zoned RAID Johannes Thumshirn
` (5 subsequent siblings)
11 siblings, 1 reply; 29+ messages in thread
From: Johannes Thumshirn @ 2023-09-14 16:07 UTC (permalink / raw)
To: Chris Mason, Josef Bacik, David Sterba
Cc: Christoph Hellwig, Naohiro Aota, Qu Wenruo, Damien Le Moal,
linux-btrfs, linux-kernel, Johannes Thumshirn
A filesystem that uses the RAID stripe tree for logical to physical
address translation can't use the regular scrub path, that reads all
stripes and then checks if a sector is unused afterwards.
When using the RAID stripe tree, this will result in lookup errors, as the
stripe tree doesn't know the requested logical addresses.
Instead, look up stripes that are backed by the extent bitmap.
Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
---
fs/btrfs/bio.c | 2 ++
fs/btrfs/raid-stripe-tree.c | 8 ++++++-
fs/btrfs/scrub.c | 53 +++++++++++++++++++++++++++++++++++++++++++++
fs/btrfs/volumes.h | 1 +
4 files changed, 63 insertions(+), 1 deletion(-)
diff --git a/fs/btrfs/bio.c b/fs/btrfs/bio.c
index ddbe6f8d4ea2..bdb6e3effdbb 100644
--- a/fs/btrfs/bio.c
+++ b/fs/btrfs/bio.c
@@ -663,6 +663,8 @@ static bool btrfs_submit_chunk(struct btrfs_bio *bbio, int mirror_num)
blk_status_t ret;
int error;
+ smap.is_scrub = !bbio->inode;
+
btrfs_bio_counter_inc_blocked(fs_info);
error = btrfs_map_block(fs_info, btrfs_op(bio), logical, &map_length,
&bioc, &smap, &mirror_num, 1);
diff --git a/fs/btrfs/raid-stripe-tree.c b/fs/btrfs/raid-stripe-tree.c
index 697a6e1fd255..63bf62c33436 100644
--- a/fs/btrfs/raid-stripe-tree.c
+++ b/fs/btrfs/raid-stripe-tree.c
@@ -334,6 +334,11 @@ int btrfs_get_raid_extent_offset(struct btrfs_fs_info *fs_info,
if (!path)
return -ENOMEM;
+ if (stripe->is_scrub) {
+ path->skip_locking = 1;
+ path->search_commit_root = 1;
+ }
+
ret = btrfs_search_slot(NULL, stripe_root, &stripe_key, path, 0, 0);
if (ret < 0)
goto free_path;
@@ -420,7 +425,8 @@ int btrfs_get_raid_extent_offset(struct btrfs_fs_info *fs_info,
out:
if (ret > 0)
ret = -ENOENT;
- if (ret && ret != -EIO) {
+ if (ret && ret != -EIO && !stripe->is_scrub) {
+
if (IS_ENABLED(CONFIG_BTRFS_DEBUG))
btrfs_print_tree(leaf, 1);
btrfs_err(fs_info,
diff --git a/fs/btrfs/scrub.c b/fs/btrfs/scrub.c
index f16220ce5fba..42948b66d4be 100644
--- a/fs/btrfs/scrub.c
+++ b/fs/btrfs/scrub.c
@@ -23,6 +23,7 @@
#include "accessors.h"
#include "file-item.h"
#include "scrub.h"
+#include "raid-stripe-tree.h"
/*
* This is only the first step towards a full-features scrub. It reads all
@@ -1634,6 +1635,53 @@ static void scrub_reset_stripe(struct scrub_stripe *stripe)
}
}
+static void scrub_submit_extent_sector_read(struct scrub_ctx *sctx,
+ struct scrub_stripe *stripe)
+{
+ struct btrfs_fs_info *fs_info = stripe->bg->fs_info;
+ struct btrfs_bio *bbio = NULL;
+ int mirror = stripe->mirror_num;
+ int i;
+
+ atomic_inc(&stripe->pending_io);
+
+ for_each_set_bit(i, &stripe->extent_sector_bitmap, stripe->nr_sectors) {
+ struct page *page = scrub_stripe_get_page(stripe, i);
+ unsigned int pgoff = scrub_stripe_get_page_offset(stripe, i);
+
+ /* The current sector cannot be merged, submit the bio. */
+ if (bbio &&
+ ((i > 0 && !test_bit(i - 1, &stripe->extent_sector_bitmap)) ||
+ bbio->bio.bi_iter.bi_size >= BTRFS_STRIPE_LEN)) {
+ ASSERT(bbio->bio.bi_iter.bi_size);
+ atomic_inc(&stripe->pending_io);
+ btrfs_submit_bio(bbio, mirror);
+ bbio = NULL;
+ }
+
+ if (!bbio) {
+ bbio = btrfs_bio_alloc(stripe->nr_sectors, REQ_OP_READ,
+ fs_info, scrub_read_endio, stripe);
+ bbio->bio.bi_iter.bi_sector = (stripe->logical +
+ (i << fs_info->sectorsize_bits)) >> SECTOR_SHIFT;
+ }
+
+ __bio_add_page(&bbio->bio, page, fs_info->sectorsize, pgoff);
+ }
+
+ if (bbio) {
+ ASSERT(bbio->bio.bi_iter.bi_size);
+ atomic_inc(&stripe->pending_io);
+ btrfs_submit_bio(bbio, mirror);
+ }
+
+ if (atomic_dec_and_test(&stripe->pending_io)) {
+ wake_up(&stripe->io_wait);
+ INIT_WORK(&stripe->work, scrub_stripe_read_repair_worker);
+ queue_work(stripe->bg->fs_info->scrub_workers, &stripe->work);
+ }
+}
+
static void scrub_submit_initial_read(struct scrub_ctx *sctx,
struct scrub_stripe *stripe)
{
@@ -1645,6 +1693,11 @@ static void scrub_submit_initial_read(struct scrub_ctx *sctx,
ASSERT(stripe->mirror_num > 0);
ASSERT(test_bit(SCRUB_STRIPE_FLAG_INITIALIZED, &stripe->state));
+ if (btrfs_need_stripe_tree_update(fs_info, stripe->bg->flags)) {
+ scrub_submit_extent_sector_read(sctx, stripe);
+ return;
+ }
+
bbio = btrfs_bio_alloc(SCRUB_STRIPE_PAGES, REQ_OP_READ, fs_info,
scrub_read_endio, stripe);
diff --git a/fs/btrfs/volumes.h b/fs/btrfs/volumes.h
index 2043aff6e966..067859de8f4c 100644
--- a/fs/btrfs/volumes.h
+++ b/fs/btrfs/volumes.h
@@ -393,6 +393,7 @@ struct btrfs_io_stripe {
/* Block mapping */
u64 physical;
u64 length;
+ bool is_scrub;
/* For the endio handler */
struct btrfs_io_context *bioc;
};
--
2.41.0
^ permalink raw reply related [flat|nested] 29+ messages in thread
* [PATCH v9 07/11] btrfs: zoned: allow zoned RAID
2023-09-14 16:06 [PATCH v9 00/11] btrfs: introduce RAID stripe tree Johannes Thumshirn
` (5 preceding siblings ...)
2023-09-14 16:07 ` [PATCH v9 06/11] btrfs: implement RST version of scrub Johannes Thumshirn
@ 2023-09-14 16:07 ` Johannes Thumshirn
2023-09-14 17:59 ` David Sterba
2023-09-14 16:07 ` [PATCH v9 08/11] btrfs: add raid stripe tree pretty printer Johannes Thumshirn
` (4 subsequent siblings)
11 siblings, 1 reply; 29+ messages in thread
From: Johannes Thumshirn @ 2023-09-14 16:07 UTC (permalink / raw)
To: Chris Mason, Josef Bacik, David Sterba
Cc: Christoph Hellwig, Naohiro Aota, Qu Wenruo, Damien Le Moal,
linux-btrfs, linux-kernel, Johannes Thumshirn
When we have a raid-stripe-tree, we can do RAID0/1/10 on zoned devices for
data block-groups. For meta-data block-groups, we don't actually need
anything special, as all meta-data I/O is protected by the
btrfs_zoned_meta_io_lock() already.
Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
---
fs/btrfs/raid-stripe-tree.h | 7 ++-
fs/btrfs/volumes.c | 2 +
fs/btrfs/zoned.c | 144 ++++++++++++++++++++++++++++++++++++++++++--
3 files changed, 148 insertions(+), 5 deletions(-)
diff --git a/fs/btrfs/raid-stripe-tree.h b/fs/btrfs/raid-stripe-tree.h
index 5d9629a815c1..f31292ab9030 100644
--- a/fs/btrfs/raid-stripe-tree.h
+++ b/fs/btrfs/raid-stripe-tree.h
@@ -6,6 +6,11 @@
#ifndef BTRFS_RAID_STRIPE_TREE_H
#define BTRFS_RAID_STRIPE_TREE_H
+#define BTRFS_RST_SUPP_BLOCK_GROUP_MASK (BTRFS_BLOCK_GROUP_DUP |\
+ BTRFS_BLOCK_GROUP_RAID1_MASK |\
+ BTRFS_BLOCK_GROUP_RAID0 |\
+ BTRFS_BLOCK_GROUP_RAID10)
+
struct btrfs_io_context;
struct btrfs_io_stripe;
struct btrfs_ordered_extent;
@@ -32,7 +37,7 @@ static inline bool btrfs_need_stripe_tree_update(struct btrfs_fs_info *fs_info,
if (type != BTRFS_BLOCK_GROUP_DATA)
return false;
- if (profile & BTRFS_BLOCK_GROUP_RAID1_MASK)
+ if (profile & BTRFS_RST_SUPP_BLOCK_GROUP_MASK)
return true;
return false;
diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index 2326dbcf85f6..dc311e38eb11 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -6541,6 +6541,8 @@ int btrfs_map_block(struct btrfs_fs_info *fs_info, enum btrfs_map_op op,
* I/O context structure.
*/
if (smap && num_alloc_stripes == 1 &&
+ !(btrfs_need_stripe_tree_update(fs_info, map->type) &&
+ op != BTRFS_MAP_READ) &&
!((map->type & BTRFS_BLOCK_GROUP_RAID56_MASK) && mirror_num > 1)) {
ret = set_io_stripe(fs_info, op, logical, length, smap, map,
stripe_index, stripe_offset, stripe_nr);
diff --git a/fs/btrfs/zoned.c b/fs/btrfs/zoned.c
index d05510cb2cb2..ce2846c944d2 100644
--- a/fs/btrfs/zoned.c
+++ b/fs/btrfs/zoned.c
@@ -1397,9 +1397,11 @@ static int btrfs_load_block_group_dup(struct btrfs_block_group *bg,
struct zone_info *zone_info,
unsigned long *active)
{
- if (map->type & BTRFS_BLOCK_GROUP_DATA) {
- btrfs_err(bg->fs_info,
- "zoned: profile DUP not yet supported on data bg");
+ struct btrfs_fs_info *fs_info = bg->fs_info;
+
+ if (map->type & BTRFS_BLOCK_GROUP_DATA &&
+ !fs_info->stripe_root) {
+ btrfs_err(fs_info, "zoned: data DUP profile needs stripe_root");
return -EINVAL;
}
@@ -1433,6 +1435,133 @@ static int btrfs_load_block_group_dup(struct btrfs_block_group *bg,
return 0;
}
+static int btrfs_load_block_group_raid1(struct btrfs_block_group *bg,
+ struct map_lookup *map,
+ struct zone_info *zone_info,
+ unsigned long *active)
+{
+ struct btrfs_fs_info *fs_info = bg->fs_info;
+ int i;
+
+ if (map->type & BTRFS_BLOCK_GROUP_DATA &&
+ !fs_info->stripe_root) {
+ btrfs_err(fs_info, "zoned: data %s needs stripe_root",
+ btrfs_bg_type_to_raid_name(map->type));
+ return -EINVAL;
+
+ }
+
+ for (i = 0; i < map->num_stripes; i++) {
+ if (zone_info[i].alloc_offset == WP_MISSING_DEV ||
+ zone_info[i].alloc_offset == WP_CONVENTIONAL)
+ continue;
+
+ if ((zone_info[0].alloc_offset != zone_info[i].alloc_offset) &&
+ !btrfs_test_opt(fs_info, DEGRADED)) {
+ btrfs_err(fs_info,
+ "zoned: write pointer offset mismatch of zones in %s profile",
+ btrfs_bg_type_to_raid_name(map->type));
+ return -EIO;
+ }
+ if (test_bit(0, active) != test_bit(i, active)) {
+ if (!btrfs_test_opt(fs_info, DEGRADED) &&
+ !btrfs_zone_activate(bg)) {
+ return -EIO;
+ }
+ } else {
+ if (test_bit(0, active))
+ set_bit(BLOCK_GROUP_FLAG_ZONE_IS_ACTIVE,
+ &bg->runtime_flags);
+ }
+ /*
+ * In case a device is missing we have a cap of 0, so don't
+ * use it.
+ */
+ bg->zone_capacity = min_not_zero(zone_info[0].capacity,
+ zone_info[1].capacity);
+ }
+
+ if (zone_info[0].alloc_offset != WP_MISSING_DEV)
+ bg->alloc_offset = zone_info[0].alloc_offset;
+ else
+ bg->alloc_offset = zone_info[i - 1].alloc_offset;
+
+ return 0;
+}
+
+static int btrfs_load_block_group_raid0(struct btrfs_block_group *bg,
+ struct map_lookup *map,
+ struct zone_info *zone_info,
+ unsigned long *active)
+{
+ struct btrfs_fs_info *fs_info = bg->fs_info;
+
+ if (map->type & BTRFS_BLOCK_GROUP_DATA &&
+ !fs_info->stripe_root) {
+ btrfs_err(fs_info, "zoned: data %s needs stripe_root",
+ btrfs_bg_type_to_raid_name(map->type));
+ return -EINVAL;
+
+ }
+
+ for (int i = 0; i < map->num_stripes; i++) {
+ if (zone_info[i].alloc_offset == WP_MISSING_DEV ||
+ zone_info[i].alloc_offset == WP_CONVENTIONAL)
+ continue;
+
+ if (test_bit(0, active) != test_bit(i, active)) {
+ if (!btrfs_zone_activate(bg))
+ return -EIO;
+ } else {
+ if (test_bit(0, active))
+ set_bit(BLOCK_GROUP_FLAG_ZONE_IS_ACTIVE,
+ &bg->runtime_flags);
+ }
+ bg->zone_capacity += zone_info[i].capacity;
+ bg->alloc_offset += zone_info[i].alloc_offset;
+ }
+
+ return 0;
+}
+
+static int btrfs_load_block_group_raid10(struct btrfs_block_group *bg,
+ struct map_lookup *map,
+ struct zone_info *zone_info,
+ unsigned long *active)
+{
+ struct btrfs_fs_info *fs_info = bg->fs_info;
+
+ if (map->type & BTRFS_BLOCK_GROUP_DATA &&
+ !fs_info->stripe_root) {
+ btrfs_err(fs_info, "zoned: data %s needs stripe_root",
+ btrfs_bg_type_to_raid_name(map->type));
+ return -EINVAL;
+
+ }
+
+ for (int i = 0; i < map->num_stripes; i++) {
+ if (zone_info[i].alloc_offset == WP_MISSING_DEV ||
+ zone_info[i].alloc_offset == WP_CONVENTIONAL)
+ continue;
+
+ if (test_bit(0, active) != test_bit(i, active)) {
+ if (!btrfs_zone_activate(bg))
+ return -EIO;
+ } else {
+ if (test_bit(0, active))
+ set_bit(BLOCK_GROUP_FLAG_ZONE_IS_ACTIVE,
+ &bg->runtime_flags);
+ }
+
+ if ((i % map->sub_stripes) == 0) {
+ bg->zone_capacity += zone_info[i].capacity;
+ bg->alloc_offset += zone_info[i].alloc_offset;
+ }
+ }
+
+ return 0;
+}
+
int btrfs_load_block_group_zone_info(struct btrfs_block_group *cache, bool new)
{
struct btrfs_fs_info *fs_info = cache->fs_info;
@@ -1525,11 +1654,18 @@ int btrfs_load_block_group_zone_info(struct btrfs_block_group *cache, bool new)
ret = btrfs_load_block_group_dup(cache, map, zone_info, active);
break;
case BTRFS_BLOCK_GROUP_RAID1:
+ case BTRFS_BLOCK_GROUP_RAID1C3:
+ case BTRFS_BLOCK_GROUP_RAID1C4:
+ ret = btrfs_load_block_group_raid1(cache, map, zone_info, active);
+ break;
case BTRFS_BLOCK_GROUP_RAID0:
+ ret = btrfs_load_block_group_raid0(cache, map, zone_info, active);
+ break;
case BTRFS_BLOCK_GROUP_RAID10:
+ ret = btrfs_load_block_group_raid10(cache, map, zone_info, active);
+ break;
case BTRFS_BLOCK_GROUP_RAID5:
case BTRFS_BLOCK_GROUP_RAID6:
- /* non-single profiles are not supported yet */
default:
btrfs_err(fs_info, "zoned: profile %s not yet supported",
btrfs_bg_type_to_raid_name(map->type));
--
2.41.0
^ permalink raw reply related [flat|nested] 29+ messages in thread
* [PATCH v9 08/11] btrfs: add raid stripe tree pretty printer
2023-09-14 16:06 [PATCH v9 00/11] btrfs: introduce RAID stripe tree Johannes Thumshirn
` (6 preceding siblings ...)
2023-09-14 16:07 ` [PATCH v9 07/11] btrfs: zoned: allow zoned RAID Johannes Thumshirn
@ 2023-09-14 16:07 ` Johannes Thumshirn
2023-09-14 16:07 ` [PATCH v9 09/11] btrfs: announce presence of raid-stripe-tree in sysfs Johannes Thumshirn
` (3 subsequent siblings)
11 siblings, 0 replies; 29+ messages in thread
From: Johannes Thumshirn @ 2023-09-14 16:07 UTC (permalink / raw)
To: Chris Mason, Josef Bacik, David Sterba
Cc: Christoph Hellwig, Naohiro Aota, Qu Wenruo, Damien Le Moal,
linux-btrfs, linux-kernel, Johannes Thumshirn
Decode raid-stripe-tree entries on btrfs_print_tree().
Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
---
fs/btrfs/print-tree.c | 26 ++++++++++++++++++++++++++
1 file changed, 26 insertions(+)
diff --git a/fs/btrfs/print-tree.c b/fs/btrfs/print-tree.c
index 0c93439e929f..f3f487fc6400 100644
--- a/fs/btrfs/print-tree.c
+++ b/fs/btrfs/print-tree.c
@@ -9,6 +9,8 @@
#include "print-tree.h"
#include "accessors.h"
#include "tree-checker.h"
+#include "volumes.h"
+#include "raid-stripe-tree.h"
struct root_name_map {
u64 id;
@@ -28,6 +30,7 @@ static const struct root_name_map root_map[] = {
{ BTRFS_FREE_SPACE_TREE_OBJECTID, "FREE_SPACE_TREE" },
{ BTRFS_BLOCK_GROUP_TREE_OBJECTID, "BLOCK_GROUP_TREE" },
{ BTRFS_DATA_RELOC_TREE_OBJECTID, "DATA_RELOC_TREE" },
+ { BTRFS_RAID_STRIPE_TREE_OBJECTID, "RAID_STRIPE_TREE" },
};
const char *btrfs_root_name(const struct btrfs_key *key, char *buf)
@@ -189,6 +192,24 @@ static void print_uuid_item(const struct extent_buffer *l, unsigned long offset,
}
}
+static void print_raid_stripe_key(const struct extent_buffer *eb, u32 item_size,
+ struct btrfs_stripe_extent *stripe)
+{
+ const int num_stripes = btrfs_num_raid_stripes(item_size);
+ const u8 encoding = btrfs_stripe_extent_encoding(eb, stripe);
+ int i;
+
+ pr_info("\t\t\tencoding: %s\n",
+ (encoding && encoding < BTRFS_NR_RAID_TYPES) ?
+ btrfs_raid_array[encoding].raid_name : "unknown");
+
+ for (i = 0; i < num_stripes; i++)
+ pr_info("\t\t\tstride %d devid %llu physical %llu length %llu\n",
+ i, btrfs_raid_stride_devid(eb, &stripe->strides[i]),
+ btrfs_raid_stride_physical(eb, &stripe->strides[i]),
+ btrfs_raid_stride_length(eb, &stripe->strides[i]));
+}
+
/*
* Helper to output refs and locking status of extent buffer. Useful to debug
* race condition related problems.
@@ -349,6 +370,11 @@ void btrfs_print_leaf(const struct extent_buffer *l)
print_uuid_item(l, btrfs_item_ptr_offset(l, i),
btrfs_item_size(l, i));
break;
+ case BTRFS_RAID_STRIPE_KEY:
+ print_raid_stripe_key(l, btrfs_item_size(l, i),
+ btrfs_item_ptr(l, i,
+ struct btrfs_stripe_extent));
+ break;
}
}
}
--
2.41.0
^ permalink raw reply related [flat|nested] 29+ messages in thread
* [PATCH v9 09/11] btrfs: announce presence of raid-stripe-tree in sysfs
2023-09-14 16:06 [PATCH v9 00/11] btrfs: introduce RAID stripe tree Johannes Thumshirn
` (7 preceding siblings ...)
2023-09-14 16:07 ` [PATCH v9 08/11] btrfs: add raid stripe tree pretty printer Johannes Thumshirn
@ 2023-09-14 16:07 ` Johannes Thumshirn
2023-09-14 16:07 ` [PATCH v9 10/11] btrfs: add trace events for RST Johannes Thumshirn
` (2 subsequent siblings)
11 siblings, 0 replies; 29+ messages in thread
From: Johannes Thumshirn @ 2023-09-14 16:07 UTC (permalink / raw)
To: Chris Mason, Josef Bacik, David Sterba
Cc: Christoph Hellwig, Naohiro Aota, Qu Wenruo, Damien Le Moal,
linux-btrfs, linux-kernel, Johannes Thumshirn
If a filesystem with a raid-stripe-tree is mounted, show the RST feature
in sysfs.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
---
fs/btrfs/sysfs.c | 3 +++
1 file changed, 3 insertions(+)
diff --git a/fs/btrfs/sysfs.c b/fs/btrfs/sysfs.c
index b1d1ac25237b..1bab3d7d251e 100644
--- a/fs/btrfs/sysfs.c
+++ b/fs/btrfs/sysfs.c
@@ -297,6 +297,8 @@ BTRFS_FEAT_ATTR_INCOMPAT(zoned, ZONED);
#ifdef CONFIG_BTRFS_DEBUG
/* Remove once support for extent tree v2 is feature complete */
BTRFS_FEAT_ATTR_INCOMPAT(extent_tree_v2, EXTENT_TREE_V2);
+/* Remove once support for raid stripe tree is feature complete */
+BTRFS_FEAT_ATTR_INCOMPAT(raid_stripe_tree, RAID_STRIPE_TREE);
#endif
#ifdef CONFIG_FS_VERITY
BTRFS_FEAT_ATTR_COMPAT_RO(verity, VERITY);
@@ -327,6 +329,7 @@ static struct attribute *btrfs_supported_feature_attrs[] = {
#endif
#ifdef CONFIG_BTRFS_DEBUG
BTRFS_FEAT_ATTR_PTR(extent_tree_v2),
+ BTRFS_FEAT_ATTR_PTR(raid_stripe_tree),
#endif
#ifdef CONFIG_FS_VERITY
BTRFS_FEAT_ATTR_PTR(verity),
--
2.41.0
^ permalink raw reply related [flat|nested] 29+ messages in thread
* [PATCH v9 10/11] btrfs: add trace events for RST
2023-09-14 16:06 [PATCH v9 00/11] btrfs: introduce RAID stripe tree Johannes Thumshirn
` (8 preceding siblings ...)
2023-09-14 16:07 ` [PATCH v9 09/11] btrfs: announce presence of raid-stripe-tree in sysfs Johannes Thumshirn
@ 2023-09-14 16:07 ` Johannes Thumshirn
2023-09-14 16:07 ` [PATCH v9 11/11] btrfs: add raid-stripe-tree to features enabled with debug Johannes Thumshirn
2023-09-14 18:25 ` [PATCH v9 00/11] btrfs: introduce RAID stripe tree David Sterba
11 siblings, 0 replies; 29+ messages in thread
From: Johannes Thumshirn @ 2023-09-14 16:07 UTC (permalink / raw)
To: Chris Mason, Josef Bacik, David Sterba
Cc: Christoph Hellwig, Naohiro Aota, Qu Wenruo, Damien Le Moal,
linux-btrfs, linux-kernel, Johannes Thumshirn
Add trace events for raid-stripe-tree operations.
Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
---
fs/btrfs/raid-stripe-tree.c | 8 +++++
include/trace/events/btrfs.h | 75 ++++++++++++++++++++++++++++++++++++++++++++
2 files changed, 83 insertions(+)
diff --git a/fs/btrfs/raid-stripe-tree.c b/fs/btrfs/raid-stripe-tree.c
index 63bf62c33436..ee4155377bf9 100644
--- a/fs/btrfs/raid-stripe-tree.c
+++ b/fs/btrfs/raid-stripe-tree.c
@@ -62,6 +62,9 @@ int btrfs_delete_raid_extent(struct btrfs_trans_handle *trans, u64 start,
if (found_end <= start)
break;
+ trace_btrfs_raid_extent_delete(fs_info, start, end,
+ found_start, found_end);
+
ASSERT(found_start >= start && found_end <= end);
ret = btrfs_del_item(trans, stripe_root, path);
if (ret)
@@ -94,6 +97,8 @@ static int btrfs_insert_one_raid_extent(struct btrfs_trans_handle *trans,
return -ENOMEM;
}
+ trace_btrfs_insert_one_raid_extent(fs_info, bioc->logical, bioc->size,
+ num_stripes);
btrfs_set_stack_stripe_extent_encoding(stripe_extent, encoding);
for (int i = 0; i < num_stripes; i++) {
u64 devid = bioc->stripes[i].dev->devid;
@@ -414,6 +419,9 @@ int btrfs_get_raid_extent_offset(struct btrfs_fs_info *fs_info,
stripe->physical = physical + offset;
+ trace_btrfs_get_raid_extent_offset(fs_info, logical, *length,
+ stripe->physical, devid);
+
ret = 0;
goto free_path;
}
diff --git a/include/trace/events/btrfs.h b/include/trace/events/btrfs.h
index b2db2c2f1c57..fcf246f84547 100644
--- a/include/trace/events/btrfs.h
+++ b/include/trace/events/btrfs.h
@@ -2497,6 +2497,81 @@ DEFINE_EVENT(btrfs_raid56_bio, raid56_write,
TP_ARGS(rbio, bio, trace_info)
);
+TRACE_EVENT(btrfs_insert_one_raid_extent,
+
+ TP_PROTO(const struct btrfs_fs_info *fs_info, u64 logical, u64 length,
+ int num_stripes),
+
+ TP_ARGS(fs_info, logical, length, num_stripes),
+
+ TP_STRUCT__entry_btrfs(
+ __field( u64, logical )
+ __field( u64, length )
+ __field( int, num_stripes )
+ ),
+
+ TP_fast_assign_btrfs(fs_info,
+ __entry->logical = logical;
+ __entry->length = length;
+ __entry->num_stripes = num_stripes;
+ ),
+
+ TP_printk_btrfs("logical=%llu length=%llu num_stripes=%d",
+ __entry->logical, __entry->length,
+ __entry->num_stripes)
+);
+
+TRACE_EVENT(btrfs_raid_extent_delete,
+
+ TP_PROTO(const struct btrfs_fs_info *fs_info, u64 start, u64 end,
+ u64 found_start, u64 found_end),
+
+ TP_ARGS(fs_info, start, end, found_start, found_end),
+
+ TP_STRUCT__entry_btrfs(
+ __field( u64, start )
+ __field( u64, end )
+ __field( u64, found_start )
+ __field( u64, found_end )
+ ),
+
+ TP_fast_assign_btrfs(fs_info,
+ __entry->start = start;
+ __entry->end = end;
+ __entry->found_start = found_start;
+ __entry->found_end = found_end;
+ ),
+
+ TP_printk_btrfs("start=%llu end=%llu found_start=%llu found_end=%llu",
+ __entry->start, __entry->end, __entry->found_start,
+ __entry->found_end)
+);
+
+TRACE_EVENT(btrfs_get_raid_extent_offset,
+
+ TP_PROTO(const struct btrfs_fs_info *fs_info, u64 logical, u64 length,
+ u64 physical, u64 devid),
+
+ TP_ARGS(fs_info, logical, length, physical, devid),
+
+ TP_STRUCT__entry_btrfs(
+ __field( u64, logical )
+ __field( u64, length )
+ __field( u64, physical )
+ __field( u64, devid )
+ ),
+
+ TP_fast_assign_btrfs(fs_info,
+ __entry->logical = logical;
+ __entry->length = length;
+ __entry->physical = physical;
+ __entry->devid = devid;
+ ),
+
+ TP_printk_btrfs("logical=%llu length=%llu physical=%llu devid=%llu",
+ __entry->logical, __entry->length, __entry->physical,
+ __entry->devid)
+);
#endif /* _TRACE_BTRFS_H */
/* This part must be outside protection */
--
2.41.0
^ permalink raw reply related [flat|nested] 29+ messages in thread
* [PATCH v9 11/11] btrfs: add raid-stripe-tree to features enabled with debug
2023-09-14 16:06 [PATCH v9 00/11] btrfs: introduce RAID stripe tree Johannes Thumshirn
` (9 preceding siblings ...)
2023-09-14 16:07 ` [PATCH v9 10/11] btrfs: add trace events for RST Johannes Thumshirn
@ 2023-09-14 16:07 ` Johannes Thumshirn
2023-09-14 18:25 ` [PATCH v9 00/11] btrfs: introduce RAID stripe tree David Sterba
11 siblings, 0 replies; 29+ messages in thread
From: Johannes Thumshirn @ 2023-09-14 16:07 UTC (permalink / raw)
To: Chris Mason, Josef Bacik, David Sterba
Cc: Christoph Hellwig, Naohiro Aota, Qu Wenruo, Damien Le Moal,
linux-btrfs, linux-kernel, Johannes Thumshirn
Until the RAID stripe tree code is well enough tested and feature
complete, "hide" it behind CONFIG_BTRFS_DEBUG so only people who
want to use it are actually using it.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
---
fs/btrfs/fs.h | 3 ++-
1 file changed, 2 insertions(+), 1 deletion(-)
diff --git a/fs/btrfs/fs.h b/fs/btrfs/fs.h
index 5c7778e8b5ed..0f5894e2bdeb 100644
--- a/fs/btrfs/fs.h
+++ b/fs/btrfs/fs.h
@@ -223,7 +223,8 @@ enum {
*/
#define BTRFS_FEATURE_INCOMPAT_SUPP \
(BTRFS_FEATURE_INCOMPAT_SUPP_STABLE | \
- BTRFS_FEATURE_INCOMPAT_EXTENT_TREE_V2)
+ BTRFS_FEATURE_INCOMPAT_EXTENT_TREE_V2 |\
+ BTRFS_FEATURE_INCOMPAT_RAID_STRIPE_TREE)
#else
--
2.41.0
^ permalink raw reply related [flat|nested] 29+ messages in thread
* Re: [PATCH v9 05/11] btrfs: lookup physical address from stripe extent
2023-09-14 16:07 ` [PATCH v9 05/11] btrfs: lookup physical address from stripe extent Johannes Thumshirn
@ 2023-09-14 17:57 ` David Sterba
0 siblings, 0 replies; 29+ messages in thread
From: David Sterba @ 2023-09-14 17:57 UTC (permalink / raw)
To: Johannes Thumshirn
Cc: Chris Mason, Josef Bacik, David Sterba, Christoph Hellwig,
Naohiro Aota, Qu Wenruo, Damien Le Moal, linux-btrfs,
linux-kernel
On Thu, Sep 14, 2023 at 09:07:00AM -0700, Johannes Thumshirn wrote:
> Lookup the physical address from the raid stripe tree when a read on an
> RAID volume formatted with the raid stripe tree was attempted.
>
> Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
> ---
> fs/btrfs/raid-stripe-tree.c | 130 ++++++++++++++++++++++++++++++++++++++++++++
> fs/btrfs/raid-stripe-tree.h | 11 ++++
> fs/btrfs/volumes.c | 37 ++++++++++---
> 3 files changed, 169 insertions(+), 9 deletions(-)
>
> diff --git a/fs/btrfs/raid-stripe-tree.c b/fs/btrfs/raid-stripe-tree.c
> index 517bc08803f1..697a6e1fd255 100644
> --- a/fs/btrfs/raid-stripe-tree.c
> +++ b/fs/btrfs/raid-stripe-tree.c
> @@ -303,3 +303,133 @@ int btrfs_insert_raid_extent(struct btrfs_trans_handle *trans,
>
> return ret;
> }
> +
> +int btrfs_get_raid_extent_offset(struct btrfs_fs_info *fs_info,
> + u64 logical, u64 *length, u64 map_type,
> + u32 stripe_index,
> + struct btrfs_io_stripe *stripe)
> +{
> + struct btrfs_root *stripe_root = fs_info->stripe_root;
> + struct btrfs_stripe_extent *stripe_extent;
> + struct btrfs_key stripe_key;
> + struct btrfs_key found_key;
> + struct btrfs_path *path;
> + struct extent_buffer *leaf;
> + const u64 end = logical + *length;
> + int num_stripes;
> + u8 encoding;
> + u64 offset;
> + u64 found_logical;
> + u64 found_length;
> + u64 found_end;
> + int slot;
> + int ret;
> + int i;
> +
> + stripe_key.objectid = logical;
> + stripe_key.type = BTRFS_RAID_STRIPE_KEY;
> + stripe_key.offset = 0;
> +
> + path = btrfs_alloc_path();
> + if (!path)
> + return -ENOMEM;
> +
> + ret = btrfs_search_slot(NULL, stripe_root, &stripe_key, path, 0, 0);
> + if (ret < 0)
> + goto free_path;
> + if (ret) {
> + if (path->slots[0] != 0)
> + path->slots[0]--;
> + }
> +
> +
> + while (1) {
> + leaf = path->nodes[0];
> + slot = path->slots[0];
> +
> + btrfs_item_key_to_cpu(leaf, &found_key, slot);
> + found_logical = found_key.objectid;
> + found_length = found_key.offset;
> + found_end = found_logical + found_length;
> +
> + if (found_logical > end) {
> + ret = -ENOENT;
> + goto out;
> + }
> +
> + if (in_range(logical, found_logical, found_length))
> + break;
> +
> + ret = btrfs_next_item(stripe_root, path);
> + if (ret)
> + goto out;
> + }
> +
> + offset = logical - found_logical;
> +
> + /*
> + * If we have a logically contiguous, but physically noncontinuous
> + * range, we need to split the bio. Record the length after which we
> + * must split the bio.
> + */
> + if (end > found_end)
> + *length -= end - found_end;
> +
> + num_stripes = btrfs_num_raid_stripes(btrfs_item_size(leaf, slot));
> + stripe_extent = btrfs_item_ptr(leaf, slot, struct btrfs_stripe_extent);
> + encoding = btrfs_stripe_extent_encoding(leaf, stripe_extent);
> +
> + if (encoding != btrfs_bg_flags_to_raid_index(map_type)) {
> + ret = -EUCLEAN;
> + btrfs_handle_fs_error(fs_info, ret,
> + "on-disk stripe encoding %d doesn't match RAID index %d",
> + encoding,
> + btrfs_bg_flags_to_raid_index(map_type));
> + goto out;
> + }
> +
> + for (i = 0; i < num_stripes; i++) {
> + struct btrfs_raid_stride *stride = &stripe_extent->strides[i];
> + u64 devid = btrfs_raid_stride_devid(leaf, stride);
> + u64 len = btrfs_raid_stride_length(leaf, stride);
> + u64 physical = btrfs_raid_stride_physical(leaf, stride);
> +
> + if (offset >= len) {
> + offset -= len;
> +
> + if (offset >= BTRFS_STRIPE_LEN)
> + continue;
> + }
> +
> + if (devid != stripe->dev->devid)
> + continue;
> +
> + if ((map_type & BTRFS_BLOCK_GROUP_DUP) && stripe_index != i)
> + continue;
> +
> + stripe->physical = physical + offset;
> +
> + ret = 0;
> + goto free_path;
> + }
> +
> + /*
> + * If we're here, we haven't found the requested devid in the stripe.
> + */
> + ret = -ENOENT;
> +out:
> + if (ret > 0)
> + ret = -ENOENT;
> + if (ret && ret != -EIO) {
> + if (IS_ENABLED(CONFIG_BTRFS_DEBUG))
> + btrfs_print_tree(leaf, 1);
> + btrfs_err(fs_info,
> + "cannot find raid-stripe for logical [%llu, %llu] devid %llu, profile %s",
> + logical, logical + *length, stripe->dev->devid,
> + btrfs_bg_type_to_raid_name(map_type));
> + }
> +free_path:
> + btrfs_free_path(path);
> +
> + return ret;
> +}
> diff --git a/fs/btrfs/raid-stripe-tree.h b/fs/btrfs/raid-stripe-tree.h
> index b3a127c997c8..5d9629a815c1 100644
> --- a/fs/btrfs/raid-stripe-tree.h
> +++ b/fs/btrfs/raid-stripe-tree.h
> @@ -13,6 +13,10 @@ struct btrfs_trans_handle;
>
> int btrfs_delete_raid_extent(struct btrfs_trans_handle *trans, u64 start,
> u64 length);
> +int btrfs_get_raid_extent_offset(struct btrfs_fs_info *fs_info,
> + u64 logical, u64 *length, u64 map_type,
> + u32 stripe_index,
> + struct btrfs_io_stripe *stripe);
> int btrfs_insert_raid_extent(struct btrfs_trans_handle *trans,
> struct btrfs_ordered_extent *ordered_extent);
>
> @@ -33,4 +37,11 @@ static inline bool btrfs_need_stripe_tree_update(struct btrfs_fs_info *fs_info,
>
> return false;
> }
> +
> +static inline int btrfs_num_raid_stripes(u32 item_size)
> +{
> + return (item_size - offsetof(struct btrfs_stripe_extent, strides)) /
> + sizeof(struct btrfs_raid_stride);
> +}
> +
> #endif
> diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
> index c2bac87912c7..2326dbcf85f6 100644
> --- a/fs/btrfs/volumes.c
> +++ b/fs/btrfs/volumes.c
> @@ -35,6 +35,7 @@
> #include "relocation.h"
> #include "scrub.h"
> #include "super.h"
> +#include "raid-stripe-tree.h"
>
> #define BTRFS_BLOCK_GROUP_STRIPE_MASK (BTRFS_BLOCK_GROUP_RAID0 | \
> BTRFS_BLOCK_GROUP_RAID10 | \
> @@ -6309,12 +6310,22 @@ static u64 btrfs_max_io_len(struct map_lookup *map, enum btrfs_map_op op,
> return U64_MAX;
> }
>
> -static void set_io_stripe(struct btrfs_io_stripe *dst, const struct map_lookup *map,
> - u32 stripe_index, u64 stripe_offset, u32 stripe_nr)
> +static int set_io_stripe(struct btrfs_fs_info *fs_info, enum btrfs_map_op op,
> + u64 logical, u64 *length, struct btrfs_io_stripe *dst,
> + struct map_lookup *map, u32 stripe_index,
> + u64 stripe_offset, u64 stripe_nr)
> {
> dst->dev = map->stripes[stripe_index].dev;
> +
> + if (op == BTRFS_MAP_READ &&
> + btrfs_need_stripe_tree_update(fs_info, map->type))
> + return btrfs_get_raid_extent_offset(fs_info, logical, length,
> + map->type, stripe_index,
> + dst);
> +
> dst->physical = map->stripes[stripe_index].physical +
> stripe_offset + btrfs_stripe_nr_to_offset(stripe_nr);
> + return 0;
> }
>
> /*
> @@ -6531,11 +6542,11 @@ int btrfs_map_block(struct btrfs_fs_info *fs_info, enum btrfs_map_op op,
> */
> if (smap && num_alloc_stripes == 1 &&
> !((map->type & BTRFS_BLOCK_GROUP_RAID56_MASK) && mirror_num > 1)) {
> - set_io_stripe(smap, map, stripe_index, stripe_offset, stripe_nr);
> + ret = set_io_stripe(fs_info, op, logical, length, smap, map,
> + stripe_index, stripe_offset, stripe_nr);
> if (mirror_num_ret)
> *mirror_num_ret = mirror_num;
> *bioc_ret = NULL;
> - ret = 0;
> goto out;
> }
>
> @@ -6566,21 +6577,29 @@ int btrfs_map_block(struct btrfs_fs_info *fs_info, enum btrfs_map_op op,
> bioc->full_stripe_logical = em->start +
> btrfs_stripe_nr_to_offset(stripe_nr * data_stripes);
> for (i = 0; i < num_stripes; i++)
> - set_io_stripe(&bioc->stripes[i], map,
> - (i + stripe_nr) % num_stripes,
> - stripe_offset, stripe_nr);
> + ret = set_io_stripe(fs_info, op, logical, length,
> + &bioc->stripes[i], map,
> + (i + stripe_nr) % num_stripes,
> + stripe_offset, stripe_nr);
You've added error value but it's not checked and the whole for-loop
continues, is that intentional?
> } else {
> /*
> * For all other non-RAID56 profiles, just copy the target
> * stripe into the bioc.
> */
> for (i = 0; i < num_stripes; i++) {
> - set_io_stripe(&bioc->stripes[i], map, stripe_index,
> - stripe_offset, stripe_nr);
> + ret = set_io_stripe(fs_info, op, logical, length,
> + &bioc->stripes[i], map, stripe_index,
> + stripe_offset, stripe_nr);
Same here.
> stripe_index++;
> }
> }
>
> + if (ret) {
> + *bioc_ret = NULL;
> + btrfs_put_bioc(bioc);
> + goto out;
> + }
This handles ret != 0 but after the whole loop is done.
> +
> if (op != BTRFS_MAP_READ)
> max_errors = btrfs_chunk_max_errors(map);
>
>
> --
> 2.41.0
^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [PATCH v9 07/11] btrfs: zoned: allow zoned RAID
2023-09-14 16:07 ` [PATCH v9 07/11] btrfs: zoned: allow zoned RAID Johannes Thumshirn
@ 2023-09-14 17:59 ` David Sterba
0 siblings, 0 replies; 29+ messages in thread
From: David Sterba @ 2023-09-14 17:59 UTC (permalink / raw)
To: Johannes Thumshirn
Cc: Chris Mason, Josef Bacik, David Sterba, Christoph Hellwig,
Naohiro Aota, Qu Wenruo, Damien Le Moal, linux-btrfs,
linux-kernel
On Thu, Sep 14, 2023 at 09:07:02AM -0700, Johannes Thumshirn wrote:
> --- a/fs/btrfs/zoned.c
> +++ b/fs/btrfs/zoned.c
> @@ -1397,9 +1397,11 @@ static int btrfs_load_block_group_dup(struct btrfs_block_group *bg,
> struct zone_info *zone_info,
> unsigned long *active)
> {
> - if (map->type & BTRFS_BLOCK_GROUP_DATA) {
> - btrfs_err(bg->fs_info,
> - "zoned: profile DUP not yet supported on data bg");
> + struct btrfs_fs_info *fs_info = bg->fs_info;
> +
> + if (map->type & BTRFS_BLOCK_GROUP_DATA &&
> + !fs_info->stripe_root) {
> + btrfs_err(fs_info, "zoned: data DUP profile needs stripe_root");
Using stripe_root for identifier is ok so we don't have overly long ones
but for user messages please use raid-stripe-tree. Fixed.
^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [PATCH v9 03/11] btrfs: add support for inserting raid stripe extents
2023-09-14 16:06 ` [PATCH v9 03/11] btrfs: add support for inserting raid stripe extents Johannes Thumshirn
@ 2023-09-14 18:07 ` David Sterba
2023-09-15 10:03 ` Geert Uytterhoeven
2023-09-14 18:10 ` David Sterba
2023-09-15 0:55 ` Qu Wenruo
2 siblings, 1 reply; 29+ messages in thread
From: David Sterba @ 2023-09-14 18:07 UTC (permalink / raw)
To: Johannes Thumshirn
Cc: Chris Mason, Josef Bacik, David Sterba, Christoph Hellwig,
Naohiro Aota, Qu Wenruo, Damien Le Moal, linux-btrfs,
linux-kernel
On Thu, Sep 14, 2023 at 09:06:58AM -0700, Johannes Thumshirn wrote:
> Add support for inserting stripe extents into the raid stripe tree on
> completion of every write that needs an extra logical-to-physical
> translation when using RAID.
>
> Inserting the stripe extents happens after the data I/O has completed,
> this is done to a) support zone-append and b) rule out the possibility of
> a RAID-write-hole.
>
> Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
> ---
> fs/btrfs/Makefile | 2 +-
> fs/btrfs/bio.c | 23 +++++
> fs/btrfs/extent-tree.c | 1 +
> fs/btrfs/inode.c | 8 +-
> fs/btrfs/ordered-data.c | 1 +
> fs/btrfs/ordered-data.h | 2 +
> fs/btrfs/raid-stripe-tree.c | 245 ++++++++++++++++++++++++++++++++++++++++++++
> fs/btrfs/raid-stripe-tree.h | 34 ++++++
> fs/btrfs/volumes.c | 4 +-
> fs/btrfs/volumes.h | 15 +--
> 10 files changed, 326 insertions(+), 9 deletions(-)
>
> diff --git a/fs/btrfs/Makefile b/fs/btrfs/Makefile
> index c57d80729d4f..525af975f61c 100644
> --- a/fs/btrfs/Makefile
> +++ b/fs/btrfs/Makefile
> @@ -33,7 +33,7 @@ btrfs-y += super.o ctree.o extent-tree.o print-tree.o root-tree.o dir-item.o \
> uuid-tree.o props.o free-space-tree.o tree-checker.o space-info.o \
> block-rsv.o delalloc-space.o block-group.o discard.o reflink.o \
> subpage.o tree-mod-log.o extent-io-tree.o fs.o messages.o bio.o \
> - lru_cache.o
> + lru_cache.o raid-stripe-tree.o
>
> btrfs-$(CONFIG_BTRFS_FS_POSIX_ACL) += acl.o
> btrfs-$(CONFIG_BTRFS_FS_REF_VERIFY) += ref-verify.o
> diff --git a/fs/btrfs/bio.c b/fs/btrfs/bio.c
> index 31ff36990404..ddbe6f8d4ea2 100644
> --- a/fs/btrfs/bio.c
> +++ b/fs/btrfs/bio.c
> @@ -14,6 +14,7 @@
> #include "rcu-string.h"
> #include "zoned.h"
> #include "file-item.h"
> +#include "raid-stripe-tree.h"
>
> static struct bio_set btrfs_bioset;
> static struct bio_set btrfs_clone_bioset;
> @@ -415,6 +416,9 @@ static void btrfs_orig_write_end_io(struct bio *bio)
> else
> bio->bi_status = BLK_STS_OK;
>
> + if (bio_op(bio) == REQ_OP_ZONE_APPEND && !bio->bi_status)
> + stripe->physical = bio->bi_iter.bi_sector << SECTOR_SHIFT;
> +
> btrfs_orig_bbio_end_io(bbio);
> btrfs_put_bioc(bioc);
> }
> @@ -426,6 +430,8 @@ static void btrfs_clone_write_end_io(struct bio *bio)
> if (bio->bi_status) {
> atomic_inc(&stripe->bioc->error);
> btrfs_log_dev_io_error(bio, stripe->dev);
> + } else if (bio_op(bio) == REQ_OP_ZONE_APPEND) {
> + stripe->physical = bio->bi_iter.bi_sector << SECTOR_SHIFT;
> }
>
> /* Pass on control to the original bio this one was cloned from */
> @@ -487,6 +493,7 @@ static void btrfs_submit_mirrored_bio(struct btrfs_io_context *bioc, int dev_nr)
> bio->bi_private = &bioc->stripes[dev_nr];
> bio->bi_iter.bi_sector = bioc->stripes[dev_nr].physical >> SECTOR_SHIFT;
> bioc->stripes[dev_nr].bioc = bioc;
> + bioc->size = bio->bi_iter.bi_size;
> btrfs_submit_dev_bio(bioc->stripes[dev_nr].dev, bio);
> }
>
> @@ -496,6 +503,8 @@ static void __btrfs_submit_bio(struct bio *bio, struct btrfs_io_context *bioc,
> if (!bioc) {
> /* Single mirror read/write fast path. */
> btrfs_bio(bio)->mirror_num = mirror_num;
> + if (bio_op(bio) != REQ_OP_READ)
> + btrfs_bio(bio)->orig_physical = smap->physical;
> bio->bi_iter.bi_sector = smap->physical >> SECTOR_SHIFT;
> if (bio_op(bio) != REQ_OP_READ)
> btrfs_bio(bio)->orig_physical = smap->physical;
> @@ -688,6 +697,20 @@ static bool btrfs_submit_chunk(struct btrfs_bio *bbio, int mirror_num)
> bio->bi_opf |= REQ_OP_ZONE_APPEND;
> }
>
> + if (is_data_bbio(bbio) && bioc &&
> + btrfs_need_stripe_tree_update(bioc->fs_info,
> + bioc->map_type)) {
> + /*
> + * No locking for the list update, as we only add to
> + * the list in the I/O submission path, and list
> + * iteration only happens in the completion path,
> + * which can't happen until after the last submission.
> + */
> + btrfs_get_bioc(bioc);
> + list_add_tail(&bioc->ordered_entry,
> + &bbio->ordered->bioc_list);
> + }
> +
> /*
> * Csum items for reloc roots have already been cloned at this
> * point, so they are handled as part of the no-checksum case.
> diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
> index cb12bfb047e7..959d7449ea0d 100644
> --- a/fs/btrfs/extent-tree.c
> +++ b/fs/btrfs/extent-tree.c
> @@ -42,6 +42,7 @@
> #include "file-item.h"
> #include "orphan.h"
> #include "tree-checker.h"
> +#include "raid-stripe-tree.h"
>
> #undef SCRAMBLE_DELAYED_REFS
>
> diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
> index e02a5ba5b533..b5e0ed3a36f7 100644
> --- a/fs/btrfs/inode.c
> +++ b/fs/btrfs/inode.c
> @@ -71,6 +71,7 @@
> #include "super.h"
> #include "orphan.h"
> #include "backref.h"
> +#include "raid-stripe-tree.h"
>
> struct btrfs_iget_args {
> u64 ino;
> @@ -3091,6 +3092,10 @@ int btrfs_finish_one_ordered(struct btrfs_ordered_extent *ordered_extent)
>
> trans->block_rsv = &inode->block_rsv;
>
> + ret = btrfs_insert_raid_extent(trans, ordered_extent);
> + if (ret)
> + goto out;
> +
> if (test_bit(BTRFS_ORDERED_COMPRESSED, &ordered_extent->flags))
> compress_type = ordered_extent->compress_type;
> if (test_bit(BTRFS_ORDERED_PREALLOC, &ordered_extent->flags)) {
> @@ -3224,7 +3229,8 @@ int btrfs_finish_one_ordered(struct btrfs_ordered_extent *ordered_extent)
> int btrfs_finish_ordered_io(struct btrfs_ordered_extent *ordered)
> {
> if (btrfs_is_zoned(btrfs_sb(ordered->inode->i_sb)) &&
> - !test_bit(BTRFS_ORDERED_IOERR, &ordered->flags))
> + !test_bit(BTRFS_ORDERED_IOERR, &ordered->flags) &&
> + list_empty(&ordered->bioc_list))
> btrfs_finish_ordered_zoned(ordered);
> return btrfs_finish_one_ordered(ordered);
> }
> diff --git a/fs/btrfs/ordered-data.c b/fs/btrfs/ordered-data.c
> index 345c449d588c..55c7d5543265 100644
> --- a/fs/btrfs/ordered-data.c
> +++ b/fs/btrfs/ordered-data.c
> @@ -191,6 +191,7 @@ static struct btrfs_ordered_extent *alloc_ordered_extent(
> INIT_LIST_HEAD(&entry->log_list);
> INIT_LIST_HEAD(&entry->root_extent_list);
> INIT_LIST_HEAD(&entry->work_list);
> + INIT_LIST_HEAD(&entry->bioc_list);
> init_completion(&entry->completion);
>
> /*
> diff --git a/fs/btrfs/ordered-data.h b/fs/btrfs/ordered-data.h
> index 173bd5c5df26..1c51ac57e5df 100644
> --- a/fs/btrfs/ordered-data.h
> +++ b/fs/btrfs/ordered-data.h
> @@ -151,6 +151,8 @@ struct btrfs_ordered_extent {
> struct completion completion;
> struct btrfs_work flush_work;
> struct list_head work_list;
> +
> + struct list_head bioc_list;
> };
>
> static inline void
> diff --git a/fs/btrfs/raid-stripe-tree.c b/fs/btrfs/raid-stripe-tree.c
> new file mode 100644
> index 000000000000..7cdcc45a8796
> --- /dev/null
> +++ b/fs/btrfs/raid-stripe-tree.c
> @@ -0,0 +1,245 @@
> +// SPDX-License-Identifier: GPL-2.0
> +/*
> + * Copyright (C) 2023 Western Digital Corporation or its affiliates.
> + */
> +
> +#include <linux/btrfs_tree.h>
> +
> +#include "ctree.h"
> +#include "fs.h"
> +#include "accessors.h"
> +#include "transaction.h"
> +#include "disk-io.h"
> +#include "raid-stripe-tree.h"
> +#include "volumes.h"
> +#include "misc.h"
> +#include "print-tree.h"
> +
> +static int btrfs_insert_one_raid_extent(struct btrfs_trans_handle *trans,
> + int num_stripes,
> + struct btrfs_io_context *bioc)
> +{
> + struct btrfs_fs_info *fs_info = trans->fs_info;
> + struct btrfs_key stripe_key;
> + struct btrfs_root *stripe_root = fs_info->stripe_root;
> + u8 encoding = btrfs_bg_flags_to_raid_index(bioc->map_type);
> + struct btrfs_stripe_extent *stripe_extent;
> + const size_t item_size = struct_size(stripe_extent, strides, num_stripes);
> + int ret;
> +
> + stripe_extent = kzalloc(item_size, GFP_NOFS);
> + if (!stripe_extent) {
> + btrfs_abort_transaction(trans, -ENOMEM);
> + btrfs_end_transaction(trans);
> + return -ENOMEM;
> + }
> +
> + btrfs_set_stack_stripe_extent_encoding(stripe_extent, encoding);
> + for (int i = 0; i < num_stripes; i++) {
> + u64 devid = bioc->stripes[i].dev->devid;
> + u64 physical = bioc->stripes[i].physical;
> + u64 length = bioc->stripes[i].length;
> + struct btrfs_raid_stride *raid_stride =
> + &stripe_extent->strides[i];
> +
> + if (length == 0)
> + length = bioc->size;
> +
> + btrfs_set_stack_raid_stride_devid(raid_stride, devid);
> + btrfs_set_stack_raid_stride_physical(raid_stride, physical);
> + btrfs_set_stack_raid_stride_length(raid_stride, length);
> + }
> +
> + stripe_key.objectid = bioc->logical;
> + stripe_key.type = BTRFS_RAID_STRIPE_KEY;
> + stripe_key.offset = bioc->size;
> +
> + ret = btrfs_insert_item(trans, stripe_root, &stripe_key, stripe_extent,
> + item_size);
> + if (ret)
> + btrfs_abort_transaction(trans, ret);
> +
> + kfree(stripe_extent);
> +
> + return ret;
> +}
> +
> +static int btrfs_insert_mirrored_raid_extents(struct btrfs_trans_handle *trans,
> + struct btrfs_ordered_extent *ordered,
> + u64 map_type)
> +{
> + int num_stripes = btrfs_bg_type_to_factor(map_type);
> + struct btrfs_io_context *bioc;
> + int ret;
> +
> + list_for_each_entry(bioc, &ordered->bioc_list, ordered_entry) {
> + ret = btrfs_insert_one_raid_extent(trans, num_stripes, bioc);
> + if (ret)
> + return ret;
> + }
> +
> + return 0;
> +}
> +
> +static int btrfs_insert_striped_mirrored_raid_extents(
> + struct btrfs_trans_handle *trans,
> + struct btrfs_ordered_extent *ordered,
> + u64 map_type)
> +{
> + struct btrfs_io_context *bioc;
> + struct btrfs_io_context *rbioc;
> + const int nstripes = list_count_nodes(&ordered->bioc_list);
> + const int index = btrfs_bg_flags_to_raid_index(map_type);
> + const int substripes = btrfs_raid_array[index].sub_stripes;
> + const int max_stripes =
> + trans->fs_info->fs_devices->rw_devices / substripes;
This will probably warn due to u64/u32 division.
^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [PATCH v9 03/11] btrfs: add support for inserting raid stripe extents
2023-09-14 16:06 ` [PATCH v9 03/11] btrfs: add support for inserting raid stripe extents Johannes Thumshirn
2023-09-14 18:07 ` David Sterba
@ 2023-09-14 18:10 ` David Sterba
2023-09-15 0:55 ` Qu Wenruo
2 siblings, 0 replies; 29+ messages in thread
From: David Sterba @ 2023-09-14 18:10 UTC (permalink / raw)
To: Johannes Thumshirn
Cc: Chris Mason, Josef Bacik, David Sterba, Christoph Hellwig,
Naohiro Aota, Qu Wenruo, Damien Le Moal, linux-btrfs,
linux-kernel
On Thu, Sep 14, 2023 at 09:06:58AM -0700, Johannes Thumshirn wrote:
> + map_type = list_first_entry(&ordered_extent->bioc_list, typeof(*bioc),
> + ordered_entry)->map_type;
> +
> + switch (map_type & BTRFS_BLOCK_GROUP_PROFILE_MASK) {
> + case BTRFS_BLOCK_GROUP_DUP:
> + case BTRFS_BLOCK_GROUP_RAID1:
> + case BTRFS_BLOCK_GROUP_RAID1C3:
> + case BTRFS_BLOCK_GROUP_RAID1C4:
> + ret = btrfs_insert_mirrored_raid_extents(trans, ordered_extent,
> + map_type);
> + break;
> + case BTRFS_BLOCK_GROUP_RAID0:
> + ret = btrfs_insert_striped_raid_extents(trans, ordered_extent,
> + map_type);
> + break;
> + case BTRFS_BLOCK_GROUP_RAID10:
> + ret = btrfs_insert_striped_mirrored_raid_extents(trans, ordered_extent, map_type);
> + break;
> + default:
> + btrfs_err(trans->fs_info, "unknown block-group profile %lld",
> + map_type & BTRFS_BLOCK_GROUP_PROFILE_MASK);
> + ASSERT(0);
Please don't use ASSERT(0), the error is handled and no need to crash
here.
> + ret = -EINVAL;
> + break;
> + }
^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [PATCH v9 00/11] btrfs: introduce RAID stripe tree
2023-09-14 16:06 [PATCH v9 00/11] btrfs: introduce RAID stripe tree Johannes Thumshirn
` (10 preceding siblings ...)
2023-09-14 16:07 ` [PATCH v9 11/11] btrfs: add raid-stripe-tree to features enabled with debug Johannes Thumshirn
@ 2023-09-14 18:25 ` David Sterba
2023-09-20 16:23 ` David Sterba
11 siblings, 1 reply; 29+ messages in thread
From: David Sterba @ 2023-09-14 18:25 UTC (permalink / raw)
To: Johannes Thumshirn
Cc: Chris Mason, Josef Bacik, David Sterba, Christoph Hellwig,
Naohiro Aota, Qu Wenruo, Damien Le Moal, linux-btrfs,
linux-kernel
On Thu, Sep 14, 2023 at 09:06:55AM -0700, Johannes Thumshirn wrote:
> Updates of the raid-stripe-tree are done at ordered extent write time to safe
> on bandwidth while for reading we do the stripe-tree lookup on bio mapping
> time, i.e. when the logical to physical translation happens for regular btrfs
> RAID as well.
>
> The stripe tree is keyed by an extent's disk_bytenr and disk_num_bytes and
> it's contents are the respective physical device id and position.
>
> For an example 1M write (split into 126K segments due to zone-append)
> rapido2:/home/johannes/src/fstests# xfs_io -fdc "pwrite -b 1M 0 1M" -c fsync /mnt/test/test
> wrote 1048576/1048576 bytes at offset 0
> 1 MiB, 1 ops; 0.0065 sec (151.538 MiB/sec and 151.5381 ops/sec)
>
> The tree will look as follows (both 128k buffered writes to a ZNS drive):
>
> RAID0 case:
> bash-5.2# btrfs inspect-internal dump-tree -t raid_stripe /dev/nvme0n1
> btrfs-progs v6.3
> raid stripe tree key (RAID_STRIPE_TREE ROOT_ITEM 0)
> leaf 805535744 items 1 free space 16218 generation 8 owner RAID_STRIPE_TREE
> leaf 805535744 flags 0x1(WRITTEN) backref revision 1
> checksum stored 2d2d2262
> checksum calced 2d2d2262
> fs uuid ab05cfc6-9859-404e-970d-3999b1cb5438
> chunk uuid c9470ba2-49ac-4d46-8856-438a18e6bd23
> item 0 key (1073741824 RAID_STRIPE_KEY 131072) itemoff 16243 itemsize 56
> encoding: RAID0
> stripe 0 devid 1 offset 805306368 length 131072
> stripe 1 devid 2 offset 536870912 length 131072
> total bytes 42949672960
> bytes used 294912
> uuid ab05cfc6-9859-404e-970d-3999b1cb5438
>
> RAID1 case:
> bash-5.2# btrfs inspect-internal dump-tree -t raid_stripe /dev/nvme0n1
> btrfs-progs v6.3
> raid stripe tree key (RAID_STRIPE_TREE ROOT_ITEM 0)
> leaf 805535744 items 1 free space 16218 generation 8 owner RAID_STRIPE_TREE
> leaf 805535744 flags 0x1(WRITTEN) backref revision 1
> checksum stored 56199539
> checksum calced 56199539
> fs uuid 9e693a37-fbd1-4891-aed2-e7fe64605045
> chunk uuid 691874fc-1b9c-469b-bd7f-05e0e6ba88c4
> item 0 key (939524096 RAID_STRIPE_KEY 131072) itemoff 16243 itemsize 56
> encoding: RAID1
> stripe 0 devid 1 offset 939524096 length 65536
> stripe 1 devid 2 offset 536870912 length 65536
> total bytes 42949672960
> bytes used 294912
> uuid 9e693a37-fbd1-4891-aed2-e7fe64605045
>
> A design document can be found here:
> https://docs.google.com/document/d/1Iui_jMidCd4MVBNSSLXRfO7p5KmvnoQL/edit?usp=sharing&ouid=103609947580185458266&rtpof=true&sd=true
Please also turn it to developer documentation file (in
btrfs-progs/Documentation/dev), it can follow the same structure.
>
> The user-space part of this series can be found here:
> https://lore.kernel.org/linux-btrfs/20230215143109.2721722-1-johannes.thumshirn@wdc.com
>
> Changes to v8:
> - Changed tracepoints according to David's comments
> - Mark on-disk structures as packed
> - Got rid of __DECLARE_FLEX_ARRAY
> - Rebase onto misc-next
> - Split out helpers for new btrfs_load_block_group_zone_info RAID cases
> - Constify declarations where possible
> - Initialise variables before use
> - Lower scope of variables
> - Remove btrfs_stripe_root() helper
> - Pick different BTRFS_RAID_STRIPE_KEY constant
> - Reorder on-disk encoding types to match the raid_index
> - And possibly more, please git range-diff the versions
> - Link to v8: https://lore.kernel.org/r/20230911-raid-stripe-tree-v8-0-647676fa852c@wdc.com
v9 will be added as topic branch to for-next, I did several style
changes so please send any updates as incrementals if needed.
^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [PATCH v9 01/11] btrfs: add raid stripe tree definitions
2023-09-14 16:06 ` [PATCH v9 01/11] btrfs: add raid stripe tree definitions Johannes Thumshirn
@ 2023-09-15 0:22 ` Qu Wenruo
2023-09-15 0:26 ` Qu Wenruo
0 siblings, 1 reply; 29+ messages in thread
From: Qu Wenruo @ 2023-09-15 0:22 UTC (permalink / raw)
To: Johannes Thumshirn, Chris Mason, Josef Bacik, David Sterba
Cc: Christoph Hellwig, Naohiro Aota, Damien Le Moal, linux-btrfs,
linux-kernel
On 2023/9/15 01:36, Johannes Thumshirn wrote:
> Add definitions for the raid stripe tree. This tree will hold information
> about the on-disk layout of the stripes in a RAID set.
>
> Each stripe extent has a 1:1 relationship with an on-disk extent item and
> is doing the logical to per-drive physical address translation for the
> extent item in question.
>
> Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
> ---
> fs/btrfs/accessors.h | 10 ++++++++++
> fs/btrfs/locking.c | 1 +
> include/uapi/linux/btrfs_tree.h | 31 +++++++++++++++++++++++++++++++
> 3 files changed, 42 insertions(+)
>
> diff --git a/fs/btrfs/accessors.h b/fs/btrfs/accessors.h
> index f958eccff477..977ff160a024 100644
> --- a/fs/btrfs/accessors.h
> +++ b/fs/btrfs/accessors.h
> @@ -306,6 +306,16 @@ BTRFS_SETGET_FUNCS(timespec_nsec, struct btrfs_timespec, nsec, 32);
> BTRFS_SETGET_STACK_FUNCS(stack_timespec_sec, struct btrfs_timespec, sec, 64);
> BTRFS_SETGET_STACK_FUNCS(stack_timespec_nsec, struct btrfs_timespec, nsec, 32);
>
> +BTRFS_SETGET_FUNCS(stripe_extent_encoding, struct btrfs_stripe_extent, encoding, 8);
> +BTRFS_SETGET_FUNCS(raid_stride_devid, struct btrfs_raid_stride, devid, 64);
> +BTRFS_SETGET_FUNCS(raid_stride_physical, struct btrfs_raid_stride, physical, 64);
> +BTRFS_SETGET_FUNCS(raid_stride_length, struct btrfs_raid_stride, length, 64);
> +BTRFS_SETGET_STACK_FUNCS(stack_stripe_extent_encoding,
> + struct btrfs_stripe_extent, encoding, 8);
> +BTRFS_SETGET_STACK_FUNCS(stack_raid_stride_devid, struct btrfs_raid_stride, devid, 64);
> +BTRFS_SETGET_STACK_FUNCS(stack_raid_stride_physical, struct btrfs_raid_stride, physical, 64);
> +BTRFS_SETGET_STACK_FUNCS(stack_raid_stride_length, struct btrfs_raid_stride, length, 64);
> +
> /* struct btrfs_dev_extent */
> BTRFS_SETGET_FUNCS(dev_extent_chunk_tree, struct btrfs_dev_extent, chunk_tree, 64);
> BTRFS_SETGET_FUNCS(dev_extent_chunk_objectid, struct btrfs_dev_extent,
> diff --git a/fs/btrfs/locking.c b/fs/btrfs/locking.c
> index 6ac4fd8cc8dc..74d8e2003f58 100644
> --- a/fs/btrfs/locking.c
> +++ b/fs/btrfs/locking.c
> @@ -74,6 +74,7 @@ static struct btrfs_lockdep_keyset {
> { .id = BTRFS_UUID_TREE_OBJECTID, DEFINE_NAME("uuid") },
> { .id = BTRFS_FREE_SPACE_TREE_OBJECTID, DEFINE_NAME("free-space") },
> { .id = BTRFS_BLOCK_GROUP_TREE_OBJECTID, DEFINE_NAME("block-group") },
> + { .id = BTRFS_RAID_STRIPE_TREE_OBJECTID, DEFINE_NAME("raid-stripe") },
> { .id = 0, DEFINE_NAME("tree") },
> };
>
> diff --git a/include/uapi/linux/btrfs_tree.h b/include/uapi/linux/btrfs_tree.h
> index fc3c32186d7e..6d9c43416b6e 100644
> --- a/include/uapi/linux/btrfs_tree.h
> +++ b/include/uapi/linux/btrfs_tree.h
> @@ -73,6 +73,9 @@
> /* Holds the block group items for extent tree v2. */
> #define BTRFS_BLOCK_GROUP_TREE_OBJECTID 11ULL
>
> +/* Tracks RAID stripes in block groups. */
> +#define BTRFS_RAID_STRIPE_TREE_OBJECTID 12ULL
> +
> /* device stats in the device tree */
> #define BTRFS_DEV_STATS_OBJECTID 0ULL
>
> @@ -261,6 +264,8 @@
> #define BTRFS_DEV_ITEM_KEY 216
> #define BTRFS_CHUNK_ITEM_KEY 228
>
> +#define BTRFS_RAID_STRIPE_KEY 230
> +
> /*
> * Records the overall state of the qgroups.
> * There's only one instance of this key present,
> @@ -719,6 +724,32 @@ struct btrfs_free_space_header {
> __le64 num_bitmaps;
> } __attribute__ ((__packed__));
>
> +struct btrfs_raid_stride {
> + /* The btrfs device-id this raid extent lives on */
> + __le64 devid;
> + /* The physical location on disk */
> + __le64 physical;
> + /* The length of stride on this disk */
> + __le64 length;
> +} __attribute__ ((__packed__));
> +
> +/* The stripe_extent::encoding, 1:1 mapping of enum btrfs_raid_types */
> +#define BTRFS_STRIPE_RAID0 1
> +#define BTRFS_STRIPE_RAID1 2
> +#define BTRFS_STRIPE_DUP 3
> +#define BTRFS_STRIPE_RAID10 4
> +#define BTRFS_STRIPE_RAID5 5
> +#define BTRFS_STRIPE_RAID6 6
> +#define BTRFS_STRIPE_RAID1C3 7
> +#define BTRFS_STRIPE_RAID1C4 8
> +
> +struct btrfs_stripe_extent {
> + __u8 encoding;
Considerng the encoding for now is 1:1 map of btrfs_raid_types, and
normally we use variable like @raid_index for such usage, have
considered rename it to "raid_index" or "profile_index" instead?
Another thing is, you may want to add extra tree-checker code to verify
the btrfs_stripe_extent members.
For encoding, it should be all be the known numbers, and item size for
alignment.
The same for physical/length alignment checks.
Thanks,
Qu
> + __u8 reserved[7];
> + /* An array of raid strides this stripe is composed of */
> + struct btrfs_raid_stride strides[];
> +} __attribute__ ((__packed__));
> +
> #define BTRFS_HEADER_FLAG_WRITTEN (1ULL << 0)
> #define BTRFS_HEADER_FLAG_RELOC (1ULL << 1)
>
>
^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [PATCH v9 01/11] btrfs: add raid stripe tree definitions
2023-09-15 0:22 ` Qu Wenruo
@ 2023-09-15 0:26 ` Qu Wenruo
2023-09-15 9:55 ` Johannes Thumshirn
0 siblings, 1 reply; 29+ messages in thread
From: Qu Wenruo @ 2023-09-15 0:26 UTC (permalink / raw)
To: Johannes Thumshirn, Chris Mason, Josef Bacik, David Sterba
Cc: Christoph Hellwig, Naohiro Aota, Damien Le Moal, linux-btrfs,
linux-kernel
On 2023/9/15 09:52, Qu Wenruo wrote:
>
>
> On 2023/9/15 01:36, Johannes Thumshirn wrote:
>> Add definitions for the raid stripe tree. This tree will hold information
>> about the on-disk layout of the stripes in a RAID set.
>>
>> Each stripe extent has a 1:1 relationship with an on-disk extent item and
>> is doing the logical to per-drive physical address translation for the
>> extent item in question.
>>
>> Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
>> ---
>> fs/btrfs/accessors.h | 10 ++++++++++
>> fs/btrfs/locking.c | 1 +
>> include/uapi/linux/btrfs_tree.h | 31 +++++++++++++++++++++++++++++++
>> 3 files changed, 42 insertions(+)
>>
>> diff --git a/fs/btrfs/accessors.h b/fs/btrfs/accessors.h
>> index f958eccff477..977ff160a024 100644
>> --- a/fs/btrfs/accessors.h
>> +++ b/fs/btrfs/accessors.h
>> @@ -306,6 +306,16 @@ BTRFS_SETGET_FUNCS(timespec_nsec, struct
>> btrfs_timespec, nsec, 32);
>> BTRFS_SETGET_STACK_FUNCS(stack_timespec_sec, struct btrfs_timespec,
>> sec, 64);
>> BTRFS_SETGET_STACK_FUNCS(stack_timespec_nsec, struct btrfs_timespec,
>> nsec, 32);
>> +BTRFS_SETGET_FUNCS(stripe_extent_encoding, struct
>> btrfs_stripe_extent, encoding, 8);
>> +BTRFS_SETGET_FUNCS(raid_stride_devid, struct btrfs_raid_stride,
>> devid, 64);
>> +BTRFS_SETGET_FUNCS(raid_stride_physical, struct btrfs_raid_stride,
>> physical, 64);
>> +BTRFS_SETGET_FUNCS(raid_stride_length, struct btrfs_raid_stride,
>> length, 64);
>> +BTRFS_SETGET_STACK_FUNCS(stack_stripe_extent_encoding,
>> + struct btrfs_stripe_extent, encoding, 8);
>> +BTRFS_SETGET_STACK_FUNCS(stack_raid_stride_devid, struct
>> btrfs_raid_stride, devid, 64);
>> +BTRFS_SETGET_STACK_FUNCS(stack_raid_stride_physical, struct
>> btrfs_raid_stride, physical, 64);
>> +BTRFS_SETGET_STACK_FUNCS(stack_raid_stride_length, struct
>> btrfs_raid_stride, length, 64);
>> +
>> /* struct btrfs_dev_extent */
>> BTRFS_SETGET_FUNCS(dev_extent_chunk_tree, struct btrfs_dev_extent,
>> chunk_tree, 64);
>> BTRFS_SETGET_FUNCS(dev_extent_chunk_objectid, struct btrfs_dev_extent,
>> diff --git a/fs/btrfs/locking.c b/fs/btrfs/locking.c
>> index 6ac4fd8cc8dc..74d8e2003f58 100644
>> --- a/fs/btrfs/locking.c
>> +++ b/fs/btrfs/locking.c
>> @@ -74,6 +74,7 @@ static struct btrfs_lockdep_keyset {
>> { .id = BTRFS_UUID_TREE_OBJECTID, DEFINE_NAME("uuid") },
>> { .id = BTRFS_FREE_SPACE_TREE_OBJECTID,
>> DEFINE_NAME("free-space") },
>> { .id = BTRFS_BLOCK_GROUP_TREE_OBJECTID,
>> DEFINE_NAME("block-group") },
>> + { .id = BTRFS_RAID_STRIPE_TREE_OBJECTID,
>> DEFINE_NAME("raid-stripe") },
>> { .id = 0, DEFINE_NAME("tree") },
>> };
>> diff --git a/include/uapi/linux/btrfs_tree.h
>> b/include/uapi/linux/btrfs_tree.h
>> index fc3c32186d7e..6d9c43416b6e 100644
>> --- a/include/uapi/linux/btrfs_tree.h
>> +++ b/include/uapi/linux/btrfs_tree.h
>> @@ -73,6 +73,9 @@
>> /* Holds the block group items for extent tree v2. */
>> #define BTRFS_BLOCK_GROUP_TREE_OBJECTID 11ULL
>> +/* Tracks RAID stripes in block groups. */
>> +#define BTRFS_RAID_STRIPE_TREE_OBJECTID 12ULL
>> +
>> /* device stats in the device tree */
>> #define BTRFS_DEV_STATS_OBJECTID 0ULL
>> @@ -261,6 +264,8 @@
>> #define BTRFS_DEV_ITEM_KEY 216
>> #define BTRFS_CHUNK_ITEM_KEY 228
>> +#define BTRFS_RAID_STRIPE_KEY 230
>> +
>> /*
>> * Records the overall state of the qgroups.
>> * There's only one instance of this key present,
>> @@ -719,6 +724,32 @@ struct btrfs_free_space_header {
>> __le64 num_bitmaps;
>> } __attribute__ ((__packed__));
>> +struct btrfs_raid_stride {
>> + /* The btrfs device-id this raid extent lives on */
>> + __le64 devid;
>> + /* The physical location on disk */
>> + __le64 physical;
>> + /* The length of stride on this disk */
>> + __le64 length;
Forgot to mention, for btrfs_stripe_extent structure, its key is
(PHYSICAL, RAID_STRIPE_KEY, LENGTH) right?
So is the length in the btrfs_raid_stride duplicated and we can save 8
bytes?
Thanks,
Qu
>> +} __attribute__ ((__packed__));
>> +
>> +/* The stripe_extent::encoding, 1:1 mapping of enum btrfs_raid_types */
>> +#define BTRFS_STRIPE_RAID0 1
>> +#define BTRFS_STRIPE_RAID1 2
>> +#define BTRFS_STRIPE_DUP 3
>> +#define BTRFS_STRIPE_RAID10 4
>> +#define BTRFS_STRIPE_RAID5 5
>> +#define BTRFS_STRIPE_RAID6 6
>> +#define BTRFS_STRIPE_RAID1C3 7
>> +#define BTRFS_STRIPE_RAID1C4 8
>> +
>> +struct btrfs_stripe_extent {
>> + __u8 encoding;
>
> Considerng the encoding for now is 1:1 map of btrfs_raid_types, and
> normally we use variable like @raid_index for such usage, have
> considered rename it to "raid_index" or "profile_index" instead?
>
> Another thing is, you may want to add extra tree-checker code to verify
> the btrfs_stripe_extent members.
>
> For encoding, it should be all be the known numbers, and item size for
> alignment.
>
> The same for physical/length alignment checks.
>
> Thanks,
> Qu
>> + __u8 reserved[7];
>> + /* An array of raid strides this stripe is composed of */
>> + struct btrfs_raid_stride strides[];
>> +} __attribute__ ((__packed__));
>> +
>> #define BTRFS_HEADER_FLAG_WRITTEN (1ULL << 0)
>> #define BTRFS_HEADER_FLAG_RELOC (1ULL << 1)
>>
^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [PATCH v9 03/11] btrfs: add support for inserting raid stripe extents
2023-09-14 16:06 ` [PATCH v9 03/11] btrfs: add support for inserting raid stripe extents Johannes Thumshirn
2023-09-14 18:07 ` David Sterba
2023-09-14 18:10 ` David Sterba
@ 2023-09-15 0:55 ` Qu Wenruo
2023-09-19 12:13 ` Johannes Thumshirn
2 siblings, 1 reply; 29+ messages in thread
From: Qu Wenruo @ 2023-09-15 0:55 UTC (permalink / raw)
To: Johannes Thumshirn, Chris Mason, Josef Bacik, David Sterba
Cc: Christoph Hellwig, Naohiro Aota, Qu Wenruo, Damien Le Moal,
linux-btrfs, linux-kernel
On 2023/9/15 01:36, Johannes Thumshirn wrote:
> Add support for inserting stripe extents into the raid stripe tree on
> completion of every write that needs an extra logical-to-physical
> translation when using RAID.
>
> Inserting the stripe extents happens after the data I/O has completed,
> this is done to a) support zone-append and b) rule out the possibility of
> a RAID-write-hole.
>
> Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
> ---
> fs/btrfs/Makefile | 2 +-
> fs/btrfs/bio.c | 23 +++++
> fs/btrfs/extent-tree.c | 1 +
> fs/btrfs/inode.c | 8 +-
> fs/btrfs/ordered-data.c | 1 +
> fs/btrfs/ordered-data.h | 2 +
> fs/btrfs/raid-stripe-tree.c | 245 ++++++++++++++++++++++++++++++++++++++++++++
> fs/btrfs/raid-stripe-tree.h | 34 ++++++
> fs/btrfs/volumes.c | 4 +-
> fs/btrfs/volumes.h | 15 +--
> 10 files changed, 326 insertions(+), 9 deletions(-)
>
> diff --git a/fs/btrfs/Makefile b/fs/btrfs/Makefile
> index c57d80729d4f..525af975f61c 100644
> --- a/fs/btrfs/Makefile
> +++ b/fs/btrfs/Makefile
> @@ -33,7 +33,7 @@ btrfs-y += super.o ctree.o extent-tree.o print-tree.o root-tree.o dir-item.o \
> uuid-tree.o props.o free-space-tree.o tree-checker.o space-info.o \
> block-rsv.o delalloc-space.o block-group.o discard.o reflink.o \
> subpage.o tree-mod-log.o extent-io-tree.o fs.o messages.o bio.o \
> - lru_cache.o
> + lru_cache.o raid-stripe-tree.o
>
> btrfs-$(CONFIG_BTRFS_FS_POSIX_ACL) += acl.o
> btrfs-$(CONFIG_BTRFS_FS_REF_VERIFY) += ref-verify.o
> diff --git a/fs/btrfs/bio.c b/fs/btrfs/bio.c
> index 31ff36990404..ddbe6f8d4ea2 100644
> --- a/fs/btrfs/bio.c
> +++ b/fs/btrfs/bio.c
> @@ -14,6 +14,7 @@
> #include "rcu-string.h"
> #include "zoned.h"
> #include "file-item.h"
> +#include "raid-stripe-tree.h"
>
> static struct bio_set btrfs_bioset;
> static struct bio_set btrfs_clone_bioset;
> @@ -415,6 +416,9 @@ static void btrfs_orig_write_end_io(struct bio *bio)
> else
> bio->bi_status = BLK_STS_OK;
>
> + if (bio_op(bio) == REQ_OP_ZONE_APPEND && !bio->bi_status)
> + stripe->physical = bio->bi_iter.bi_sector << SECTOR_SHIFT;
> +
> btrfs_orig_bbio_end_io(bbio);
> btrfs_put_bioc(bioc);
> }
> @@ -426,6 +430,8 @@ static void btrfs_clone_write_end_io(struct bio *bio)
> if (bio->bi_status) {
> atomic_inc(&stripe->bioc->error);
> btrfs_log_dev_io_error(bio, stripe->dev);
> + } else if (bio_op(bio) == REQ_OP_ZONE_APPEND) {
> + stripe->physical = bio->bi_iter.bi_sector << SECTOR_SHIFT;
> }
>
> /* Pass on control to the original bio this one was cloned from */
> @@ -487,6 +493,7 @@ static void btrfs_submit_mirrored_bio(struct btrfs_io_context *bioc, int dev_nr)
> bio->bi_private = &bioc->stripes[dev_nr];
> bio->bi_iter.bi_sector = bioc->stripes[dev_nr].physical >> SECTOR_SHIFT;
> bioc->stripes[dev_nr].bioc = bioc;
> + bioc->size = bio->bi_iter.bi_size;
> btrfs_submit_dev_bio(bioc->stripes[dev_nr].dev, bio);
> }
>
> @@ -496,6 +503,8 @@ static void __btrfs_submit_bio(struct bio *bio, struct btrfs_io_context *bioc,
> if (!bioc) {
> /* Single mirror read/write fast path. */
> btrfs_bio(bio)->mirror_num = mirror_num;
> + if (bio_op(bio) != REQ_OP_READ)
> + btrfs_bio(bio)->orig_physical = smap->physical;
> bio->bi_iter.bi_sector = smap->physical >> SECTOR_SHIFT;
> if (bio_op(bio) != REQ_OP_READ)
> btrfs_bio(bio)->orig_physical = smap->physical;
> @@ -688,6 +697,20 @@ static bool btrfs_submit_chunk(struct btrfs_bio *bbio, int mirror_num)
> bio->bi_opf |= REQ_OP_ZONE_APPEND;
> }
>
> + if (is_data_bbio(bbio) && bioc &&
> + btrfs_need_stripe_tree_update(bioc->fs_info,
> + bioc->map_type)) {
> + /*
> + * No locking for the list update, as we only add to
> + * the list in the I/O submission path, and list
> + * iteration only happens in the completion path,
> + * which can't happen until after the last submission.
> + */
> + btrfs_get_bioc(bioc);
> + list_add_tail(&bioc->ordered_entry,
> + &bbio->ordered->bioc_list);
> + }
> +
> /*
> * Csum items for reloc roots have already been cloned at this
> * point, so they are handled as part of the no-checksum case.
> diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
> index cb12bfb047e7..959d7449ea0d 100644
> --- a/fs/btrfs/extent-tree.c
> +++ b/fs/btrfs/extent-tree.c
> @@ -42,6 +42,7 @@
> #include "file-item.h"
> #include "orphan.h"
> #include "tree-checker.h"
> +#include "raid-stripe-tree.h"
>
> #undef SCRAMBLE_DELAYED_REFS
>
> diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
> index e02a5ba5b533..b5e0ed3a36f7 100644
> --- a/fs/btrfs/inode.c
> +++ b/fs/btrfs/inode.c
> @@ -71,6 +71,7 @@
> #include "super.h"
> #include "orphan.h"
> #include "backref.h"
> +#include "raid-stripe-tree.h"
>
> struct btrfs_iget_args {
> u64 ino;
> @@ -3091,6 +3092,10 @@ int btrfs_finish_one_ordered(struct btrfs_ordered_extent *ordered_extent)
>
> trans->block_rsv = &inode->block_rsv;
>
> + ret = btrfs_insert_raid_extent(trans, ordered_extent);
> + if (ret)
> + goto out;
> +
> if (test_bit(BTRFS_ORDERED_COMPRESSED, &ordered_extent->flags))
> compress_type = ordered_extent->compress_type;
> if (test_bit(BTRFS_ORDERED_PREALLOC, &ordered_extent->flags)) {
> @@ -3224,7 +3229,8 @@ int btrfs_finish_one_ordered(struct btrfs_ordered_extent *ordered_extent)
> int btrfs_finish_ordered_io(struct btrfs_ordered_extent *ordered)
> {
> if (btrfs_is_zoned(btrfs_sb(ordered->inode->i_sb)) &&
> - !test_bit(BTRFS_ORDERED_IOERR, &ordered->flags))
> + !test_bit(BTRFS_ORDERED_IOERR, &ordered->flags) &&
> + list_empty(&ordered->bioc_list))
> btrfs_finish_ordered_zoned(ordered);
> return btrfs_finish_one_ordered(ordered);
> }
> diff --git a/fs/btrfs/ordered-data.c b/fs/btrfs/ordered-data.c
> index 345c449d588c..55c7d5543265 100644
> --- a/fs/btrfs/ordered-data.c
> +++ b/fs/btrfs/ordered-data.c
> @@ -191,6 +191,7 @@ static struct btrfs_ordered_extent *alloc_ordered_extent(
> INIT_LIST_HEAD(&entry->log_list);
> INIT_LIST_HEAD(&entry->root_extent_list);
> INIT_LIST_HEAD(&entry->work_list);
> + INIT_LIST_HEAD(&entry->bioc_list);
> init_completion(&entry->completion);
>
> /*
> diff --git a/fs/btrfs/ordered-data.h b/fs/btrfs/ordered-data.h
> index 173bd5c5df26..1c51ac57e5df 100644
> --- a/fs/btrfs/ordered-data.h
> +++ b/fs/btrfs/ordered-data.h
> @@ -151,6 +151,8 @@ struct btrfs_ordered_extent {
> struct completion completion;
> struct btrfs_work flush_work;
> struct list_head work_list;
> +
> + struct list_head bioc_list;
> };
>
> static inline void
> diff --git a/fs/btrfs/raid-stripe-tree.c b/fs/btrfs/raid-stripe-tree.c
> new file mode 100644
> index 000000000000..7cdcc45a8796
> --- /dev/null
> +++ b/fs/btrfs/raid-stripe-tree.c
> @@ -0,0 +1,245 @@
> +// SPDX-License-Identifier: GPL-2.0
> +/*
> + * Copyright (C) 2023 Western Digital Corporation or its affiliates.
> + */
> +
> +#include <linux/btrfs_tree.h>
> +
> +#include "ctree.h"
> +#include "fs.h"
> +#include "accessors.h"
> +#include "transaction.h"
> +#include "disk-io.h"
> +#include "raid-stripe-tree.h"
> +#include "volumes.h"
> +#include "misc.h"
> +#include "print-tree.h"
> +
> +static int btrfs_insert_one_raid_extent(struct btrfs_trans_handle *trans,
> + int num_stripes,
> + struct btrfs_io_context *bioc)
> +{
> + struct btrfs_fs_info *fs_info = trans->fs_info;
> + struct btrfs_key stripe_key;
> + struct btrfs_root *stripe_root = fs_info->stripe_root;
> + u8 encoding = btrfs_bg_flags_to_raid_index(bioc->map_type);
> + struct btrfs_stripe_extent *stripe_extent;
> + const size_t item_size = struct_size(stripe_extent, strides, num_stripes);
> + int ret;
> +
> + stripe_extent = kzalloc(item_size, GFP_NOFS);
> + if (!stripe_extent) {
> + btrfs_abort_transaction(trans, -ENOMEM);
> + btrfs_end_transaction(trans);
> + return -ENOMEM;
> + }
> +
> + btrfs_set_stack_stripe_extent_encoding(stripe_extent, encoding);
> + for (int i = 0; i < num_stripes; i++) {
> + u64 devid = bioc->stripes[i].dev->devid;
> + u64 physical = bioc->stripes[i].physical;
> + u64 length = bioc->stripes[i].length;
> + struct btrfs_raid_stride *raid_stride =
> + &stripe_extent->strides[i];
> +
> + if (length == 0)
> + length = bioc->size;
> +
> + btrfs_set_stack_raid_stride_devid(raid_stride, devid);
> + btrfs_set_stack_raid_stride_physical(raid_stride, physical);
> + btrfs_set_stack_raid_stride_length(raid_stride, length);
> + }
> +
> + stripe_key.objectid = bioc->logical;
> + stripe_key.type = BTRFS_RAID_STRIPE_KEY;
> + stripe_key.offset = bioc->size;
> +
> + ret = btrfs_insert_item(trans, stripe_root, &stripe_key, stripe_extent,
> + item_size);
> + if (ret)
> + btrfs_abort_transaction(trans, ret);
> +
> + kfree(stripe_extent);
> +
> + return ret;
> +}
> +
> +static int btrfs_insert_mirrored_raid_extents(struct btrfs_trans_handle *trans,
> + struct btrfs_ordered_extent *ordered,
> + u64 map_type)
> +{
> + int num_stripes = btrfs_bg_type_to_factor(map_type);
> + struct btrfs_io_context *bioc;
> + int ret;
> +
> + list_for_each_entry(bioc, &ordered->bioc_list, ordered_entry) {
> + ret = btrfs_insert_one_raid_extent(trans, num_stripes, bioc);
> + if (ret)
> + return ret;
> + }
> +
> + return 0;
> +}
> +
> +static int btrfs_insert_striped_mirrored_raid_extents(
> + struct btrfs_trans_handle *trans,
> + struct btrfs_ordered_extent *ordered,
> + u64 map_type)
> +{
> + struct btrfs_io_context *bioc;
> + struct btrfs_io_context *rbioc;
> + const int nstripes = list_count_nodes(&ordered->bioc_list);
> + const int index = btrfs_bg_flags_to_raid_index(map_type);
> + const int substripes = btrfs_raid_array[index].sub_stripes;
> + const int max_stripes =
> + trans->fs_info->fs_devices->rw_devices / substripes;
> + int left = nstripes;
> + int i;
> + int ret = 0;
> + u64 stripe_end;
> + u64 prev_end;
> +
> + if (nstripes == 1)
> + return btrfs_insert_mirrored_raid_extents(trans, ordered, map_type);
> +
> + rbioc = kzalloc(struct_size(rbioc, stripes, nstripes * substripes),
> + GFP_NOFS);
> + if (!rbioc)
> + return -ENOMEM;
> +
> + rbioc->map_type = map_type;
> + rbioc->logical = list_first_entry(&ordered->bioc_list, typeof(*rbioc),
> + ordered_entry)->logical;
> +
> + stripe_end = rbioc->logical;
> + prev_end = stripe_end;
> + i = 0;
> + list_for_each_entry(bioc, &ordered->bioc_list, ordered_entry) {
> +
> + rbioc->size += bioc->size;
> + for (int j = 0; j < substripes; j++) {
> + int stripe = i + j;
> + rbioc->stripes[stripe].dev = bioc->stripes[j].dev;
> + rbioc->stripes[stripe].physical = bioc->stripes[j].physical;
> + rbioc->stripes[stripe].length = bioc->size;
> + }
> +
> + stripe_end += rbioc->size;
> + if (i >= nstripes ||
> + (stripe_end - prev_end >= max_stripes * BTRFS_STRIPE_LEN)) {
> + ret = btrfs_insert_one_raid_extent(trans,
> + nstripes * substripes,
> + rbioc);
> + if (ret)
> + goto out;
> +
> + left -= nstripes;
> + i = 0;
> + rbioc->logical += rbioc->size;
> + rbioc->size = 0;
> + } else {
> + i += substripes;
> + prev_end = stripe_end;
> + }
> + }
> +
> + if (left) {
> + bioc = list_prev_entry(bioc, ordered_entry);
> + ret = btrfs_insert_one_raid_extent(trans, substripes, bioc);
> + }
> +
> +out:
> + kfree(rbioc);
> + return ret;
> +}
> +
> +static int btrfs_insert_striped_raid_extents(struct btrfs_trans_handle *trans,
> + struct btrfs_ordered_extent *ordered,
> + u64 map_type)
> +{
> + struct btrfs_io_context *bioc;
> + struct btrfs_io_context *rbioc;
> + const int nstripes = list_count_nodes(&ordered->bioc_list);
> + int i;
> + int ret = 0;
> +
> + rbioc = kzalloc(struct_size(rbioc, stripes, nstripes), GFP_NOFS);
> + if (!rbioc)
> + return -ENOMEM;
> + rbioc->map_type = map_type;
> + rbioc->logical = list_first_entry(&ordered->bioc_list, typeof(*rbioc),
> + ordered_entry)->logical;
> +
> + i = 0;
> + list_for_each_entry(bioc, &ordered->bioc_list, ordered_entry) {
> + rbioc->size += bioc->size;
> + rbioc->stripes[i].dev = bioc->stripes[0].dev;
> + rbioc->stripes[i].physical = bioc->stripes[0].physical;
> + rbioc->stripes[i].length = bioc->size;
> +
> + if (i == nstripes - 1) {
> + ret = btrfs_insert_one_raid_extent(trans, nstripes, rbioc);
> + if (ret)
> + goto out;
> +
> + i = 0;
> + rbioc->logical += rbioc->size;
> + rbioc->size = 0;
> + } else {
> + i++;
> + }
> + }
> +
> + if (i && i < nstripes - 1)
> + ret = btrfs_insert_one_raid_extent(trans, i, rbioc);
> +
> +out:
> + kfree(rbioc);
> + return ret;
> +}
> +
> +int btrfs_insert_raid_extent(struct btrfs_trans_handle *trans,
> + struct btrfs_ordered_extent *ordered_extent)
> +{
> + struct btrfs_io_context *bioc;
> + u64 map_type;
> + int ret;
> +
> + if (!trans->fs_info->stripe_root)
> + return 0;
> +
> + map_type = list_first_entry(&ordered_extent->bioc_list, typeof(*bioc),
> + ordered_entry)->map_type;
> +
> + switch (map_type & BTRFS_BLOCK_GROUP_PROFILE_MASK) {
> + case BTRFS_BLOCK_GROUP_DUP:
> + case BTRFS_BLOCK_GROUP_RAID1:
> + case BTRFS_BLOCK_GROUP_RAID1C3:
> + case BTRFS_BLOCK_GROUP_RAID1C4:
> + ret = btrfs_insert_mirrored_raid_extents(trans, ordered_extent,
> + map_type);
> + break;
> + case BTRFS_BLOCK_GROUP_RAID0:
> + ret = btrfs_insert_striped_raid_extents(trans, ordered_extent,
> + map_type);
> + break;
> + case BTRFS_BLOCK_GROUP_RAID10:
> + ret = btrfs_insert_striped_mirrored_raid_extents(trans, ordered_extent, map_type);
> + break;
> + default:
> + btrfs_err(trans->fs_info, "unknown block-group profile %lld",
> + map_type & BTRFS_BLOCK_GROUP_PROFILE_MASK);
> + ASSERT(0);
> + ret = -EINVAL;
> + break;
> + }
> +
> + while (!list_empty(&ordered_extent->bioc_list)) {
> + bioc = list_first_entry(&ordered_extent->bioc_list,
> + typeof(*bioc), ordered_entry);
> + list_del(&bioc->ordered_entry);
> + btrfs_put_bioc(bioc);
> + }
> +
> + return ret;
> +}
> diff --git a/fs/btrfs/raid-stripe-tree.h b/fs/btrfs/raid-stripe-tree.h
> new file mode 100644
> index 000000000000..884f0e99d5e8
> --- /dev/null
> +++ b/fs/btrfs/raid-stripe-tree.h
> @@ -0,0 +1,34 @@
> +/* SPDX-License-Identifier: GPL-2.0 */
> +/*
> + * Copyright (C) 2023 Western Digital Corporation or its affiliates.
> + */
> +
> +#ifndef BTRFS_RAID_STRIPE_TREE_H
> +#define BTRFS_RAID_STRIPE_TREE_H
> +
> +struct btrfs_io_context;
> +struct btrfs_io_stripe;
> +struct btrfs_ordered_extent;
> +struct btrfs_trans_handle;
> +
> +int btrfs_insert_raid_extent(struct btrfs_trans_handle *trans,
> + struct btrfs_ordered_extent *ordered_extent);
> +
> +static inline bool btrfs_need_stripe_tree_update(struct btrfs_fs_info *fs_info,
> + u64 map_type)
> +{
> + u64 type = map_type & BTRFS_BLOCK_GROUP_TYPE_MASK;
> + u64 profile = map_type & BTRFS_BLOCK_GROUP_PROFILE_MASK;
> +
> + if (!fs_info->stripe_root)
> + return false;
I found a corncer case that this can be problematic.
If we have a fs with RST root tree node/leaf corrupted, mounted with
rescue=ibadroots, then fs_info->stripe_root would be NULL, and in the
5th patch inside set_io_stripe() we just fall back to regular non-RST path.
This would bring us mostly incorrect data (and can be very problematic
for nodatacsum files).
Thus stripe_root itself is not a reliable way to determine if we're at
RST routine, I'd say only super incompat flags is reliable.
And fs_info->stripe_root should only be checked for functions that do
RST tree operations, and return -EIO properly if it's not initialized.
> +
> + if (type != BTRFS_BLOCK_GROUP_DATA)
> + return false;
> +
> + if (profile & BTRFS_BLOCK_GROUP_RAID1_MASK)
> + return true;
Just a stupid quest, RAID0 DATA doesn't need RST purely because they are
the same as SINGLE, thus we only update the file items to the real
written logical address, and no need for the extra mapping?
Thus only profiles with duplication relies on RST, right?
If so, then I guess DUP should also be covered by RST.
> +
> + return false;
> +}
> +#endif
> diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
> index a1eae8b5b412..c2bac87912c7 100644
> --- a/fs/btrfs/volumes.c
> +++ b/fs/btrfs/volumes.c
> @@ -5984,6 +5984,7 @@ static int find_live_mirror(struct btrfs_fs_info *fs_info,
> }
>
> static struct btrfs_io_context *alloc_btrfs_io_context(struct btrfs_fs_info *fs_info,
> + u64 logical,
> u16 total_stripes)
> {
> struct btrfs_io_context *bioc;
> @@ -6003,6 +6004,7 @@ static struct btrfs_io_context *alloc_btrfs_io_context(struct btrfs_fs_info *fs_
> bioc->fs_info = fs_info;
> bioc->replace_stripe_src = -1;
> bioc->full_stripe_logical = (u64)-1;
> + bioc->logical = logical;
>
> return bioc;
> }
> @@ -6537,7 +6539,7 @@ int btrfs_map_block(struct btrfs_fs_info *fs_info, enum btrfs_map_op op,
> goto out;
> }
>
> - bioc = alloc_btrfs_io_context(fs_info, num_alloc_stripes);
> + bioc = alloc_btrfs_io_context(fs_info, logical, num_alloc_stripes);
> if (!bioc) {
> ret = -ENOMEM;
> goto out;
> diff --git a/fs/btrfs/volumes.h b/fs/btrfs/volumes.h
> index 26397adc8706..2043aff6e966 100644
> --- a/fs/btrfs/volumes.h
> +++ b/fs/btrfs/volumes.h
> @@ -390,12 +390,11 @@ struct btrfs_fs_devices {
>
> struct btrfs_io_stripe {
> struct btrfs_device *dev;
> - union {
> - /* Block mapping */
> - u64 physical;
> - /* For the endio handler */
> - struct btrfs_io_context *bioc;
> - };
> + /* Block mapping */
> + u64 physical;
> + u64 length;
> + /* For the endio handler */
> + struct btrfs_io_context *bioc;
> };
>
> struct btrfs_discard_stripe {
> @@ -428,6 +427,10 @@ struct btrfs_io_context {
> atomic_t error;
> u16 max_errors;
>
> + u64 logical;
> + u64 size;
> + struct list_head ordered_entry;
Considering this is only utlized by RST, can we rename it to be more
specific?
Like rst_ordered_entry?
Or I'm pretty sure just weeks later I would need to dig to see what this
list is used for.
Thanks,
Qu
> +
> /*
> * The total number of stripes, including the extra duplicated
> * stripe for replace.
>
^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [PATCH v9 06/11] btrfs: implement RST version of scrub
2023-09-14 16:07 ` [PATCH v9 06/11] btrfs: implement RST version of scrub Johannes Thumshirn
@ 2023-09-15 0:58 ` Qu Wenruo
2023-09-15 14:11 ` David Sterba
0 siblings, 1 reply; 29+ messages in thread
From: Qu Wenruo @ 2023-09-15 0:58 UTC (permalink / raw)
To: Johannes Thumshirn, Chris Mason, Josef Bacik, David Sterba
Cc: Christoph Hellwig, Naohiro Aota, Qu Wenruo, Damien Le Moal,
linux-btrfs, linux-kernel
On 2023/9/15 01:37, Johannes Thumshirn wrote:
> A filesystem that uses the RAID stripe tree for logical to physical
> address translation can't use the regular scrub path, that reads all
> stripes and then checks if a sector is unused afterwards.
>
> When using the RAID stripe tree, this will result in lookup errors, as the
> stripe tree doesn't know the requested logical addresses.
>
> Instead, look up stripes that are backed by the extent bitmap.
>
> Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
> ---
> fs/btrfs/bio.c | 2 ++
> fs/btrfs/raid-stripe-tree.c | 8 ++++++-
> fs/btrfs/scrub.c | 53 +++++++++++++++++++++++++++++++++++++++++++++
> fs/btrfs/volumes.h | 1 +
> 4 files changed, 63 insertions(+), 1 deletion(-)
>
> diff --git a/fs/btrfs/bio.c b/fs/btrfs/bio.c
> index ddbe6f8d4ea2..bdb6e3effdbb 100644
> --- a/fs/btrfs/bio.c
> +++ b/fs/btrfs/bio.c
> @@ -663,6 +663,8 @@ static bool btrfs_submit_chunk(struct btrfs_bio *bbio, int mirror_num)
> blk_status_t ret;
> int error;
>
> + smap.is_scrub = !bbio->inode;
> +
> btrfs_bio_counter_inc_blocked(fs_info);
> error = btrfs_map_block(fs_info, btrfs_op(bio), logical, &map_length,
> &bioc, &smap, &mirror_num, 1);
> diff --git a/fs/btrfs/raid-stripe-tree.c b/fs/btrfs/raid-stripe-tree.c
> index 697a6e1fd255..63bf62c33436 100644
> --- a/fs/btrfs/raid-stripe-tree.c
> +++ b/fs/btrfs/raid-stripe-tree.c
> @@ -334,6 +334,11 @@ int btrfs_get_raid_extent_offset(struct btrfs_fs_info *fs_info,
> if (!path)
> return -ENOMEM;
>
> + if (stripe->is_scrub) {
> + path->skip_locking = 1;
> + path->search_commit_root = 1;
> + }
> +
> ret = btrfs_search_slot(NULL, stripe_root, &stripe_key, path, 0, 0);
> if (ret < 0)
> goto free_path;
> @@ -420,7 +425,8 @@ int btrfs_get_raid_extent_offset(struct btrfs_fs_info *fs_info,
> out:
> if (ret > 0)
> ret = -ENOENT;
> - if (ret && ret != -EIO) {
> + if (ret && ret != -EIO && !stripe->is_scrub) {
> +
One extra newline.
And why scrub path doesn't need the warning?
IIRC if our rst doesn't match extent tree, it can be a problem and we
need some error messages.
Thanks,
Qu
> if (IS_ENABLED(CONFIG_BTRFS_DEBUG))
> btrfs_print_tree(leaf, 1);
> btrfs_err(fs_info,
> diff --git a/fs/btrfs/scrub.c b/fs/btrfs/scrub.c
> index f16220ce5fba..42948b66d4be 100644
> --- a/fs/btrfs/scrub.c
> +++ b/fs/btrfs/scrub.c
> @@ -23,6 +23,7 @@
> #include "accessors.h"
> #include "file-item.h"
> #include "scrub.h"
> +#include "raid-stripe-tree.h"
>
> /*
> * This is only the first step towards a full-features scrub. It reads all
> @@ -1634,6 +1635,53 @@ static void scrub_reset_stripe(struct scrub_stripe *stripe)
> }
> }
>
> +static void scrub_submit_extent_sector_read(struct scrub_ctx *sctx,
> + struct scrub_stripe *stripe)
> +{
> + struct btrfs_fs_info *fs_info = stripe->bg->fs_info;
> + struct btrfs_bio *bbio = NULL;
> + int mirror = stripe->mirror_num;
> + int i;
> +
> + atomic_inc(&stripe->pending_io);
> +
> + for_each_set_bit(i, &stripe->extent_sector_bitmap, stripe->nr_sectors) {
> + struct page *page = scrub_stripe_get_page(stripe, i);
> + unsigned int pgoff = scrub_stripe_get_page_offset(stripe, i);
> +
> + /* The current sector cannot be merged, submit the bio. */
> + if (bbio &&
> + ((i > 0 && !test_bit(i - 1, &stripe->extent_sector_bitmap)) ||
> + bbio->bio.bi_iter.bi_size >= BTRFS_STRIPE_LEN)) {
> + ASSERT(bbio->bio.bi_iter.bi_size);
> + atomic_inc(&stripe->pending_io);
> + btrfs_submit_bio(bbio, mirror);
> + bbio = NULL;
> + }
> +
> + if (!bbio) {
> + bbio = btrfs_bio_alloc(stripe->nr_sectors, REQ_OP_READ,
> + fs_info, scrub_read_endio, stripe);
> + bbio->bio.bi_iter.bi_sector = (stripe->logical +
> + (i << fs_info->sectorsize_bits)) >> SECTOR_SHIFT;
> + }
> +
> + __bio_add_page(&bbio->bio, page, fs_info->sectorsize, pgoff);
> + }
> +
> + if (bbio) {
> + ASSERT(bbio->bio.bi_iter.bi_size);
> + atomic_inc(&stripe->pending_io);
> + btrfs_submit_bio(bbio, mirror);
> + }
> +
> + if (atomic_dec_and_test(&stripe->pending_io)) {
> + wake_up(&stripe->io_wait);
> + INIT_WORK(&stripe->work, scrub_stripe_read_repair_worker);
> + queue_work(stripe->bg->fs_info->scrub_workers, &stripe->work);
> + }
> +}
> +
> static void scrub_submit_initial_read(struct scrub_ctx *sctx,
> struct scrub_stripe *stripe)
> {
> @@ -1645,6 +1693,11 @@ static void scrub_submit_initial_read(struct scrub_ctx *sctx,
> ASSERT(stripe->mirror_num > 0);
> ASSERT(test_bit(SCRUB_STRIPE_FLAG_INITIALIZED, &stripe->state));
>
> + if (btrfs_need_stripe_tree_update(fs_info, stripe->bg->flags)) {
> + scrub_submit_extent_sector_read(sctx, stripe);
> + return;
> + }
> +
> bbio = btrfs_bio_alloc(SCRUB_STRIPE_PAGES, REQ_OP_READ, fs_info,
> scrub_read_endio, stripe);
>
> diff --git a/fs/btrfs/volumes.h b/fs/btrfs/volumes.h
> index 2043aff6e966..067859de8f4c 100644
> --- a/fs/btrfs/volumes.h
> +++ b/fs/btrfs/volumes.h
> @@ -393,6 +393,7 @@ struct btrfs_io_stripe {
> /* Block mapping */
> u64 physical;
> u64 length;
> + bool is_scrub;
> /* For the endio handler */
> struct btrfs_io_context *bioc;
> };
>
^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [PATCH v9 01/11] btrfs: add raid stripe tree definitions
2023-09-15 0:26 ` Qu Wenruo
@ 2023-09-15 9:55 ` Johannes Thumshirn
2023-09-15 10:33 ` Qu Wenruo
0 siblings, 1 reply; 29+ messages in thread
From: Johannes Thumshirn @ 2023-09-15 9:55 UTC (permalink / raw)
To: Qu Wenruo, Chris Mason, Josef Bacik, David Sterba
Cc: Christoph Hellwig, Naohiro Aota, Damien Le Moal,
linux-btrfs@vger.kernel.org, linux-kernel@vger.kernel.org
On 15.09.23 02:27, Qu Wenruo wrote:
>>> /*
>>> * Records the overall state of the qgroups.
>>> * There's only one instance of this key present,
>>> @@ -719,6 +724,32 @@ struct btrfs_free_space_header {
>>> __le64 num_bitmaps;
>>> } __attribute__ ((__packed__));
>>> +struct btrfs_raid_stride {
>>> + /* The btrfs device-id this raid extent lives on */
>>> + __le64 devid;
>>> + /* The physical location on disk */
>>> + __le64 physical;
>>> + /* The length of stride on this disk */
>>> + __le64 length;
>
> Forgot to mention, for btrfs_stripe_extent structure, its key is
> (PHYSICAL, RAID_STRIPE_KEY, LENGTH) right?
>
> So is the length in the btrfs_raid_stride duplicated and we can save 8
> bytes?
Nope. The length in the key is the stripe length. The length in the
stride is the stride length.
Here's an example for why this is needed:
wrote 32768/32768 bytes at offset 0
XXX Bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
wrote 131072/131072 bytes at offset 0
XXX Bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
wrote 8192/8192 bytes at offset 65536
XXX Bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
[snip]
item 0 key (XXXXXX RAID_STRIPE_KEY 32768) itemoff XXXXX itemsize 32
encoding: RAID0
stripe 0 devid 1 physical XXXXXXXXX length 32768
item 1 key (XXXXXX RAID_STRIPE_KEY 131072) itemoff XXXXX
itemsize 80
encoding: RAID0
stripe 0 devid 1 physical XXXXXXXXX length 32768
stripe 1 devid 2 physical XXXXXXXXX length 65536
stripe 2 devid 1 physical XXXXXXXXX length 32768
item 2 key (XXXXXX RAID_STRIPE_KEY 8192) itemoff XXXXX itemsize 32
encoding: RAID0
stripe 0 devid 1 physical XXXXXXXXX length 8192
Without the length in the stride, we don't know when to select the next
stride in item 1 above.
^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [PATCH v9 03/11] btrfs: add support for inserting raid stripe extents
2023-09-14 18:07 ` David Sterba
@ 2023-09-15 10:03 ` Geert Uytterhoeven
0 siblings, 0 replies; 29+ messages in thread
From: Geert Uytterhoeven @ 2023-09-15 10:03 UTC (permalink / raw)
To: David Sterba
Cc: Johannes Thumshirn, Chris Mason, Josef Bacik, David Sterba,
Christoph Hellwig, Naohiro Aota, Qu Wenruo, Damien Le Moal,
linux-btrfs, linux-kernel
Hi David,
On Thu, 14 Sep 2023, David Sterba wrote:
> On Thu, Sep 14, 2023 at 09:06:58AM -0700, Johannes Thumshirn wrote:
>> Add support for inserting stripe extents into the raid stripe tree on
>> completion of every write that needs an extra logical-to-physical
>> translation when using RAID.
>>
>> Inserting the stripe extents happens after the data I/O has completed,
>> this is done to a) support zone-append and b) rule out the possibility of
>> a RAID-write-hole.
>>
>> Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
>> --- /dev/null
>> +++ b/fs/btrfs/raid-stripe-tree.c
>> +static int btrfs_insert_striped_mirrored_raid_extents(
>> + struct btrfs_trans_handle *trans,
>> + struct btrfs_ordered_extent *ordered,
>> + u64 map_type)
>> +{
>> + struct btrfs_io_context *bioc;
>> + struct btrfs_io_context *rbioc;
>> + const int nstripes = list_count_nodes(&ordered->bioc_list);
>> + const int index = btrfs_bg_flags_to_raid_index(map_type);
>> + const int substripes = btrfs_raid_array[index].sub_stripes;
>> + const int max_stripes =
>> + trans->fs_info->fs_devices->rw_devices / substripes;
>
> This will probably warn due to u64/u32 division.
Worse, it causes link failures in linux-next, as e.g. reported by
noreply@ellerman.id.au:
ERROR: modpost: "__udivdi3" [fs/btrfs/btrfs.ko] undefined!
So despite being aware of the issue, you still queued it?
The use of "int" for almost all variables is also a red flag:
- list_count_nodes() returns size_t,
- btrfs_bg_flags_to_raid_index() returns an enum.
- btrfs_raid_array[index].sub_stripes is u8,
- The result of the division may not fit in 32-bit.
Thanks for fixing, soon! ;-)
Gr{oetje,eeting}s,
Geert
--
Geert Uytterhoeven -- There's lots of Linux beyond ia32 -- geert@linux-m68k.org
In personal conversations with technical people, I call myself a hacker. But
when I'm talking to journalists I just say "programmer" or something like that.
-- Linus Torvalds
^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [PATCH v9 01/11] btrfs: add raid stripe tree definitions
2023-09-15 9:55 ` Johannes Thumshirn
@ 2023-09-15 10:33 ` Qu Wenruo
2023-09-15 10:46 ` Johannes Thumshirn
2023-10-02 9:32 ` Johannes Thumshirn
0 siblings, 2 replies; 29+ messages in thread
From: Qu Wenruo @ 2023-09-15 10:33 UTC (permalink / raw)
To: Johannes Thumshirn, Qu Wenruo, Chris Mason, Josef Bacik,
David Sterba
Cc: Christoph Hellwig, Naohiro Aota, Damien Le Moal,
linux-btrfs@vger.kernel.org, linux-kernel@vger.kernel.org
On 2023/9/15 19:25, Johannes Thumshirn wrote:
> On 15.09.23 02:27, Qu Wenruo wrote:
>>>> /*
>>>> * Records the overall state of the qgroups.
>>>> * There's only one instance of this key present,
>>>> @@ -719,6 +724,32 @@ struct btrfs_free_space_header {
>>>> __le64 num_bitmaps;
>>>> } __attribute__ ((__packed__));
>>>> +struct btrfs_raid_stride {
>>>> + /* The btrfs device-id this raid extent lives on */
>>>> + __le64 devid;
>>>> + /* The physical location on disk */
>>>> + __le64 physical;
>>>> + /* The length of stride on this disk */
>>>> + __le64 length;
>>
>> Forgot to mention, for btrfs_stripe_extent structure, its key is
>> (PHYSICAL, RAID_STRIPE_KEY, LENGTH) right?
>>
>> So is the length in the btrfs_raid_stride duplicated and we can save 8
>> bytes?
>
> Nope. The length in the key is the stripe length. The length in the
> stride is the stride length.
>
> Here's an example for why this is needed:
>
> wrote 32768/32768 bytes at offset 0
> XXX Bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
> wrote 131072/131072 bytes at offset 0
> XXX Bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
> wrote 8192/8192 bytes at offset 65536
> XXX Bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
>
> [snip]
>
> item 0 key (XXXXXX RAID_STRIPE_KEY 32768) itemoff XXXXX itemsize 32
> encoding: RAID0
> stripe 0 devid 1 physical XXXXXXXXX length 32768
> item 1 key (XXXXXX RAID_STRIPE_KEY 131072) itemoff XXXXX
> itemsize 80
Maybe you want to put the whole RAID_STRIPE_KEY definition into the headers.
In fact my initial assumption of such case would be something like this:
item 0 key (X+0 RAID_STRIPE 32K)
stripe 0 devid 1 physical XXXXX len 32K
item 1 key (X+32K RAID_STRIPE 32K)
stripe 0 devid 1 physical XXXXX + 32K len 32K
item 2 key (X+64K RAID_STRIPE 64K)
stripe 0 devid 2 physical YYYYY len 64K
item 3 key (X+128K RAID_STRIPE 32K)
stripe 0 devid 1 physical XXXXX + 64K len 32K
...
AKA, each RAID_STRIPE_KEY would only contain a continous physical stripe.
And in above case, item 0 and item 1 can be easily merged, also length
can be removed.
And this explains why the lookup code is more complex than I initially
thought.
BTW, would the above layout make the code a little easier?
Or is there any special reason for the existing one layout?
Thank,
Qu
> encoding: RAID0
> stripe 0 devid 1 physical XXXXXXXXX length 32768
> stripe 1 devid 2 physical XXXXXXXXX length 65536
> stripe 2 devid 1 physical XXXXXXXXX length 32768
This current layout has another problem.
For RAID10 the interpretation of the RAID_STRIPE item can be very complex.
While
> item 2 key (XXXXXX RAID_STRIPE_KEY 8192) itemoff XXXXX itemsize 32
> encoding: RAID0
> stripe 0 devid 1 physical XXXXXXXXX length 8192
>
> Without the length in the stride, we don't know when to select the next
> stride in item 1 above.
^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [PATCH v9 01/11] btrfs: add raid stripe tree definitions
2023-09-15 10:33 ` Qu Wenruo
@ 2023-09-15 10:46 ` Johannes Thumshirn
2023-10-02 9:32 ` Johannes Thumshirn
1 sibling, 0 replies; 29+ messages in thread
From: Johannes Thumshirn @ 2023-09-15 10:46 UTC (permalink / raw)
To: Qu Wenruo, Qu Wenruo, Chris Mason, Josef Bacik, David Sterba
Cc: Christoph Hellwig, Naohiro Aota, Damien Le Moal,
linux-btrfs@vger.kernel.org, linux-kernel@vger.kernel.org
On 15.09.23 12:34, Qu Wenruo wrote:
>> [snip]
>>
>> item 0 key (XXXXXX RAID_STRIPE_KEY 32768) itemoff XXXXX itemsize 32
>> encoding: RAID0
>> stripe 0 devid 1 physical XXXXXXXXX length 32768
>> item 1 key (XXXXXX RAID_STRIPE_KEY 131072) itemoff XXXXX
>> itemsize 80
> Maybe you want to put the whole RAID_STRIPE_KEY definition into the headers.
>
> In fact my initial assumption of such case would be something like this:
>
> item 0 key (X+0 RAID_STRIPE 32K)
> stripe 0 devid 1 physical XXXXX len 32K
> item 1 key (X+32K RAID_STRIPE 32K)
> stripe 0 devid 1 physical XXXXX + 32K len 32K
> item 2 key (X+64K RAID_STRIPE 64K)
> stripe 0 devid 2 physical YYYYY len 64K
> item 3 key (X+128K RAID_STRIPE 32K)
> stripe 0 devid 1 physical XXXXX + 64K len 32K
> ...
>
> AKA, each RAID_STRIPE_KEY would only contain a continous physical stripe.
> And in above case, item 0 and item 1 can be easily merged, also length
> can be removed.
>
> And this explains why the lookup code is more complex than I initially
> thought.
>
> BTW, would the above layout make the code a little easier?
> Or is there any special reason for the existing one layout?
It would definitely make the code easier to the cost of more items. But
of cause smaller items, as we can get rid of the stride length.
Let me think about it.
^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [PATCH v9 06/11] btrfs: implement RST version of scrub
2023-09-15 0:58 ` Qu Wenruo
@ 2023-09-15 14:11 ` David Sterba
0 siblings, 0 replies; 29+ messages in thread
From: David Sterba @ 2023-09-15 14:11 UTC (permalink / raw)
To: Qu Wenruo
Cc: Johannes Thumshirn, Chris Mason, Josef Bacik, David Sterba,
Christoph Hellwig, Naohiro Aota, Qu Wenruo, Damien Le Moal,
linux-btrfs, linux-kernel
On Fri, Sep 15, 2023 at 10:28:50AM +0930, Qu Wenruo wrote:
>
>
> On 2023/9/15 01:37, Johannes Thumshirn wrote:
> > A filesystem that uses the RAID stripe tree for logical to physical
> > address translation can't use the regular scrub path, that reads all
> > stripes and then checks if a sector is unused afterwards.
> >
> > When using the RAID stripe tree, this will result in lookup errors, as the
> > stripe tree doesn't know the requested logical addresses.
> >
> > Instead, look up stripes that are backed by the extent bitmap.
> >
> > Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
> > ---
> > fs/btrfs/bio.c | 2 ++
> > fs/btrfs/raid-stripe-tree.c | 8 ++++++-
> > fs/btrfs/scrub.c | 53 +++++++++++++++++++++++++++++++++++++++++++++
> > fs/btrfs/volumes.h | 1 +
> > 4 files changed, 63 insertions(+), 1 deletion(-)
> >
> > diff --git a/fs/btrfs/bio.c b/fs/btrfs/bio.c
> > index ddbe6f8d4ea2..bdb6e3effdbb 100644
> > --- a/fs/btrfs/bio.c
> > +++ b/fs/btrfs/bio.c
> > @@ -663,6 +663,8 @@ static bool btrfs_submit_chunk(struct btrfs_bio *bbio, int mirror_num)
> > blk_status_t ret;
> > int error;
> >
> > + smap.is_scrub = !bbio->inode;
> > +
> > btrfs_bio_counter_inc_blocked(fs_info);
> > error = btrfs_map_block(fs_info, btrfs_op(bio), logical, &map_length,
> > &bioc, &smap, &mirror_num, 1);
> > diff --git a/fs/btrfs/raid-stripe-tree.c b/fs/btrfs/raid-stripe-tree.c
> > index 697a6e1fd255..63bf62c33436 100644
> > --- a/fs/btrfs/raid-stripe-tree.c
> > +++ b/fs/btrfs/raid-stripe-tree.c
> > @@ -334,6 +334,11 @@ int btrfs_get_raid_extent_offset(struct btrfs_fs_info *fs_info,
> > if (!path)
> > return -ENOMEM;
> >
> > + if (stripe->is_scrub) {
> > + path->skip_locking = 1;
> > + path->search_commit_root = 1;
> > + }
> > +
> > ret = btrfs_search_slot(NULL, stripe_root, &stripe_key, path, 0, 0);
> > if (ret < 0)
> > goto free_path;
> > @@ -420,7 +425,8 @@ int btrfs_get_raid_extent_offset(struct btrfs_fs_info *fs_info,
> > out:
> > if (ret > 0)
> > ret = -ENOENT;
> > - if (ret && ret != -EIO) {
> > + if (ret && ret != -EIO && !stripe->is_scrub) {
> > +
>
> One extra newline.
There were way more stray newlines, you don't have to point that out
in reviews, I fix them once we have version that would not change too
much.
^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [PATCH v9 03/11] btrfs: add support for inserting raid stripe extents
2023-09-15 0:55 ` Qu Wenruo
@ 2023-09-19 12:13 ` Johannes Thumshirn
0 siblings, 0 replies; 29+ messages in thread
From: Johannes Thumshirn @ 2023-09-19 12:13 UTC (permalink / raw)
To: Qu Wenruo, Chris Mason, Josef Bacik, David Sterba
Cc: Christoph Hellwig, Naohiro Aota, Qu Wenruo, Damien Le Moal,
linux-btrfs@vger.kernel.org, linux-kernel@vger.kernel.org
On 15.09.23 02:55, Qu Wenruo wrote:
>
> I found a corncer case that this can be problematic.
>
> If we have a fs with RST root tree node/leaf corrupted, mounted with
> rescue=ibadroots, then fs_info->stripe_root would be NULL, and in the
> 5th patch inside set_io_stripe() we just fall back to regular non-RST path.
> This would bring us mostly incorrect data (and can be very problematic
> for nodatacsum files).
>
> Thus stripe_root itself is not a reliable way to determine if we're at
> RST routine, I'd say only super incompat flags is reliable.
Fixed.
>
> And fs_info->stripe_root should only be checked for functions that do
> RST tree operations, and return -EIO properly if it's not initialized.
>> +
>> + if (type != BTRFS_BLOCK_GROUP_DATA)
>> + return false;
>> +
>> + if (profile & BTRFS_BLOCK_GROUP_RAID1_MASK)
>> + return true;
>
> Just a stupid quest, RAID0 DATA doesn't need RST purely because they are
> the same as SINGLE, thus we only update the file items to the real
> written logical address, and no need for the extra mapping?
Yes but there can still be discrepancies between the assumed physical
address and the real one due to ZONE_APPEND operations. RST backed file
systems don't go the "normal" zoned btrfs logical rewrite path but have
their own.
Also I prefere to keep the stripes together.
> Thus only profiles with duplication relies on RST, right?
> If so, then I guess DUP should also be covered by RST.
>
Later in this patches, DUP, RAID0 and RAID10 will get added as well.
^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [PATCH v9 00/11] btrfs: introduce RAID stripe tree
2023-09-14 18:25 ` [PATCH v9 00/11] btrfs: introduce RAID stripe tree David Sterba
@ 2023-09-20 16:23 ` David Sterba
0 siblings, 0 replies; 29+ messages in thread
From: David Sterba @ 2023-09-20 16:23 UTC (permalink / raw)
To: David Sterba
Cc: Johannes Thumshirn, Chris Mason, Josef Bacik, David Sterba,
Christoph Hellwig, Naohiro Aota, Qu Wenruo, Damien Le Moal,
linux-btrfs, linux-kernel
On Thu, Sep 14, 2023 at 08:25:34PM +0200, David Sterba wrote:
> On Thu, Sep 14, 2023 at 09:06:55AM -0700, Johannes Thumshirn wrote:
> > Updates of the raid-stripe-tree are done at ordered extent write time to safe
> > on bandwidth while for reading we do the stripe-tree lookup on bio mapping
> > time, i.e. when the logical to physical translation happens for regular btrfs
> > RAID as well.
> >
> > The stripe tree is keyed by an extent's disk_bytenr and disk_num_bytes and
> > it's contents are the respective physical device id and position.
> >
> > For an example 1M write (split into 126K segments due to zone-append)
> > rapido2:/home/johannes/src/fstests# xfs_io -fdc "pwrite -b 1M 0 1M" -c fsync /mnt/test/test
> > wrote 1048576/1048576 bytes at offset 0
> > 1 MiB, 1 ops; 0.0065 sec (151.538 MiB/sec and 151.5381 ops/sec)
> >
> > The tree will look as follows (both 128k buffered writes to a ZNS drive):
> >
> > RAID0 case:
> > bash-5.2# btrfs inspect-internal dump-tree -t raid_stripe /dev/nvme0n1
> > btrfs-progs v6.3
> > raid stripe tree key (RAID_STRIPE_TREE ROOT_ITEM 0)
> > leaf 805535744 items 1 free space 16218 generation 8 owner RAID_STRIPE_TREE
> > leaf 805535744 flags 0x1(WRITTEN) backref revision 1
> > checksum stored 2d2d2262
> > checksum calced 2d2d2262
> > fs uuid ab05cfc6-9859-404e-970d-3999b1cb5438
> > chunk uuid c9470ba2-49ac-4d46-8856-438a18e6bd23
> > item 0 key (1073741824 RAID_STRIPE_KEY 131072) itemoff 16243 itemsize 56
> > encoding: RAID0
> > stripe 0 devid 1 offset 805306368 length 131072
> > stripe 1 devid 2 offset 536870912 length 131072
> > total bytes 42949672960
> > bytes used 294912
> > uuid ab05cfc6-9859-404e-970d-3999b1cb5438
> >
> > RAID1 case:
> > bash-5.2# btrfs inspect-internal dump-tree -t raid_stripe /dev/nvme0n1
> > btrfs-progs v6.3
> > raid stripe tree key (RAID_STRIPE_TREE ROOT_ITEM 0)
> > leaf 805535744 items 1 free space 16218 generation 8 owner RAID_STRIPE_TREE
> > leaf 805535744 flags 0x1(WRITTEN) backref revision 1
> > checksum stored 56199539
> > checksum calced 56199539
> > fs uuid 9e693a37-fbd1-4891-aed2-e7fe64605045
> > chunk uuid 691874fc-1b9c-469b-bd7f-05e0e6ba88c4
> > item 0 key (939524096 RAID_STRIPE_KEY 131072) itemoff 16243 itemsize 56
> > encoding: RAID1
> > stripe 0 devid 1 offset 939524096 length 65536
> > stripe 1 devid 2 offset 536870912 length 65536
> > total bytes 42949672960
> > bytes used 294912
> > uuid 9e693a37-fbd1-4891-aed2-e7fe64605045
> >
> > A design document can be found here:
> > https://docs.google.com/document/d/1Iui_jMidCd4MVBNSSLXRfO7p5KmvnoQL/edit?usp=sharing&ouid=103609947580185458266&rtpof=true&sd=true
>
> Please also turn it to developer documentation file (in
> btrfs-progs/Documentation/dev), it can follow the same structure.
>
> >
> > The user-space part of this series can be found here:
> > https://lore.kernel.org/linux-btrfs/20230215143109.2721722-1-johannes.thumshirn@wdc.com
> >
> > Changes to v8:
> > - Changed tracepoints according to David's comments
> > - Mark on-disk structures as packed
> > - Got rid of __DECLARE_FLEX_ARRAY
> > - Rebase onto misc-next
> > - Split out helpers for new btrfs_load_block_group_zone_info RAID cases
> > - Constify declarations where possible
> > - Initialise variables before use
> > - Lower scope of variables
> > - Remove btrfs_stripe_root() helper
> > - Pick different BTRFS_RAID_STRIPE_KEY constant
> > - Reorder on-disk encoding types to match the raid_index
> > - And possibly more, please git range-diff the versions
> > - Link to v8: https://lore.kernel.org/r/20230911-raid-stripe-tree-v8-0-647676fa852c@wdc.com
>
> v9 will be added as topic branch to for-next, I did several style
> changes so please send any updates as incrementals if needed.
Moved to misc-next. I'll do a minor release of btrfs-progs soon so we
get the tool support for testing.
^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [PATCH v9 01/11] btrfs: add raid stripe tree definitions
2023-09-15 10:33 ` Qu Wenruo
2023-09-15 10:46 ` Johannes Thumshirn
@ 2023-10-02 9:32 ` Johannes Thumshirn
1 sibling, 0 replies; 29+ messages in thread
From: Johannes Thumshirn @ 2023-10-02 9:32 UTC (permalink / raw)
To: Qu Wenruo, Qu Wenruo, Chris Mason, Josef Bacik, David Sterba
Cc: Christoph Hellwig, Naohiro Aota, Damien Le Moal,
linux-btrfs@vger.kernel.org, linux-kernel@vger.kernel.org
On 15.09.23 12:34, Qu Wenruo wrote:
>
>
>
> On 2023/9/15 19:25, Johannes Thumshirn wrote:
>> On 15.09.23 02:27, Qu Wenruo wrote:
>>>>> /*
>>>>> * Records the overall state of the qgroups.
>>>>> * There's only one instance of this key present,
>>>>> @@ -719,6 +724,32 @@ struct btrfs_free_space_header {
>>>>> __le64 num_bitmaps;
>>>>> } __attribute__ ((__packed__));
>>>>> +struct btrfs_raid_stride {
>>>>> + /* The btrfs device-id this raid extent lives on */
>>>>> + __le64 devid;
>>>>> + /* The physical location on disk */
>>>>> + __le64 physical;
>>>>> + /* The length of stride on this disk */
>>>>> + __le64 length;
>>>
>>> Forgot to mention, for btrfs_stripe_extent structure, its key is
>>> (PHYSICAL, RAID_STRIPE_KEY, LENGTH) right?
>>>
>>> So is the length in the btrfs_raid_stride duplicated and we can save 8
>>> bytes?
>>
>> Nope. The length in the key is the stripe length. The length in the
>> stride is the stride length.
>>
>> Here's an example for why this is needed:
>>
>> wrote 32768/32768 bytes at offset 0
>> XXX Bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
>> wrote 131072/131072 bytes at offset 0
>> XXX Bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
>> wrote 8192/8192 bytes at offset 65536
>> XXX Bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
>>
>> [snip]
>>
>> item 0 key (XXXXXX RAID_STRIPE_KEY 32768) itemoff XXXXX itemsize 32
>> encoding: RAID0
>> stripe 0 devid 1 physical XXXXXXXXX length 32768
>> item 1 key (XXXXXX RAID_STRIPE_KEY 131072) itemoff XXXXX
>> itemsize 80
>
> Maybe you want to put the whole RAID_STRIPE_KEY definition into the headers.
>
> In fact my initial assumption of such case would be something like this:
>
> item 0 key (X+0 RAID_STRIPE 32K)
> stripe 0 devid 1 physical XXXXX len 32K
> item 1 key (X+32K RAID_STRIPE 32K)
> stripe 0 devid 1 physical XXXXX + 32K len 32K
> item 2 key (X+64K RAID_STRIPE 64K)
> stripe 0 devid 2 physical YYYYY len 64K
> item 3 key (X+128K RAID_STRIPE 32K)
> stripe 0 devid 1 physical XXXXX + 64K len 32K
> ...
>
> AKA, each RAID_STRIPE_KEY would only contain a continous physical stripe.
> And in above case, item 0 and item 1 can be easily merged, also length
> can be removed.
>
> And this explains why the lookup code is more complex than I initially
> thought.
>
> BTW, would the above layout make the code a little easier?
> Or is there any special reason for the existing one layout?
>
> Thank,
> Qu
>
>
>> encoding: RAID0
>> stripe 0 devid 1 physical XXXXXXXXX length 32768
>> stripe 1 devid 2 physical XXXXXXXXX length 65536
>> stripe 2 devid 1 physical XXXXXXXXX length 32768
>
> This current layout has another problem.
> For RAID10 the interpretation of the RAID_STRIPE item can be very complex.
> While
>
>> item 2 key (XXXXXX RAID_STRIPE_KEY 8192) itemoff XXXXX itemsize 32
>> encoding: RAID0
>> stripe 0 devid 1 physical XXXXXXXXX length 8192
>>
>> Without the length in the stride, we don't know when to select the next
>> stride in item 1 above.
>
JFYI preliminary tests for your suggestion look reasonably good. I'll
give it some more testing and code cleanup but it actually seems
sensible to do.
^ permalink raw reply [flat|nested] 29+ messages in thread
end of thread, other threads:[~2023-10-02 9:32 UTC | newest]
Thread overview: 29+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2023-09-14 16:06 [PATCH v9 00/11] btrfs: introduce RAID stripe tree Johannes Thumshirn
2023-09-14 16:06 ` [PATCH v9 01/11] btrfs: add raid stripe tree definitions Johannes Thumshirn
2023-09-15 0:22 ` Qu Wenruo
2023-09-15 0:26 ` Qu Wenruo
2023-09-15 9:55 ` Johannes Thumshirn
2023-09-15 10:33 ` Qu Wenruo
2023-09-15 10:46 ` Johannes Thumshirn
2023-10-02 9:32 ` Johannes Thumshirn
2023-09-14 16:06 ` [PATCH v9 02/11] btrfs: read raid-stripe-tree from disk Johannes Thumshirn
2023-09-14 16:06 ` [PATCH v9 03/11] btrfs: add support for inserting raid stripe extents Johannes Thumshirn
2023-09-14 18:07 ` David Sterba
2023-09-15 10:03 ` Geert Uytterhoeven
2023-09-14 18:10 ` David Sterba
2023-09-15 0:55 ` Qu Wenruo
2023-09-19 12:13 ` Johannes Thumshirn
2023-09-14 16:06 ` [PATCH v9 04/11] btrfs: delete stripe extent on extent deletion Johannes Thumshirn
2023-09-14 16:07 ` [PATCH v9 05/11] btrfs: lookup physical address from stripe extent Johannes Thumshirn
2023-09-14 17:57 ` David Sterba
2023-09-14 16:07 ` [PATCH v9 06/11] btrfs: implement RST version of scrub Johannes Thumshirn
2023-09-15 0:58 ` Qu Wenruo
2023-09-15 14:11 ` David Sterba
2023-09-14 16:07 ` [PATCH v9 07/11] btrfs: zoned: allow zoned RAID Johannes Thumshirn
2023-09-14 17:59 ` David Sterba
2023-09-14 16:07 ` [PATCH v9 08/11] btrfs: add raid stripe tree pretty printer Johannes Thumshirn
2023-09-14 16:07 ` [PATCH v9 09/11] btrfs: announce presence of raid-stripe-tree in sysfs Johannes Thumshirn
2023-09-14 16:07 ` [PATCH v9 10/11] btrfs: add trace events for RST Johannes Thumshirn
2023-09-14 16:07 ` [PATCH v9 11/11] btrfs: add raid-stripe-tree to features enabled with debug Johannes Thumshirn
2023-09-14 18:25 ` [PATCH v9 00/11] btrfs: introduce RAID stripe tree David Sterba
2023-09-20 16:23 ` David Sterba
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).