* [PATCH] btrfs: scrub: skip PREALLOC extents on RAID stripe-tree
@ 2024-09-12 14:33 Johannes Thumshirn
2024-09-12 21:32 ` Qu Wenruo
0 siblings, 1 reply; 12+ messages in thread
From: Johannes Thumshirn @ 2024-09-12 14:33 UTC (permalink / raw)
To: Chris Mason, Josef Bacik, David Sterba,
open list:BTRFS FILE SYSTEM, open list
Cc: Johannes Thumshirn, Qu Wenru
From: Johannes Thumshirn <johannes.thumshirn@wdc.com>
When scrubbing a RAID stripe-tree based filesystem with preallocated
extents, the btrfs_map_block() called by
scrub_submit_extent_sector_read() will return ENOENT, because there is
no RAID stripe-tree entry for preallocated extents. This then causes
the sector to be marked as a sector with an I/O error.
To prevent this false alert don't mark secotors for that
btrfs_map_block() returned an ENOENT as I/O errors but skip them.
This results for example in errors in fstests' btrfs/060 .. btrfs/074
which all perform fsstress and scrub operations. Whit this fix, these
errors are gone and the tests pass again.
Cc: Qu Wenru <wqu@suse.com>
Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
---
fs/btrfs/scrub.c | 15 +++++++++++++--
1 file changed, 13 insertions(+), 2 deletions(-)
diff --git a/fs/btrfs/scrub.c b/fs/btrfs/scrub.c
index 3a3427428074..b195c41676e3 100644
--- a/fs/btrfs/scrub.c
+++ b/fs/btrfs/scrub.c
@@ -1657,7 +1657,7 @@ static u32 stripe_length(const struct scrub_stripe *stripe)
}
static void scrub_submit_extent_sector_read(struct scrub_ctx *sctx,
- struct scrub_stripe *stripe)
+ struct scrub_stripe *stripe)
{
struct btrfs_fs_info *fs_info = stripe->bg->fs_info;
struct btrfs_bio *bbio = NULL;
@@ -1703,10 +1703,21 @@ static void scrub_submit_extent_sector_read(struct scrub_ctx *sctx,
err = btrfs_map_block(fs_info, BTRFS_MAP_READ, logical,
&stripe_len, &bioc, &io_stripe, &mirror);
btrfs_put_bioc(bioc);
- if (err < 0) {
+ if (err < 0 && err != -ENOENT) {
set_bit(i, &stripe->io_error_bitmap);
set_bit(i, &stripe->error_bitmap);
continue;
+ } else if (err == -ENOENT) {
+ /*
+ * btrfs_map_block() returns -ENOENT if it can't
+ * find the logical address in the RAID stripe
+ * tree. This can happens on PREALLOC extents.
+ * As we know the extent tree has an extent
+ * recorded there, we can be sure this is a
+ * PREALLOC extent, so skip this sector and
+ * continue to the next.
+ */
+ continue;
}
bbio = btrfs_bio_alloc(stripe->nr_sectors, REQ_OP_READ,
--
2.43.0
^ permalink raw reply related [flat|nested] 12+ messages in thread
* Re: [PATCH] btrfs: scrub: skip PREALLOC extents on RAID stripe-tree
2024-09-12 14:33 [PATCH] btrfs: scrub: skip PREALLOC extents on RAID stripe-tree Johannes Thumshirn
@ 2024-09-12 21:32 ` Qu Wenruo
2024-09-13 5:42 ` Johannes Thumshirn
0 siblings, 1 reply; 12+ messages in thread
From: Qu Wenruo @ 2024-09-12 21:32 UTC (permalink / raw)
To: Johannes Thumshirn, Chris Mason, Josef Bacik, David Sterba,
open list:BTRFS FILE SYSTEM, open list
Cc: Johannes Thumshirn, Qu Wenru
在 2024/9/13 00:03, Johannes Thumshirn 写道:
> From: Johannes Thumshirn <johannes.thumshirn@wdc.com>
>
> When scrubbing a RAID stripe-tree based filesystem with preallocated
> extents, the btrfs_map_block() called by
> scrub_submit_extent_sector_read() will return ENOENT, because there is
> no RAID stripe-tree entry for preallocated extents. This then causes
> the sector to be marked as a sector with an I/O error.
>
> To prevent this false alert don't mark secotors for that
> btrfs_map_block() returned an ENOENT as I/O errors but skip them.
>
> This results for example in errors in fstests' btrfs/060 .. btrfs/074
> which all perform fsstress and scrub operations. Whit this fix, these
> errors are gone and the tests pass again.
>
> Cc: Qu Wenru <wqu@suse.com>
My concern is, ENOENT can be some real problems other than PREALLOC.
I'd prefer this to be the last-resort method.
Would it be possible to create an RST entry for preallocated operations
manually? E.g. without creating a dummy OE, but just insert the needed
RST entries into RST tree at fallocate time?
Thanks,
Qu
> Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
> ---
> fs/btrfs/scrub.c | 15 +++++++++++++--
> 1 file changed, 13 insertions(+), 2 deletions(-)
>
> diff --git a/fs/btrfs/scrub.c b/fs/btrfs/scrub.c
> index 3a3427428074..b195c41676e3 100644
> --- a/fs/btrfs/scrub.c
> +++ b/fs/btrfs/scrub.c
> @@ -1657,7 +1657,7 @@ static u32 stripe_length(const struct scrub_stripe *stripe)
> }
>
> static void scrub_submit_extent_sector_read(struct scrub_ctx *sctx,
> - struct scrub_stripe *stripe)
> + struct scrub_stripe *stripe)
> {
> struct btrfs_fs_info *fs_info = stripe->bg->fs_info;
> struct btrfs_bio *bbio = NULL;
> @@ -1703,10 +1703,21 @@ static void scrub_submit_extent_sector_read(struct scrub_ctx *sctx,
> err = btrfs_map_block(fs_info, BTRFS_MAP_READ, logical,
> &stripe_len, &bioc, &io_stripe, &mirror);
> btrfs_put_bioc(bioc);
> - if (err < 0) {
> + if (err < 0 && err != -ENOENT) {
> set_bit(i, &stripe->io_error_bitmap);
> set_bit(i, &stripe->error_bitmap);
> continue;
> + } else if (err == -ENOENT) {
> + /*
> + * btrfs_map_block() returns -ENOENT if it can't
> + * find the logical address in the RAID stripe
> + * tree. This can happens on PREALLOC extents.
> + * As we know the extent tree has an extent
> + * recorded there, we can be sure this is a
> + * PREALLOC extent, so skip this sector and
> + * continue to the next.
> + */
> + continue;
> }
>
> bbio = btrfs_bio_alloc(stripe->nr_sectors, REQ_OP_READ,
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [PATCH] btrfs: scrub: skip PREALLOC extents on RAID stripe-tree
2024-09-12 21:32 ` Qu Wenruo
@ 2024-09-13 5:42 ` Johannes Thumshirn
2024-09-13 5:47 ` Qu Wenruo
0 siblings, 1 reply; 12+ messages in thread
From: Johannes Thumshirn @ 2024-09-13 5:42 UTC (permalink / raw)
To: Qu Wenruo, Johannes Thumshirn, Chris Mason, Josef Bacik,
David Sterba, open list:BTRFS FILE SYSTEM, open list
Cc: WenRuo Qu
On 12.09.24 23:32, Qu Wenruo wrote:
>
>
> 在 2024/9/13 00:03, Johannes Thumshirn 写道:
>> From: Johannes Thumshirn <johannes.thumshirn@wdc.com>
>>
>> When scrubbing a RAID stripe-tree based filesystem with preallocated
>> extents, the btrfs_map_block() called by
>> scrub_submit_extent_sector_read() will return ENOENT, because there is
>> no RAID stripe-tree entry for preallocated extents. This then causes
>> the sector to be marked as a sector with an I/O error.
>>
>> To prevent this false alert don't mark secotors for that
>> btrfs_map_block() returned an ENOENT as I/O errors but skip them.
>>
>> This results for example in errors in fstests' btrfs/060 .. btrfs/074
>> which all perform fsstress and scrub operations. Whit this fix, these
>> errors are gone and the tests pass again.
>>
>> Cc: Qu Wenru <wqu@suse.com>
>
> My concern is, ENOENT can be some real problems other than PREALLOC.
> I'd prefer this to be the last-resort method.
Hm but what else could create an entry in the extent tree without having
it in the stripe tree? I can't really think of a situation creating this
layout.
> Would it be possible to create an RST entry for preallocated operations
> manually? E.g. without creating a dummy OE, but just insert the needed
> RST entries into RST tree at fallocate time?
Let me give it a try. But I'm a bit less happy to do so, as RST already
increases the write amplification.
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [PATCH] btrfs: scrub: skip PREALLOC extents on RAID stripe-tree
2024-09-13 5:42 ` Johannes Thumshirn
@ 2024-09-13 5:47 ` Qu Wenruo
2024-09-18 14:08 ` [RFC 0/2] Add dummy RAID stripe tree entries for PREALLOC data Johannes Thumshirn
0 siblings, 1 reply; 12+ messages in thread
From: Qu Wenruo @ 2024-09-13 5:47 UTC (permalink / raw)
To: Johannes Thumshirn, Johannes Thumshirn, Chris Mason, Josef Bacik,
David Sterba, open list:BTRFS FILE SYSTEM, open list
Cc: WenRuo Qu
在 2024/9/13 15:12, Johannes Thumshirn 写道:
> On 12.09.24 23:32, Qu Wenruo wrote:
>>
>>
>> 在 2024/9/13 00:03, Johannes Thumshirn 写道:
>>> From: Johannes Thumshirn <johannes.thumshirn@wdc.com>
>>>
>>> When scrubbing a RAID stripe-tree based filesystem with preallocated
>>> extents, the btrfs_map_block() called by
>>> scrub_submit_extent_sector_read() will return ENOENT, because there is
>>> no RAID stripe-tree entry for preallocated extents. This then causes
>>> the sector to be marked as a sector with an I/O error.
>>>
>>> To prevent this false alert don't mark secotors for that
>>> btrfs_map_block() returned an ENOENT as I/O errors but skip them.
>>>
>>> This results for example in errors in fstests' btrfs/060 .. btrfs/074
>>> which all perform fsstress and scrub operations. Whit this fix, these
>>> errors are gone and the tests pass again.
>>>
>>> Cc: Qu Wenru <wqu@suse.com>
>>
>> My concern is, ENOENT can be some real problems other than PREALLOC.
>> I'd prefer this to be the last-resort method.
>
> Hm but what else could create an entry in the extent tree without having
> it in the stripe tree? I can't really think of a situation creating this
> layout.
My concern is that, if by some other bug that certain writes didn't
create needed RST entry, we will always treat them as preallocated
during scrub.
Thus it may be better to have a way to distinguish a real missing entry
and preallocated extents.
>
>
>> Would it be possible to create an RST entry for preallocated operations
>> manually? E.g. without creating a dummy OE, but just insert the needed
>> RST entries into RST tree at fallocate time?
>
> Let me give it a try. But I'm a bit less happy to do so, as RST already
> increases the write amplification.
Well, write amplification is always a big problem for btrfs...
Thanks,
Qu
^ permalink raw reply [flat|nested] 12+ messages in thread
* [RFC 0/2] Add dummy RAID stripe tree entries for PREALLOC data
2024-09-13 5:47 ` Qu Wenruo
@ 2024-09-18 14:08 ` Johannes Thumshirn
2024-09-18 14:08 ` [RFC 1/2] btrfs: introduce dummy RAID stripes for preallocated data Johannes Thumshirn
` (4 more replies)
0 siblings, 5 replies; 12+ messages in thread
From: Johannes Thumshirn @ 2024-09-18 14:08 UTC (permalink / raw)
To: Qu Wenruo; +Cc: linux-btrfs, Johannes Thumshirn
This is an RFC implementation of Qu's request to be able to
distinguish preallocated extents in the stripe tree for scrub.
It's not 100% working yet but only showing the basic "how it's going to
look like".
I'm not really sure this is a better solution than returning ENOENT
and ignoring it in scrub.
A third possibility would be to do a full backref walk on
btrfs_map_block() error and then check if it's a preallocated extent.
Johannes Thumshirn (2):
btrfs: introduce dummy RAID stripes for preallocated data
btrfs: insert dummy RAID stripe extents for preallocated data
fs/btrfs/inode.c | 47 +++++++++++++++++++++++++++++++++
fs/btrfs/raid-stripe-tree.c | 47 +++++++++++++++++++++++++++++++++
fs/btrfs/raid-stripe-tree.h | 2 ++
include/uapi/linux/btrfs_tree.h | 1 +
4 files changed, 97 insertions(+)
--
2.43.0
^ permalink raw reply [flat|nested] 12+ messages in thread
* [RFC 1/2] btrfs: introduce dummy RAID stripes for preallocated data
2024-09-18 14:08 ` [RFC 0/2] Add dummy RAID stripe tree entries for PREALLOC data Johannes Thumshirn
@ 2024-09-18 14:08 ` Johannes Thumshirn
2024-09-18 14:08 ` [RFC 2/2] btrfs: insert dummy RAID stripe extents " Johannes Thumshirn
` (3 subsequent siblings)
4 siblings, 0 replies; 12+ messages in thread
From: Johannes Thumshirn @ 2024-09-18 14:08 UTC (permalink / raw)
To: Qu Wenruo; +Cc: linux-btrfs, Johannes Thumshirn
From: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Introduce BTRFS_RAID_STRIPE_DUMMY_KEY which describes a range in the
RAID stripe-tree that is backed by meta-data but no data. For instance
for preallocated extents.
On read, we can simply ignore the contents of the stripe_extent as
we're not interested in the physical address and device id of these
stripe_extents.
Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
---
fs/btrfs/raid-stripe-tree.c | 43 +++++++++++++++++++++++++++++++++
fs/btrfs/raid-stripe-tree.h | 2 ++
include/uapi/linux/btrfs_tree.h | 1 +
3 files changed, 46 insertions(+)
diff --git a/fs/btrfs/raid-stripe-tree.c b/fs/btrfs/raid-stripe-tree.c
index 075fecd08d87..bbe0689b1d17 100644
--- a/fs/btrfs/raid-stripe-tree.c
+++ b/fs/btrfs/raid-stripe-tree.c
@@ -158,6 +158,40 @@ static int update_raid_extent_item(struct btrfs_trans_handle *trans,
return ret;
}
+int btrfs_insert_dummy_raid_extent(struct btrfs_trans_handle *trans,
+ u64 logical, u64 length)
+{
+ struct btrfs_fs_info *fs_info = trans->fs_info;
+ struct btrfs_key stripe_key;
+ struct btrfs_root *stripe_root = fs_info->stripe_root;
+ struct btrfs_path *path;
+ int ret;
+
+ path = btrfs_alloc_path();
+ if (!path)
+ return -ENOMEM;
+
+ stripe_key.objectid = logical;
+ stripe_key.type = BTRFS_RAID_STRIPE_DUMMY_KEY;
+ stripe_key.offset = length;
+
+ ret = btrfs_insert_empty_item(trans, stripe_root, path, &stripe_key, 0);
+ if (ret == -EEXIST) {
+ struct extent_buffer *leaf = path->nodes[0];
+ int slot = path->slots[0];
+ struct btrfs_key found_key;
+
+ btrfs_item_key_to_cpu(leaf, &found_key, slot);
+ found_key.objectid = logical;
+ found_key.offset = length;
+ btrfs_mark_buffer_dirty(trans, leaf);
+ ret = 0;
+ }
+ btrfs_free_path(path);
+
+ return ret;
+}
+
static int btrfs_insert_one_raid_extent(struct btrfs_trans_handle *trans,
struct btrfs_io_context *bioc)
{
@@ -305,6 +339,15 @@ int btrfs_get_raid_extent_offset(struct btrfs_fs_info *fs_info,
if (end > found_end)
*length -= end - found_end;
+ /*
+ * If we have a BTRFS_RAID_STRIPE_DUMMY_KEY it means we've hit
+ * an entry for a preallocated extent. Noone will ever check
+ * the physical, only logical and length, so we're good to
+ * bail out from here.
+ */
+ if (found_key.type == BTRFS_RAID_STRIPE_DUMMY_KEY)
+ goto out;
+
num_stripes = btrfs_num_raid_stripes(btrfs_item_size(leaf, slot));
stripe_extent = btrfs_item_ptr(leaf, slot, struct btrfs_stripe_extent);
diff --git a/fs/btrfs/raid-stripe-tree.h b/fs/btrfs/raid-stripe-tree.h
index 1ac1c21aac2f..dbe52789f201 100644
--- a/fs/btrfs/raid-stripe-tree.h
+++ b/fs/btrfs/raid-stripe-tree.h
@@ -27,6 +27,8 @@ int btrfs_get_raid_extent_offset(struct btrfs_fs_info *fs_info,
u32 stripe_index, struct btrfs_io_stripe *stripe);
int btrfs_insert_raid_extent(struct btrfs_trans_handle *trans,
struct btrfs_ordered_extent *ordered_extent);
+int btrfs_insert_dummy_raid_extent(struct btrfs_trans_handle *trans,
+ u64 logical, u64 length);
static inline bool btrfs_need_stripe_tree_update(struct btrfs_fs_info *fs_info,
u64 map_type)
diff --git a/include/uapi/linux/btrfs_tree.h b/include/uapi/linux/btrfs_tree.h
index fc29d273845d..76b18013b394 100644
--- a/include/uapi/linux/btrfs_tree.h
+++ b/include/uapi/linux/btrfs_tree.h
@@ -281,6 +281,7 @@
#define BTRFS_CHUNK_ITEM_KEY 228
#define BTRFS_RAID_STRIPE_KEY 230
+#define BTRFS_RAID_STRIPE_DUMMY_KEY 231
/*
* Records the overall state of the qgroups.
--
2.43.0
^ permalink raw reply related [flat|nested] 12+ messages in thread
* [RFC 2/2] btrfs: insert dummy RAID stripe extents for preallocated data
2024-09-18 14:08 ` [RFC 0/2] Add dummy RAID stripe tree entries for PREALLOC data Johannes Thumshirn
2024-09-18 14:08 ` [RFC 1/2] btrfs: introduce dummy RAID stripes for preallocated data Johannes Thumshirn
@ 2024-09-18 14:08 ` Johannes Thumshirn
2024-09-18 23:45 ` [RFC 0/2] Add dummy RAID stripe tree entries for PREALLOC data Qu Wenruo
` (2 subsequent siblings)
4 siblings, 0 replies; 12+ messages in thread
From: Johannes Thumshirn @ 2024-09-18 14:08 UTC (permalink / raw)
To: Qu Wenruo; +Cc: linux-btrfs, Johannes Thumshirn
From: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Preallocated extents are not backed by a RAID stripe-tree entry (in
case the filesystem is using the RAID stripe-tree), because there is
no real logical to physical mapping needed for them.
But for instance scrub is performing RAID stripe-tree lookups for all
extents in the extent-tree, even for preallocated ones, so
the stripe-tree lookup code will return -ENOENT for them. This is
causing scrub to mark these extents as I/O errors.
To solve this issue, add a dummy RAID stripe-tree entry for these
extents, so the block mapping performed for scrub operations doesn't fail.
Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
---
fs/btrfs/inode.c | 47 +++++++++++++++++++++++++++++++++++++
fs/btrfs/raid-stripe-tree.c | 4 ++++
2 files changed, 51 insertions(+)
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index edac499fd83d..a8e119809670 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -71,6 +71,7 @@
#include "backref.h"
#include "raid-stripe-tree.h"
#include "fiemap.h"
+#include "volumes.h"
struct btrfs_iget_args {
u64 ino;
@@ -8679,6 +8680,44 @@ static int btrfs_symlink(struct mnt_idmap *idmap, struct inode *dir,
return err;
}
+static int insert_prealloc_rst_entry(struct btrfs_fs_info *fs_info,
+ struct btrfs_trans_handle *trans_in,
+ u64 start, u64 len)
+{
+ struct btrfs_trans_handle *trans;
+ struct btrfs_chunk_map *map;
+ u64 map_type;
+ int ret;
+
+ if (!btrfs_fs_incompat(fs_info, RAID_STRIPE_TREE))
+ return 0;
+
+ if (trans_in)
+ trans = trans_in;
+ else
+ trans = btrfs_join_transaction(fs_info->stripe_root);
+
+ map = btrfs_find_chunk_map(fs_info, start, len);
+ if (!map)
+ return -ENOENT;
+
+ map_type = map->type;
+ btrfs_free_chunk_map(map);
+
+ if (!btrfs_need_stripe_tree_update(fs_info, map_type))
+ return 0;
+ ret = btrfs_insert_dummy_raid_extent(trans, start, len);
+ if (ret) {
+ btrfs_abort_transaction(trans, ret);
+ return ret;
+ }
+
+ if (trans != trans_in)
+ btrfs_end_transaction(trans);
+
+ return 0;
+}
+
static struct btrfs_trans_handle *insert_prealloc_file_extent(
struct btrfs_trans_handle *trans_in,
struct btrfs_inode *inode,
@@ -8817,6 +8856,14 @@ static int __btrfs_prealloc_file_range(struct inode *inode, int mode,
break;
}
+ ret = insert_prealloc_rst_entry(fs_info, trans, ins.objectid,
+ cur_offset);
+ if (ret) {
+ btrfs_free_reserved_extent(fs_info, ins.objectid,
+ ins.offset, 0);
+ break;
+ }
+
em = alloc_extent_map();
if (!em) {
btrfs_drop_extent_map_range(BTRFS_I(inode), cur_offset,
diff --git a/fs/btrfs/raid-stripe-tree.c b/fs/btrfs/raid-stripe-tree.c
index bbe0689b1d17..f559ea14976f 100644
--- a/fs/btrfs/raid-stripe-tree.c
+++ b/fs/btrfs/raid-stripe-tree.c
@@ -61,6 +61,9 @@ int btrfs_delete_raid_extent(struct btrfs_trans_handle *trans, u64 start, u64 le
trace_btrfs_raid_extent_delete(fs_info, start, end,
found_start, found_end);
+ if (key.type == BTRFS_RAID_STRIPE_DUMMY_KEY)
+ goto delete;
+
if (found_start < start) {
struct btrfs_key prev;
u64 diff = start - found_start;
@@ -112,6 +115,7 @@ int btrfs_delete_raid_extent(struct btrfs_trans_handle *trans, u64 start, u64 le
break;
}
+delete:
ret = btrfs_del_item(trans, stripe_root, path);
if (ret)
break;
--
2.43.0
^ permalink raw reply related [flat|nested] 12+ messages in thread
* Re: [RFC 0/2] Add dummy RAID stripe tree entries for PREALLOC data
2024-09-18 14:08 ` [RFC 0/2] Add dummy RAID stripe tree entries for PREALLOC data Johannes Thumshirn
2024-09-18 14:08 ` [RFC 1/2] btrfs: introduce dummy RAID stripes for preallocated data Johannes Thumshirn
2024-09-18 14:08 ` [RFC 2/2] btrfs: insert dummy RAID stripe extents " Johannes Thumshirn
@ 2024-09-18 23:45 ` Qu Wenruo
2024-09-19 15:42 ` Johannes Thumshirn
2024-09-19 16:57 ` Gerhard Wiesinger
2024-09-23 15:27 ` Josef Bacik
4 siblings, 1 reply; 12+ messages in thread
From: Qu Wenruo @ 2024-09-18 23:45 UTC (permalink / raw)
To: Johannes Thumshirn, Qu Wenruo; +Cc: linux-btrfs
在 2024/9/18 23:38, Johannes Thumshirn 写道:
> This is an RFC implementation of Qu's request to be able to
> distinguish preallocated extents in the stripe tree for scrub.
>
> It's not 100% working yet but only showing the basic "how it's going to
> look like".
>
> I'm not really sure this is a better solution than returning ENOENT
> and ignoring it in scrub.
If RST without zoned supports preallocation and NOCOW, then I think we
should not just insert a dummy, but a real RST entries.
So that NOCOW/preallocated writes can just re-use the existing RST
entry, without any new RST updates.
At least logically it makes more sense.
At least a quick glance into the handling shows, NOCOW writes doesn't
trigger any RST updates, so I guess it should also apply to PREALLOCATED
writes too.
Or did I miss something else?
Thanks,
Qu
>
> A third possibility would be to do a full backref walk on
> btrfs_map_block() error and then check if it's a preallocated extent.
>
> Johannes Thumshirn (2):
> btrfs: introduce dummy RAID stripes for preallocated data
> btrfs: insert dummy RAID stripe extents for preallocated data
>
> fs/btrfs/inode.c | 47 +++++++++++++++++++++++++++++++++
> fs/btrfs/raid-stripe-tree.c | 47 +++++++++++++++++++++++++++++++++
> fs/btrfs/raid-stripe-tree.h | 2 ++
> include/uapi/linux/btrfs_tree.h | 1 +
> 4 files changed, 97 insertions(+)
>
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [RFC 0/2] Add dummy RAID stripe tree entries for PREALLOC data
2024-09-18 23:45 ` [RFC 0/2] Add dummy RAID stripe tree entries for PREALLOC data Qu Wenruo
@ 2024-09-19 15:42 ` Johannes Thumshirn
0 siblings, 0 replies; 12+ messages in thread
From: Johannes Thumshirn @ 2024-09-19 15:42 UTC (permalink / raw)
To: Qu Wenruo, Johannes Thumshirn, WenRuo Qu; +Cc: linux-btrfs@vger.kernel.org
On 19.09.24 01:46, Qu Wenruo wrote:
> 在 2024/9/18 23:38, Johannes Thumshirn 写道:
>> This is an RFC implementation of Qu's request to be able to
>> distinguish preallocated extents in the stripe tree for scrub.
>>
>> It's not 100% working yet but only showing the basic "how it's going to
>> look like".
>>
>> I'm not really sure this is a better solution than returning ENOENT
>> and ignoring it in scrub.
>
> If RST without zoned supports preallocation and NOCOW, then I think we
> should not just insert a dummy, but a real RST entries.
>
> So that NOCOW/preallocated writes can just re-use the existing RST
> entry, without any new RST updates.
>
> At least logically it makes more sense.
>
> At least a quick glance into the handling shows, NOCOW writes doesn't
> trigger any RST updates, so I guess it should also apply to PREALLOCATED
> writes too.
Yup, got a fix for NOCOW. That again was a stupid mistake because I only
had zoned in mind.
For prealloc I don't this I should add RST entries, because there's
nothing on disk for these ranges. It's pre-allocated after all.
Once we write into the range, RST entries will be created:
virtme-scsi:/mnt # xfs_io -fc "falloc 0 1M" test
virtme-scsi:/mnt # btrfs ins dump-tree -t extent /dev/sda | grep -A 1
EXTENT_ITEM
item 13 key (298844160 EXTENT_ITEM 1048576) itemoff 15828
itemsize 53
refs 1 gen 8 flags DATA
virtme-scsi:/mnt # btrfs ins dump-tree -t fs /dev/sda | grep -A 3
EXTENT_DATA
item 6 key (257 EXTENT_DATA 0) itemoff 15816 itemsize 53
generation 8 type 2 (prealloc)
prealloc data disk byte 298844160 nr 1048576
prealloc data offset 0 nr 1048576
virtme-scsi:/mnt # btrfs ins dump-tree -t raid-stripe /dev/sda
btrfs-progs v6.9
raid stripe tree key (RAID_STRIPE_TREE ROOT_ITEM 0)
leaf 5259264 items 0 free space 16283 generation 3 owner RAID_STRIPE_TREE
leaf 5259264 flags 0x1(WRITTEN) backref revision 1
checksum stored f54ade94
checksum calced f54ade94
fs uuid bebf1755-9379-4ac8-a623-ad0dc52641cf
chunk uuid 463f7b1d-c0b8-4373-8334-52f5bf83475e
total bytes 21474836480
bytes used 1212416
uuid bebf1755-9379-4ac8-a623-ad0dc52641cf
virtme-scsi:/mnt # xfs_io -fc "pwrite 64k 64k" test
wrote 65536/65536 bytes at offset 65536
64 KiB, 16 ops; 0.0003 sec (183.284 MiB/sec and 46920.8211 ops/sec)
virtme-scsi:/mnt # btrfs ins dump-tree -t extent /dev/sda | grep -A 1
EXTENT_ITEM
item 13 key (298844160 EXTENT_ITEM 1048576) itemoff 15828
itemsize 53
refs 3 gen 8 flags DATA
virtme-scsi:/mnt # btrfs ins dump-tree -t fs /dev/sda | grep -A 3
EXTENT_DATA
item 6 key (257 EXTENT_DATA 0) itemoff 15816 itemsize 53
generation 9 type 2 (prealloc)
prealloc data disk byte 298844160 nr 1048576
prealloc data offset 0 nr 65536
item 7 key (257 EXTENT_DATA 65536) itemoff 15763 itemsize 53
generation 9 type 1 (regular)
extent data disk byte 298844160 nr 1048576
extent data offset 65536 nr 65536 ram 1048576
--
item 8 key (257 EXTENT_DATA 131072) itemoff 15710 itemsize 53
generation 9 type 2 (prealloc)
prealloc data disk byte 298844160 nr 1048576
prealloc data offset 131072 nr 917504
virtme-scsi:/mnt # btrfs ins dump-tree -t raid-stripe /dev/sda
btrfs-progs v6.9
raid stripe tree key (RAID_STRIPE_TREE ROOT_ITEM 0)
leaf 30638080 items 1 free space 16226 generation 9 owner RAID_STRIPE_TREE
leaf 30638080 flags 0x1(WRITTEN) backref revision 1
checksum stored 9ed94b4d
checksum calced 9ed94b4d
fs uuid bebf1755-9379-4ac8-a623-ad0dc52641cf
chunk uuid 463f7b1d-c0b8-4373-8334-52f5bf83475e
item 0 key (298909696 RAID_STRIPE 65536) itemoff 16251 itemsize 32
stripe 0 devid 1 physical 298909696
stripe 1 devid 2 physical 277938176
total bytes 21474836480
bytes used 1212416
uuid bebf1755-9379-4ac8-a623-ad0dc52641cf
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [RFC 0/2] Add dummy RAID stripe tree entries for PREALLOC data
2024-09-18 14:08 ` [RFC 0/2] Add dummy RAID stripe tree entries for PREALLOC data Johannes Thumshirn
` (2 preceding siblings ...)
2024-09-18 23:45 ` [RFC 0/2] Add dummy RAID stripe tree entries for PREALLOC data Qu Wenruo
@ 2024-09-19 16:57 ` Gerhard Wiesinger
2024-09-20 9:50 ` Johannes Thumshirn
2024-09-23 15:27 ` Josef Bacik
4 siblings, 1 reply; 12+ messages in thread
From: Gerhard Wiesinger @ 2024-09-19 16:57 UTC (permalink / raw)
To: Johannes Thumshirn, Qu Wenruo; +Cc: linux-btrfs
On 18.09.2024 16:08, Johannes Thumshirn wrote:
> This is an RFC implementation of Qu's request to be able to
> distinguish preallocated extents in the stripe tree for scrub.
>
> It's not 100% working yet but only showing the basic "how it's going to
> look like".
>
> I'm not really sure this is a better solution than returning ENOENT
> and ignoring it in scrub.
>
> A third possibility would be to do a full backref walk on
> btrfs_map_block() error and then check if it's a preallocated extent.
>
> Johannes Thumshirn (2):
> btrfs: introduce dummy RAID stripes for preallocated data
> btrfs: insert dummy RAID stripe extents for preallocated data
>
> fs/btrfs/inode.c | 47 +++++++++++++++++++++++++++++++++
> fs/btrfs/raid-stripe-tree.c | 47 +++++++++++++++++++++++++++++++++
> fs/btrfs/raid-stripe-tree.h | 2 ++
> include/uapi/linux/btrfs_tree.h | 1 +
> 4 files changed, 97 insertions(+)
>
Will this also solve the compression topic (see subject "BTRFS doesn't
compress on the fly") for e.g. PostgreSQL (which preallocates)?
Thnx.
Ciao,
Gerhard
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [RFC 0/2] Add dummy RAID stripe tree entries for PREALLOC data
2024-09-19 16:57 ` Gerhard Wiesinger
@ 2024-09-20 9:50 ` Johannes Thumshirn
0 siblings, 0 replies; 12+ messages in thread
From: Johannes Thumshirn @ 2024-09-20 9:50 UTC (permalink / raw)
To: Gerhard Wiesinger, Johannes Thumshirn, WenRuo Qu
Cc: linux-btrfs@vger.kernel.org
On 19.09.24 19:06, Gerhard Wiesinger wrote:
>> Johannes Thumshirn (2):
>> btrfs: introduce dummy RAID stripes for preallocated data
>> btrfs: insert dummy RAID stripe extents for preallocated data
>>
>> fs/btrfs/inode.c | 47 +++++++++++++++++++++++++++++++++
>> fs/btrfs/raid-stripe-tree.c | 47 +++++++++++++++++++++++++++++++++
>> fs/btrfs/raid-stripe-tree.h | 2 ++
>> include/uapi/linux/btrfs_tree.h | 1 +
>> 4 files changed, 97 insertions(+)
>>
> Will this also solve the compression topic (see subject "BTRFS doesn't
> compress on the fly") for e.g. PostgreSQL (which preallocates)?
No I'm afraid this is something completely different. This is about
creating RAID stripe-tree entries. which noone should be running in any
kind of a production environment anyways, as the feature is still very
experimental.
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [RFC 0/2] Add dummy RAID stripe tree entries for PREALLOC data
2024-09-18 14:08 ` [RFC 0/2] Add dummy RAID stripe tree entries for PREALLOC data Johannes Thumshirn
` (3 preceding siblings ...)
2024-09-19 16:57 ` Gerhard Wiesinger
@ 2024-09-23 15:27 ` Josef Bacik
4 siblings, 0 replies; 12+ messages in thread
From: Josef Bacik @ 2024-09-23 15:27 UTC (permalink / raw)
To: Johannes Thumshirn; +Cc: Qu Wenruo, linux-btrfs
On Wed, Sep 18, 2024 at 04:08:48PM +0200, Johannes Thumshirn wrote:
> This is an RFC implementation of Qu's request to be able to
> distinguish preallocated extents in the stripe tree for scrub.
>
> It's not 100% working yet but only showing the basic "how it's going to
> look like".
>
> I'm not really sure this is a better solution than returning ENOENT
> and ignoring it in scrub.
>
> A third possibility would be to do a full backref walk on
> btrfs_map_block() error and then check if it's a preallocated extent.
>
> Johannes Thumshirn (2):
> btrfs: introduce dummy RAID stripes for preallocated data
> btrfs: insert dummy RAID stripe extents for preallocated data
>
I don't like this approach, I'd rather have a RST represent actually written
bytes on disk rather than adding entries to work around this particular case.
I think that -ENOENT from btrfs_map_block() is liable to result in weird
problems wif we return -ENOENT for other operations. I think that you change it
to return -ENODATA so that we can for sure trace it back to RST, and use that to
indicate that we need to skip that extent. This way we have something that is
unique to RST, and we don't have all these other entries that are unrelated to
the RST's job. Thanks,
Josef
^ permalink raw reply [flat|nested] 12+ messages in thread
end of thread, other threads:[~2024-09-23 15:27 UTC | newest]
Thread overview: 12+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-09-12 14:33 [PATCH] btrfs: scrub: skip PREALLOC extents on RAID stripe-tree Johannes Thumshirn
2024-09-12 21:32 ` Qu Wenruo
2024-09-13 5:42 ` Johannes Thumshirn
2024-09-13 5:47 ` Qu Wenruo
2024-09-18 14:08 ` [RFC 0/2] Add dummy RAID stripe tree entries for PREALLOC data Johannes Thumshirn
2024-09-18 14:08 ` [RFC 1/2] btrfs: introduce dummy RAID stripes for preallocated data Johannes Thumshirn
2024-09-18 14:08 ` [RFC 2/2] btrfs: insert dummy RAID stripe extents " Johannes Thumshirn
2024-09-18 23:45 ` [RFC 0/2] Add dummy RAID stripe tree entries for PREALLOC data Qu Wenruo
2024-09-19 15:42 ` Johannes Thumshirn
2024-09-19 16:57 ` Gerhard Wiesinger
2024-09-20 9:50 ` Johannes Thumshirn
2024-09-23 15:27 ` Josef Bacik
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).