* [PATCH v4] btrfs: zoned: reserve data_reloc block group on mount
@ 2025-06-03 6:14 Johannes Thumshirn
2025-06-03 11:22 ` Johannes Thumshirn
0 siblings, 1 reply; 4+ messages in thread
From: Johannes Thumshirn @ 2025-06-03 6:14 UTC (permalink / raw)
To: linux-btrfs
Cc: Filipe Manana, Damien Le Moal, David Sterba, Naohiro Aota,
Josef Bacik, Johannes Thumshirn
From: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Create a block group dedicated for data relocation on mount of a zoned
filesystem.
If there is already more than one empty DATA block group on mount, this
one is picked for the data relocation block group, instead of a newly
created one.
This is done to ensure, there is always space for performing garbage
collection and the filesystem is not hitting ENOSPC under heavy overwrite
workloads.
Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Filipe Manana <fdmanana@suse.com>
---
Changes to v3:
- Use jump label to only have search loop once (Filipe)
fs/btrfs/disk-io.c | 1 +
fs/btrfs/zoned.c | 61 ++++++++++++++++++++++++++++++++++++++++++++++
fs/btrfs/zoned.h | 3 +++
3 files changed, 65 insertions(+)
diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index 3def93016963..b211dc8cdb86 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -3562,6 +3562,7 @@ int __cold open_ctree(struct super_block *sb, struct btrfs_fs_devices *fs_device
goto fail_sysfs;
}
+ btrfs_zoned_reserve_data_reloc_bg(fs_info);
btrfs_free_zone_cache(fs_info);
btrfs_check_active_zone_reservation(fs_info);
diff --git a/fs/btrfs/zoned.c b/fs/btrfs/zoned.c
index 19710634d63f..a31aa129cb0f 100644
--- a/fs/btrfs/zoned.c
+++ b/fs/btrfs/zoned.c
@@ -17,6 +17,7 @@
#include "fs.h"
#include "accessors.h"
#include "bio.h"
+#include "transaction.h"
/* Maximum number of zones to report per blkdev_report_zones() call */
#define BTRFS_REPORT_NR_ZONES 4096
@@ -2443,6 +2444,66 @@ void btrfs_clear_data_reloc_bg(struct btrfs_block_group *bg)
spin_unlock(&fs_info->relocation_bg_lock);
}
+void btrfs_zoned_reserve_data_reloc_bg(struct btrfs_fs_info *fs_info)
+{
+ struct btrfs_space_info *data_sinfo = fs_info->data_sinfo;
+ struct btrfs_space_info *space_info = data_sinfo->sub_group[0];
+ struct btrfs_trans_handle *trans;
+ struct btrfs_block_group *bg;
+ struct list_head *bg_list;
+ u64 alloc_flags;
+ bool initial = false;
+ bool did_chunk_alloc = false;
+ int index;
+ int ret;
+
+ if (!btrfs_is_zoned(fs_info))
+ return;
+
+ if (fs_info->data_reloc_bg)
+ return;
+
+ if (sb_rdonly(fs_info->sb))
+ return;
+
+ ASSERT(space_info->subgroup_id == BTRFS_SUB_GROUP_DATA_RELOC);
+ alloc_flags = btrfs_get_alloc_profile(fs_info, space_info->flags);
+ index = btrfs_bg_flags_to_raid_index(alloc_flags);
+
+ bg_list = &data_sinfo->block_groups[index];
+again:
+ list_for_each_entry(bg, bg_list, list) {
+ if (bg->used > 0)
+ continue;
+
+ if (!initial) {
+ initial = true;
+ continue;
+ }
+
+ fs_info->data_reloc_bg = bg->start;
+ set_bit(BLOCK_GROUP_FLAG_ZONED_DATA_RELOC, &bg->runtime_flags);
+ btrfs_zone_activate(bg);
+
+ return;
+ }
+
+ if (did_chunk_alloc)
+ return;
+
+ trans = btrfs_join_transaction(fs_info->tree_root);
+ if (IS_ERR(trans))
+ return;
+
+ ret = btrfs_chunk_alloc(trans, space_info, alloc_flags, CHUNK_ALLOC_FORCE);
+ btrfs_end_transaction(trans);
+ if (ret == 1) {
+ did_chunk_alloc = true;
+ bg_list = &space_info->block_groups[index];
+ goto again;
+ }
+}
+
void btrfs_free_zone_cache(struct btrfs_fs_info *fs_info)
{
struct btrfs_fs_devices *fs_devices = fs_info->fs_devices;
diff --git a/fs/btrfs/zoned.h b/fs/btrfs/zoned.h
index 9672bf4c3335..6e11533b8e14 100644
--- a/fs/btrfs/zoned.h
+++ b/fs/btrfs/zoned.h
@@ -88,6 +88,7 @@ void btrfs_zone_finish_endio(struct btrfs_fs_info *fs_info, u64 logical,
void btrfs_schedule_zone_finish_bg(struct btrfs_block_group *bg,
struct extent_buffer *eb);
void btrfs_clear_data_reloc_bg(struct btrfs_block_group *bg);
+void btrfs_zoned_reserve_data_reloc_bg(struct btrfs_fs_info *fs_info);
void btrfs_free_zone_cache(struct btrfs_fs_info *fs_info);
bool btrfs_zoned_should_reclaim(const struct btrfs_fs_info *fs_info);
void btrfs_zoned_release_data_reloc_bg(struct btrfs_fs_info *fs_info, u64 logical,
@@ -241,6 +242,8 @@ static inline void btrfs_schedule_zone_finish_bg(struct btrfs_block_group *bg,
static inline void btrfs_clear_data_reloc_bg(struct btrfs_block_group *bg) { }
+static inline void btrfs_zoned_reserve_data_reloc_bg(struct btrfs_fs_info *fs_info) { }
+
static inline void btrfs_free_zone_cache(struct btrfs_fs_info *fs_info) { }
static inline bool btrfs_zoned_should_reclaim(const struct btrfs_fs_info *fs_info)
--
2.49.0
^ permalink raw reply related [flat|nested] 4+ messages in thread
* Re: [PATCH v4] btrfs: zoned: reserve data_reloc block group on mount
2025-06-03 6:14 [PATCH v4] btrfs: zoned: reserve data_reloc block group on mount Johannes Thumshirn
@ 2025-06-03 11:22 ` Johannes Thumshirn
2025-06-03 14:06 ` Filipe Manana
0 siblings, 1 reply; 4+ messages in thread
From: Johannes Thumshirn @ 2025-06-03 11:22 UTC (permalink / raw)
To: Johannes Thumshirn, linux-btrfs@vger.kernel.org
Cc: Filipe Manana, Damien Le Moal, David Sterba, Naohiro Aota,
Josef Bacik
On 03.06.25 08:14, Johannes Thumshirn wrote:
> From: Johannes Thumshirn <johannes.thumshirn@wdc.com>
>
> Create a block group dedicated for data relocation on mount of a zoned
> filesystem.
>
> If there is already more than one empty DATA block group on mount, this
> one is picked for the data relocation block group, instead of a newly
> created one.
>
> This is done to ensure, there is always space for performing garbage
> collection and the filesystem is not hitting ENOSPC under heavy overwrite
> workloads.
>
> Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
> Reviewed-by: Filipe Manana <fdmanana@suse.com>
Unfortunately this can result in a FS corruption if the accompanying
mkfs patch is not applied.
I think it is, because I'm not waiting for the transaction to be written
out in case we need to allocate a chunk. Therefor metadata on DUP can
get out of sync somehow when one copy is on a sequential zone and one on
a conventional zone.
> fs/btrfs/disk-io.c | 1 +
> fs/btrfs/zoned.c | 61 ++++++++++++++++++++++++++++++++++++++++++++++
> fs/btrfs/zoned.h | 3 +++
> 3 files changed, 65 insertions(+)
>
> diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
> index 3def93016963..b211dc8cdb86 100644
> --- a/fs/btrfs/disk-io.c
> +++ b/fs/btrfs/disk-io.c
> @@ -3562,6 +3562,7 @@ int __cold open_ctree(struct super_block *sb, struct btrfs_fs_devices *fs_device
> goto fail_sysfs;
> }
>
> + btrfs_zoned_reserve_data_reloc_bg(fs_info);
> btrfs_free_zone_cache(fs_info);
>
> btrfs_check_active_zone_reservation(fs_info);
> diff --git a/fs/btrfs/zoned.c b/fs/btrfs/zoned.c
> index 19710634d63f..a31aa129cb0f 100644
> --- a/fs/btrfs/zoned.c
> +++ b/fs/btrfs/zoned.c
> @@ -17,6 +17,7 @@
> #include "fs.h"
> #include "accessors.h"
> #include "bio.h"
> +#include "transaction.h"
>
> /* Maximum number of zones to report per blkdev_report_zones() call */
> #define BTRFS_REPORT_NR_ZONES 4096
> @@ -2443,6 +2444,66 @@ void btrfs_clear_data_reloc_bg(struct btrfs_block_group *bg)
> spin_unlock(&fs_info->relocation_bg_lock);
> }
>
> +void btrfs_zoned_reserve_data_reloc_bg(struct btrfs_fs_info *fs_info)
> +{
> + struct btrfs_space_info *data_sinfo = fs_info->data_sinfo;
> + struct btrfs_space_info *space_info = data_sinfo->sub_group[0];
> + struct btrfs_trans_handle *trans;
> + struct btrfs_block_group *bg;
> + struct list_head *bg_list;
> + u64 alloc_flags;
> + bool initial = false;
> + bool did_chunk_alloc = false;
> + int index;
> + int ret;
> +
> + if (!btrfs_is_zoned(fs_info))
> + return;
> +
> + if (fs_info->data_reloc_bg)
> + return;
> +
> + if (sb_rdonly(fs_info->sb))
> + return;
> +
> + ASSERT(space_info->subgroup_id == BTRFS_SUB_GROUP_DATA_RELOC);
> + alloc_flags = btrfs_get_alloc_profile(fs_info, space_info->flags);
> + index = btrfs_bg_flags_to_raid_index(alloc_flags);
> +
> + bg_list = &data_sinfo->block_groups[index];
> +again:
> + list_for_each_entry(bg, bg_list, list) {
> + if (bg->used > 0)
> + continue;
> +
> + if (!initial) {
> + initial = true;
> + continue;
> + }
> +
> + fs_info->data_reloc_bg = bg->start;
> + set_bit(BLOCK_GROUP_FLAG_ZONED_DATA_RELOC, &bg->runtime_flags);
> + btrfs_zone_activate(bg);
> +
> + return;
> + }
> +
> + if (did_chunk_alloc)
> + return;
> +
> + trans = btrfs_join_transaction(fs_info->tree_root);
> + if (IS_ERR(trans))
> + return;
> +
> + ret = btrfs_chunk_alloc(trans, space_info, alloc_flags, CHUNK_ALLOC_FORCE);
> + btrfs_end_transaction(trans);
> + if (ret == 1) {
> + did_chunk_alloc = true;
> + bg_list = &space_info->block_groups[index];
> + goto again;
> + }
> +}
> +
> void btrfs_free_zone_cache(struct btrfs_fs_info *fs_info)
> {
> struct btrfs_fs_devices *fs_devices = fs_info->fs_devices;
> diff --git a/fs/btrfs/zoned.h b/fs/btrfs/zoned.h
> index 9672bf4c3335..6e11533b8e14 100644
> --- a/fs/btrfs/zoned.h
> +++ b/fs/btrfs/zoned.h
> @@ -88,6 +88,7 @@ void btrfs_zone_finish_endio(struct btrfs_fs_info *fs_info, u64 logical,
> void btrfs_schedule_zone_finish_bg(struct btrfs_block_group *bg,
> struct extent_buffer *eb);
> void btrfs_clear_data_reloc_bg(struct btrfs_block_group *bg);
> +void btrfs_zoned_reserve_data_reloc_bg(struct btrfs_fs_info *fs_info);
> void btrfs_free_zone_cache(struct btrfs_fs_info *fs_info);
> bool btrfs_zoned_should_reclaim(const struct btrfs_fs_info *fs_info);
> void btrfs_zoned_release_data_reloc_bg(struct btrfs_fs_info *fs_info, u64 logical,
> @@ -241,6 +242,8 @@ static inline void btrfs_schedule_zone_finish_bg(struct btrfs_block_group *bg,
>
> static inline void btrfs_clear_data_reloc_bg(struct btrfs_block_group *bg) { }
>
> +static inline void btrfs_zoned_reserve_data_reloc_bg(struct btrfs_fs_info *fs_info) { }
> +
> static inline void btrfs_free_zone_cache(struct btrfs_fs_info *fs_info) { }
>
> static inline bool btrfs_zoned_should_reclaim(const struct btrfs_fs_info *fs_info)
^ permalink raw reply [flat|nested] 4+ messages in thread
* Re: [PATCH v4] btrfs: zoned: reserve data_reloc block group on mount
2025-06-03 11:22 ` Johannes Thumshirn
@ 2025-06-03 14:06 ` Filipe Manana
2025-06-03 18:17 ` Johannes Thumshirn
0 siblings, 1 reply; 4+ messages in thread
From: Filipe Manana @ 2025-06-03 14:06 UTC (permalink / raw)
To: Johannes Thumshirn
Cc: Johannes Thumshirn, linux-btrfs@vger.kernel.org, Filipe Manana,
Damien Le Moal, David Sterba, Naohiro Aota, Josef Bacik
On Tue, Jun 3, 2025 at 12:23 PM Johannes Thumshirn
<Johannes.Thumshirn@wdc.com> wrote:
>
> On 03.06.25 08:14, Johannes Thumshirn wrote:
> > From: Johannes Thumshirn <johannes.thumshirn@wdc.com>
> >
> > Create a block group dedicated for data relocation on mount of a zoned
> > filesystem.
> >
> > If there is already more than one empty DATA block group on mount, this
> > one is picked for the data relocation block group, instead of a newly
> > created one.
> >
> > This is done to ensure, there is always space for performing garbage
> > collection and the filesystem is not hitting ENOSPC under heavy overwrite
> > workloads.
> >
> > Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
> > Reviewed-by: Filipe Manana <fdmanana@suse.com>
>
> Unfortunately this can result in a FS corruption if the accompanying
> mkfs patch is not applied.
>
> I think it is, because I'm not waiting for the transaction to be written
> out in case we need to allocate a chunk. Therefor metadata on DUP can
> get out of sync somehow when one copy is on a sequential zone and one on
> a conventional zone.
Not familiar with the zone specific problems, but in order to use a
new chunk, there's no need to commit a transaction.
And if for some weird reason that is a problem for the zoned case, how
about committing the transaction after allocating the chunk? Does it
still cause any issue?
>
>
> > fs/btrfs/disk-io.c | 1 +
> > fs/btrfs/zoned.c | 61 ++++++++++++++++++++++++++++++++++++++++++++++
> > fs/btrfs/zoned.h | 3 +++
> > 3 files changed, 65 insertions(+)
> >
> > diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
> > index 3def93016963..b211dc8cdb86 100644
> > --- a/fs/btrfs/disk-io.c
> > +++ b/fs/btrfs/disk-io.c
> > @@ -3562,6 +3562,7 @@ int __cold open_ctree(struct super_block *sb, struct btrfs_fs_devices *fs_device
> > goto fail_sysfs;
> > }
> >
> > + btrfs_zoned_reserve_data_reloc_bg(fs_info);
> > btrfs_free_zone_cache(fs_info);
> >
> > btrfs_check_active_zone_reservation(fs_info);
> > diff --git a/fs/btrfs/zoned.c b/fs/btrfs/zoned.c
> > index 19710634d63f..a31aa129cb0f 100644
> > --- a/fs/btrfs/zoned.c
> > +++ b/fs/btrfs/zoned.c
> > @@ -17,6 +17,7 @@
> > #include "fs.h"
> > #include "accessors.h"
> > #include "bio.h"
> > +#include "transaction.h"
> >
> > /* Maximum number of zones to report per blkdev_report_zones() call */
> > #define BTRFS_REPORT_NR_ZONES 4096
> > @@ -2443,6 +2444,66 @@ void btrfs_clear_data_reloc_bg(struct btrfs_block_group *bg)
> > spin_unlock(&fs_info->relocation_bg_lock);
> > }
> >
> > +void btrfs_zoned_reserve_data_reloc_bg(struct btrfs_fs_info *fs_info)
> > +{
> > + struct btrfs_space_info *data_sinfo = fs_info->data_sinfo;
> > + struct btrfs_space_info *space_info = data_sinfo->sub_group[0];
> > + struct btrfs_trans_handle *trans;
> > + struct btrfs_block_group *bg;
> > + struct list_head *bg_list;
> > + u64 alloc_flags;
> > + bool initial = false;
> > + bool did_chunk_alloc = false;
> > + int index;
> > + int ret;
> > +
> > + if (!btrfs_is_zoned(fs_info))
> > + return;
> > +
> > + if (fs_info->data_reloc_bg)
> > + return;
> > +
> > + if (sb_rdonly(fs_info->sb))
> > + return;
> > +
> > + ASSERT(space_info->subgroup_id == BTRFS_SUB_GROUP_DATA_RELOC);
> > + alloc_flags = btrfs_get_alloc_profile(fs_info, space_info->flags);
> > + index = btrfs_bg_flags_to_raid_index(alloc_flags);
> > +
> > + bg_list = &data_sinfo->block_groups[index];
> > +again:
> > + list_for_each_entry(bg, bg_list, list) {
> > + if (bg->used > 0)
> > + continue;
> > +
> > + if (!initial) {
> > + initial = true;
> > + continue;
> > + }
> > +
> > + fs_info->data_reloc_bg = bg->start;
> > + set_bit(BLOCK_GROUP_FLAG_ZONED_DATA_RELOC, &bg->runtime_flags);
> > + btrfs_zone_activate(bg);
> > +
> > + return;
> > + }
> > +
> > + if (did_chunk_alloc)
> > + return;
> > +
> > + trans = btrfs_join_transaction(fs_info->tree_root);
> > + if (IS_ERR(trans))
> > + return;
> > +
> > + ret = btrfs_chunk_alloc(trans, space_info, alloc_flags, CHUNK_ALLOC_FORCE);
> > + btrfs_end_transaction(trans);
> > + if (ret == 1) {
> > + did_chunk_alloc = true;
> > + bg_list = &space_info->block_groups[index];
> > + goto again;
> > + }
> > +}
> > +
> > void btrfs_free_zone_cache(struct btrfs_fs_info *fs_info)
> > {
> > struct btrfs_fs_devices *fs_devices = fs_info->fs_devices;
> > diff --git a/fs/btrfs/zoned.h b/fs/btrfs/zoned.h
> > index 9672bf4c3335..6e11533b8e14 100644
> > --- a/fs/btrfs/zoned.h
> > +++ b/fs/btrfs/zoned.h
> > @@ -88,6 +88,7 @@ void btrfs_zone_finish_endio(struct btrfs_fs_info *fs_info, u64 logical,
> > void btrfs_schedule_zone_finish_bg(struct btrfs_block_group *bg,
> > struct extent_buffer *eb);
> > void btrfs_clear_data_reloc_bg(struct btrfs_block_group *bg);
> > +void btrfs_zoned_reserve_data_reloc_bg(struct btrfs_fs_info *fs_info);
> > void btrfs_free_zone_cache(struct btrfs_fs_info *fs_info);
> > bool btrfs_zoned_should_reclaim(const struct btrfs_fs_info *fs_info);
> > void btrfs_zoned_release_data_reloc_bg(struct btrfs_fs_info *fs_info, u64 logical,
> > @@ -241,6 +242,8 @@ static inline void btrfs_schedule_zone_finish_bg(struct btrfs_block_group *bg,
> >
> > static inline void btrfs_clear_data_reloc_bg(struct btrfs_block_group *bg) { }
> >
> > +static inline void btrfs_zoned_reserve_data_reloc_bg(struct btrfs_fs_info *fs_info) { }
> > +
> > static inline void btrfs_free_zone_cache(struct btrfs_fs_info *fs_info) { }
> >
> > static inline bool btrfs_zoned_should_reclaim(const struct btrfs_fs_info *fs_info)
>
^ permalink raw reply [flat|nested] 4+ messages in thread
* Re: [PATCH v4] btrfs: zoned: reserve data_reloc block group on mount
2025-06-03 14:06 ` Filipe Manana
@ 2025-06-03 18:17 ` Johannes Thumshirn
0 siblings, 0 replies; 4+ messages in thread
From: Johannes Thumshirn @ 2025-06-03 18:17 UTC (permalink / raw)
To: Filipe Manana
Cc: Johannes Thumshirn, linux-btrfs@vger.kernel.org, Filipe Manana,
Damien Le Moal, David Sterba, Naohiro Aota, Josef Bacik
On 03.06.25 16:07, Filipe Manana wrote:
> On Tue, Jun 3, 2025 at 12:23 PM Johannes Thumshirn
> <Johannes.Thumshirn@wdc.com> wrote:
>>
>> On 03.06.25 08:14, Johannes Thumshirn wrote:
>>> From: Johannes Thumshirn <johannes.thumshirn@wdc.com>
>>>
>>> Create a block group dedicated for data relocation on mount of a zoned
>>> filesystem.
>>>
>>> If there is already more than one empty DATA block group on mount, this
>>> one is picked for the data relocation block group, instead of a newly
>>> created one.
>>>
>>> This is done to ensure, there is always space for performing garbage
>>> collection and the filesystem is not hitting ENOSPC under heavy overwrite
>>> workloads.
>>>
>>> Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
>>> Reviewed-by: Filipe Manana <fdmanana@suse.com>
>>
>> Unfortunately this can result in a FS corruption if the accompanying
>> mkfs patch is not applied.
>>
>> I think it is, because I'm not waiting for the transaction to be written
>> out in case we need to allocate a chunk. Therefor metadata on DUP can
>> get out of sync somehow when one copy is on a sequential zone and one on
>> a conventional zone.
>
> Not familiar with the zone specific problems, but in order to use a
> new chunk, there's no need to commit a transaction.
> And if for some weird reason that is a problem for the zoned case, how
> about committing the transaction after allocating the chunk? Does it
> still cause any issue?
AFAICT yes. And it's a very wired case that only happens on DUP metadata
and the metadata bg has to be backed by a conventional and a sequential
zone.
The problem is, every hypothesis I have how this could happen is
invalidated by looking at the code. Basically what must happen is that
more than one chunk is created backed by the same zone, which according
to the code can't happen.
^ permalink raw reply [flat|nested] 4+ messages in thread
end of thread, other threads:[~2025-06-03 18:17 UTC | newest]
Thread overview: 4+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-06-03 6:14 [PATCH v4] btrfs: zoned: reserve data_reloc block group on mount Johannes Thumshirn
2025-06-03 11:22 ` Johannes Thumshirn
2025-06-03 14:06 ` Filipe Manana
2025-06-03 18:17 ` Johannes Thumshirn
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox