* [PATCH RFC 1/9] btrfs: zoned: do not select metadata BG as finish target
2025-06-27 9:19 [PATCH RFC 0/9] btrfs: zoned: fixes for garbage collection under preassure Johannes Thumshirn
@ 2025-06-27 9:19 ` Johannes Thumshirn
2025-06-27 11:34 ` Christoph Hellwig
2025-06-27 9:19 ` [PATCH RFC 2/9] btrfs: zoned: get rid of relocation_bg_lock Johannes Thumshirn
` (7 subsequent siblings)
8 siblings, 1 reply; 25+ messages in thread
From: Johannes Thumshirn @ 2025-06-27 9:19 UTC (permalink / raw)
To: linux-btrfs
Cc: Damien Le Moal, Naohiro Aota, David Sterba, Josef Bacik,
Boris Burkov, Filipe Manana
From: Naohiro Aota <naohiro.aota@wdc.com>
We call btrfs_zone_finish_one_bg() to zone finish one block group and make
a room to activate another block group. Currently, we can choose a metadata
block group as a target. But, as we reserve an active metadata block group,
we no longer want to select a metadata block group. So, skip it in the
loop.
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
---
fs/btrfs/zoned.c | 6 ++++--
1 file changed, 4 insertions(+), 2 deletions(-)
diff --git a/fs/btrfs/zoned.c b/fs/btrfs/zoned.c
index bd987c90a05c..0d5d6db72b62 100644
--- a/fs/btrfs/zoned.c
+++ b/fs/btrfs/zoned.c
@@ -2651,8 +2651,10 @@ int btrfs_zone_finish_one_bg(struct btrfs_fs_info *fs_info)
spin_lock(&block_group->lock);
if (block_group->reserved || block_group->alloc_offset == 0 ||
- (block_group->flags & BTRFS_BLOCK_GROUP_SYSTEM) ||
- test_bit(BLOCK_GROUP_FLAG_ZONED_DATA_RELOC, &block_group->runtime_flags)) {
+ (block_group->flags &
+ (BTRFS_BLOCK_GROUP_METADATA | BTRFS_BLOCK_GROUP_SYSTEM)) ||
+ test_bit(BLOCK_GROUP_FLAG_ZONED_DATA_RELOC,
+ &block_group->runtime_flags)) {
spin_unlock(&block_group->lock);
continue;
}
--
2.49.0
^ permalink raw reply related [flat|nested] 25+ messages in thread* Re: [PATCH RFC 1/9] btrfs: zoned: do not select metadata BG as finish target
2025-06-27 9:19 ` [PATCH RFC 1/9] btrfs: zoned: do not select metadata BG as finish target Johannes Thumshirn
@ 2025-06-27 11:34 ` Christoph Hellwig
2025-07-02 15:34 ` Naohiro Aota
0 siblings, 1 reply; 25+ messages in thread
From: Christoph Hellwig @ 2025-06-27 11:34 UTC (permalink / raw)
To: Johannes Thumshirn
Cc: linux-btrfs, Damien Le Moal, Naohiro Aota, David Sterba,
Josef Bacik, Boris Burkov, Filipe Manana
On Fri, Jun 27, 2025 at 11:19:06AM +0200, Johannes Thumshirn wrote:
> From: Naohiro Aota <naohiro.aota@wdc.com>
>
> We call btrfs_zone_finish_one_bg() to zone finish one block group and make
> a room to activate another block group. Currently, we can choose a metadata
> block group as a target. But, as we reserve an active metadata block group,
> we no longer want to select a metadata block group. So, skip it in the
> loop.
Q: why do you finish a currently open zone to start with? If you add
an extra zones worth of over provisioning, you have enough slack to
always be able to fill to the advertized capacity, and never need to
finish an open zone before it is fully filled. Which simplifies the
implementation and reduces P/E cycles.
> Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
You'll also need to add your signoff here when sending the patch on.
^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [PATCH RFC 1/9] btrfs: zoned: do not select metadata BG as finish target
2025-06-27 11:34 ` Christoph Hellwig
@ 2025-07-02 15:34 ` Naohiro Aota
0 siblings, 0 replies; 25+ messages in thread
From: Naohiro Aota @ 2025-07-02 15:34 UTC (permalink / raw)
To: hch@infradead.org, Johannes Thumshirn
Cc: linux-btrfs@vger.kernel.org, Damien Le Moal, Naohiro Aota,
David Sterba, Josef Bacik, Boris Burkov, Filipe Manana
On Fri Jun 27, 2025 at 8:34 PM JST, Christoph Hellwig wrote:
> On Fri, Jun 27, 2025 at 11:19:06AM +0200, Johannes Thumshirn wrote:
>> From: Naohiro Aota <naohiro.aota@wdc.com>
>>
>> We call btrfs_zone_finish_one_bg() to zone finish one block group and make
>> a room to activate another block group. Currently, we can choose a metadata
>> block group as a target. But, as we reserve an active metadata block group,
>> we no longer want to select a metadata block group. So, skip it in the
>> loop.
>
> Q: why do you finish a currently open zone to start with? If you add
> an extra zones worth of over provisioning, you have enough slack to
> always be able to fill to the advertized capacity, and never need to
> finish an open zone before it is fully filled. Which simplifies the
> implementation and reduces P/E cycles.
Basically, this is called when data extent allocation cannot activate a
new zone, so the number of active zones == max active zones. In this
case, it first call btrfs_zone_finish_one_bg() to try to finish a zone
with minimum free space. If it succeeds, we can allocate new block group
and allocate an extent from there. Or, it retries the allocation with a
smaller size. So, it just prefers zone finishing than filling with a
fragmented allocation.
Another usage is when it writes to metadata. While we reserve zones for
metadata, this can be an escape hatch to finish some zones and make a
room for the new writing.
>
>> Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
>
> You'll also need to add your signoff here when sending the patch on.
^ permalink raw reply [flat|nested] 25+ messages in thread
* [PATCH RFC 2/9] btrfs: zoned: get rid of relocation_bg_lock
2025-06-27 9:19 [PATCH RFC 0/9] btrfs: zoned: fixes for garbage collection under preassure Johannes Thumshirn
2025-06-27 9:19 ` [PATCH RFC 1/9] btrfs: zoned: do not select metadata BG as finish target Johannes Thumshirn
@ 2025-06-27 9:19 ` Johannes Thumshirn
2025-06-27 9:19 ` [PATCH RFC 3/9] btrfs: zoned: get rid of treelog_bg_lock Johannes Thumshirn
` (6 subsequent siblings)
8 siblings, 0 replies; 25+ messages in thread
From: Johannes Thumshirn @ 2025-06-27 9:19 UTC (permalink / raw)
To: linux-btrfs
Cc: Damien Le Moal, Naohiro Aota, David Sterba, Josef Bacik,
Boris Burkov, Filipe Manana, Johannes Thumshirn
From: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Lockstat analysis of benchmark workloads shows a very high contention of
the relocation_bg_lock. But the relocation_bg_lock only protects a single
field in 'struct btrfs_fs_info', namely 'u64 data_reloc_bg'.
Use READ_ONCE()/WRITE_ONCE() to access 'btrfs_fs_info::data_reloc_bg'.
This is safe in the allocator path, as relocation I/O is only going to
block groups in the relocation sub-space_info and at the moment, there is
only one relocation block group in this space info.
Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
---
fs/btrfs/disk-io.c | 1 -
fs/btrfs/extent-tree.c | 28 +++++++++++-----------------
fs/btrfs/fs.h | 6 +-----
fs/btrfs/zoned.c | 11 +++++------
4 files changed, 17 insertions(+), 29 deletions(-)
diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index 6ac5be02dce7..9a13f5b1ed43 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -2791,7 +2791,6 @@ void btrfs_init_fs_info(struct btrfs_fs_info *fs_info)
spin_lock_init(&fs_info->unused_bgs_lock);
spin_lock_init(&fs_info->treelog_bg_lock);
spin_lock_init(&fs_info->zone_active_bgs_lock);
- spin_lock_init(&fs_info->relocation_bg_lock);
rwlock_init(&fs_info->tree_mod_log_lock);
rwlock_init(&fs_info->global_root_lock);
mutex_init(&fs_info->unused_bg_unpin_mutex);
diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index 10f50c725313..a9bda68a1883 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -3865,14 +3865,10 @@ static int do_allocation_zoned(struct btrfs_block_group *block_group,
* Do not allow non-relocation blocks in the dedicated relocation block
* group, and vice versa.
*/
- spin_lock(&fs_info->relocation_bg_lock);
- data_reloc_bytenr = fs_info->data_reloc_bg;
+ data_reloc_bytenr = READ_ONCE(fs_info->data_reloc_bg);
if (data_reloc_bytenr &&
((ffe_ctl->for_data_reloc && bytenr != data_reloc_bytenr) ||
(!ffe_ctl->for_data_reloc && bytenr == data_reloc_bytenr)))
- skip = true;
- spin_unlock(&fs_info->relocation_bg_lock);
- if (skip)
return 1;
/* Check RO and no space case before trying to activate it */
@@ -3899,7 +3895,6 @@ static int do_allocation_zoned(struct btrfs_block_group *block_group,
spin_lock(&space_info->lock);
spin_lock(&block_group->lock);
spin_lock(&fs_info->treelog_bg_lock);
- spin_lock(&fs_info->relocation_bg_lock);
if (ret)
goto out;
@@ -3908,8 +3903,8 @@ static int do_allocation_zoned(struct btrfs_block_group *block_group,
block_group->start == fs_info->treelog_bg ||
fs_info->treelog_bg == 0);
ASSERT(!ffe_ctl->for_data_reloc ||
- block_group->start == fs_info->data_reloc_bg ||
- fs_info->data_reloc_bg == 0);
+ block_group->start == data_reloc_bytenr ||
+ data_reloc_bytenr == 0);
if (block_group->ro ||
(!ffe_ctl->for_data_reloc &&
@@ -3932,7 +3927,7 @@ static int do_allocation_zoned(struct btrfs_block_group *block_group,
* Do not allow currently used block group to be the data relocation
* dedicated block group.
*/
- if (ffe_ctl->for_data_reloc && !fs_info->data_reloc_bg &&
+ if (ffe_ctl->for_data_reloc && data_reloc_bytenr == 0 &&
(block_group->used || block_group->reserved)) {
ret = 1;
goto out;
@@ -3957,8 +3952,8 @@ static int do_allocation_zoned(struct btrfs_block_group *block_group,
fs_info->treelog_bg = block_group->start;
if (ffe_ctl->for_data_reloc) {
- if (!fs_info->data_reloc_bg)
- fs_info->data_reloc_bg = block_group->start;
+ if (READ_ONCE(fs_info->data_reloc_bg) == 0)
+ WRITE_ONCE(fs_info->data_reloc_bg, block_group->start);
/*
* Do not allow allocations from this block group, unless it is
* for data relocation. Compared to increasing the ->ro, setting
@@ -3994,8 +3989,7 @@ static int do_allocation_zoned(struct btrfs_block_group *block_group,
if (ret && ffe_ctl->for_treelog)
fs_info->treelog_bg = 0;
if (ret && ffe_ctl->for_data_reloc)
- fs_info->data_reloc_bg = 0;
- spin_unlock(&fs_info->relocation_bg_lock);
+ WRITE_ONCE(fs_info->data_reloc_bg, 0);
spin_unlock(&fs_info->treelog_bg_lock);
spin_unlock(&block_group->lock);
spin_unlock(&space_info->lock);
@@ -4304,10 +4298,10 @@ static int prepare_allocation_zoned(struct btrfs_fs_info *fs_info,
ffe_ctl->hint_byte = fs_info->treelog_bg;
spin_unlock(&fs_info->treelog_bg_lock);
} else if (ffe_ctl->for_data_reloc) {
- spin_lock(&fs_info->relocation_bg_lock);
- if (fs_info->data_reloc_bg)
- ffe_ctl->hint_byte = fs_info->data_reloc_bg;
- spin_unlock(&fs_info->relocation_bg_lock);
+ u64 data_reloc_bg = READ_ONCE(fs_info->data_reloc_bg);
+
+ if (data_reloc_bg)
+ ffe_ctl->hint_byte = data_reloc_bg;
} else if (ffe_ctl->flags & BTRFS_BLOCK_GROUP_DATA) {
struct btrfs_block_group *block_group;
diff --git a/fs/btrfs/fs.h b/fs/btrfs/fs.h
index b239e4b8421c..570f4b85096c 100644
--- a/fs/btrfs/fs.h
+++ b/fs/btrfs/fs.h
@@ -849,11 +849,7 @@ struct btrfs_fs_info {
spinlock_t treelog_bg_lock;
u64 treelog_bg;
- /*
- * Start of the dedicated data relocation block group, protected by
- * relocation_bg_lock.
- */
- spinlock_t relocation_bg_lock;
+ /* Start of the dedicated data relocation block group */
u64 data_reloc_bg;
struct mutex zoned_data_reloc_io_lock;
diff --git a/fs/btrfs/zoned.c b/fs/btrfs/zoned.c
index 0d5d6db72b62..388c277a84d3 100644
--- a/fs/btrfs/zoned.c
+++ b/fs/btrfs/zoned.c
@@ -2495,11 +2495,10 @@ void btrfs_schedule_zone_finish_bg(struct btrfs_block_group *bg,
void btrfs_clear_data_reloc_bg(struct btrfs_block_group *bg)
{
struct btrfs_fs_info *fs_info = bg->fs_info;
+ u64 data_reloc_bg = READ_ONCE(fs_info->data_reloc_bg);
- spin_lock(&fs_info->relocation_bg_lock);
- if (fs_info->data_reloc_bg == bg->start)
- fs_info->data_reloc_bg = 0;
- spin_unlock(&fs_info->relocation_bg_lock);
+ if (data_reloc_bg == bg->start)
+ WRITE_ONCE(fs_info->data_reloc_bg, 0);
}
void btrfs_zoned_reserve_data_reloc_bg(struct btrfs_fs_info *fs_info)
@@ -2518,7 +2517,7 @@ void btrfs_zoned_reserve_data_reloc_bg(struct btrfs_fs_info *fs_info)
if (!btrfs_is_zoned(fs_info))
return;
- if (fs_info->data_reloc_bg)
+ if (READ_ONCE(fs_info->data_reloc_bg))
return;
if (sb_rdonly(fs_info->sb))
@@ -2539,7 +2538,7 @@ void btrfs_zoned_reserve_data_reloc_bg(struct btrfs_fs_info *fs_info)
continue;
}
- fs_info->data_reloc_bg = bg->start;
+ WRITE_ONCE(fs_info->data_reloc_bg, bg->start);
set_bit(BLOCK_GROUP_FLAG_ZONED_DATA_RELOC, &bg->runtime_flags);
btrfs_zone_activate(bg);
--
2.49.0
^ permalink raw reply related [flat|nested] 25+ messages in thread* [PATCH RFC 3/9] btrfs: zoned: get rid of treelog_bg_lock
2025-06-27 9:19 [PATCH RFC 0/9] btrfs: zoned: fixes for garbage collection under preassure Johannes Thumshirn
2025-06-27 9:19 ` [PATCH RFC 1/9] btrfs: zoned: do not select metadata BG as finish target Johannes Thumshirn
2025-06-27 9:19 ` [PATCH RFC 2/9] btrfs: zoned: get rid of relocation_bg_lock Johannes Thumshirn
@ 2025-06-27 9:19 ` Johannes Thumshirn
2025-06-27 9:19 ` [PATCH RFC 4/9] btrfs: zoned: don't hold space_info lock on zoned allocation Johannes Thumshirn
` (5 subsequent siblings)
8 siblings, 0 replies; 25+ messages in thread
From: Johannes Thumshirn @ 2025-06-27 9:19 UTC (permalink / raw)
To: linux-btrfs
Cc: Damien Le Moal, Naohiro Aota, David Sterba, Josef Bacik,
Boris Burkov, Filipe Manana, Johannes Thumshirn
From: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Lockstat analysis of benchmark workloads shows a very high contention of
the treelog_bg_lock. But the treelog_bg_lock only protects a single
field in 'struct btrfs_fs_info', namely 'u64 treelog_bg'.
Use READ_ONCE()/WRITE_ONCE() to access 'btrfs_fs_info::treelog_bg'.
This is safe in the allocator path, as treelog I/O is only going to block
groups in the treelog sub-space_info and at the moment, there is only one
treelog block group in this space info.
Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
---
fs/btrfs/disk-io.c | 1 -
fs/btrfs/extent-tree.c | 45 +++++++++++-------------------------------
fs/btrfs/fs.h | 1 -
fs/btrfs/zoned.c | 2 +-
fs/btrfs/zoned.h | 7 +++----
5 files changed, 15 insertions(+), 41 deletions(-)
diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index 9a13f5b1ed43..35cd38de7727 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -2789,7 +2789,6 @@ void btrfs_init_fs_info(struct btrfs_fs_info *fs_info)
spin_lock_init(&fs_info->defrag_inodes_lock);
spin_lock_init(&fs_info->super_lock);
spin_lock_init(&fs_info->unused_bgs_lock);
- spin_lock_init(&fs_info->treelog_bg_lock);
spin_lock_init(&fs_info->zone_active_bgs_lock);
rwlock_init(&fs_info->tree_mod_log_lock);
rwlock_init(&fs_info->global_root_lock);
diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index a9bda68a1883..46358a555f78 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -3809,22 +3809,6 @@ static int do_allocation_clustered(struct btrfs_block_group *block_group,
return find_free_extent_unclustered(block_group, ffe_ctl);
}
-/*
- * Tree-log block group locking
- * ============================
- *
- * fs_info::treelog_bg_lock protects the fs_info::treelog_bg which
- * indicates the starting address of a block group, which is reserved only
- * for tree-log metadata.
- *
- * Lock nesting
- * ============
- *
- * space_info::lock
- * block_group::lock
- * fs_info::treelog_bg_lock
- */
-
/*
* Simple allocator for sequential-only block group. It only allows sequential
* allocation. No need to play with trees. This function also reserves the
@@ -3844,7 +3828,6 @@ static int do_allocation_zoned(struct btrfs_block_group *block_group,
u64 log_bytenr;
u64 data_reloc_bytenr;
int ret = 0;
- bool skip = false;
ASSERT(btrfs_is_zoned(block_group->fs_info));
@@ -3852,13 +3835,9 @@ static int do_allocation_zoned(struct btrfs_block_group *block_group,
* Do not allow non-tree-log blocks in the dedicated tree-log block
* group, and vice versa.
*/
- spin_lock(&fs_info->treelog_bg_lock);
- log_bytenr = fs_info->treelog_bg;
+ log_bytenr = READ_ONCE(fs_info->treelog_bg);
if (log_bytenr && ((ffe_ctl->for_treelog && bytenr != log_bytenr) ||
(!ffe_ctl->for_treelog && bytenr == log_bytenr)))
- skip = true;
- spin_unlock(&fs_info->treelog_bg_lock);
- if (skip)
return 1;
/*
@@ -3894,14 +3873,13 @@ static int do_allocation_zoned(struct btrfs_block_group *block_group,
spin_lock(&space_info->lock);
spin_lock(&block_group->lock);
- spin_lock(&fs_info->treelog_bg_lock);
if (ret)
goto out;
ASSERT(!ffe_ctl->for_treelog ||
- block_group->start == fs_info->treelog_bg ||
- fs_info->treelog_bg == 0);
+ block_group->start == log_bytenr ||
+ log_bytenr == 0);
ASSERT(!ffe_ctl->for_data_reloc ||
block_group->start == data_reloc_bytenr ||
data_reloc_bytenr == 0);
@@ -3917,7 +3895,7 @@ static int do_allocation_zoned(struct btrfs_block_group *block_group,
* Do not allow currently using block group to be tree-log dedicated
* block group.
*/
- if (ffe_ctl->for_treelog && !fs_info->treelog_bg &&
+ if (ffe_ctl->for_treelog && log_bytenr == 0 &&
(block_group->used || block_group->reserved)) {
ret = 1;
goto out;
@@ -3948,8 +3926,8 @@ static int do_allocation_zoned(struct btrfs_block_group *block_group,
goto out;
}
- if (ffe_ctl->for_treelog && !fs_info->treelog_bg)
- fs_info->treelog_bg = block_group->start;
+ if (ffe_ctl->for_treelog && READ_ONCE(fs_info->treelog_bg) == 0)
+ WRITE_ONCE(fs_info->treelog_bg, block_group->start);
if (ffe_ctl->for_data_reloc) {
if (READ_ONCE(fs_info->data_reloc_bg) == 0)
@@ -3987,10 +3965,9 @@ static int do_allocation_zoned(struct btrfs_block_group *block_group,
out:
if (ret && ffe_ctl->for_treelog)
- fs_info->treelog_bg = 0;
+ WRITE_ONCE(fs_info->treelog_bg, 0);
if (ret && ffe_ctl->for_data_reloc)
WRITE_ONCE(fs_info->data_reloc_bg, 0);
- spin_unlock(&fs_info->treelog_bg_lock);
spin_unlock(&block_group->lock);
spin_unlock(&space_info->lock);
return ret;
@@ -4293,10 +4270,10 @@ static int prepare_allocation_zoned(struct btrfs_fs_info *fs_info,
struct find_free_extent_ctl *ffe_ctl)
{
if (ffe_ctl->for_treelog) {
- spin_lock(&fs_info->treelog_bg_lock);
- if (fs_info->treelog_bg)
- ffe_ctl->hint_byte = fs_info->treelog_bg;
- spin_unlock(&fs_info->treelog_bg_lock);
+ u64 treelog_bg = READ_ONCE(fs_info->treelog_bg);
+
+ if (treelog_bg)
+ ffe_ctl->hint_byte = treelog_bg;
} else if (ffe_ctl->for_data_reloc) {
u64 data_reloc_bg = READ_ONCE(fs_info->data_reloc_bg);
diff --git a/fs/btrfs/fs.h b/fs/btrfs/fs.h
index 570f4b85096c..a388af40a251 100644
--- a/fs/btrfs/fs.h
+++ b/fs/btrfs/fs.h
@@ -846,7 +846,6 @@ struct btrfs_fs_info {
u64 max_zone_append_size;
struct mutex zoned_meta_io_lock;
- spinlock_t treelog_bg_lock;
u64 treelog_bg;
/* Start of the dedicated data relocation block group */
diff --git a/fs/btrfs/zoned.c b/fs/btrfs/zoned.c
index 388c277a84d3..c89f846af6dd 100644
--- a/fs/btrfs/zoned.c
+++ b/fs/btrfs/zoned.c
@@ -1948,7 +1948,7 @@ static bool check_bg_is_active(struct btrfs_eb_write_context *ctx,
if (test_bit(BLOCK_GROUP_FLAG_ZONE_IS_ACTIVE, &block_group->runtime_flags))
return true;
- if (fs_info->treelog_bg == block_group->start) {
+ if (READ_ONCE(fs_info->treelog_bg) == block_group->start) {
if (!btrfs_zone_activate(block_group)) {
int ret_fin = btrfs_zone_finish_one_bg(fs_info);
diff --git a/fs/btrfs/zoned.h b/fs/btrfs/zoned.h
index 6e11533b8e14..c1b3a5c3a799 100644
--- a/fs/btrfs/zoned.h
+++ b/fs/btrfs/zoned.h
@@ -383,14 +383,13 @@ static inline void btrfs_zoned_meta_io_unlock(struct btrfs_fs_info *fs_info)
static inline void btrfs_clear_treelog_bg(struct btrfs_block_group *bg)
{
struct btrfs_fs_info *fs_info = bg->fs_info;
+ u64 treelog_bg = READ_ONCE(fs_info->treelog_bg);
if (!btrfs_is_zoned(fs_info))
return;
- spin_lock(&fs_info->treelog_bg_lock);
- if (fs_info->treelog_bg == bg->start)
- fs_info->treelog_bg = 0;
- spin_unlock(&fs_info->treelog_bg_lock);
+ if (treelog_bg == bg->start)
+ WRITE_ONCE(fs_info->treelog_bg, 0);
}
static inline void btrfs_zoned_data_reloc_lock(struct btrfs_inode *inode)
--
2.49.0
^ permalink raw reply related [flat|nested] 25+ messages in thread* [PATCH RFC 4/9] btrfs: zoned: don't hold space_info lock on zoned allocation
2025-06-27 9:19 [PATCH RFC 0/9] btrfs: zoned: fixes for garbage collection under preassure Johannes Thumshirn
` (2 preceding siblings ...)
2025-06-27 9:19 ` [PATCH RFC 3/9] btrfs: zoned: get rid of treelog_bg_lock Johannes Thumshirn
@ 2025-06-27 9:19 ` Johannes Thumshirn
2025-06-27 9:19 ` [PATCH RFC 5/9] btrfs: remove delalloc_root_mutex Johannes Thumshirn
` (4 subsequent siblings)
8 siblings, 0 replies; 25+ messages in thread
From: Johannes Thumshirn @ 2025-06-27 9:19 UTC (permalink / raw)
To: linux-btrfs
Cc: Damien Le Moal, Naohiro Aota, David Sterba, Josef Bacik,
Boris Burkov, Filipe Manana, Johannes Thumshirn
From: Johannes Thumshirn <johannes.thumshirn@wdc.com>
The zoned extent allocator holds 'struct btrfs_space_info::lock' nearly
over the entirety of the allocation process, but nothing in
do_allocation_zoned() is actually accessing fields of 'struct
btrfs_space_info'.
Furthermore taking lock_stat snapshots in performance testing, always shows
the space_info::lock as the most contented lock in the entire system.
Remove locking the space_info lock during do_allocation_zoned() to reduce
lock contention.
Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
---
fs/btrfs/extent-tree.c | 3 ---
1 file changed, 3 deletions(-)
diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index 46358a555f78..da731f6d4dad 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -3819,7 +3819,6 @@ static int do_allocation_zoned(struct btrfs_block_group *block_group,
struct btrfs_block_group **bg_ret)
{
struct btrfs_fs_info *fs_info = block_group->fs_info;
- struct btrfs_space_info *space_info = block_group->space_info;
struct btrfs_free_space_ctl *ctl = block_group->free_space_ctl;
u64 start = block_group->start;
u64 num_bytes = ffe_ctl->num_bytes;
@@ -3871,7 +3870,6 @@ static int do_allocation_zoned(struct btrfs_block_group *block_group,
*/
}
- spin_lock(&space_info->lock);
spin_lock(&block_group->lock);
if (ret)
@@ -3969,7 +3967,6 @@ static int do_allocation_zoned(struct btrfs_block_group *block_group,
if (ret && ffe_ctl->for_data_reloc)
WRITE_ONCE(fs_info->data_reloc_bg, 0);
spin_unlock(&block_group->lock);
- spin_unlock(&space_info->lock);
return ret;
}
--
2.49.0
^ permalink raw reply related [flat|nested] 25+ messages in thread* [PATCH RFC 5/9] btrfs: remove delalloc_root_mutex
2025-06-27 9:19 [PATCH RFC 0/9] btrfs: zoned: fixes for garbage collection under preassure Johannes Thumshirn
` (3 preceding siblings ...)
2025-06-27 9:19 ` [PATCH RFC 4/9] btrfs: zoned: don't hold space_info lock on zoned allocation Johannes Thumshirn
@ 2025-06-27 9:19 ` Johannes Thumshirn
2025-06-27 12:42 ` Filipe Manana
2025-06-27 9:19 ` [PATCH RFC 6/9] btrfs: remove btrfs_root's delalloc_mutex Johannes Thumshirn
` (3 subsequent siblings)
8 siblings, 1 reply; 25+ messages in thread
From: Johannes Thumshirn @ 2025-06-27 9:19 UTC (permalink / raw)
To: linux-btrfs
Cc: Damien Le Moal, Naohiro Aota, David Sterba, Josef Bacik,
Boris Burkov, Filipe Manana, Johannes Thumshirn
From: Johannes Thumshirn <johannes.thumshirn@wdc.com>
When benchmarking garbage collection on zoned BTRFS filesystems on ZNS
drives, we regularly observe hung_task messages like the following:
INFO: task kworker/u132:2:297 blocked for more than 122 seconds.
Not tainted 6.16.0-rc1+ #1225
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
task:kworker/u132:2 state:D stack:0 pid:297 tgid:297 ppid:2 task_flags:0x4208060 flags:0x00004000
Workqueue: events_unbound btrfs_preempt_reclaim_metadata_space
Call Trace:
<TASK>
__schedule+0x2f9/0x7b0
schedule+0x27/0x80
schedule_preempt_disabled+0x15/0x30
__mutex_lock.constprop.0+0x4af/0x890
? srso_return_thunk+0x5/0x5f
btrfs_start_delalloc_roots+0x8a/0x290
? timerqueue_del+0x2e/0x60
shrink_delalloc+0x10c/0x2d0
? srso_return_thunk+0x5/0x5f
? psi_group_change+0x19e/0x460
? srso_return_thunk+0x5/0x5f
? btrfs_reduce_alloc_profile+0x9a/0x1d0
flush_space+0x202/0x280
? srso_return_thunk+0x5/0x5f
? need_preemptive_reclaim+0xaa/0x190
btrfs_preempt_reclaim_metadata_space+0xe7/0x340
process_one_work+0x192/0x350
worker_thread+0x25a/0x3a0
? __pfx_worker_thread+0x10/0x10
kthread+0xfc/0x240
? __pfx_kthread+0x10/0x10
? __pfx_kthread+0x10/0x10
ret_from_fork+0x152/0x180
? __pfx_kthread+0x10/0x10
ret_from_fork_asm+0x1a/0x30
</TASK>
INFO: task kworker/u132:2:297 is blocked on a mutex likely owned by task kworker/u129:0:2359.
task:kworker/u129:0 state:R running task stack:0 pid:2359 tgid:2359 ppid:2
The affected tasks are blocked on 'struct btrfs_fs_info::delalloc_root_mutex',
a global lock that serializes entry into btrfs_start_delalloc_roots().
This lock was introduced in commit 573bfb72f760 ("Btrfs: fix possible
empty list access when flushing the delalloc inodes") but without a
clear justification for its necessity.
However, the condition it was meant to protect against—a possibly empty
list access—is already safely handled by 'list_splice_init()', which
does nothing when the source list is empty.
There are no known concurrency issues in btrfs_start_delalloc_roots()
that require serialization via this mutex. All critical regions are
either covered by per-root locking or operate on safely isolated lists.
Removing the lock eliminates the observed hangs and improves metadata
GC throughput, particularly on systems with high concurrency like
ZNS-based deployments.
Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
---
fs/btrfs/disk-io.c | 1 -
fs/btrfs/fs.h | 1 -
fs/btrfs/inode.c | 2 --
3 files changed, 4 deletions(-)
diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index 35cd38de7727..929f39886b0e 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -2795,7 +2795,6 @@ void btrfs_init_fs_info(struct btrfs_fs_info *fs_info)
mutex_init(&fs_info->unused_bg_unpin_mutex);
mutex_init(&fs_info->reclaim_bgs_lock);
mutex_init(&fs_info->reloc_mutex);
- mutex_init(&fs_info->delalloc_root_mutex);
mutex_init(&fs_info->zoned_meta_io_lock);
mutex_init(&fs_info->zoned_data_reloc_io_lock);
seqlock_init(&fs_info->profiles_lock);
diff --git a/fs/btrfs/fs.h b/fs/btrfs/fs.h
index a388af40a251..04ebc976f841 100644
--- a/fs/btrfs/fs.h
+++ b/fs/btrfs/fs.h
@@ -606,7 +606,6 @@ struct btrfs_fs_info {
*/
struct list_head ordered_roots;
- struct mutex delalloc_root_mutex;
spinlock_t delalloc_root_lock;
/* All fs/file tree roots that have delalloc inodes. */
struct list_head delalloc_roots;
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 80c72c594b19..d68f4ef61c43 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -8766,7 +8766,6 @@ int btrfs_start_delalloc_roots(struct btrfs_fs_info *fs_info, long nr,
if (BTRFS_FS_ERROR(fs_info))
return -EROFS;
- mutex_lock(&fs_info->delalloc_root_mutex);
spin_lock(&fs_info->delalloc_root_lock);
list_splice_init(&fs_info->delalloc_roots, &splice);
while (!list_empty(&splice)) {
@@ -8800,7 +8799,6 @@ int btrfs_start_delalloc_roots(struct btrfs_fs_info *fs_info, long nr,
list_splice_tail(&splice, &fs_info->delalloc_roots);
spin_unlock(&fs_info->delalloc_root_lock);
}
- mutex_unlock(&fs_info->delalloc_root_mutex);
return ret;
}
--
2.49.0
^ permalink raw reply related [flat|nested] 25+ messages in thread* Re: [PATCH RFC 5/9] btrfs: remove delalloc_root_mutex
2025-06-27 9:19 ` [PATCH RFC 5/9] btrfs: remove delalloc_root_mutex Johannes Thumshirn
@ 2025-06-27 12:42 ` Filipe Manana
0 siblings, 0 replies; 25+ messages in thread
From: Filipe Manana @ 2025-06-27 12:42 UTC (permalink / raw)
To: Johannes Thumshirn
Cc: linux-btrfs, Damien Le Moal, Naohiro Aota, David Sterba,
Josef Bacik, Boris Burkov, Filipe Manana, Johannes Thumshirn
On Fri, Jun 27, 2025 at 10:23 AM Johannes Thumshirn <jth@kernel.org> wrote:
>
> From: Johannes Thumshirn <johannes.thumshirn@wdc.com>
>
> When benchmarking garbage collection on zoned BTRFS filesystems on ZNS
> drives, we regularly observe hung_task messages like the following:
>
> INFO: task kworker/u132:2:297 blocked for more than 122 seconds.
> Not tainted 6.16.0-rc1+ #1225
> "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> task:kworker/u132:2 state:D stack:0 pid:297 tgid:297 ppid:2 task_flags:0x4208060 flags:0x00004000
> Workqueue: events_unbound btrfs_preempt_reclaim_metadata_space
> Call Trace:
> <TASK>
> __schedule+0x2f9/0x7b0
> schedule+0x27/0x80
> schedule_preempt_disabled+0x15/0x30
> __mutex_lock.constprop.0+0x4af/0x890
> ? srso_return_thunk+0x5/0x5f
> btrfs_start_delalloc_roots+0x8a/0x290
> ? timerqueue_del+0x2e/0x60
> shrink_delalloc+0x10c/0x2d0
> ? srso_return_thunk+0x5/0x5f
> ? psi_group_change+0x19e/0x460
> ? srso_return_thunk+0x5/0x5f
> ? btrfs_reduce_alloc_profile+0x9a/0x1d0
> flush_space+0x202/0x280
> ? srso_return_thunk+0x5/0x5f
> ? need_preemptive_reclaim+0xaa/0x190
> btrfs_preempt_reclaim_metadata_space+0xe7/0x340
> process_one_work+0x192/0x350
> worker_thread+0x25a/0x3a0
> ? __pfx_worker_thread+0x10/0x10
> kthread+0xfc/0x240
> ? __pfx_kthread+0x10/0x10
> ? __pfx_kthread+0x10/0x10
> ret_from_fork+0x152/0x180
> ? __pfx_kthread+0x10/0x10
> ret_from_fork_asm+0x1a/0x30
> </TASK>
> INFO: task kworker/u132:2:297 is blocked on a mutex likely owned by task kworker/u129:0:2359.
> task:kworker/u129:0 state:R running task stack:0 pid:2359 tgid:2359 ppid:2
>
> The affected tasks are blocked on 'struct btrfs_fs_info::delalloc_root_mutex',
> a global lock that serializes entry into btrfs_start_delalloc_roots().
> This lock was introduced in commit 573bfb72f760 ("Btrfs: fix possible
> empty list access when flushing the delalloc inodes") but without a
> clear justification for its necessity.
>
> However, the condition it was meant to protect against—a possibly empty
> list access—is already safely handled by 'list_splice_init()', which
> does nothing when the source list is empty.
>
> There are no known concurrency issues in btrfs_start_delalloc_roots()
> that require serialization via this mutex. All critical regions are
> either covered by per-root locking or operate on safely isolated lists.
Nop... see comments further below.
>
> Removing the lock eliminates the observed hangs and improves metadata
> GC throughput, particularly on systems with high concurrency like
> ZNS-based deployments.
>
> Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
> ---
> fs/btrfs/disk-io.c | 1 -
> fs/btrfs/fs.h | 1 -
> fs/btrfs/inode.c | 2 --
> 3 files changed, 4 deletions(-)
>
> diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
> index 35cd38de7727..929f39886b0e 100644
> --- a/fs/btrfs/disk-io.c
> +++ b/fs/btrfs/disk-io.c
> @@ -2795,7 +2795,6 @@ void btrfs_init_fs_info(struct btrfs_fs_info *fs_info)
> mutex_init(&fs_info->unused_bg_unpin_mutex);
> mutex_init(&fs_info->reclaim_bgs_lock);
> mutex_init(&fs_info->reloc_mutex);
> - mutex_init(&fs_info->delalloc_root_mutex);
> mutex_init(&fs_info->zoned_meta_io_lock);
> mutex_init(&fs_info->zoned_data_reloc_io_lock);
> seqlock_init(&fs_info->profiles_lock);
> diff --git a/fs/btrfs/fs.h b/fs/btrfs/fs.h
> index a388af40a251..04ebc976f841 100644
> --- a/fs/btrfs/fs.h
> +++ b/fs/btrfs/fs.h
> @@ -606,7 +606,6 @@ struct btrfs_fs_info {
> */
> struct list_head ordered_roots;
>
> - struct mutex delalloc_root_mutex;
> spinlock_t delalloc_root_lock;
> /* All fs/file tree roots that have delalloc inodes. */
> struct list_head delalloc_roots;
> diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
> index 80c72c594b19..d68f4ef61c43 100644
> --- a/fs/btrfs/inode.c
> +++ b/fs/btrfs/inode.c
> @@ -8766,7 +8766,6 @@ int btrfs_start_delalloc_roots(struct btrfs_fs_info *fs_info, long nr,
> if (BTRFS_FS_ERROR(fs_info))
> return -EROFS;
>
> - mutex_lock(&fs_info->delalloc_root_mutex);
> spin_lock(&fs_info->delalloc_root_lock);
> list_splice_init(&fs_info->delalloc_roots, &splice);
> while (!list_empty(&splice)) {
> @@ -8800,7 +8799,6 @@ int btrfs_start_delalloc_roots(struct btrfs_fs_info *fs_info, long nr,
> list_splice_tail(&splice, &fs_info->delalloc_roots);
> spin_unlock(&fs_info->delalloc_root_lock);
> }
> - mutex_unlock(&fs_info->delalloc_root_mutex);
The lock is useful and exists to make sure two tasks calling this
function wait for all dealloc to be flushed and ordered extents to
complete for all inodes from all roots.
The problem is similar to the one I pointed out for the next patch,
but perhaps a bit more subtle.
So after applying this patch:
1) Task A enters btrfs_start_delalloc_roots() and takes the spinlock
fs_info->delalloc_root_lock;
2) Task A splices the fs_info->delalloc_roots list into the local
splice list. The list has two roots, root X and root Y;
3) Task A enters the first iteration of the while loop, extracts root
X from the split list, grabs a reference for it, adds it back to the
fs_info->delalloc_roots list and unlocks fs_info->delalloc_root_lock;
4) Task B enters btrfs_start_delalloc_roots(), takes the
fs_info->delalloc_root_lock lock;
5) Task B splices the fs_info->delalloc_roots list into the local
splice list - the list only contains root X -> root Y is held only the
splice list from task A.
As a consequence task B will never wait for writeback and ordered
extents from inodes from root Y to complete.
Therefore breaking the expected semantics of btrfs_start_delalloc_roots().
Thanks.
> return ret;
> }
>
> --
> 2.49.0
>
>
^ permalink raw reply [flat|nested] 25+ messages in thread
* [PATCH RFC 6/9] btrfs: remove btrfs_root's delalloc_mutex
2025-06-27 9:19 [PATCH RFC 0/9] btrfs: zoned: fixes for garbage collection under preassure Johannes Thumshirn
` (4 preceding siblings ...)
2025-06-27 9:19 ` [PATCH RFC 5/9] btrfs: remove delalloc_root_mutex Johannes Thumshirn
@ 2025-06-27 9:19 ` Johannes Thumshirn
2025-06-27 12:30 ` Filipe Manana
2025-06-27 9:19 ` [PATCH RFC 7/9] btrfs: lower auto-reclaim message log level Johannes Thumshirn
` (2 subsequent siblings)
8 siblings, 1 reply; 25+ messages in thread
From: Johannes Thumshirn @ 2025-06-27 9:19 UTC (permalink / raw)
To: linux-btrfs
Cc: Damien Le Moal, Naohiro Aota, David Sterba, Josef Bacik,
Boris Burkov, Filipe Manana, Johannes Thumshirn
From: Johannes Thumshirn <johannes.thumshirn@wdc.com>
When running metadata space reclaim under high I/O concurrency, we observe
hung tasks caused by lock contention on `btrfs_root::delalloc_mutex`. For
example:
INFO: task kworker/u132:1:2177 blocked for more than 122 seconds.
Not tainted 6.16.0-rc3+ #1246
Workqueue: events_unbound btrfs_preempt_reclaim_metadata_space
Call Trace:
__schedule+0x2f9/0x7b0
schedule+0x27/0x80
__mutex_lock.constprop.0+0x4af/0x890
start_delalloc_inodes+0x6e/0x400
btrfs_start_delalloc_roots+0x162/0x270
shrink_delalloc+0x10c/0x2d0
flush_space+0x202/0x280
btrfs_preempt_reclaim_metadata_space+0xe7/0x340
The `delalloc_mutex` serializes delalloc flushing per root but is no
longer necessary. All critical paths (inode flushing, extent writing,
metadata updates) are already synchronized using finer-grained locking at
the inode, page, and tree levels. In particular, concurrent flushers
coordinate via inode locking, and no shared state requires global
serialization across the root.
Removing this mutex avoids unnecessary blocking in reclaim paths and
improves responsiveness under pressure, especially on systems with many
flushers or multi-queue SSDs/ZNS devices.
Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
---
fs/btrfs/ctree.h | 1 -
fs/btrfs/disk-io.c | 1 -
fs/btrfs/inode.c | 2 --
3 files changed, 4 deletions(-)
diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index 8a54a0b6e502..06c7742a5de0 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -238,7 +238,6 @@ struct btrfs_root {
spinlock_t root_item_lock;
refcount_t refs;
- struct mutex delalloc_mutex;
spinlock_t delalloc_lock;
/*
* all of the inodes that have delalloc bytes. It is possible for
diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index 929f39886b0e..e39f5e893312 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -678,7 +678,6 @@ static struct btrfs_root *btrfs_alloc_root(struct btrfs_fs_info *fs_info,
mutex_init(&root->objectid_mutex);
mutex_init(&root->log_mutex);
mutex_init(&root->ordered_extent_mutex);
- mutex_init(&root->delalloc_mutex);
init_waitqueue_head(&root->qgroup_flush_wait);
init_waitqueue_head(&root->log_writer_wait);
init_waitqueue_head(&root->log_commit_wait[0]);
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index d68f4ef61c43..b9c52b9ea912 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -8673,7 +8673,6 @@ static int start_delalloc_inodes(struct btrfs_root *root,
int ret = 0;
bool full_flush = wbc->nr_to_write == LONG_MAX;
- mutex_lock(&root->delalloc_mutex);
spin_lock(&root->delalloc_lock);
list_splice_init(&root->delalloc_inodes, &splice);
while (!list_empty(&splice)) {
@@ -8730,7 +8729,6 @@ static int start_delalloc_inodes(struct btrfs_root *root,
list_splice_tail(&splice, &root->delalloc_inodes);
spin_unlock(&root->delalloc_lock);
}
- mutex_unlock(&root->delalloc_mutex);
return ret;
}
--
2.49.0
^ permalink raw reply related [flat|nested] 25+ messages in thread* Re: [PATCH RFC 6/9] btrfs: remove btrfs_root's delalloc_mutex
2025-06-27 9:19 ` [PATCH RFC 6/9] btrfs: remove btrfs_root's delalloc_mutex Johannes Thumshirn
@ 2025-06-27 12:30 ` Filipe Manana
0 siblings, 0 replies; 25+ messages in thread
From: Filipe Manana @ 2025-06-27 12:30 UTC (permalink / raw)
To: Johannes Thumshirn
Cc: linux-btrfs, Damien Le Moal, Naohiro Aota, David Sterba,
Josef Bacik, Boris Burkov, Filipe Manana, Johannes Thumshirn
On Fri, Jun 27, 2025 at 10:23 AM Johannes Thumshirn <jth@kernel.org> wrote:
>
> From: Johannes Thumshirn <johannes.thumshirn@wdc.com>
>
> When running metadata space reclaim under high I/O concurrency, we observe
> hung tasks caused by lock contention on `btrfs_root::delalloc_mutex`. For
> example:
>
> INFO: task kworker/u132:1:2177 blocked for more than 122 seconds.
> Not tainted 6.16.0-rc3+ #1246
> Workqueue: events_unbound btrfs_preempt_reclaim_metadata_space
> Call Trace:
> __schedule+0x2f9/0x7b0
> schedule+0x27/0x80
> __mutex_lock.constprop.0+0x4af/0x890
> start_delalloc_inodes+0x6e/0x400
> btrfs_start_delalloc_roots+0x162/0x270
> shrink_delalloc+0x10c/0x2d0
> flush_space+0x202/0x280
> btrfs_preempt_reclaim_metadata_space+0xe7/0x340
>
> The `delalloc_mutex` serializes delalloc flushing per root but is no
> longer necessary. All critical paths (inode flushing, extent writing,
> metadata updates) are already synchronized using finer-grained locking at
> the inode, page, and tree levels. In particular, concurrent flushers
> coordinate via inode locking, and no shared state requires global
> serialization across the root.
Well that's not enough...
The mutex is there to ensure that if two (or more) tasks call
start_delalloc_inodes(), they only return after all IO is flushed and
ordered extents completed.
Without the mutex we break the semantics.
For example, after removing the mutex as you propose here, we have:
1) Task A enters start_delalloc_inodes() and takes the spinlock
root->delalloc_lock;
2) Task B also enters start_delalloc_inodes() and spins on
root->delalloc_lock because it's currently held by task A;
3) Task A extracts inode X from the local split list, grabs a
reference for it and unlocks root->delalloc_lock;
4) Task B takes the root->delalloc_lock and sees an empty
root->delalloc_inodes list, so it will never wait for all the
writeback and ordered extents from inode X to complete, and return
before they complete.
You see, inode locks, page locks, etc, don't even enter the equation
here and are irrelevant.
>
> Removing this mutex avoids unnecessary blocking in reclaim paths and
> improves responsiveness under pressure, especially on systems with many
> flushers or multi-queue SSDs/ZNS devices.
>
> Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
>
> Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Duplicated SoB tags.
Thanks.
> ---
> fs/btrfs/ctree.h | 1 -
> fs/btrfs/disk-io.c | 1 -
> fs/btrfs/inode.c | 2 --
> 3 files changed, 4 deletions(-)
>
> diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
> index 8a54a0b6e502..06c7742a5de0 100644
> --- a/fs/btrfs/ctree.h
> +++ b/fs/btrfs/ctree.h
> @@ -238,7 +238,6 @@ struct btrfs_root {
> spinlock_t root_item_lock;
> refcount_t refs;
>
> - struct mutex delalloc_mutex;
> spinlock_t delalloc_lock;
> /*
> * all of the inodes that have delalloc bytes. It is possible for
> diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
> index 929f39886b0e..e39f5e893312 100644
> --- a/fs/btrfs/disk-io.c
> +++ b/fs/btrfs/disk-io.c
> @@ -678,7 +678,6 @@ static struct btrfs_root *btrfs_alloc_root(struct btrfs_fs_info *fs_info,
> mutex_init(&root->objectid_mutex);
> mutex_init(&root->log_mutex);
> mutex_init(&root->ordered_extent_mutex);
> - mutex_init(&root->delalloc_mutex);
> init_waitqueue_head(&root->qgroup_flush_wait);
> init_waitqueue_head(&root->log_writer_wait);
> init_waitqueue_head(&root->log_commit_wait[0]);
> diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
> index d68f4ef61c43..b9c52b9ea912 100644
> --- a/fs/btrfs/inode.c
> +++ b/fs/btrfs/inode.c
> @@ -8673,7 +8673,6 @@ static int start_delalloc_inodes(struct btrfs_root *root,
> int ret = 0;
> bool full_flush = wbc->nr_to_write == LONG_MAX;
>
> - mutex_lock(&root->delalloc_mutex);
> spin_lock(&root->delalloc_lock);
> list_splice_init(&root->delalloc_inodes, &splice);
> while (!list_empty(&splice)) {
> @@ -8730,7 +8729,6 @@ static int start_delalloc_inodes(struct btrfs_root *root,
> list_splice_tail(&splice, &root->delalloc_inodes);
> spin_unlock(&root->delalloc_lock);
> }
> - mutex_unlock(&root->delalloc_mutex);
> return ret;
> }
>
> --
> 2.49.0
>
>
^ permalink raw reply [flat|nested] 25+ messages in thread
* [PATCH RFC 7/9] btrfs: lower auto-reclaim message log level
2025-06-27 9:19 [PATCH RFC 0/9] btrfs: zoned: fixes for garbage collection under preassure Johannes Thumshirn
` (5 preceding siblings ...)
2025-06-27 9:19 ` [PATCH RFC 6/9] btrfs: remove btrfs_root's delalloc_mutex Johannes Thumshirn
@ 2025-06-27 9:19 ` Johannes Thumshirn
2025-06-27 11:35 ` Christoph Hellwig
2025-06-27 23:24 ` kernel test robot
2025-06-27 9:19 ` [PATCH RFC 8/9] btrfs: lower log level of relocation messages Johannes Thumshirn
2025-06-27 9:19 ` [PATCH RFC 9/9] btrfs: remove unused bgs on allocation failure Johannes Thumshirn
8 siblings, 2 replies; 25+ messages in thread
From: Johannes Thumshirn @ 2025-06-27 9:19 UTC (permalink / raw)
To: linux-btrfs
Cc: Damien Le Moal, Naohiro Aota, David Sterba, Josef Bacik,
Boris Burkov, Filipe Manana, Johannes Thumshirn
From: Johannes Thumshirn <johannes.thumshirn@wdc.com>
When running a system with automatic reclaim/balancing enabled, there are
lots of info level messages like the following in the kernel log:
BTRFS info (device nvme2n1): reclaiming chunk 1138166333440 with 10% used 0% reserved 89% unusable
Lower the log level to debug for these messages.
Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
---
fs/btrfs/block-group.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/fs/btrfs/block-group.c b/fs/btrfs/block-group.c
index 00e567a4cd16..5e6aead653c4 100644
--- a/fs/btrfs/block-group.c
+++ b/fs/btrfs/block-group.c
@@ -1963,7 +1963,7 @@ void btrfs_reclaim_bgs_work(struct work_struct *work)
reserved = bg->reserved;
spin_unlock(&bg->lock);
- btrfs_info(fs_info,
+ btrfs_debug(fs_info,
"reclaiming chunk %llu with %llu%% used %llu%% reserved %llu%% unusable",
bg->start,
div64_u64(used * 100, bg->length),
--
2.49.0
^ permalink raw reply related [flat|nested] 25+ messages in thread* Re: [PATCH RFC 7/9] btrfs: lower auto-reclaim message log level
2025-06-27 9:19 ` [PATCH RFC 7/9] btrfs: lower auto-reclaim message log level Johannes Thumshirn
@ 2025-06-27 11:35 ` Christoph Hellwig
2025-06-27 23:24 ` kernel test robot
1 sibling, 0 replies; 25+ messages in thread
From: Christoph Hellwig @ 2025-06-27 11:35 UTC (permalink / raw)
To: Johannes Thumshirn
Cc: linux-btrfs, Damien Le Moal, Naohiro Aota, David Sterba,
Josef Bacik, Boris Burkov, Filipe Manana, Johannes Thumshirn
On Fri, Jun 27, 2025 at 11:19:12AM +0200, Johannes Thumshirn wrote:
> From: Johannes Thumshirn <johannes.thumshirn@wdc.com>
>
> When running a system with automatic reclaim/balancing enabled, there are
> lots of info level messages like the following in the kernel log:
>
> BTRFS info (device nvme2n1): reclaiming chunk 1138166333440 with 10% used 0% reserved 89% unusable
>
> Lower the log level to debug for these messages.
Wouldn't a trace even be an even better choice for this?
^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [PATCH RFC 7/9] btrfs: lower auto-reclaim message log level
2025-06-27 9:19 ` [PATCH RFC 7/9] btrfs: lower auto-reclaim message log level Johannes Thumshirn
2025-06-27 11:35 ` Christoph Hellwig
@ 2025-06-27 23:24 ` kernel test robot
1 sibling, 0 replies; 25+ messages in thread
From: kernel test robot @ 2025-06-27 23:24 UTC (permalink / raw)
To: Johannes Thumshirn; +Cc: llvm, oe-kbuild-all
Hi Johannes,
[This is a private test report for your RFC patch.]
kernel test robot noticed the following build warnings:
[auto build test WARNING on next-20250626]
[cannot apply to kdave/for-next v6.16-rc3 v6.16-rc2 v6.16-rc1 linus/master v6.16-rc3]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch#_base_tree_information]
url: https://github.com/intel-lab-lkp/linux/commits/Johannes-Thumshirn/btrfs-zoned-do-not-select-metadata-BG-as-finish-target/20250627-172551
base: next-20250626
patch link: https://lore.kernel.org/r/20250627091914.100715-8-jth%40kernel.org
patch subject: [PATCH RFC 7/9] btrfs: lower auto-reclaim message log level
config: x86_64-buildonly-randconfig-002-20250628 (https://download.01.org/0day-ci/archive/20250628/202506280733.ut2JoUtS-lkp@intel.com/config)
compiler: clang version 20.1.7 (https://github.com/llvm/llvm-project 6146a88f60492b520a36f8f8f3231e15f3cc6082)
reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20250628/202506280733.ut2JoUtS-lkp@intel.com/reproduce)
If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202506280733.ut2JoUtS-lkp@intel.com/
All warnings (new ones prefixed by >>):
>> fs/btrfs/block-group.c:1846:7: warning: variable 'zone_unusable' set but not used [-Wunused-but-set-variable]
1846 | u64 zone_unusable;
| ^
1 warning generated.
vim +/zone_unusable +1846 fs/btrfs/block-group.c
81531225e5bd50c Boris Burkov 2022-10-13 1803
18bb8bbf13c1839 Johannes Thumshirn 2021-04-19 1804 void btrfs_reclaim_bgs_work(struct work_struct *work)
18bb8bbf13c1839 Johannes Thumshirn 2021-04-19 1805 {
18bb8bbf13c1839 Johannes Thumshirn 2021-04-19 1806 struct btrfs_fs_info *fs_info =
18bb8bbf13c1839 Johannes Thumshirn 2021-04-19 1807 container_of(work, struct btrfs_fs_info, reclaim_bgs_work);
18bb8bbf13c1839 Johannes Thumshirn 2021-04-19 1808 struct btrfs_block_group *bg;
18bb8bbf13c1839 Johannes Thumshirn 2021-04-19 1809 struct btrfs_space_info *space_info;
4eb4e85c4f81849 Boris Burkov 2024-06-07 1810 LIST_HEAD(retry_list);
18bb8bbf13c1839 Johannes Thumshirn 2021-04-19 1811
18bb8bbf13c1839 Johannes Thumshirn 2021-04-19 1812 if (!test_bit(BTRFS_FS_OPEN, &fs_info->flags))
18bb8bbf13c1839 Johannes Thumshirn 2021-04-19 1813 return;
18bb8bbf13c1839 Johannes Thumshirn 2021-04-19 1814
2f12741f81af638 Josef Bacik 2022-07-15 1815 if (btrfs_fs_closing(fs_info))
2f12741f81af638 Josef Bacik 2022-07-15 1816 return;
2f12741f81af638 Josef Bacik 2022-07-15 1817
3687fcb0752ac9c Johannes Thumshirn 2022-03-29 1818 if (!btrfs_should_reclaim(fs_info))
3687fcb0752ac9c Johannes Thumshirn 2022-03-29 1819 return;
3687fcb0752ac9c Johannes Thumshirn 2022-03-29 1820
ca5e4ea0beaec8b Naohiro Aota 2022-02-18 1821 sb_start_write(fs_info->sb);
ca5e4ea0beaec8b Naohiro Aota 2022-02-18 1822
ca5e4ea0beaec8b Naohiro Aota 2022-02-18 1823 if (!btrfs_exclop_start(fs_info, BTRFS_EXCLOP_BALANCE)) {
ca5e4ea0beaec8b Naohiro Aota 2022-02-18 1824 sb_end_write(fs_info->sb);
18bb8bbf13c1839 Johannes Thumshirn 2021-04-19 1825 return;
ca5e4ea0beaec8b Naohiro Aota 2022-02-18 1826 }
18bb8bbf13c1839 Johannes Thumshirn 2021-04-19 1827
9cc0b837e14ae91 Johannes Thumshirn 2021-07-06 1828 /*
9cc0b837e14ae91 Johannes Thumshirn 2021-07-06 1829 * Long running balances can keep us blocked here for eternity, so
9cc0b837e14ae91 Johannes Thumshirn 2021-07-06 1830 * simply skip reclaim if we're unable to get the mutex.
9cc0b837e14ae91 Johannes Thumshirn 2021-07-06 1831 */
9cc0b837e14ae91 Johannes Thumshirn 2021-07-06 1832 if (!mutex_trylock(&fs_info->reclaim_bgs_lock)) {
9cc0b837e14ae91 Johannes Thumshirn 2021-07-06 1833 btrfs_exclop_finish(fs_info);
ca5e4ea0beaec8b Naohiro Aota 2022-02-18 1834 sb_end_write(fs_info->sb);
9cc0b837e14ae91 Johannes Thumshirn 2021-07-06 1835 return;
9cc0b837e14ae91 Johannes Thumshirn 2021-07-06 1836 }
9cc0b837e14ae91 Johannes Thumshirn 2021-07-06 1837
18bb8bbf13c1839 Johannes Thumshirn 2021-04-19 1838 spin_lock(&fs_info->unused_bgs_lock);
2ca0ec770c62b32 Johannes Thumshirn 2021-10-14 1839 /*
2ca0ec770c62b32 Johannes Thumshirn 2021-10-14 1840 * Sort happens under lock because we can't simply splice it and sort.
2ca0ec770c62b32 Johannes Thumshirn 2021-10-14 1841 * The block groups might still be in use and reachable via bg_list,
2ca0ec770c62b32 Johannes Thumshirn 2021-10-14 1842 * and their presence in the reclaim_bgs list must be preserved.
2ca0ec770c62b32 Johannes Thumshirn 2021-10-14 1843 */
2ca0ec770c62b32 Johannes Thumshirn 2021-10-14 1844 list_sort(NULL, &fs_info->reclaim_bgs, reclaim_bgs_cmp);
18bb8bbf13c1839 Johannes Thumshirn 2021-04-19 1845 while (!list_empty(&fs_info->reclaim_bgs)) {
5f93e776c6734ce Johannes Thumshirn 2021-06-29 @1846 u64 zone_unusable;
ba5d06440cae63e Filipe Manana 2025-02-24 1847 u64 used;
620768704326c9a Filipe Manana 2025-02-24 1848 u64 reserved;
1cea5cf0e664290 Filipe Manana 2021-06-21 1849 int ret = 0;
1cea5cf0e664290 Filipe Manana 2021-06-21 1850
18bb8bbf13c1839 Johannes Thumshirn 2021-04-19 1851 bg = list_first_entry(&fs_info->reclaim_bgs,
18bb8bbf13c1839 Johannes Thumshirn 2021-04-19 1852 struct btrfs_block_group,
18bb8bbf13c1839 Johannes Thumshirn 2021-04-19 1853 bg_list);
18bb8bbf13c1839 Johannes Thumshirn 2021-04-19 1854 list_del_init(&bg->bg_list);
18bb8bbf13c1839 Johannes Thumshirn 2021-04-19 1855
18bb8bbf13c1839 Johannes Thumshirn 2021-04-19 1856 space_info = bg->space_info;
18bb8bbf13c1839 Johannes Thumshirn 2021-04-19 1857 spin_unlock(&fs_info->unused_bgs_lock);
18bb8bbf13c1839 Johannes Thumshirn 2021-04-19 1858
18bb8bbf13c1839 Johannes Thumshirn 2021-04-19 1859 /* Don't race with allocators so take the groups_sem */
18bb8bbf13c1839 Johannes Thumshirn 2021-04-19 1860 down_write(&space_info->groups_sem);
18bb8bbf13c1839 Johannes Thumshirn 2021-04-19 1861
f5ff64ccf7bb727 Boris Burkov 2024-02-02 1862 spin_lock(&space_info->lock);
18bb8bbf13c1839 Johannes Thumshirn 2021-04-19 1863 spin_lock(&bg->lock);
18bb8bbf13c1839 Johannes Thumshirn 2021-04-19 1864 if (bg->reserved || bg->pinned || bg->ro) {
18bb8bbf13c1839 Johannes Thumshirn 2021-04-19 1865 /*
18bb8bbf13c1839 Johannes Thumshirn 2021-04-19 1866 * We want to bail if we made new allocations or have
18bb8bbf13c1839 Johannes Thumshirn 2021-04-19 1867 * outstanding allocations in this block group. We do
18bb8bbf13c1839 Johannes Thumshirn 2021-04-19 1868 * the ro check in case balance is currently acting on
18bb8bbf13c1839 Johannes Thumshirn 2021-04-19 1869 * this block group.
18bb8bbf13c1839 Johannes Thumshirn 2021-04-19 1870 */
18bb8bbf13c1839 Johannes Thumshirn 2021-04-19 1871 spin_unlock(&bg->lock);
f5ff64ccf7bb727 Boris Burkov 2024-02-02 1872 spin_unlock(&space_info->lock);
18bb8bbf13c1839 Johannes Thumshirn 2021-04-19 1873 up_write(&space_info->groups_sem);
18bb8bbf13c1839 Johannes Thumshirn 2021-04-19 1874 goto next;
18bb8bbf13c1839 Johannes Thumshirn 2021-04-19 1875 }
cc4804bfd6392bc Boris Burkov 2022-10-13 1876 if (bg->used == 0) {
cc4804bfd6392bc Boris Burkov 2022-10-13 1877 /*
cc4804bfd6392bc Boris Burkov 2022-10-13 1878 * It is possible that we trigger relocation on a block
cc4804bfd6392bc Boris Burkov 2022-10-13 1879 * group as its extents are deleted and it first goes
cc4804bfd6392bc Boris Burkov 2022-10-13 1880 * below the threshold, then shortly after goes empty.
cc4804bfd6392bc Boris Burkov 2022-10-13 1881 *
cc4804bfd6392bc Boris Burkov 2022-10-13 1882 * In this case, relocating it does delete it, but has
cc4804bfd6392bc Boris Burkov 2022-10-13 1883 * some overhead in relocation specific metadata, looking
cc4804bfd6392bc Boris Burkov 2022-10-13 1884 * for the non-existent extents and running some extra
cc4804bfd6392bc Boris Burkov 2022-10-13 1885 * transactions, which we can avoid by using one of the
cc4804bfd6392bc Boris Burkov 2022-10-13 1886 * other mechanisms for dealing with empty block groups.
cc4804bfd6392bc Boris Burkov 2022-10-13 1887 */
cc4804bfd6392bc Boris Burkov 2022-10-13 1888 if (!btrfs_test_opt(fs_info, DISCARD_ASYNC))
cc4804bfd6392bc Boris Burkov 2022-10-13 1889 btrfs_mark_bg_unused(bg);
cc4804bfd6392bc Boris Burkov 2022-10-13 1890 spin_unlock(&bg->lock);
f5ff64ccf7bb727 Boris Burkov 2024-02-02 1891 spin_unlock(&space_info->lock);
cc4804bfd6392bc Boris Burkov 2022-10-13 1892 up_write(&space_info->groups_sem);
cc4804bfd6392bc Boris Burkov 2022-10-13 1893 goto next;
81531225e5bd50c Boris Burkov 2022-10-13 1894
81531225e5bd50c Boris Burkov 2022-10-13 1895 }
81531225e5bd50c Boris Burkov 2022-10-13 1896 /*
81531225e5bd50c Boris Burkov 2022-10-13 1897 * The block group might no longer meet the reclaim condition by
81531225e5bd50c Boris Burkov 2022-10-13 1898 * the time we get around to reclaiming it, so to avoid
81531225e5bd50c Boris Burkov 2022-10-13 1899 * reclaiming overly full block_groups, skip reclaiming them.
81531225e5bd50c Boris Burkov 2022-10-13 1900 *
81531225e5bd50c Boris Burkov 2022-10-13 1901 * Since the decision making process also depends on the amount
81531225e5bd50c Boris Burkov 2022-10-13 1902 * being freed, pass in a fake giant value to skip that extra
81531225e5bd50c Boris Burkov 2022-10-13 1903 * check, which is more meaningful when adding to the list in
81531225e5bd50c Boris Burkov 2022-10-13 1904 * the first place.
81531225e5bd50c Boris Burkov 2022-10-13 1905 */
81531225e5bd50c Boris Burkov 2022-10-13 1906 if (!should_reclaim_block_group(bg, bg->length)) {
81531225e5bd50c Boris Burkov 2022-10-13 1907 spin_unlock(&bg->lock);
f5ff64ccf7bb727 Boris Burkov 2024-02-02 1908 spin_unlock(&space_info->lock);
81531225e5bd50c Boris Burkov 2022-10-13 1909 up_write(&space_info->groups_sem);
81531225e5bd50c Boris Burkov 2022-10-13 1910 goto next;
cc4804bfd6392bc Boris Burkov 2022-10-13 1911 }
1283b8c125a83bf Filipe Manana 2025-02-21 1912
1283b8c125a83bf Filipe Manana 2025-02-21 1913 /*
1283b8c125a83bf Filipe Manana 2025-02-21 1914 * Cache the zone_unusable value before turning the block group
1283b8c125a83bf Filipe Manana 2025-02-21 1915 * to read only. As soon as the block group is read only it's
1283b8c125a83bf Filipe Manana 2025-02-21 1916 * zone_unusable value gets moved to the block group's read-only
1283b8c125a83bf Filipe Manana 2025-02-21 1917 * bytes and isn't available for calculations anymore. We also
1283b8c125a83bf Filipe Manana 2025-02-21 1918 * cache it before unlocking the block group, to prevent races
1283b8c125a83bf Filipe Manana 2025-02-21 1919 * (reports from KCSAN and such tools) with tasks updating it.
1283b8c125a83bf Filipe Manana 2025-02-21 1920 */
1283b8c125a83bf Filipe Manana 2025-02-21 1921 zone_unusable = bg->zone_unusable;
1283b8c125a83bf Filipe Manana 2025-02-21 1922
18bb8bbf13c1839 Johannes Thumshirn 2021-04-19 1923 spin_unlock(&bg->lock);
f5ff64ccf7bb727 Boris Burkov 2024-02-02 1924 spin_unlock(&space_info->lock);
18bb8bbf13c1839 Johannes Thumshirn 2021-04-19 1925
93463ff7b54626f Naohiro Aota 2023-06-06 1926 /*
93463ff7b54626f Naohiro Aota 2023-06-06 1927 * Get out fast, in case we're read-only or unmounting the
93463ff7b54626f Naohiro Aota 2023-06-06 1928 * filesystem. It is OK to drop block groups from the list even
93463ff7b54626f Naohiro Aota 2023-06-06 1929 * for the read-only case. As we did sb_start_write(),
93463ff7b54626f Naohiro Aota 2023-06-06 1930 * "mount -o remount,ro" won't happen and read-only filesystem
93463ff7b54626f Naohiro Aota 2023-06-06 1931 * means it is forced read-only due to a fatal error. So, it
93463ff7b54626f Naohiro Aota 2023-06-06 1932 * never gets back to read-write to let us reclaim again.
93463ff7b54626f Naohiro Aota 2023-06-06 1933 */
93463ff7b54626f Naohiro Aota 2023-06-06 1934 if (btrfs_need_cleaner_sleep(fs_info)) {
18bb8bbf13c1839 Johannes Thumshirn 2021-04-19 1935 up_write(&space_info->groups_sem);
18bb8bbf13c1839 Johannes Thumshirn 2021-04-19 1936 goto next;
18bb8bbf13c1839 Johannes Thumshirn 2021-04-19 1937 }
18bb8bbf13c1839 Johannes Thumshirn 2021-04-19 1938
18bb8bbf13c1839 Johannes Thumshirn 2021-04-19 1939 ret = inc_block_group_ro(bg, 0);
18bb8bbf13c1839 Johannes Thumshirn 2021-04-19 1940 up_write(&space_info->groups_sem);
18bb8bbf13c1839 Johannes Thumshirn 2021-04-19 1941 if (ret < 0)
18bb8bbf13c1839 Johannes Thumshirn 2021-04-19 1942 goto next;
18bb8bbf13c1839 Johannes Thumshirn 2021-04-19 1943
ba5d06440cae63e Filipe Manana 2025-02-24 1944 /*
620768704326c9a Filipe Manana 2025-02-24 1945 * The amount of bytes reclaimed corresponds to the sum of the
620768704326c9a Filipe Manana 2025-02-24 1946 * "used" and "reserved" counters. We have set the block group
620768704326c9a Filipe Manana 2025-02-24 1947 * to RO above, which prevents reservations from happening but
620768704326c9a Filipe Manana 2025-02-24 1948 * we may have existing reservations for which allocation has
620768704326c9a Filipe Manana 2025-02-24 1949 * not yet been done - btrfs_update_block_group() was not yet
620768704326c9a Filipe Manana 2025-02-24 1950 * called, which is where we will transfer a reserved extent's
620768704326c9a Filipe Manana 2025-02-24 1951 * size from the "reserved" counter to the "used" counter - this
620768704326c9a Filipe Manana 2025-02-24 1952 * happens when running delayed references. When we relocate the
620768704326c9a Filipe Manana 2025-02-24 1953 * chunk below, relocation first flushes dellaloc, waits for
620768704326c9a Filipe Manana 2025-02-24 1954 * ordered extent completion (which is where we create delayed
620768704326c9a Filipe Manana 2025-02-24 1955 * references for data extents) and commits the current
620768704326c9a Filipe Manana 2025-02-24 1956 * transaction (which runs delayed references), and only after
620768704326c9a Filipe Manana 2025-02-24 1957 * it does the actual work to move extents out of the block
620768704326c9a Filipe Manana 2025-02-24 1958 * group. So the reported amount of reclaimed bytes is
620768704326c9a Filipe Manana 2025-02-24 1959 * effectively the sum of the 'used' and 'reserved' counters.
ba5d06440cae63e Filipe Manana 2025-02-24 1960 */
ba5d06440cae63e Filipe Manana 2025-02-24 1961 spin_lock(&bg->lock);
ba5d06440cae63e Filipe Manana 2025-02-24 1962 used = bg->used;
620768704326c9a Filipe Manana 2025-02-24 1963 reserved = bg->reserved;
ba5d06440cae63e Filipe Manana 2025-02-24 1964 spin_unlock(&bg->lock);
ba5d06440cae63e Filipe Manana 2025-02-24 1965
3ba0572b72b1363 Johannes Thumshirn 2025-06-27 1966 btrfs_debug(fs_info,
620768704326c9a Filipe Manana 2025-02-24 1967 "reclaiming chunk %llu with %llu%% used %llu%% reserved %llu%% unusable",
95cd356ca23c380 Johannes Thumshirn 2023-02-21 1968 bg->start,
ba5d06440cae63e Filipe Manana 2025-02-24 1969 div64_u64(used * 100, bg->length),
620768704326c9a Filipe Manana 2025-02-24 1970 div64_u64(reserved * 100, bg->length),
5f93e776c6734ce Johannes Thumshirn 2021-06-29 1971 div64_u64(zone_unusable * 100, bg->length));
18bb8bbf13c1839 Johannes Thumshirn 2021-04-19 1972 trace_btrfs_reclaim_block_group(bg);
18bb8bbf13c1839 Johannes Thumshirn 2021-04-19 1973 ret = btrfs_relocate_chunk(fs_info, bg->start);
74944c873602a3e Josef Bacik 2022-07-25 1974 if (ret) {
74944c873602a3e Josef Bacik 2022-07-25 1975 btrfs_dec_block_group_ro(bg);
18bb8bbf13c1839 Johannes Thumshirn 2021-04-19 1976 btrfs_err(fs_info, "error relocating chunk %llu",
18bb8bbf13c1839 Johannes Thumshirn 2021-04-19 1977 bg->start);
ba5d06440cae63e Filipe Manana 2025-02-24 1978 used = 0;
620768704326c9a Filipe Manana 2025-02-24 1979 reserved = 0;
243192b6764990e Boris Burkov 2024-01-25 1980 spin_lock(&space_info->lock);
243192b6764990e Boris Burkov 2024-01-25 1981 space_info->reclaim_errors++;
813d4c642251649 Boris Burkov 2024-02-14 1982 if (READ_ONCE(space_info->periodic_reclaim))
813d4c642251649 Boris Burkov 2024-02-14 1983 space_info->periodic_reclaim_ready = false;
243192b6764990e Boris Burkov 2024-01-25 1984 spin_unlock(&space_info->lock);
74944c873602a3e Josef Bacik 2022-07-25 1985 }
243192b6764990e Boris Burkov 2024-01-25 1986 spin_lock(&space_info->lock);
243192b6764990e Boris Burkov 2024-01-25 1987 space_info->reclaim_count++;
ba5d06440cae63e Filipe Manana 2025-02-24 1988 space_info->reclaim_bytes += used;
620768704326c9a Filipe Manana 2025-02-24 1989 space_info->reclaim_bytes += reserved;
243192b6764990e Boris Burkov 2024-01-25 1990 spin_unlock(&space_info->lock);
18bb8bbf13c1839 Johannes Thumshirn 2021-04-19 1991
18bb8bbf13c1839 Johannes Thumshirn 2021-04-19 1992 next:
0497dfba98c00ed Boris Burkov 2025-03-05 1993 if (ret && !READ_ONCE(space_info->periodic_reclaim))
0497dfba98c00ed Boris Burkov 2025-03-05 1994 btrfs_link_bg_list(bg, &retry_list);
1cea5cf0e664290 Filipe Manana 2021-06-21 1995 btrfs_put_block_group(bg);
3ed01616bad6c7e Naohiro Aota 2023-06-06 1996
3ed01616bad6c7e Naohiro Aota 2023-06-06 1997 mutex_unlock(&fs_info->reclaim_bgs_lock);
3ed01616bad6c7e Naohiro Aota 2023-06-06 1998 /*
3ed01616bad6c7e Naohiro Aota 2023-06-06 1999 * Reclaiming all the block groups in the list can take really
3ed01616bad6c7e Naohiro Aota 2023-06-06 2000 * long. Prioritize cleaning up unused block groups.
3ed01616bad6c7e Naohiro Aota 2023-06-06 2001 */
3ed01616bad6c7e Naohiro Aota 2023-06-06 2002 btrfs_delete_unused_bgs(fs_info);
3ed01616bad6c7e Naohiro Aota 2023-06-06 2003 /*
3ed01616bad6c7e Naohiro Aota 2023-06-06 2004 * If we are interrupted by a balance, we can just bail out. The
3ed01616bad6c7e Naohiro Aota 2023-06-06 2005 * cleaner thread restart again if necessary.
3ed01616bad6c7e Naohiro Aota 2023-06-06 2006 */
3ed01616bad6c7e Naohiro Aota 2023-06-06 2007 if (!mutex_trylock(&fs_info->reclaim_bgs_lock))
3ed01616bad6c7e Naohiro Aota 2023-06-06 2008 goto end;
d96b34248c2f4ea Filipe Manana 2021-11-22 2009 spin_lock(&fs_info->unused_bgs_lock);
18bb8bbf13c1839 Johannes Thumshirn 2021-04-19 2010 }
18bb8bbf13c1839 Johannes Thumshirn 2021-04-19 2011 spin_unlock(&fs_info->unused_bgs_lock);
18bb8bbf13c1839 Johannes Thumshirn 2021-04-19 2012 mutex_unlock(&fs_info->reclaim_bgs_lock);
3ed01616bad6c7e Naohiro Aota 2023-06-06 2013 end:
4eb4e85c4f81849 Boris Burkov 2024-06-07 2014 spin_lock(&fs_info->unused_bgs_lock);
4eb4e85c4f81849 Boris Burkov 2024-06-07 2015 list_splice_tail(&retry_list, &fs_info->reclaim_bgs);
4eb4e85c4f81849 Boris Burkov 2024-06-07 2016 spin_unlock(&fs_info->unused_bgs_lock);
18bb8bbf13c1839 Johannes Thumshirn 2021-04-19 2017 btrfs_exclop_finish(fs_info);
ca5e4ea0beaec8b Naohiro Aota 2022-02-18 2018 sb_end_write(fs_info->sb);
18bb8bbf13c1839 Johannes Thumshirn 2021-04-19 2019 }
18bb8bbf13c1839 Johannes Thumshirn 2021-04-19 2020
--
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki
^ permalink raw reply [flat|nested] 25+ messages in thread
* [PATCH RFC 8/9] btrfs: lower log level of relocation messages
2025-06-27 9:19 [PATCH RFC 0/9] btrfs: zoned: fixes for garbage collection under preassure Johannes Thumshirn
` (6 preceding siblings ...)
2025-06-27 9:19 ` [PATCH RFC 7/9] btrfs: lower auto-reclaim message log level Johannes Thumshirn
@ 2025-06-27 9:19 ` Johannes Thumshirn
2025-06-27 11:36 ` Christoph Hellwig
` (2 more replies)
2025-06-27 9:19 ` [PATCH RFC 9/9] btrfs: remove unused bgs on allocation failure Johannes Thumshirn
8 siblings, 3 replies; 25+ messages in thread
From: Johannes Thumshirn @ 2025-06-27 9:19 UTC (permalink / raw)
To: linux-btrfs
Cc: Damien Le Moal, Naohiro Aota, David Sterba, Josef Bacik,
Boris Burkov, Filipe Manana, Johannes Thumshirn
From: Johannes Thumshirn <johannes.thumshirn@wdc.com>
When running a system with automatic reclaim/balancing enabled, there are
lots of info level messages like the following in the kernel log:
BTRFS info (device nvme2n1): relocating block group 629212708864 flags data
BTRFS info (device nvme2n1): found 510 extents, stage: move data extents
Lower the log level to debug for these messages.
Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
---
fs/btrfs/relocation.c | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)
diff --git a/fs/btrfs/relocation.c b/fs/btrfs/relocation.c
index d7ec1d72821c..46b9236708ed 100644
--- a/fs/btrfs/relocation.c
+++ b/fs/btrfs/relocation.c
@@ -3892,7 +3892,7 @@ static void describe_relocation(struct btrfs_block_group *block_group)
btrfs_describe_block_groups(block_group->flags, buf, sizeof(buf));
- btrfs_info(block_group->fs_info, "relocating block group %llu flags %s",
+ btrfs_debug(block_group->fs_info, "relocating block group %llu flags %s",
block_group->start, buf);
}
@@ -4044,7 +4044,7 @@ int btrfs_relocate_block_group(struct btrfs_fs_info *fs_info, u64 group_start)
if (rc->extents_found == 0)
break;
- btrfs_info(fs_info, "found %llu extents, stage: %s",
+ btrfs_debug(fs_info, "found %llu extents, stage: %s",
rc->extents_found, stage_to_string(finishes_stage));
}
--
2.49.0
^ permalink raw reply related [flat|nested] 25+ messages in thread* Re: [PATCH RFC 8/9] btrfs: lower log level of relocation messages
2025-06-27 9:19 ` [PATCH RFC 8/9] btrfs: lower log level of relocation messages Johannes Thumshirn
@ 2025-06-27 11:36 ` Christoph Hellwig
2025-06-27 23:44 ` kernel test robot
2025-06-30 17:12 ` David Sterba
2 siblings, 0 replies; 25+ messages in thread
From: Christoph Hellwig @ 2025-06-27 11:36 UTC (permalink / raw)
To: Johannes Thumshirn
Cc: linux-btrfs, Damien Le Moal, Naohiro Aota, David Sterba,
Josef Bacik, Boris Burkov, Filipe Manana, Johannes Thumshirn
On Fri, Jun 27, 2025 at 11:19:13AM +0200, Johannes Thumshirn wrote:
> From: Johannes Thumshirn <johannes.thumshirn@wdc.com>
>
> When running a system with automatic reclaim/balancing enabled, there are
> lots of info level messages like the following in the kernel log:
>
> BTRFS info (device nvme2n1): relocating block group 629212708864 flags data
> BTRFS info (device nvme2n1): found 510 extents, stage: move data extents
>
> Lower the log level to debug for these messages.
Same here. This is useful debug information, but I'd expect something
like this to be a trace event that is enabled as needed, and not
something going into the kernel log.
^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [PATCH RFC 8/9] btrfs: lower log level of relocation messages
2025-06-27 9:19 ` [PATCH RFC 8/9] btrfs: lower log level of relocation messages Johannes Thumshirn
2025-06-27 11:36 ` Christoph Hellwig
@ 2025-06-27 23:44 ` kernel test robot
2025-06-30 17:12 ` David Sterba
2 siblings, 0 replies; 25+ messages in thread
From: kernel test robot @ 2025-06-27 23:44 UTC (permalink / raw)
To: Johannes Thumshirn; +Cc: oe-kbuild-all
Hi Johannes,
[This is a private test report for your RFC patch.]
kernel test robot noticed the following build warnings:
[auto build test WARNING on next-20250626]
[cannot apply to kdave/for-next v6.16-rc3 v6.16-rc2 v6.16-rc1 linus/master v6.16-rc3]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch#_base_tree_information]
url: https://github.com/intel-lab-lkp/linux/commits/Johannes-Thumshirn/btrfs-zoned-do-not-select-metadata-BG-as-finish-target/20250627-172551
base: next-20250626
patch link: https://lore.kernel.org/r/20250627091914.100715-9-jth%40kernel.org
patch subject: [PATCH RFC 8/9] btrfs: lower log level of relocation messages
config: x86_64-buildonly-randconfig-003-20250628 (https://download.01.org/0day-ci/archive/20250628/202506280720.2zYUJWXx-lkp@intel.com/config)
compiler: gcc-12 (Debian 12.2.0-14+deb12u1) 12.2.0
reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20250628/202506280720.2zYUJWXx-lkp@intel.com/reproduce)
If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202506280720.2zYUJWXx-lkp@intel.com/
All warnings (new ones prefixed by >>):
fs/btrfs/relocation.c: In function 'btrfs_relocate_block_group':
fs/btrfs/relocation.c:4013:34: warning: variable 'finishes_stage' set but not used [-Wunused-but-set-variable]
4013 | enum reloc_stage finishes_stage;
| ^~~~~~~~~~~~~~
fs/btrfs/relocation.c: At top level:
>> fs/btrfs/relocation.c:3899:20: warning: 'stage_to_string' defined but not used [-Wunused-function]
3899 | static const char *stage_to_string(enum reloc_stage stage)
| ^~~~~~~~~~~~~~~
vim +/stage_to_string +3899 fs/btrfs/relocation.c
ebce0e01b930bf Adam Borowski 2016-11-14 3898
8daf07cf2b7919 David Sterba 2023-09-22 @3899 static const char *stage_to_string(enum reloc_stage stage)
430640e31649be Qu Wenruo 2019-11-29 3900 {
430640e31649be Qu Wenruo 2019-11-29 3901 if (stage == MOVE_DATA_EXTENTS)
430640e31649be Qu Wenruo 2019-11-29 3902 return "move data extents";
430640e31649be Qu Wenruo 2019-11-29 3903 if (stage == UPDATE_DATA_PTRS)
430640e31649be Qu Wenruo 2019-11-29 3904 return "update data pointers";
430640e31649be Qu Wenruo 2019-11-29 3905 return "unknown";
430640e31649be Qu Wenruo 2019-11-29 3906 }
430640e31649be Qu Wenruo 2019-11-29 3907
5d4f98a28c7d33 Yan Zheng 2009-06-10 3908 /*
5d4f98a28c7d33 Yan Zheng 2009-06-10 3909 * function to relocate all extents in a block group.
5d4f98a28c7d33 Yan Zheng 2009-06-10 3910 */
6bccf3ab1e1f09 Jeff Mahoney 2016-06-21 3911 int btrfs_relocate_block_group(struct btrfs_fs_info *fs_info, u64 group_start)
5d4f98a28c7d33 Yan Zheng 2009-06-10 3912 {
32da5386d9a4fd David Sterba 2019-10-29 3913 struct btrfs_block_group *bg;
29cbcf401793f4 Josef Bacik 2021-11-05 3914 struct btrfs_root *extent_root = btrfs_extent_root(fs_info, group_start);
5d4f98a28c7d33 Yan Zheng 2009-06-10 3915 struct reloc_control *rc;
0af3d00bad38d3 Josef Bacik 2010-06-21 3916 struct inode *inode;
0af3d00bad38d3 Josef Bacik 2010-06-21 3917 struct btrfs_path *path;
5d4f98a28c7d33 Yan Zheng 2009-06-10 3918 int ret;
f0486c68e4bd9a Yan, Zheng 2010-05-16 3919 int rw = 0;
5d4f98a28c7d33 Yan Zheng 2009-06-10 3920 int err = 0;
5d4f98a28c7d33 Yan Zheng 2009-06-10 3921
b4be6aefa73c9a Josef Bacik 2022-02-18 3922 /*
b4be6aefa73c9a Josef Bacik 2022-02-18 3923 * This only gets set if we had a half-deleted snapshot on mount. We
b4be6aefa73c9a Josef Bacik 2022-02-18 3924 * cannot allow relocation to start while we're still trying to clean up
b4be6aefa73c9a Josef Bacik 2022-02-18 3925 * these pending deletions.
b4be6aefa73c9a Josef Bacik 2022-02-18 3926 */
b4be6aefa73c9a Josef Bacik 2022-02-18 3927 ret = wait_on_bit(&fs_info->flags, BTRFS_FS_UNFINISHED_DROPS, TASK_INTERRUPTIBLE);
b4be6aefa73c9a Josef Bacik 2022-02-18 3928 if (ret)
b4be6aefa73c9a Josef Bacik 2022-02-18 3929 return ret;
b4be6aefa73c9a Josef Bacik 2022-02-18 3930
b4be6aefa73c9a Josef Bacik 2022-02-18 3931 /* We may have been woken up by close_ctree, so bail if we're closing. */
b4be6aefa73c9a Josef Bacik 2022-02-18 3932 if (btrfs_fs_closing(fs_info))
b4be6aefa73c9a Josef Bacik 2022-02-18 3933 return -EINTR;
b4be6aefa73c9a Josef Bacik 2022-02-18 3934
eede2bf34f4fa8 Omar Sandoval 2016-11-03 3935 bg = btrfs_lookup_block_group(fs_info, group_start);
eede2bf34f4fa8 Omar Sandoval 2016-11-03 3936 if (!bg)
eede2bf34f4fa8 Omar Sandoval 2016-11-03 3937 return -ENOENT;
eede2bf34f4fa8 Omar Sandoval 2016-11-03 3938
0320b3538b2b81 Naohiro Aota 2022-03-29 3939 /*
0320b3538b2b81 Naohiro Aota 2022-03-29 3940 * Relocation of a data block group creates ordered extents. Without
0320b3538b2b81 Naohiro Aota 2022-03-29 3941 * sb_start_write(), we can freeze the filesystem while unfinished
0320b3538b2b81 Naohiro Aota 2022-03-29 3942 * ordered extents are left. Such ordered extents can cause a deadlock
0320b3538b2b81 Naohiro Aota 2022-03-29 3943 * e.g. when syncfs() is waiting for their completion but they can't
0320b3538b2b81 Naohiro Aota 2022-03-29 3944 * finish because they block when joining a transaction, due to the
0320b3538b2b81 Naohiro Aota 2022-03-29 3945 * fact that the freeze locks are being held in write mode.
0320b3538b2b81 Naohiro Aota 2022-03-29 3946 */
0320b3538b2b81 Naohiro Aota 2022-03-29 3947 if (bg->flags & BTRFS_BLOCK_GROUP_DATA)
0320b3538b2b81 Naohiro Aota 2022-03-29 3948 ASSERT(sb_write_started(fs_info->sb));
0320b3538b2b81 Naohiro Aota 2022-03-29 3949
eede2bf34f4fa8 Omar Sandoval 2016-11-03 3950 if (btrfs_pinned_by_swapfile(fs_info, bg)) {
eede2bf34f4fa8 Omar Sandoval 2016-11-03 3951 btrfs_put_block_group(bg);
eede2bf34f4fa8 Omar Sandoval 2016-11-03 3952 return -ETXTBSY;
eede2bf34f4fa8 Omar Sandoval 2016-11-03 3953 }
eede2bf34f4fa8 Omar Sandoval 2016-11-03 3954
c258d6e36442eb Qu Wenruo 2019-03-01 3955 rc = alloc_reloc_control(fs_info);
eede2bf34f4fa8 Omar Sandoval 2016-11-03 3956 if (!rc) {
eede2bf34f4fa8 Omar Sandoval 2016-11-03 3957 btrfs_put_block_group(bg);
5d4f98a28c7d33 Yan Zheng 2009-06-10 3958 return -ENOMEM;
eede2bf34f4fa8 Omar Sandoval 2016-11-03 3959 }
5d4f98a28c7d33 Yan Zheng 2009-06-10 3960
907d2710d72754 David Sterba 2021-05-18 3961 ret = reloc_chunk_start(fs_info);
907d2710d72754 David Sterba 2021-05-18 3962 if (ret < 0) {
907d2710d72754 David Sterba 2021-05-18 3963 err = ret;
907d2710d72754 David Sterba 2021-05-18 3964 goto out_put_bg;
907d2710d72754 David Sterba 2021-05-18 3965 }
907d2710d72754 David Sterba 2021-05-18 3966
f0486c68e4bd9a Yan, Zheng 2010-05-16 3967 rc->extent_root = extent_root;
eede2bf34f4fa8 Omar Sandoval 2016-11-03 3968 rc->block_group = bg;
5d4f98a28c7d33 Yan Zheng 2009-06-10 3969
b12de52896c0e8 Qu Wenruo 2019-11-15 3970 ret = btrfs_inc_block_group_ro(rc->block_group, true);
f0486c68e4bd9a Yan, Zheng 2010-05-16 3971 if (ret) {
f0486c68e4bd9a Yan, Zheng 2010-05-16 3972 err = ret;
f0486c68e4bd9a Yan, Zheng 2010-05-16 3973 goto out;
f0486c68e4bd9a Yan, Zheng 2010-05-16 3974 }
f0486c68e4bd9a Yan, Zheng 2010-05-16 3975 rw = 1;
f0486c68e4bd9a Yan, Zheng 2010-05-16 3976
0af3d00bad38d3 Josef Bacik 2010-06-21 3977 path = btrfs_alloc_path();
0af3d00bad38d3 Josef Bacik 2010-06-21 3978 if (!path) {
0af3d00bad38d3 Josef Bacik 2010-06-21 3979 err = -ENOMEM;
0af3d00bad38d3 Josef Bacik 2010-06-21 3980 goto out;
0af3d00bad38d3 Josef Bacik 2010-06-21 3981 }
0af3d00bad38d3 Josef Bacik 2010-06-21 3982
7949f3392ed65d David Sterba 2019-03-20 3983 inode = lookup_free_space_inode(rc->block_group, path);
0af3d00bad38d3 Josef Bacik 2010-06-21 3984 btrfs_free_path(path);
0af3d00bad38d3 Josef Bacik 2010-06-21 3985
0af3d00bad38d3 Josef Bacik 2010-06-21 3986 if (!IS_ERR(inode))
20faaab2c32f37 Filipe Manana 2025-03-07 3987 ret = delete_block_group_cache(rc->block_group, inode, 0);
0af3d00bad38d3 Josef Bacik 2010-06-21 3988 else
0af3d00bad38d3 Josef Bacik 2010-06-21 3989 ret = PTR_ERR(inode);
0af3d00bad38d3 Josef Bacik 2010-06-21 3990
0af3d00bad38d3 Josef Bacik 2010-06-21 3991 if (ret && ret != -ENOENT) {
0af3d00bad38d3 Josef Bacik 2010-06-21 3992 err = ret;
0af3d00bad38d3 Josef Bacik 2010-06-21 3993 goto out;
0af3d00bad38d3 Josef Bacik 2010-06-21 3994 }
0af3d00bad38d3 Josef Bacik 2010-06-21 3995
f75a043737ecf1 Filipe Manana 2025-03-07 3996 rc->data_inode = create_reloc_inode(rc->block_group);
5d4f98a28c7d33 Yan Zheng 2009-06-10 3997 if (IS_ERR(rc->data_inode)) {
5d4f98a28c7d33 Yan Zheng 2009-06-10 3998 err = PTR_ERR(rc->data_inode);
5d4f98a28c7d33 Yan Zheng 2009-06-10 3999 rc->data_inode = NULL;
5d4f98a28c7d33 Yan Zheng 2009-06-10 4000 goto out;
5d4f98a28c7d33 Yan Zheng 2009-06-10 4001 }
5d4f98a28c7d33 Yan Zheng 2009-06-10 4002
17a21d79149b24 Johannes Thumshirn 2024-06-05 4003 describe_relocation(rc->block_group);
5d4f98a28c7d33 Yan Zheng 2009-06-10 4004
9cfa3e34e20e67 Filipe Manana 2016-04-26 4005 btrfs_wait_block_group_reservations(rc->block_group);
f78c436c3931e7 Filipe Manana 2016-05-09 4006 btrfs_wait_nocow_writers(rc->block_group);
42317ab440c110 David Sterba 2024-05-14 4007 btrfs_wait_ordered_roots(fs_info, U64_MAX, rc->block_group);
5d4f98a28c7d33 Yan Zheng 2009-06-10 4008
7ae9bd18032e81 Naohiro Aota 2021-08-19 4009 ret = btrfs_zone_finish(rc->block_group);
7ae9bd18032e81 Naohiro Aota 2021-08-19 4010 WARN_ON(ret && ret != -EAGAIN);
7ae9bd18032e81 Naohiro Aota 2021-08-19 4011
5d4f98a28c7d33 Yan Zheng 2009-06-10 4012 while (1) {
8daf07cf2b7919 David Sterba 2023-09-22 @4013 enum reloc_stage finishes_stage;
430640e31649be Qu Wenruo 2019-11-29 4014
76dda93c6ae2c1 Yan, Zheng 2009-09-21 4015 mutex_lock(&fs_info->cleaner_mutex);
5d4f98a28c7d33 Yan Zheng 2009-06-10 4016 ret = relocate_block_group(rc);
76dda93c6ae2c1 Yan, Zheng 2009-09-21 4017 mutex_unlock(&fs_info->cleaner_mutex);
ff612ba7849964 Josef Bacik 2019-02-25 4018 if (ret < 0)
5d4f98a28c7d33 Yan Zheng 2009-06-10 4019 err = ret;
5d4f98a28c7d33 Yan Zheng 2009-06-10 4020
430640e31649be Qu Wenruo 2019-11-29 4021 finishes_stage = rc->stage;
ff612ba7849964 Josef Bacik 2019-02-25 4022 /*
ff612ba7849964 Josef Bacik 2019-02-25 4023 * We may have gotten ENOSPC after we already dirtied some
ff612ba7849964 Josef Bacik 2019-02-25 4024 * extents. If writeout happens while we're relocating a
ff612ba7849964 Josef Bacik 2019-02-25 4025 * different block group we could end up hitting the
ff612ba7849964 Josef Bacik 2019-02-25 4026 * BUG_ON(rc->stage == UPDATE_DATA_PTRS) in
ff612ba7849964 Josef Bacik 2019-02-25 4027 * btrfs_reloc_cow_block. Make sure we write everything out
ff612ba7849964 Josef Bacik 2019-02-25 4028 * properly so we don't trip over this problem, and then break
ff612ba7849964 Josef Bacik 2019-02-25 4029 * out of the loop if we hit an error.
ff612ba7849964 Josef Bacik 2019-02-25 4030 */
5d4f98a28c7d33 Yan Zheng 2009-06-10 4031 if (rc->stage == MOVE_DATA_EXTENTS && rc->found_file_extent) {
e641e323abb3ce Filipe Manana 2024-05-18 4032 ret = btrfs_wait_ordered_range(BTRFS_I(rc->data_inode), 0,
0ef8b726075aa6 Josef Bacik 2013-10-25 4033 (u64)-1);
ff612ba7849964 Josef Bacik 2019-02-25 4034 if (ret)
0ef8b726075aa6 Josef Bacik 2013-10-25 4035 err = ret;
5d4f98a28c7d33 Yan Zheng 2009-06-10 4036 invalidate_mapping_pages(rc->data_inode->i_mapping,
5d4f98a28c7d33 Yan Zheng 2009-06-10 4037 0, -1);
5d4f98a28c7d33 Yan Zheng 2009-06-10 4038 rc->stage = UPDATE_DATA_PTRS;
5d4f98a28c7d33 Yan Zheng 2009-06-10 4039 }
ff612ba7849964 Josef Bacik 2019-02-25 4040
ff612ba7849964 Josef Bacik 2019-02-25 4041 if (err < 0)
ff612ba7849964 Josef Bacik 2019-02-25 4042 goto out;
ff612ba7849964 Josef Bacik 2019-02-25 4043
ff612ba7849964 Josef Bacik 2019-02-25 4044 if (rc->extents_found == 0)
ff612ba7849964 Josef Bacik 2019-02-25 4045 break;
ff612ba7849964 Josef Bacik 2019-02-25 4046
29dfe9961e3037 Johannes Thumshirn 2025-06-27 4047 btrfs_debug(fs_info, "found %llu extents, stage: %s",
430640e31649be Qu Wenruo 2019-11-29 4048 rc->extents_found, stage_to_string(finishes_stage));
5d4f98a28c7d33 Yan Zheng 2009-06-10 4049 }
5d4f98a28c7d33 Yan Zheng 2009-06-10 4050
5d4f98a28c7d33 Yan Zheng 2009-06-10 4051 WARN_ON(rc->block_group->pinned > 0);
5d4f98a28c7d33 Yan Zheng 2009-06-10 4052 WARN_ON(rc->block_group->reserved > 0);
bf38be65f3703d David Sterba 2019-10-23 4053 WARN_ON(rc->block_group->used > 0);
5d4f98a28c7d33 Yan Zheng 2009-06-10 4054 out:
f0486c68e4bd9a Yan, Zheng 2010-05-16 4055 if (err && rw)
2ff7e61e0d30ff Jeff Mahoney 2016-06-22 4056 btrfs_dec_block_group_ro(rc->block_group);
5d4f98a28c7d33 Yan Zheng 2009-06-10 4057 iput(rc->data_inode);
907d2710d72754 David Sterba 2021-05-18 4058 out_put_bg:
907d2710d72754 David Sterba 2021-05-18 4059 btrfs_put_block_group(bg);
907d2710d72754 David Sterba 2021-05-18 4060 reloc_chunk_end(fs_info);
1a0afa0ecfc4db Josef Bacik 2020-03-04 4061 free_reloc_control(rc);
5d4f98a28c7d33 Yan Zheng 2009-06-10 4062 return err;
5d4f98a28c7d33 Yan Zheng 2009-06-10 4063 }
5d4f98a28c7d33 Yan Zheng 2009-06-10 4064
--
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki
^ permalink raw reply [flat|nested] 25+ messages in thread* Re: [PATCH RFC 8/9] btrfs: lower log level of relocation messages
2025-06-27 9:19 ` [PATCH RFC 8/9] btrfs: lower log level of relocation messages Johannes Thumshirn
2025-06-27 11:36 ` Christoph Hellwig
2025-06-27 23:44 ` kernel test robot
@ 2025-06-30 17:12 ` David Sterba
2025-07-01 5:09 ` Johannes Thumshirn
2 siblings, 1 reply; 25+ messages in thread
From: David Sterba @ 2025-06-30 17:12 UTC (permalink / raw)
To: Johannes Thumshirn
Cc: linux-btrfs, Damien Le Moal, Naohiro Aota, David Sterba,
Josef Bacik, Boris Burkov, Filipe Manana, Johannes Thumshirn
On Fri, Jun 27, 2025 at 11:19:13AM +0200, Johannes Thumshirn wrote:
> From: Johannes Thumshirn <johannes.thumshirn@wdc.com>
>
> When running a system with automatic reclaim/balancing enabled, there are
> lots of info level messages like the following in the kernel log:
>
> BTRFS info (device nvme2n1): relocating block group 629212708864 flags data
> BTRFS info (device nvme2n1): found 510 extents, stage: move data extents
>
> Lower the log level to debug for these messages.
>
> Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
I kind of like that the message is in the system log on the info level,
it's a high level operation and tracks the progress. Also it's been
there forever and I don't think I'm the only one used to seeing it
there. We have many info messages and vague guidelines when to use it,
but I think "once per block group" is still within the intentions.
^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [PATCH RFC 8/9] btrfs: lower log level of relocation messages
2025-06-30 17:12 ` David Sterba
@ 2025-07-01 5:09 ` Johannes Thumshirn
2025-07-01 14:43 ` David Sterba
0 siblings, 1 reply; 25+ messages in thread
From: Johannes Thumshirn @ 2025-07-01 5:09 UTC (permalink / raw)
To: dsterba@suse.cz, Johannes Thumshirn
Cc: linux-btrfs@vger.kernel.org, Damien Le Moal, Naohiro Aota,
David Sterba, Josef Bacik, Boris Burkov, Filipe Manana
On 30.06.25 19:12, David Sterba wrote:
> On Fri, Jun 27, 2025 at 11:19:13AM +0200, Johannes Thumshirn wrote:
>> From: Johannes Thumshirn <johannes.thumshirn@wdc.com>
>>
>> When running a system with automatic reclaim/balancing enabled, there are
>> lots of info level messages like the following in the kernel log:
>>
>> BTRFS info (device nvme2n1): relocating block group 629212708864 flags data
>> BTRFS info (device nvme2n1): found 510 extents, stage: move data extents
>>
>> Lower the log level to debug for these messages.
>>
>> Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
>
> I kind of like that the message is in the system log on the info level,
> it's a high level operation and tracks the progress. Also it's been
> there forever and I don't think I'm the only one used to seeing it
> there. We have many info messages and vague guidelines when to use it,
> but I think "once per block group" is still within the intentions.
>
Yes but now that automatic balancing is in place this is spamming all
over dmesg.
^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [PATCH RFC 8/9] btrfs: lower log level of relocation messages
2025-07-01 5:09 ` Johannes Thumshirn
@ 2025-07-01 14:43 ` David Sterba
0 siblings, 0 replies; 25+ messages in thread
From: David Sterba @ 2025-07-01 14:43 UTC (permalink / raw)
To: Johannes Thumshirn
Cc: dsterba@suse.cz, Johannes Thumshirn, linux-btrfs@vger.kernel.org,
Damien Le Moal, Naohiro Aota, David Sterba, Josef Bacik,
Boris Burkov, Filipe Manana
On Tue, Jul 01, 2025 at 05:09:06AM +0000, Johannes Thumshirn wrote:
> On 30.06.25 19:12, David Sterba wrote:
> > On Fri, Jun 27, 2025 at 11:19:13AM +0200, Johannes Thumshirn wrote:
> >> From: Johannes Thumshirn <johannes.thumshirn@wdc.com>
> >>
> >> When running a system with automatic reclaim/balancing enabled, there are
> >> lots of info level messages like the following in the kernel log:
> >>
> >> BTRFS info (device nvme2n1): relocating block group 629212708864 flags data
> >> BTRFS info (device nvme2n1): found 510 extents, stage: move data extents
> >>
> >> Lower the log level to debug for these messages.
> >>
> >> Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
> >
> > I kind of like that the message is in the system log on the info level,
> > it's a high level operation and tracks the progress. Also it's been
> > there forever and I don't think I'm the only one used to seeing it
> > there. We have many info messages and vague guidelines when to use it,
> > but I think "once per block group" is still within the intentions.
> >
>
> Yes but now that automatic balancing is in place this is spamming all
> over dmesg.
We could distinguish the reason of relocation so the one started
manually will print what we have now and the automatic only say two
messages like "starting automatic bg cleanup" and once it finishes some
kind of summary "bg cleanup removed 123 block groups". If you want
additional debugging just for the automatic reclaim then it's OK.
^ permalink raw reply [flat|nested] 25+ messages in thread
* [PATCH RFC 9/9] btrfs: remove unused bgs on allocation failure
2025-06-27 9:19 [PATCH RFC 0/9] btrfs: zoned: fixes for garbage collection under preassure Johannes Thumshirn
` (7 preceding siblings ...)
2025-06-27 9:19 ` [PATCH RFC 8/9] btrfs: lower log level of relocation messages Johannes Thumshirn
@ 2025-06-27 9:19 ` Johannes Thumshirn
2025-06-27 11:38 ` Christoph Hellwig
2025-06-27 12:14 ` Filipe Manana
8 siblings, 2 replies; 25+ messages in thread
From: Johannes Thumshirn @ 2025-06-27 9:19 UTC (permalink / raw)
To: linux-btrfs
Cc: Damien Le Moal, Naohiro Aota, David Sterba, Josef Bacik,
Boris Burkov, Filipe Manana, Johannes Thumshirn
From: Johannes Thumshirn <johannes.thumshirn@wdc.com>
In case find_free_extent() return ENOSPC, check if there are block-groups
in the filsystem which have been marked as 'unused' and if so, reclaim the
space occupied by these block-groups.
Restart the search for free space to place the extent afterwards.
In case the allocation is targeted for the data relocation root, skip this
step, as it can cause deadlocks between block group deletion and relocation.
Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
---
fs/btrfs/block-group.h | 11 +++++++++++
fs/btrfs/extent-tree.c | 5 +++++
2 files changed, 16 insertions(+)
diff --git a/fs/btrfs/block-group.h b/fs/btrfs/block-group.h
index a8bb8429c966..d5c91db88456 100644
--- a/fs/btrfs/block-group.h
+++ b/fs/btrfs/block-group.h
@@ -396,4 +396,15 @@ int btrfs_use_block_group_size_class(struct btrfs_block_group *bg,
bool force_wrong_size_class);
bool btrfs_block_group_should_use_size_class(const struct btrfs_block_group *bg);
+static inline bool btrfs_has_unused_block_groups(struct btrfs_fs_info *fs_info)
+{
+ bool unused_bgs;
+
+ spin_lock(&fs_info->unused_bgs_lock);
+ unused_bgs = !list_empty(&fs_info->unused_bgs);
+ spin_unlock(&fs_info->unused_bgs_lock);
+
+ return unused_bgs;
+}
+
#endif /* BTRFS_BLOCK_GROUP_H */
diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index da731f6d4dad..34d21713c6ab 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -4683,6 +4683,11 @@ int btrfs_reserve_extent(struct btrfs_root *root, u64 ram_bytes,
if (!ret && !is_data) {
btrfs_dec_block_group_reservations(fs_info, ins->objectid);
} else if (ret == -ENOSPC) {
+ if (!btrfs_is_data_reloc_root(root) &&
+ btrfs_has_unused_block_groups(fs_info)) {
+ btrfs_delete_unused_bgs(fs_info);
+ goto again;
+ }
if (!final_tried && ins->offset) {
num_bytes = min(num_bytes >> 1, ins->offset);
num_bytes = round_down(num_bytes,
--
2.49.0
^ permalink raw reply related [flat|nested] 25+ messages in thread* Re: [PATCH RFC 9/9] btrfs: remove unused bgs on allocation failure
2025-06-27 9:19 ` [PATCH RFC 9/9] btrfs: remove unused bgs on allocation failure Johannes Thumshirn
@ 2025-06-27 11:38 ` Christoph Hellwig
2025-06-30 11:45 ` Johannes Thumshirn
2025-06-27 12:14 ` Filipe Manana
1 sibling, 1 reply; 25+ messages in thread
From: Christoph Hellwig @ 2025-06-27 11:38 UTC (permalink / raw)
To: Johannes Thumshirn
Cc: linux-btrfs, Damien Le Moal, Naohiro Aota, David Sterba,
Josef Bacik, Boris Burkov, Filipe Manana, Johannes Thumshirn
On Fri, Jun 27, 2025 at 11:19:14AM +0200, Johannes Thumshirn wrote:
> From: Johannes Thumshirn <johannes.thumshirn@wdc.com>
>
> In case find_free_extent() return ENOSPC, check if there are block-groups
> in the filsystem which have been marked as 'unused' and if so, reclaim the
> space occupied by these block-groups.
>
> Restart the search for free space to place the extent afterwards.
>
> In case the allocation is targeted for the data relocation root, skip this
> step, as it can cause deadlocks between block group deletion and relocation.
Assuming an unused BG is one without space in it that just needs a zone
reset or discard (a quick look at the code seems to confirm that, but
with some extra caveats): why don't you reclaim it ASAP once it becomes
unused, at least modulo those space reservation caveats (which I don't
understand from that quick look).
^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [PATCH RFC 9/9] btrfs: remove unused bgs on allocation failure
2025-06-27 11:38 ` Christoph Hellwig
@ 2025-06-30 11:45 ` Johannes Thumshirn
2025-06-30 12:05 ` Filipe Manana
0 siblings, 1 reply; 25+ messages in thread
From: Johannes Thumshirn @ 2025-06-30 11:45 UTC (permalink / raw)
To: hch@infradead.org, Johannes Thumshirn
Cc: linux-btrfs@vger.kernel.org, Damien Le Moal, Naohiro Aota,
David Sterba, Josef Bacik, Boris Burkov, Filipe Manana
On 27.06.25 13:39, Christoph Hellwig wrote:
> On Fri, Jun 27, 2025 at 11:19:14AM +0200, Johannes Thumshirn wrote:
>> From: Johannes Thumshirn <johannes.thumshirn@wdc.com>
>>
>> In case find_free_extent() return ENOSPC, check if there are block-groups
>> in the filsystem which have been marked as 'unused' and if so, reclaim the
>> space occupied by these block-groups.
>>
>> Restart the search for free space to place the extent afterwards.
>>
>> In case the allocation is targeted for the data relocation root, skip this
>> step, as it can cause deadlocks between block group deletion and relocation.
>
> Assuming an unused BG is one without space in it that just needs a zone
> reset or discard (a quick look at the code seems to confirm that, but
> with some extra caveats): why don't you reclaim it ASAP once it becomes
> unused, at least modulo those space reservation caveats (which I don't
> understand from that quick look).
>
>
I've looked into it looks promising. Threw it into fstests and (up to
now) nothing broke. So I'll run Damien's scripts on a ZNS drive and
we'll see if it helps.
^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [PATCH RFC 9/9] btrfs: remove unused bgs on allocation failure
2025-06-30 11:45 ` Johannes Thumshirn
@ 2025-06-30 12:05 ` Filipe Manana
0 siblings, 0 replies; 25+ messages in thread
From: Filipe Manana @ 2025-06-30 12:05 UTC (permalink / raw)
To: Johannes Thumshirn
Cc: hch@infradead.org, Johannes Thumshirn,
linux-btrfs@vger.kernel.org, Damien Le Moal, Naohiro Aota,
David Sterba, Josef Bacik, Boris Burkov, Filipe Manana
On Mon, Jun 30, 2025 at 12:46 PM Johannes Thumshirn
<Johannes.Thumshirn@wdc.com> wrote:
>
> On 27.06.25 13:39, Christoph Hellwig wrote:
> > On Fri, Jun 27, 2025 at 11:19:14AM +0200, Johannes Thumshirn wrote:
> >> From: Johannes Thumshirn <johannes.thumshirn@wdc.com>
> >>
> >> In case find_free_extent() return ENOSPC, check if there are block-groups
> >> in the filsystem which have been marked as 'unused' and if so, reclaim the
> >> space occupied by these block-groups.
> >>
> >> Restart the search for free space to place the extent afterwards.
> >>
> >> In case the allocation is targeted for the data relocation root, skip this
> >> step, as it can cause deadlocks between block group deletion and relocation.
> >
> > Assuming an unused BG is one without space in it that just needs a zone
> > reset or discard (a quick look at the code seems to confirm that, but
> > with some extra caveats): why don't you reclaim it ASAP once it becomes
> > unused, at least modulo those space reservation caveats (which I don't
> > understand from that quick look).
> >
> >
>
> I've looked into it looks promising. Threw it into fstests and (up to
> now) nothing broke. So I'll run Damien's scripts on a ZNS drive and
> we'll see if it helps.
That brings a new problem.
For example a data block group becomes empty and you delete it immediately.
If a data allocation happens before the transaction used to delete the
block group is committed and there are no other data block groups with
enough space and there's no more unallocated device space, we will
-ENOSPC, whereas before we wouldn't.
Remember that a delete block group's space can only be allocated again
after the transaction used to delete it is committed, to respect COW
semantics.
That's why you see the allocator using the commit root (at
find_free_dev_extent()).
And you can't commit the transaction as soon as the bg becomes used,
as we're holding a transaction handle open and would deadlock.
^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [PATCH RFC 9/9] btrfs: remove unused bgs on allocation failure
2025-06-27 9:19 ` [PATCH RFC 9/9] btrfs: remove unused bgs on allocation failure Johannes Thumshirn
2025-06-27 11:38 ` Christoph Hellwig
@ 2025-06-27 12:14 ` Filipe Manana
1 sibling, 0 replies; 25+ messages in thread
From: Filipe Manana @ 2025-06-27 12:14 UTC (permalink / raw)
To: Johannes Thumshirn
Cc: linux-btrfs, Damien Le Moal, Naohiro Aota, David Sterba,
Josef Bacik, Boris Burkov, Filipe Manana, Johannes Thumshirn
On Fri, Jun 27, 2025 at 10:36 AM Johannes Thumshirn <jth@kernel.org> wrote:
>
> From: Johannes Thumshirn <johannes.thumshirn@wdc.com>
>
> In case find_free_extent() return ENOSPC, check if there are block-groups
> in the filsystem which have been marked as 'unused' and if so, reclaim the
> space occupied by these block-groups.
>
> Restart the search for free space to place the extent afterwards.
>
> In case the allocation is targeted for the data relocation root, skip this
> step, as it can cause deadlocks between block group deletion and relocation.
>
> Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
> ---
> fs/btrfs/block-group.h | 11 +++++++++++
> fs/btrfs/extent-tree.c | 5 +++++
> 2 files changed, 16 insertions(+)
>
> diff --git a/fs/btrfs/block-group.h b/fs/btrfs/block-group.h
> index a8bb8429c966..d5c91db88456 100644
> --- a/fs/btrfs/block-group.h
> +++ b/fs/btrfs/block-group.h
> @@ -396,4 +396,15 @@ int btrfs_use_block_group_size_class(struct btrfs_block_group *bg,
> bool force_wrong_size_class);
> bool btrfs_block_group_should_use_size_class(const struct btrfs_block_group *bg);
>
> +static inline bool btrfs_has_unused_block_groups(struct btrfs_fs_info *fs_info)
> +{
> + bool unused_bgs;
> +
> + spin_lock(&fs_info->unused_bgs_lock);
> + unused_bgs = !list_empty(&fs_info->unused_bgs);
> + spin_unlock(&fs_info->unused_bgs_lock);
> +
> + return unused_bgs;
> +}
> +
> #endif /* BTRFS_BLOCK_GROUP_H */
> diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
> index da731f6d4dad..34d21713c6ab 100644
> --- a/fs/btrfs/extent-tree.c
> +++ b/fs/btrfs/extent-tree.c
> @@ -4683,6 +4683,11 @@ int btrfs_reserve_extent(struct btrfs_root *root, u64 ram_bytes,
> if (!ret && !is_data) {
> btrfs_dec_block_group_reservations(fs_info, ins->objectid);
> } else if (ret == -ENOSPC) {
> + if (!btrfs_is_data_reloc_root(root) &&
> + btrfs_has_unused_block_groups(fs_info)) {
> + btrfs_delete_unused_bgs(fs_info);
> + goto again;
> + }
Unfortunately this won't solve the -ENOSPC.
A deleted block group can't be reused in the same transaction, we have
to commit the transaction used to delete it.
This is to respect COW semantics and crash proof consistency.
And we can't commit here the transaction since that would deadlock for
any path that holds a transaction handle open, such as when modifying
any tree for example.
So unless some other task happens to commit the transaction used by
btrfs_delete_unused_bgs() after we call btrfs_delete_unused_bgs() and
before we retry the extent reservation when jumping to 'again',
-ENOSPC will happen again.
> if (!final_tried && ins->offset) {
> num_bytes = min(num_bytes >> 1, ins->offset);
> num_bytes = round_down(num_bytes,
> --
> 2.49.0
>
>
^ permalink raw reply [flat|nested] 25+ messages in thread