[PATCH RFC 0/9] btrfs: zoned: fixes for garbage collection under preassure

public inbox for linux-btrfs@vger.kernel.org
 help / color / mirror / Atom feed

* [PATCH RFC 0/9] btrfs: zoned: fixes for garbage collection under preassure
@ 2025-06-27  9:19 Johannes Thumshirn
  2025-06-27  9:19 ` [PATCH RFC 1/9] btrfs: zoned: do not select metadata BG as finish target Johannes Thumshirn
                   ` (8 more replies)
  0 siblings, 9 replies; 23+ messages in thread
From: Johannes Thumshirn @ 2025-06-27  9:19 UTC (permalink / raw)
  To: linux-btrfs
  Cc: Damien Le Moal, Naohiro Aota, David Sterba, Josef Bacik,
	Boris Burkov, Filipe Manana, Johannes Thumshirn

From: Johannes Thumshirn <johannes.thumshirn@wdc.com>

A few weeks ago Damien started benchmarking zoned BTRFS and zoned XFS
garbage collection on SMR and ZNS drives. 

One of these benchmarks was filling the FS to 75% and then subsequientially
overwriting the files again.

One thing that came out of these trials, is that BTRFS on ZNS devices did
not finish the overwrite phase because it prematurely ran out of space.

Once I tried reclaiming earlier (patch 9/9) I've seen haning task because
of extremely high lock contention on serveral locks. The patches 2 to 6
either remove these locks, or in case of the space-info lock remove
unneeded lock contentiom.

Patches 7 and 8 lower the log level of reclaim messages, because with the
automatic reclaim BTRFS has these days there is a lot of log spamming.

Pathch 8 triggers removal of unused block groups on allocation failure, I
opted for making it non zoned specific as I think even regular filesystems
can benefit from this.. The first patch from Naohiro is also needed so we
don't prematurely finish metadata block groups on zoned devices.

Reasons for RFC, there are a lot of locking changes involved in this
series. As much as I'm comfortable with running fio workloads and fstests,
I really prefere more eyes. Especially Filipe always is a good reviewer of
locking related changes and Boris for reclaiming. The series is probably
not the final solution either but only forward progress, but I don't want
to be running into the wrong direction.

This series also includes a previously sent series of me titled "btrfs:
zoned: reduce lock contention in zoned extent allocator" and can be found
here:
https://lore.kernel.org/linux-btrfs/20250624093710.18685-1-jth@kernel.org/

Johannes Thumshirn (8):
  btrfs: zoned: get rid of relocation_bg_lock
  btrfs: zoned: get rid of treelog_bg_lock
  btrfs: zoned: don't hold space_info lock on zoned allocation
  btrfs: remove delalloc_root_mutex
  btrfs: remove btrfs_root's delalloc_mutex
  btrfs: lower auto-reclaim message log level
  btrfs: lower log level of relocation messages
  btrfs: remove unused bgs on allocation failure

Naohiro Aota (1):
  btrfs: zoned: do not select metadata BG as finish target

 fs/btrfs/block-group.c |  2 +-
 fs/btrfs/block-group.h | 11 ++++++
 fs/btrfs/ctree.h       |  1 -
 fs/btrfs/disk-io.c     |  4 ---
 fs/btrfs/extent-tree.c | 81 ++++++++++++++----------------------------
 fs/btrfs/fs.h          |  8 +----
 fs/btrfs/inode.c       |  4 ---
 fs/btrfs/relocation.c  |  4 +--
 fs/btrfs/zoned.c       | 19 +++++-----
 fs/btrfs/zoned.h       |  7 ++--
 10 files changed, 55 insertions(+), 86 deletions(-)

-- 
2.49.0

^ permalink raw reply	[flat|nested] 23+ messages in thread

* [PATCH RFC 1/9] btrfs: zoned: do not select metadata BG as finish target
  2025-06-27  9:19 [PATCH RFC 0/9] btrfs: zoned: fixes for garbage collection under preassure Johannes Thumshirn
@ 2025-06-27  9:19 ` Johannes Thumshirn
  2025-06-27 11:34   ` Christoph Hellwig
  2025-06-27  9:19 ` [PATCH RFC 2/9] btrfs: zoned: get rid of relocation_bg_lock Johannes Thumshirn
                   ` (7 subsequent siblings)
  8 siblings, 1 reply; 23+ messages in thread
From: Johannes Thumshirn @ 2025-06-27  9:19 UTC (permalink / raw)
  To: linux-btrfs
  Cc: Damien Le Moal, Naohiro Aota, David Sterba, Josef Bacik,
	Boris Burkov, Filipe Manana

From: Naohiro Aota <naohiro.aota@wdc.com>

We call btrfs_zone_finish_one_bg() to zone finish one block group and make
a room to activate another block group. Currently, we can choose a metadata
block group as a target. But, as we reserve an active metadata block group,
we no longer want to select a metadata block group. So, skip it in the
loop.

Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
---
 fs/btrfs/zoned.c | 6 ++++--
 1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/fs/btrfs/zoned.c b/fs/btrfs/zoned.c
index bd987c90a05c..0d5d6db72b62 100644
--- a/fs/btrfs/zoned.c
+++ b/fs/btrfs/zoned.c
@@ -2651,8 +2651,10 @@ int btrfs_zone_finish_one_bg(struct btrfs_fs_info *fs_info)
 
 		spin_lock(&block_group->lock);
 		if (block_group->reserved || block_group->alloc_offset == 0 ||
-		    (block_group->flags & BTRFS_BLOCK_GROUP_SYSTEM) ||
-		    test_bit(BLOCK_GROUP_FLAG_ZONED_DATA_RELOC, &block_group->runtime_flags)) {
+		    (block_group->flags &
+		     (BTRFS_BLOCK_GROUP_METADATA | BTRFS_BLOCK_GROUP_SYSTEM)) ||
+		    test_bit(BLOCK_GROUP_FLAG_ZONED_DATA_RELOC,
+			     &block_group->runtime_flags)) {
 			spin_unlock(&block_group->lock);
 			continue;
 		}
-- 
2.49.0


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCH RFC 2/9] btrfs: zoned: get rid of relocation_bg_lock
  2025-06-27  9:19 [PATCH RFC 0/9] btrfs: zoned: fixes for garbage collection under preassure Johannes Thumshirn
  2025-06-27  9:19 ` [PATCH RFC 1/9] btrfs: zoned: do not select metadata BG as finish target Johannes Thumshirn
@ 2025-06-27  9:19 ` Johannes Thumshirn
  2025-06-27  9:19 ` [PATCH RFC 3/9] btrfs: zoned: get rid of treelog_bg_lock Johannes Thumshirn
                   ` (6 subsequent siblings)
  8 siblings, 0 replies; 23+ messages in thread
From: Johannes Thumshirn @ 2025-06-27  9:19 UTC (permalink / raw)
  To: linux-btrfs
  Cc: Damien Le Moal, Naohiro Aota, David Sterba, Josef Bacik,
	Boris Burkov, Filipe Manana, Johannes Thumshirn

From: Johannes Thumshirn <johannes.thumshirn@wdc.com>

Lockstat analysis of benchmark workloads shows a very high contention of
the relocation_bg_lock. But the relocation_bg_lock only protects a single
field in 'struct btrfs_fs_info', namely 'u64 data_reloc_bg'.

Use READ_ONCE()/WRITE_ONCE() to access 'btrfs_fs_info::data_reloc_bg'.

This is safe in the allocator path, as relocation I/O is only going to
block groups in the relocation sub-space_info and at the moment, there is
only one relocation block group in this space info.

Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
---
 fs/btrfs/disk-io.c     |  1 -
 fs/btrfs/extent-tree.c | 28 +++++++++++-----------------
 fs/btrfs/fs.h          |  6 +-----
 fs/btrfs/zoned.c       | 11 +++++------
 4 files changed, 17 insertions(+), 29 deletions(-)

diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index 6ac5be02dce7..9a13f5b1ed43 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -2791,7 +2791,6 @@ void btrfs_init_fs_info(struct btrfs_fs_info *fs_info)
 	spin_lock_init(&fs_info->unused_bgs_lock);
 	spin_lock_init(&fs_info->treelog_bg_lock);
 	spin_lock_init(&fs_info->zone_active_bgs_lock);
-	spin_lock_init(&fs_info->relocation_bg_lock);
 	rwlock_init(&fs_info->tree_mod_log_lock);
 	rwlock_init(&fs_info->global_root_lock);
 	mutex_init(&fs_info->unused_bg_unpin_mutex);
diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index 10f50c725313..a9bda68a1883 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -3865,14 +3865,10 @@ static int do_allocation_zoned(struct btrfs_block_group *block_group,
 	 * Do not allow non-relocation blocks in the dedicated relocation block
 	 * group, and vice versa.
 	 */
-	spin_lock(&fs_info->relocation_bg_lock);
-	data_reloc_bytenr = fs_info->data_reloc_bg;
+	data_reloc_bytenr = READ_ONCE(fs_info->data_reloc_bg);
 	if (data_reloc_bytenr &&
 	    ((ffe_ctl->for_data_reloc && bytenr != data_reloc_bytenr) ||
 	     (!ffe_ctl->for_data_reloc && bytenr == data_reloc_bytenr)))
-		skip = true;
-	spin_unlock(&fs_info->relocation_bg_lock);
-	if (skip)
 		return 1;
 
 	/* Check RO and no space case before trying to activate it */
@@ -3899,7 +3895,6 @@ static int do_allocation_zoned(struct btrfs_block_group *block_group,
 	spin_lock(&space_info->lock);
 	spin_lock(&block_group->lock);
 	spin_lock(&fs_info->treelog_bg_lock);
-	spin_lock(&fs_info->relocation_bg_lock);
 
 	if (ret)
 		goto out;
@@ -3908,8 +3903,8 @@ static int do_allocation_zoned(struct btrfs_block_group *block_group,
 	       block_group->start == fs_info->treelog_bg ||
 	       fs_info->treelog_bg == 0);
 	ASSERT(!ffe_ctl->for_data_reloc ||
-	       block_group->start == fs_info->data_reloc_bg ||
-	       fs_info->data_reloc_bg == 0);
+	       block_group->start == data_reloc_bytenr ||
+	       data_reloc_bytenr == 0);
 
 	if (block_group->ro ||
 	    (!ffe_ctl->for_data_reloc &&
@@ -3932,7 +3927,7 @@ static int do_allocation_zoned(struct btrfs_block_group *block_group,
 	 * Do not allow currently used block group to be the data relocation
 	 * dedicated block group.
 	 */
-	if (ffe_ctl->for_data_reloc && !fs_info->data_reloc_bg &&
+	if (ffe_ctl->for_data_reloc && data_reloc_bytenr == 0 &&
 	    (block_group->used || block_group->reserved)) {
 		ret = 1;
 		goto out;
@@ -3957,8 +3952,8 @@ static int do_allocation_zoned(struct btrfs_block_group *block_group,
 		fs_info->treelog_bg = block_group->start;
 
 	if (ffe_ctl->for_data_reloc) {
-		if (!fs_info->data_reloc_bg)
-			fs_info->data_reloc_bg = block_group->start;
+		if (READ_ONCE(fs_info->data_reloc_bg) == 0)
+			WRITE_ONCE(fs_info->data_reloc_bg, block_group->start);
 		/*
 		 * Do not allow allocations from this block group, unless it is
 		 * for data relocation. Compared to increasing the ->ro, setting
@@ -3994,8 +3989,7 @@ static int do_allocation_zoned(struct btrfs_block_group *block_group,
 	if (ret && ffe_ctl->for_treelog)
 		fs_info->treelog_bg = 0;
 	if (ret && ffe_ctl->for_data_reloc)
-		fs_info->data_reloc_bg = 0;
-	spin_unlock(&fs_info->relocation_bg_lock);
+		WRITE_ONCE(fs_info->data_reloc_bg, 0);
 	spin_unlock(&fs_info->treelog_bg_lock);
 	spin_unlock(&block_group->lock);
 	spin_unlock(&space_info->lock);
@@ -4304,10 +4298,10 @@ static int prepare_allocation_zoned(struct btrfs_fs_info *fs_info,
 			ffe_ctl->hint_byte = fs_info->treelog_bg;
 		spin_unlock(&fs_info->treelog_bg_lock);
 	} else if (ffe_ctl->for_data_reloc) {
-		spin_lock(&fs_info->relocation_bg_lock);
-		if (fs_info->data_reloc_bg)
-			ffe_ctl->hint_byte = fs_info->data_reloc_bg;
-		spin_unlock(&fs_info->relocation_bg_lock);
+		u64 data_reloc_bg = READ_ONCE(fs_info->data_reloc_bg);
+
+		if (data_reloc_bg)
+			ffe_ctl->hint_byte = data_reloc_bg;
 	} else if (ffe_ctl->flags & BTRFS_BLOCK_GROUP_DATA) {
 		struct btrfs_block_group *block_group;
 
diff --git a/fs/btrfs/fs.h b/fs/btrfs/fs.h
index b239e4b8421c..570f4b85096c 100644
--- a/fs/btrfs/fs.h
+++ b/fs/btrfs/fs.h
@@ -849,11 +849,7 @@ struct btrfs_fs_info {
 	spinlock_t treelog_bg_lock;
 	u64 treelog_bg;
 
-	/*
-	 * Start of the dedicated data relocation block group, protected by
-	 * relocation_bg_lock.
-	 */
-	spinlock_t relocation_bg_lock;
+	/* Start of the dedicated data relocation block group */
 	u64 data_reloc_bg;
 	struct mutex zoned_data_reloc_io_lock;
 
diff --git a/fs/btrfs/zoned.c b/fs/btrfs/zoned.c
index 0d5d6db72b62..388c277a84d3 100644
--- a/fs/btrfs/zoned.c
+++ b/fs/btrfs/zoned.c
@@ -2495,11 +2495,10 @@ void btrfs_schedule_zone_finish_bg(struct btrfs_block_group *bg,
 void btrfs_clear_data_reloc_bg(struct btrfs_block_group *bg)
 {
 	struct btrfs_fs_info *fs_info = bg->fs_info;
+	u64 data_reloc_bg = READ_ONCE(fs_info->data_reloc_bg);
 
-	spin_lock(&fs_info->relocation_bg_lock);
-	if (fs_info->data_reloc_bg == bg->start)
-		fs_info->data_reloc_bg = 0;
-	spin_unlock(&fs_info->relocation_bg_lock);
+	if (data_reloc_bg == bg->start)
+		WRITE_ONCE(fs_info->data_reloc_bg, 0);
 }
 
 void btrfs_zoned_reserve_data_reloc_bg(struct btrfs_fs_info *fs_info)
@@ -2518,7 +2517,7 @@ void btrfs_zoned_reserve_data_reloc_bg(struct btrfs_fs_info *fs_info)
 	if (!btrfs_is_zoned(fs_info))
 		return;
 
-	if (fs_info->data_reloc_bg)
+	if (READ_ONCE(fs_info->data_reloc_bg))
 		return;
 
 	if (sb_rdonly(fs_info->sb))
@@ -2539,7 +2538,7 @@ void btrfs_zoned_reserve_data_reloc_bg(struct btrfs_fs_info *fs_info)
 			continue;
 		}
 
-		fs_info->data_reloc_bg = bg->start;
+		WRITE_ONCE(fs_info->data_reloc_bg, bg->start);
 		set_bit(BLOCK_GROUP_FLAG_ZONED_DATA_RELOC, &bg->runtime_flags);
 		btrfs_zone_activate(bg);
 
-- 
2.49.0


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCH RFC 3/9] btrfs: zoned: get rid of treelog_bg_lock
  2025-06-27  9:19 [PATCH RFC 0/9] btrfs: zoned: fixes for garbage collection under preassure Johannes Thumshirn
  2025-06-27  9:19 ` [PATCH RFC 1/9] btrfs: zoned: do not select metadata BG as finish target Johannes Thumshirn
  2025-06-27  9:19 ` [PATCH RFC 2/9] btrfs: zoned: get rid of relocation_bg_lock Johannes Thumshirn
@ 2025-06-27  9:19 ` Johannes Thumshirn
  2025-06-27  9:19 ` [PATCH RFC 4/9] btrfs: zoned: don't hold space_info lock on zoned allocation Johannes Thumshirn
                   ` (5 subsequent siblings)
  8 siblings, 0 replies; 23+ messages in thread
From: Johannes Thumshirn @ 2025-06-27  9:19 UTC (permalink / raw)
  To: linux-btrfs
  Cc: Damien Le Moal, Naohiro Aota, David Sterba, Josef Bacik,
	Boris Burkov, Filipe Manana, Johannes Thumshirn

From: Johannes Thumshirn <johannes.thumshirn@wdc.com>

Lockstat analysis of benchmark workloads shows a very high contention of
the treelog_bg_lock. But the treelog_bg_lock only protects a single
field in 'struct btrfs_fs_info', namely 'u64 treelog_bg'.

Use READ_ONCE()/WRITE_ONCE() to access 'btrfs_fs_info::treelog_bg'.

This is safe in the allocator path, as treelog I/O is only going to block
groups in the treelog sub-space_info and at the moment, there is only one
treelog block group in this space info.

Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
---
 fs/btrfs/disk-io.c     |  1 -
 fs/btrfs/extent-tree.c | 45 +++++++++++-------------------------------
 fs/btrfs/fs.h          |  1 -
 fs/btrfs/zoned.c       |  2 +-
 fs/btrfs/zoned.h       |  7 +++----
 5 files changed, 15 insertions(+), 41 deletions(-)

diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index 9a13f5b1ed43..35cd38de7727 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -2789,7 +2789,6 @@ void btrfs_init_fs_info(struct btrfs_fs_info *fs_info)
 	spin_lock_init(&fs_info->defrag_inodes_lock);
 	spin_lock_init(&fs_info->super_lock);
 	spin_lock_init(&fs_info->unused_bgs_lock);
-	spin_lock_init(&fs_info->treelog_bg_lock);
 	spin_lock_init(&fs_info->zone_active_bgs_lock);
 	rwlock_init(&fs_info->tree_mod_log_lock);
 	rwlock_init(&fs_info->global_root_lock);
diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index a9bda68a1883..46358a555f78 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -3809,22 +3809,6 @@ static int do_allocation_clustered(struct btrfs_block_group *block_group,
 	return find_free_extent_unclustered(block_group, ffe_ctl);
 }
 
-/*
- * Tree-log block group locking
- * ============================
- *
- * fs_info::treelog_bg_lock protects the fs_info::treelog_bg which
- * indicates the starting address of a block group, which is reserved only
- * for tree-log metadata.
- *
- * Lock nesting
- * ============
- *
- * space_info::lock
- *   block_group::lock
- *     fs_info::treelog_bg_lock
- */
-
 /*
  * Simple allocator for sequential-only block group. It only allows sequential
  * allocation. No need to play with trees. This function also reserves the
@@ -3844,7 +3828,6 @@ static int do_allocation_zoned(struct btrfs_block_group *block_group,
 	u64 log_bytenr;
 	u64 data_reloc_bytenr;
 	int ret = 0;
-	bool skip = false;
 
 	ASSERT(btrfs_is_zoned(block_group->fs_info));
 
@@ -3852,13 +3835,9 @@ static int do_allocation_zoned(struct btrfs_block_group *block_group,
 	 * Do not allow non-tree-log blocks in the dedicated tree-log block
 	 * group, and vice versa.
 	 */
-	spin_lock(&fs_info->treelog_bg_lock);
-	log_bytenr = fs_info->treelog_bg;
+	log_bytenr = READ_ONCE(fs_info->treelog_bg);
 	if (log_bytenr && ((ffe_ctl->for_treelog && bytenr != log_bytenr) ||
 			   (!ffe_ctl->for_treelog && bytenr == log_bytenr)))
-		skip = true;
-	spin_unlock(&fs_info->treelog_bg_lock);
-	if (skip)
 		return 1;
 
 	/*
@@ -3894,14 +3873,13 @@ static int do_allocation_zoned(struct btrfs_block_group *block_group,
 
 	spin_lock(&space_info->lock);
 	spin_lock(&block_group->lock);
-	spin_lock(&fs_info->treelog_bg_lock);
 
 	if (ret)
 		goto out;
 
 	ASSERT(!ffe_ctl->for_treelog ||
-	       block_group->start == fs_info->treelog_bg ||
-	       fs_info->treelog_bg == 0);
+	       block_group->start == log_bytenr ||
+	       log_bytenr == 0);
 	ASSERT(!ffe_ctl->for_data_reloc ||
 	       block_group->start == data_reloc_bytenr ||
 	       data_reloc_bytenr == 0);
@@ -3917,7 +3895,7 @@ static int do_allocation_zoned(struct btrfs_block_group *block_group,
 	 * Do not allow currently using block group to be tree-log dedicated
 	 * block group.
 	 */
-	if (ffe_ctl->for_treelog && !fs_info->treelog_bg &&
+	if (ffe_ctl->for_treelog && log_bytenr == 0 &&
 	    (block_group->used || block_group->reserved)) {
 		ret = 1;
 		goto out;
@@ -3948,8 +3926,8 @@ static int do_allocation_zoned(struct btrfs_block_group *block_group,
 		goto out;
 	}
 
-	if (ffe_ctl->for_treelog && !fs_info->treelog_bg)
-		fs_info->treelog_bg = block_group->start;
+	if (ffe_ctl->for_treelog && READ_ONCE(fs_info->treelog_bg) == 0)
+		WRITE_ONCE(fs_info->treelog_bg, block_group->start);
 
 	if (ffe_ctl->for_data_reloc) {
 		if (READ_ONCE(fs_info->data_reloc_bg) == 0)
@@ -3987,10 +3965,9 @@ static int do_allocation_zoned(struct btrfs_block_group *block_group,
 
 out:
 	if (ret && ffe_ctl->for_treelog)
-		fs_info->treelog_bg = 0;
+		WRITE_ONCE(fs_info->treelog_bg, 0);
 	if (ret && ffe_ctl->for_data_reloc)
 		WRITE_ONCE(fs_info->data_reloc_bg, 0);
-	spin_unlock(&fs_info->treelog_bg_lock);
 	spin_unlock(&block_group->lock);
 	spin_unlock(&space_info->lock);
 	return ret;
@@ -4293,10 +4270,10 @@ static int prepare_allocation_zoned(struct btrfs_fs_info *fs_info,
 				    struct find_free_extent_ctl *ffe_ctl)
 {
 	if (ffe_ctl->for_treelog) {
-		spin_lock(&fs_info->treelog_bg_lock);
-		if (fs_info->treelog_bg)
-			ffe_ctl->hint_byte = fs_info->treelog_bg;
-		spin_unlock(&fs_info->treelog_bg_lock);
+		u64 treelog_bg = READ_ONCE(fs_info->treelog_bg);
+
+		if (treelog_bg)
+			ffe_ctl->hint_byte = treelog_bg;
 	} else if (ffe_ctl->for_data_reloc) {
 		u64 data_reloc_bg = READ_ONCE(fs_info->data_reloc_bg);
 
diff --git a/fs/btrfs/fs.h b/fs/btrfs/fs.h
index 570f4b85096c..a388af40a251 100644
--- a/fs/btrfs/fs.h
+++ b/fs/btrfs/fs.h
@@ -846,7 +846,6 @@ struct btrfs_fs_info {
 	u64 max_zone_append_size;
 
 	struct mutex zoned_meta_io_lock;
-	spinlock_t treelog_bg_lock;
 	u64 treelog_bg;
 
 	/* Start of the dedicated data relocation block group */
diff --git a/fs/btrfs/zoned.c b/fs/btrfs/zoned.c
index 388c277a84d3..c89f846af6dd 100644
--- a/fs/btrfs/zoned.c
+++ b/fs/btrfs/zoned.c
@@ -1948,7 +1948,7 @@ static bool check_bg_is_active(struct btrfs_eb_write_context *ctx,
 	if (test_bit(BLOCK_GROUP_FLAG_ZONE_IS_ACTIVE, &block_group->runtime_flags))
 		return true;
 
-	if (fs_info->treelog_bg == block_group->start) {
+	if (READ_ONCE(fs_info->treelog_bg) == block_group->start) {
 		if (!btrfs_zone_activate(block_group)) {
 			int ret_fin = btrfs_zone_finish_one_bg(fs_info);
 
diff --git a/fs/btrfs/zoned.h b/fs/btrfs/zoned.h
index 6e11533b8e14..c1b3a5c3a799 100644
--- a/fs/btrfs/zoned.h
+++ b/fs/btrfs/zoned.h
@@ -383,14 +383,13 @@ static inline void btrfs_zoned_meta_io_unlock(struct btrfs_fs_info *fs_info)
 static inline void btrfs_clear_treelog_bg(struct btrfs_block_group *bg)
 {
 	struct btrfs_fs_info *fs_info = bg->fs_info;
+	u64 treelog_bg = READ_ONCE(fs_info->treelog_bg);
 
 	if (!btrfs_is_zoned(fs_info))
 		return;
 
-	spin_lock(&fs_info->treelog_bg_lock);
-	if (fs_info->treelog_bg == bg->start)
-		fs_info->treelog_bg = 0;
-	spin_unlock(&fs_info->treelog_bg_lock);
+	if (treelog_bg == bg->start)
+		WRITE_ONCE(fs_info->treelog_bg, 0);
 }
 
 static inline void btrfs_zoned_data_reloc_lock(struct btrfs_inode *inode)
-- 
2.49.0


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCH RFC 4/9] btrfs: zoned: don't hold space_info lock on zoned allocation
  2025-06-27  9:19 [PATCH RFC 0/9] btrfs: zoned: fixes for garbage collection under preassure Johannes Thumshirn
                   ` (2 preceding siblings ...)
  2025-06-27  9:19 ` [PATCH RFC 3/9] btrfs: zoned: get rid of treelog_bg_lock Johannes Thumshirn
@ 2025-06-27  9:19 ` Johannes Thumshirn
  2025-06-27  9:19 ` [PATCH RFC 5/9] btrfs: remove delalloc_root_mutex Johannes Thumshirn
                   ` (4 subsequent siblings)
  8 siblings, 0 replies; 23+ messages in thread
From: Johannes Thumshirn @ 2025-06-27  9:19 UTC (permalink / raw)
  To: linux-btrfs
  Cc: Damien Le Moal, Naohiro Aota, David Sterba, Josef Bacik,
	Boris Burkov, Filipe Manana, Johannes Thumshirn

From: Johannes Thumshirn <johannes.thumshirn@wdc.com>

The zoned extent allocator holds 'struct btrfs_space_info::lock' nearly
over the entirety of the allocation process, but nothing in
do_allocation_zoned() is actually accessing fields of 'struct
btrfs_space_info'.

Furthermore taking lock_stat snapshots in performance testing, always shows
the space_info::lock as the most contented lock in the entire system.

Remove locking the space_info lock during do_allocation_zoned() to reduce
lock contention.

Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
---
 fs/btrfs/extent-tree.c | 3 ---
 1 file changed, 3 deletions(-)

diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index 46358a555f78..da731f6d4dad 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -3819,7 +3819,6 @@ static int do_allocation_zoned(struct btrfs_block_group *block_group,
 			       struct btrfs_block_group **bg_ret)
 {
 	struct btrfs_fs_info *fs_info = block_group->fs_info;
-	struct btrfs_space_info *space_info = block_group->space_info;
 	struct btrfs_free_space_ctl *ctl = block_group->free_space_ctl;
 	u64 start = block_group->start;
 	u64 num_bytes = ffe_ctl->num_bytes;
@@ -3871,7 +3870,6 @@ static int do_allocation_zoned(struct btrfs_block_group *block_group,
 		 */
 	}
 
-	spin_lock(&space_info->lock);
 	spin_lock(&block_group->lock);
 
 	if (ret)
@@ -3969,7 +3967,6 @@ static int do_allocation_zoned(struct btrfs_block_group *block_group,
 	if (ret && ffe_ctl->for_data_reloc)
 		WRITE_ONCE(fs_info->data_reloc_bg, 0);
 	spin_unlock(&block_group->lock);
-	spin_unlock(&space_info->lock);
 	return ret;
 }
 
-- 
2.49.0


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCH RFC 5/9] btrfs: remove delalloc_root_mutex
  2025-06-27  9:19 [PATCH RFC 0/9] btrfs: zoned: fixes for garbage collection under preassure Johannes Thumshirn
                   ` (3 preceding siblings ...)
  2025-06-27  9:19 ` [PATCH RFC 4/9] btrfs: zoned: don't hold space_info lock on zoned allocation Johannes Thumshirn
@ 2025-06-27  9:19 ` Johannes Thumshirn
  2025-06-27 12:42   ` Filipe Manana
  2025-06-27  9:19 ` [PATCH RFC 6/9] btrfs: remove btrfs_root's delalloc_mutex Johannes Thumshirn
                   ` (3 subsequent siblings)
  8 siblings, 1 reply; 23+ messages in thread
From: Johannes Thumshirn @ 2025-06-27  9:19 UTC (permalink / raw)
  To: linux-btrfs
  Cc: Damien Le Moal, Naohiro Aota, David Sterba, Josef Bacik,
	Boris Burkov, Filipe Manana, Johannes Thumshirn

From: Johannes Thumshirn <johannes.thumshirn@wdc.com>

When benchmarking garbage collection on zoned BTRFS filesystems on ZNS
drives, we regularly observe hung_task messages like the following:

INFO: task kworker/u132:2:297 blocked for more than 122 seconds.
       Not tainted 6.16.0-rc1+ #1225
 "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
 task:kworker/u132:2  state:D stack:0     pid:297   tgid:297   ppid:2      task_flags:0x4208060 flags:0x00004000
 Workqueue: events_unbound btrfs_preempt_reclaim_metadata_space
 Call Trace:
  <TASK>
  __schedule+0x2f9/0x7b0
  schedule+0x27/0x80
  schedule_preempt_disabled+0x15/0x30
  __mutex_lock.constprop.0+0x4af/0x890
  ? srso_return_thunk+0x5/0x5f
  btrfs_start_delalloc_roots+0x8a/0x290
  ? timerqueue_del+0x2e/0x60
  shrink_delalloc+0x10c/0x2d0
  ? srso_return_thunk+0x5/0x5f
  ? psi_group_change+0x19e/0x460
  ? srso_return_thunk+0x5/0x5f
  ? btrfs_reduce_alloc_profile+0x9a/0x1d0
  flush_space+0x202/0x280
  ? srso_return_thunk+0x5/0x5f
  ? need_preemptive_reclaim+0xaa/0x190
  btrfs_preempt_reclaim_metadata_space+0xe7/0x340
  process_one_work+0x192/0x350
  worker_thread+0x25a/0x3a0
  ? __pfx_worker_thread+0x10/0x10
  kthread+0xfc/0x240
  ? __pfx_kthread+0x10/0x10
  ? __pfx_kthread+0x10/0x10
  ret_from_fork+0x152/0x180
  ? __pfx_kthread+0x10/0x10
  ret_from_fork_asm+0x1a/0x30
  </TASK>
 INFO: task kworker/u132:2:297 is blocked on a mutex likely owned by task kworker/u129:0:2359.
 task:kworker/u129:0  state:R  running task     stack:0     pid:2359  tgid:2359  ppid:2

The affected tasks are blocked on 'struct btrfs_fs_info::delalloc_root_mutex',
a global lock that serializes entry into btrfs_start_delalloc_roots().
This lock was introduced in commit 573bfb72f760 ("Btrfs: fix possible
empty list access when flushing the delalloc inodes") but without a
clear justification for its necessity.

However, the condition it was meant to protect against—a possibly empty
list access—is already safely handled by 'list_splice_init()', which
does nothing when the source list is empty.

There are no known concurrency issues in btrfs_start_delalloc_roots()
that require serialization via this mutex. All critical regions are
either covered by per-root locking or operate on safely isolated lists.

Removing the lock eliminates the observed hangs and improves metadata
GC throughput, particularly on systems with high concurrency like
ZNS-based deployments.

Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
---
 fs/btrfs/disk-io.c | 1 -
 fs/btrfs/fs.h      | 1 -
 fs/btrfs/inode.c   | 2 --
 3 files changed, 4 deletions(-)

diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index 35cd38de7727..929f39886b0e 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -2795,7 +2795,6 @@ void btrfs_init_fs_info(struct btrfs_fs_info *fs_info)
 	mutex_init(&fs_info->unused_bg_unpin_mutex);
 	mutex_init(&fs_info->reclaim_bgs_lock);
 	mutex_init(&fs_info->reloc_mutex);
-	mutex_init(&fs_info->delalloc_root_mutex);
 	mutex_init(&fs_info->zoned_meta_io_lock);
 	mutex_init(&fs_info->zoned_data_reloc_io_lock);
 	seqlock_init(&fs_info->profiles_lock);
diff --git a/fs/btrfs/fs.h b/fs/btrfs/fs.h
index a388af40a251..04ebc976f841 100644
--- a/fs/btrfs/fs.h
+++ b/fs/btrfs/fs.h
@@ -606,7 +606,6 @@ struct btrfs_fs_info {
 	 */
 	struct list_head ordered_roots;
 
-	struct mutex delalloc_root_mutex;
 	spinlock_t delalloc_root_lock;
 	/* All fs/file tree roots that have delalloc inodes. */
 	struct list_head delalloc_roots;
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 80c72c594b19..d68f4ef61c43 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -8766,7 +8766,6 @@ int btrfs_start_delalloc_roots(struct btrfs_fs_info *fs_info, long nr,
 	if (BTRFS_FS_ERROR(fs_info))
 		return -EROFS;
 
-	mutex_lock(&fs_info->delalloc_root_mutex);
 	spin_lock(&fs_info->delalloc_root_lock);
 	list_splice_init(&fs_info->delalloc_roots, &splice);
 	while (!list_empty(&splice)) {
@@ -8800,7 +8799,6 @@ int btrfs_start_delalloc_roots(struct btrfs_fs_info *fs_info, long nr,
 		list_splice_tail(&splice, &fs_info->delalloc_roots);
 		spin_unlock(&fs_info->delalloc_root_lock);
 	}
-	mutex_unlock(&fs_info->delalloc_root_mutex);
 	return ret;
 }
 
-- 
2.49.0


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCH RFC 6/9] btrfs: remove btrfs_root's delalloc_mutex
  2025-06-27  9:19 [PATCH RFC 0/9] btrfs: zoned: fixes for garbage collection under preassure Johannes Thumshirn
                   ` (4 preceding siblings ...)
  2025-06-27  9:19 ` [PATCH RFC 5/9] btrfs: remove delalloc_root_mutex Johannes Thumshirn
@ 2025-06-27  9:19 ` Johannes Thumshirn
  2025-06-27 12:30   ` Filipe Manana
  2025-06-27  9:19 ` [PATCH RFC 7/9] btrfs: lower auto-reclaim message log level Johannes Thumshirn
                   ` (2 subsequent siblings)
  8 siblings, 1 reply; 23+ messages in thread
From: Johannes Thumshirn @ 2025-06-27  9:19 UTC (permalink / raw)
  To: linux-btrfs
  Cc: Damien Le Moal, Naohiro Aota, David Sterba, Josef Bacik,
	Boris Burkov, Filipe Manana, Johannes Thumshirn

From: Johannes Thumshirn <johannes.thumshirn@wdc.com>

When running metadata space reclaim under high I/O concurrency, we observe
hung tasks caused by lock contention on `btrfs_root::delalloc_mutex`. For
example:

  INFO: task kworker/u132:1:2177 blocked for more than 122 seconds.
        Not tainted 6.16.0-rc3+ #1246
  Workqueue: events_unbound btrfs_preempt_reclaim_metadata_space
  Call Trace:
    __schedule+0x2f9/0x7b0
    schedule+0x27/0x80
    __mutex_lock.constprop.0+0x4af/0x890
    start_delalloc_inodes+0x6e/0x400
    btrfs_start_delalloc_roots+0x162/0x270
    shrink_delalloc+0x10c/0x2d0
    flush_space+0x202/0x280
    btrfs_preempt_reclaim_metadata_space+0xe7/0x340

The `delalloc_mutex` serializes delalloc flushing per root but is no
longer necessary. All critical paths (inode flushing, extent writing,
metadata updates) are already synchronized using finer-grained locking at
the inode, page, and tree levels. In particular, concurrent flushers
coordinate via inode locking, and no shared state requires global
serialization across the root.

Removing this mutex avoids unnecessary blocking in reclaim paths and
improves responsiveness under pressure, especially on systems with many
flushers or multi-queue SSDs/ZNS devices.

Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>

Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
---
 fs/btrfs/ctree.h   | 1 -
 fs/btrfs/disk-io.c | 1 -
 fs/btrfs/inode.c   | 2 --
 3 files changed, 4 deletions(-)

diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index 8a54a0b6e502..06c7742a5de0 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -238,7 +238,6 @@ struct btrfs_root {
 	spinlock_t root_item_lock;
 	refcount_t refs;
 
-	struct mutex delalloc_mutex;
 	spinlock_t delalloc_lock;
 	/*
 	 * all of the inodes that have delalloc bytes.  It is possible for
diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index 929f39886b0e..e39f5e893312 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -678,7 +678,6 @@ static struct btrfs_root *btrfs_alloc_root(struct btrfs_fs_info *fs_info,
 	mutex_init(&root->objectid_mutex);
 	mutex_init(&root->log_mutex);
 	mutex_init(&root->ordered_extent_mutex);
-	mutex_init(&root->delalloc_mutex);
 	init_waitqueue_head(&root->qgroup_flush_wait);
 	init_waitqueue_head(&root->log_writer_wait);
 	init_waitqueue_head(&root->log_commit_wait[0]);
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index d68f4ef61c43..b9c52b9ea912 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -8673,7 +8673,6 @@ static int start_delalloc_inodes(struct btrfs_root *root,
 	int ret = 0;
 	bool full_flush = wbc->nr_to_write == LONG_MAX;
 
-	mutex_lock(&root->delalloc_mutex);
 	spin_lock(&root->delalloc_lock);
 	list_splice_init(&root->delalloc_inodes, &splice);
 	while (!list_empty(&splice)) {
@@ -8730,7 +8729,6 @@ static int start_delalloc_inodes(struct btrfs_root *root,
 		list_splice_tail(&splice, &root->delalloc_inodes);
 		spin_unlock(&root->delalloc_lock);
 	}
-	mutex_unlock(&root->delalloc_mutex);
 	return ret;
 }
 
-- 
2.49.0


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCH RFC 7/9] btrfs: lower auto-reclaim message log level
  2025-06-27  9:19 [PATCH RFC 0/9] btrfs: zoned: fixes for garbage collection under preassure Johannes Thumshirn
                   ` (5 preceding siblings ...)
  2025-06-27  9:19 ` [PATCH RFC 6/9] btrfs: remove btrfs_root's delalloc_mutex Johannes Thumshirn
@ 2025-06-27  9:19 ` Johannes Thumshirn
  2025-06-27 11:35   ` Christoph Hellwig
  2025-06-27  9:19 ` [PATCH RFC 8/9] btrfs: lower log level of relocation messages Johannes Thumshirn
  2025-06-27  9:19 ` [PATCH RFC 9/9] btrfs: remove unused bgs on allocation failure Johannes Thumshirn
  8 siblings, 1 reply; 23+ messages in thread
From: Johannes Thumshirn @ 2025-06-27  9:19 UTC (permalink / raw)
  To: linux-btrfs
  Cc: Damien Le Moal, Naohiro Aota, David Sterba, Josef Bacik,
	Boris Burkov, Filipe Manana, Johannes Thumshirn

From: Johannes Thumshirn <johannes.thumshirn@wdc.com>

When running a system with automatic reclaim/balancing enabled, there are
lots of info level messages like the following in the kernel log:

 BTRFS info (device nvme2n1): reclaiming chunk 1138166333440 with 10% used 0% reserved 89% unusable

Lower the log level to debug for these messages.

Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
---
 fs/btrfs/block-group.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/fs/btrfs/block-group.c b/fs/btrfs/block-group.c
index 00e567a4cd16..5e6aead653c4 100644
--- a/fs/btrfs/block-group.c
+++ b/fs/btrfs/block-group.c
@@ -1963,7 +1963,7 @@ void btrfs_reclaim_bgs_work(struct work_struct *work)
 		reserved = bg->reserved;
 		spin_unlock(&bg->lock);
 
-		btrfs_info(fs_info,
+		btrfs_debug(fs_info,
 	"reclaiming chunk %llu with %llu%% used %llu%% reserved %llu%% unusable",
 				bg->start,
 				div64_u64(used * 100, bg->length),
-- 
2.49.0


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCH RFC 8/9] btrfs: lower log level of relocation messages
  2025-06-27  9:19 [PATCH RFC 0/9] btrfs: zoned: fixes for garbage collection under preassure Johannes Thumshirn
                   ` (6 preceding siblings ...)
  2025-06-27  9:19 ` [PATCH RFC 7/9] btrfs: lower auto-reclaim message log level Johannes Thumshirn
@ 2025-06-27  9:19 ` Johannes Thumshirn
  2025-06-27 11:36   ` Christoph Hellwig
  2025-06-30 17:12   ` David Sterba
  2025-06-27  9:19 ` [PATCH RFC 9/9] btrfs: remove unused bgs on allocation failure Johannes Thumshirn
  8 siblings, 2 replies; 23+ messages in thread
From: Johannes Thumshirn @ 2025-06-27  9:19 UTC (permalink / raw)
  To: linux-btrfs
  Cc: Damien Le Moal, Naohiro Aota, David Sterba, Josef Bacik,
	Boris Burkov, Filipe Manana, Johannes Thumshirn

From: Johannes Thumshirn <johannes.thumshirn@wdc.com>

When running a system with automatic reclaim/balancing enabled, there are
lots of info level messages like the following in the kernel log:

 BTRFS info (device nvme2n1): relocating block group 629212708864 flags data
 BTRFS info (device nvme2n1): found 510 extents, stage: move data extents

Lower the log level to debug for these messages.

Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
---
 fs/btrfs/relocation.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/fs/btrfs/relocation.c b/fs/btrfs/relocation.c
index d7ec1d72821c..46b9236708ed 100644
--- a/fs/btrfs/relocation.c
+++ b/fs/btrfs/relocation.c
@@ -3892,7 +3892,7 @@ static void describe_relocation(struct btrfs_block_group *block_group)
 
 	btrfs_describe_block_groups(block_group->flags, buf, sizeof(buf));
 
-	btrfs_info(block_group->fs_info, "relocating block group %llu flags %s",
+	btrfs_debug(block_group->fs_info, "relocating block group %llu flags %s",
 		   block_group->start, buf);
 }
 
@@ -4044,7 +4044,7 @@ int btrfs_relocate_block_group(struct btrfs_fs_info *fs_info, u64 group_start)
 		if (rc->extents_found == 0)
 			break;
 
-		btrfs_info(fs_info, "found %llu extents, stage: %s",
+		btrfs_debug(fs_info, "found %llu extents, stage: %s",
 			   rc->extents_found, stage_to_string(finishes_stage));
 	}
 
-- 
2.49.0


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCH RFC 9/9] btrfs: remove unused bgs on allocation failure
  2025-06-27  9:19 [PATCH RFC 0/9] btrfs: zoned: fixes for garbage collection under preassure Johannes Thumshirn
                   ` (7 preceding siblings ...)
  2025-06-27  9:19 ` [PATCH RFC 8/9] btrfs: lower log level of relocation messages Johannes Thumshirn
@ 2025-06-27  9:19 ` Johannes Thumshirn
  2025-06-27 11:38   ` Christoph Hellwig
  2025-06-27 12:14   ` Filipe Manana
  8 siblings, 2 replies; 23+ messages in thread
From: Johannes Thumshirn @ 2025-06-27  9:19 UTC (permalink / raw)
  To: linux-btrfs
  Cc: Damien Le Moal, Naohiro Aota, David Sterba, Josef Bacik,
	Boris Burkov, Filipe Manana, Johannes Thumshirn

From: Johannes Thumshirn <johannes.thumshirn@wdc.com>

In case find_free_extent() return ENOSPC, check if there are block-groups
in the filsystem which have been marked as 'unused' and if so, reclaim the
space occupied by these block-groups.

Restart the search for free space to place the extent afterwards.

In case the allocation is targeted for the data relocation root, skip this
step, as it can cause deadlocks between block group deletion and relocation.

Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
---
 fs/btrfs/block-group.h | 11 +++++++++++
 fs/btrfs/extent-tree.c |  5 +++++
 2 files changed, 16 insertions(+)

diff --git a/fs/btrfs/block-group.h b/fs/btrfs/block-group.h
index a8bb8429c966..d5c91db88456 100644
--- a/fs/btrfs/block-group.h
+++ b/fs/btrfs/block-group.h
@@ -396,4 +396,15 @@ int btrfs_use_block_group_size_class(struct btrfs_block_group *bg,
 				     bool force_wrong_size_class);
 bool btrfs_block_group_should_use_size_class(const struct btrfs_block_group *bg);
 
+static inline bool btrfs_has_unused_block_groups(struct btrfs_fs_info *fs_info)
+{
+	bool unused_bgs;
+
+	spin_lock(&fs_info->unused_bgs_lock);
+	unused_bgs = !list_empty(&fs_info->unused_bgs);
+	spin_unlock(&fs_info->unused_bgs_lock);
+
+	return unused_bgs;
+}
+
 #endif /* BTRFS_BLOCK_GROUP_H */
diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index da731f6d4dad..34d21713c6ab 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -4683,6 +4683,11 @@ int btrfs_reserve_extent(struct btrfs_root *root, u64 ram_bytes,
 	if (!ret && !is_data) {
 		btrfs_dec_block_group_reservations(fs_info, ins->objectid);
 	} else if (ret == -ENOSPC) {
+		if (!btrfs_is_data_reloc_root(root) &&
+		    btrfs_has_unused_block_groups(fs_info)) {
+			btrfs_delete_unused_bgs(fs_info);
+			goto again;
+		}
 		if (!final_tried && ins->offset) {
 			num_bytes = min(num_bytes >> 1, ins->offset);
 			num_bytes = round_down(num_bytes,
-- 
2.49.0


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* Re: [PATCH RFC 1/9] btrfs: zoned: do not select metadata BG as finish target
  2025-06-27  9:19 ` [PATCH RFC 1/9] btrfs: zoned: do not select metadata BG as finish target Johannes Thumshirn
@ 2025-06-27 11:34   ` Christoph Hellwig
  2025-07-02 15:34     ` Naohiro Aota
  0 siblings, 1 reply; 23+ messages in thread
From: Christoph Hellwig @ 2025-06-27 11:34 UTC (permalink / raw)
  To: Johannes Thumshirn
  Cc: linux-btrfs, Damien Le Moal, Naohiro Aota, David Sterba,
	Josef Bacik, Boris Burkov, Filipe Manana

On Fri, Jun 27, 2025 at 11:19:06AM +0200, Johannes Thumshirn wrote:
> From: Naohiro Aota <naohiro.aota@wdc.com>
> 
> We call btrfs_zone_finish_one_bg() to zone finish one block group and make
> a room to activate another block group. Currently, we can choose a metadata
> block group as a target. But, as we reserve an active metadata block group,
> we no longer want to select a metadata block group. So, skip it in the
> loop.

Q: why do you finish a currently open zone to start with?  If you add
an extra zones worth of over provisioning, you have enough slack to
always be able to fill to the advertized capacity, and never need to
finish an open zone before it is fully filled.  Which simplifies the
implementation and reduces P/E cycles.

> Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>

You'll also need to add your signoff here when sending the patch on.


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH RFC 7/9] btrfs: lower auto-reclaim message log level
  2025-06-27  9:19 ` [PATCH RFC 7/9] btrfs: lower auto-reclaim message log level Johannes Thumshirn
@ 2025-06-27 11:35   ` Christoph Hellwig
  0 siblings, 0 replies; 23+ messages in thread
From: Christoph Hellwig @ 2025-06-27 11:35 UTC (permalink / raw)
  To: Johannes Thumshirn
  Cc: linux-btrfs, Damien Le Moal, Naohiro Aota, David Sterba,
	Josef Bacik, Boris Burkov, Filipe Manana, Johannes Thumshirn

On Fri, Jun 27, 2025 at 11:19:12AM +0200, Johannes Thumshirn wrote:
> From: Johannes Thumshirn <johannes.thumshirn@wdc.com>
> 
> When running a system with automatic reclaim/balancing enabled, there are
> lots of info level messages like the following in the kernel log:
> 
>  BTRFS info (device nvme2n1): reclaiming chunk 1138166333440 with 10% used 0% reserved 89% unusable
> 
> Lower the log level to debug for these messages.

Wouldn't a trace even be an even better choice for this?

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH RFC 8/9] btrfs: lower log level of relocation messages
  2025-06-27  9:19 ` [PATCH RFC 8/9] btrfs: lower log level of relocation messages Johannes Thumshirn
@ 2025-06-27 11:36   ` Christoph Hellwig
  2025-06-30 17:12   ` David Sterba
  1 sibling, 0 replies; 23+ messages in thread
From: Christoph Hellwig @ 2025-06-27 11:36 UTC (permalink / raw)
  To: Johannes Thumshirn
  Cc: linux-btrfs, Damien Le Moal, Naohiro Aota, David Sterba,
	Josef Bacik, Boris Burkov, Filipe Manana, Johannes Thumshirn

On Fri, Jun 27, 2025 at 11:19:13AM +0200, Johannes Thumshirn wrote:
> From: Johannes Thumshirn <johannes.thumshirn@wdc.com>
> 
> When running a system with automatic reclaim/balancing enabled, there are
> lots of info level messages like the following in the kernel log:
> 
>  BTRFS info (device nvme2n1): relocating block group 629212708864 flags data
>  BTRFS info (device nvme2n1): found 510 extents, stage: move data extents
> 
> Lower the log level to debug for these messages.

Same here.  This is useful debug information, but I'd expect something
like this to be a trace event that is enabled as needed, and not
something going into the kernel log.


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH RFC 9/9] btrfs: remove unused bgs on allocation failure
  2025-06-27  9:19 ` [PATCH RFC 9/9] btrfs: remove unused bgs on allocation failure Johannes Thumshirn
@ 2025-06-27 11:38   ` Christoph Hellwig
  2025-06-30 11:45     ` Johannes Thumshirn
  2025-06-27 12:14   ` Filipe Manana
  1 sibling, 1 reply; 23+ messages in thread
From: Christoph Hellwig @ 2025-06-27 11:38 UTC (permalink / raw)
  To: Johannes Thumshirn
  Cc: linux-btrfs, Damien Le Moal, Naohiro Aota, David Sterba,
	Josef Bacik, Boris Burkov, Filipe Manana, Johannes Thumshirn

On Fri, Jun 27, 2025 at 11:19:14AM +0200, Johannes Thumshirn wrote:
> From: Johannes Thumshirn <johannes.thumshirn@wdc.com>
> 
> In case find_free_extent() return ENOSPC, check if there are block-groups
> in the filsystem which have been marked as 'unused' and if so, reclaim the
> space occupied by these block-groups.
> 
> Restart the search for free space to place the extent afterwards.
> 
> In case the allocation is targeted for the data relocation root, skip this
> step, as it can cause deadlocks between block group deletion and relocation.

Assuming an unused BG is one without space in it that just needs a zone
reset or discard (a quick look at the code seems to confirm that, but
with some extra caveats):  why don't you reclaim it ASAP once it becomes
unused, at least modulo those space reservation caveats (which I don't
understand from that quick look).


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH RFC 9/9] btrfs: remove unused bgs on allocation failure
  2025-06-27  9:19 ` [PATCH RFC 9/9] btrfs: remove unused bgs on allocation failure Johannes Thumshirn
  2025-06-27 11:38   ` Christoph Hellwig
@ 2025-06-27 12:14   ` Filipe Manana
  1 sibling, 0 replies; 23+ messages in thread
From: Filipe Manana @ 2025-06-27 12:14 UTC (permalink / raw)
  To: Johannes Thumshirn
  Cc: linux-btrfs, Damien Le Moal, Naohiro Aota, David Sterba,
	Josef Bacik, Boris Burkov, Filipe Manana, Johannes Thumshirn

On Fri, Jun 27, 2025 at 10:36 AM Johannes Thumshirn <jth@kernel.org> wrote:
>
> From: Johannes Thumshirn <johannes.thumshirn@wdc.com>
>
> In case find_free_extent() return ENOSPC, check if there are block-groups
> in the filsystem which have been marked as 'unused' and if so, reclaim the
> space occupied by these block-groups.
>
> Restart the search for free space to place the extent afterwards.
>
> In case the allocation is targeted for the data relocation root, skip this
> step, as it can cause deadlocks between block group deletion and relocation.
>
> Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
> ---
>  fs/btrfs/block-group.h | 11 +++++++++++
>  fs/btrfs/extent-tree.c |  5 +++++
>  2 files changed, 16 insertions(+)
>
> diff --git a/fs/btrfs/block-group.h b/fs/btrfs/block-group.h
> index a8bb8429c966..d5c91db88456 100644
> --- a/fs/btrfs/block-group.h
> +++ b/fs/btrfs/block-group.h
> @@ -396,4 +396,15 @@ int btrfs_use_block_group_size_class(struct btrfs_block_group *bg,
>                                      bool force_wrong_size_class);
>  bool btrfs_block_group_should_use_size_class(const struct btrfs_block_group *bg);
>
> +static inline bool btrfs_has_unused_block_groups(struct btrfs_fs_info *fs_info)
> +{
> +       bool unused_bgs;
> +
> +       spin_lock(&fs_info->unused_bgs_lock);
> +       unused_bgs = !list_empty(&fs_info->unused_bgs);
> +       spin_unlock(&fs_info->unused_bgs_lock);
> +
> +       return unused_bgs;
> +}
> +
>  #endif /* BTRFS_BLOCK_GROUP_H */
> diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
> index da731f6d4dad..34d21713c6ab 100644
> --- a/fs/btrfs/extent-tree.c
> +++ b/fs/btrfs/extent-tree.c
> @@ -4683,6 +4683,11 @@ int btrfs_reserve_extent(struct btrfs_root *root, u64 ram_bytes,
>         if (!ret && !is_data) {
>                 btrfs_dec_block_group_reservations(fs_info, ins->objectid);
>         } else if (ret == -ENOSPC) {
> +               if (!btrfs_is_data_reloc_root(root) &&
> +                   btrfs_has_unused_block_groups(fs_info)) {
> +                       btrfs_delete_unused_bgs(fs_info);
> +                       goto again;
> +               }

Unfortunately this won't solve the -ENOSPC.

A deleted block group can't be reused in the same transaction, we have
to commit the transaction used to delete it.
This is to respect COW semantics and crash proof consistency.
And we can't commit here the transaction since that would deadlock for
any path that holds a transaction handle open, such as when modifying
any tree for example.

So unless some other task happens to commit the transaction used by
btrfs_delete_unused_bgs() after we call btrfs_delete_unused_bgs() and
before we retry the extent reservation when jumping to 'again',
-ENOSPC will happen again.


>                 if (!final_tried && ins->offset) {
>                         num_bytes = min(num_bytes >> 1, ins->offset);
>                         num_bytes = round_down(num_bytes,
> --
> 2.49.0
>
>

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH RFC 6/9] btrfs: remove btrfs_root's delalloc_mutex
  2025-06-27  9:19 ` [PATCH RFC 6/9] btrfs: remove btrfs_root's delalloc_mutex Johannes Thumshirn
@ 2025-06-27 12:30   ` Filipe Manana
  0 siblings, 0 replies; 23+ messages in thread
From: Filipe Manana @ 2025-06-27 12:30 UTC (permalink / raw)
  To: Johannes Thumshirn
  Cc: linux-btrfs, Damien Le Moal, Naohiro Aota, David Sterba,
	Josef Bacik, Boris Burkov, Filipe Manana, Johannes Thumshirn

On Fri, Jun 27, 2025 at 10:23 AM Johannes Thumshirn <jth@kernel.org> wrote:
>
> From: Johannes Thumshirn <johannes.thumshirn@wdc.com>
>
> When running metadata space reclaim under high I/O concurrency, we observe
> hung tasks caused by lock contention on `btrfs_root::delalloc_mutex`. For
> example:
>
>   INFO: task kworker/u132:1:2177 blocked for more than 122 seconds.
>         Not tainted 6.16.0-rc3+ #1246
>   Workqueue: events_unbound btrfs_preempt_reclaim_metadata_space
>   Call Trace:
>     __schedule+0x2f9/0x7b0
>     schedule+0x27/0x80
>     __mutex_lock.constprop.0+0x4af/0x890
>     start_delalloc_inodes+0x6e/0x400
>     btrfs_start_delalloc_roots+0x162/0x270
>     shrink_delalloc+0x10c/0x2d0
>     flush_space+0x202/0x280
>     btrfs_preempt_reclaim_metadata_space+0xe7/0x340
>
> The `delalloc_mutex` serializes delalloc flushing per root but is no
> longer necessary. All critical paths (inode flushing, extent writing,
> metadata updates) are already synchronized using finer-grained locking at
> the inode, page, and tree levels. In particular, concurrent flushers
> coordinate via inode locking, and no shared state requires global
> serialization across the root.

Well that's not enough...
The mutex is there to ensure that if two (or more) tasks call
start_delalloc_inodes(), they only return after all IO is flushed and
ordered extents completed.
Without the mutex we break the semantics.

For example, after removing the mutex as you propose here, we have:

1) Task A enters start_delalloc_inodes() and takes the spinlock
root->delalloc_lock;

2) Task B also enters start_delalloc_inodes() and spins on
root->delalloc_lock because it's currently held by task A;

3) Task A extracts inode X from the local split list, grabs a
reference for it and unlocks root->delalloc_lock;

4) Task B takes the root->delalloc_lock and sees an empty
root->delalloc_inodes list, so it will never wait for all the
writeback and ordered extents from inode X to complete, and return
before they complete.

You see, inode locks, page locks, etc, don't even enter the equation
here and are irrelevant.


>
> Removing this mutex avoids unnecessary blocking in reclaim paths and
> improves responsiveness under pressure, especially on systems with many
> flushers or multi-queue SSDs/ZNS devices.
>
> Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
>
> Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>

Duplicated SoB tags.

Thanks.

> ---
>  fs/btrfs/ctree.h   | 1 -
>  fs/btrfs/disk-io.c | 1 -
>  fs/btrfs/inode.c   | 2 --
>  3 files changed, 4 deletions(-)
>
> diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
> index 8a54a0b6e502..06c7742a5de0 100644
> --- a/fs/btrfs/ctree.h
> +++ b/fs/btrfs/ctree.h
> @@ -238,7 +238,6 @@ struct btrfs_root {
>         spinlock_t root_item_lock;
>         refcount_t refs;
>
> -       struct mutex delalloc_mutex;
>         spinlock_t delalloc_lock;
>         /*
>          * all of the inodes that have delalloc bytes.  It is possible for
> diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
> index 929f39886b0e..e39f5e893312 100644
> --- a/fs/btrfs/disk-io.c
> +++ b/fs/btrfs/disk-io.c
> @@ -678,7 +678,6 @@ static struct btrfs_root *btrfs_alloc_root(struct btrfs_fs_info *fs_info,
>         mutex_init(&root->objectid_mutex);
>         mutex_init(&root->log_mutex);
>         mutex_init(&root->ordered_extent_mutex);
> -       mutex_init(&root->delalloc_mutex);
>         init_waitqueue_head(&root->qgroup_flush_wait);
>         init_waitqueue_head(&root->log_writer_wait);
>         init_waitqueue_head(&root->log_commit_wait[0]);
> diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
> index d68f4ef61c43..b9c52b9ea912 100644
> --- a/fs/btrfs/inode.c
> +++ b/fs/btrfs/inode.c
> @@ -8673,7 +8673,6 @@ static int start_delalloc_inodes(struct btrfs_root *root,
>         int ret = 0;
>         bool full_flush = wbc->nr_to_write == LONG_MAX;
>
> -       mutex_lock(&root->delalloc_mutex);
>         spin_lock(&root->delalloc_lock);
>         list_splice_init(&root->delalloc_inodes, &splice);
>         while (!list_empty(&splice)) {
> @@ -8730,7 +8729,6 @@ static int start_delalloc_inodes(struct btrfs_root *root,
>                 list_splice_tail(&splice, &root->delalloc_inodes);
>                 spin_unlock(&root->delalloc_lock);
>         }
> -       mutex_unlock(&root->delalloc_mutex);
>         return ret;
>  }
>
> --
> 2.49.0
>
>

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH RFC 5/9] btrfs: remove delalloc_root_mutex
  2025-06-27  9:19 ` [PATCH RFC 5/9] btrfs: remove delalloc_root_mutex Johannes Thumshirn
@ 2025-06-27 12:42   ` Filipe Manana
  0 siblings, 0 replies; 23+ messages in thread
From: Filipe Manana @ 2025-06-27 12:42 UTC (permalink / raw)
  To: Johannes Thumshirn
  Cc: linux-btrfs, Damien Le Moal, Naohiro Aota, David Sterba,
	Josef Bacik, Boris Burkov, Filipe Manana, Johannes Thumshirn

On Fri, Jun 27, 2025 at 10:23 AM Johannes Thumshirn <jth@kernel.org> wrote:
>
> From: Johannes Thumshirn <johannes.thumshirn@wdc.com>
>
> When benchmarking garbage collection on zoned BTRFS filesystems on ZNS
> drives, we regularly observe hung_task messages like the following:
>
> INFO: task kworker/u132:2:297 blocked for more than 122 seconds.
>        Not tainted 6.16.0-rc1+ #1225
>  "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
>  task:kworker/u132:2  state:D stack:0     pid:297   tgid:297   ppid:2      task_flags:0x4208060 flags:0x00004000
>  Workqueue: events_unbound btrfs_preempt_reclaim_metadata_space
>  Call Trace:
>   <TASK>
>   __schedule+0x2f9/0x7b0
>   schedule+0x27/0x80
>   schedule_preempt_disabled+0x15/0x30
>   __mutex_lock.constprop.0+0x4af/0x890
>   ? srso_return_thunk+0x5/0x5f
>   btrfs_start_delalloc_roots+0x8a/0x290
>   ? timerqueue_del+0x2e/0x60
>   shrink_delalloc+0x10c/0x2d0
>   ? srso_return_thunk+0x5/0x5f
>   ? psi_group_change+0x19e/0x460
>   ? srso_return_thunk+0x5/0x5f
>   ? btrfs_reduce_alloc_profile+0x9a/0x1d0
>   flush_space+0x202/0x280
>   ? srso_return_thunk+0x5/0x5f
>   ? need_preemptive_reclaim+0xaa/0x190
>   btrfs_preempt_reclaim_metadata_space+0xe7/0x340
>   process_one_work+0x192/0x350
>   worker_thread+0x25a/0x3a0
>   ? __pfx_worker_thread+0x10/0x10
>   kthread+0xfc/0x240
>   ? __pfx_kthread+0x10/0x10
>   ? __pfx_kthread+0x10/0x10
>   ret_from_fork+0x152/0x180
>   ? __pfx_kthread+0x10/0x10
>   ret_from_fork_asm+0x1a/0x30
>   </TASK>
>  INFO: task kworker/u132:2:297 is blocked on a mutex likely owned by task kworker/u129:0:2359.
>  task:kworker/u129:0  state:R  running task     stack:0     pid:2359  tgid:2359  ppid:2
>
> The affected tasks are blocked on 'struct btrfs_fs_info::delalloc_root_mutex',
> a global lock that serializes entry into btrfs_start_delalloc_roots().
> This lock was introduced in commit 573bfb72f760 ("Btrfs: fix possible
> empty list access when flushing the delalloc inodes") but without a
> clear justification for its necessity.
>
> However, the condition it was meant to protect against—a possibly empty
> list access—is already safely handled by 'list_splice_init()', which
> does nothing when the source list is empty.
>
> There are no known concurrency issues in btrfs_start_delalloc_roots()
> that require serialization via this mutex. All critical regions are
> either covered by per-root locking or operate on safely isolated lists.

Nop... see comments further below.

>
> Removing the lock eliminates the observed hangs and improves metadata
> GC throughput, particularly on systems with high concurrency like
> ZNS-based deployments.
>
> Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
> ---
>  fs/btrfs/disk-io.c | 1 -
>  fs/btrfs/fs.h      | 1 -
>  fs/btrfs/inode.c   | 2 --
>  3 files changed, 4 deletions(-)
>
> diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
> index 35cd38de7727..929f39886b0e 100644
> --- a/fs/btrfs/disk-io.c
> +++ b/fs/btrfs/disk-io.c
> @@ -2795,7 +2795,6 @@ void btrfs_init_fs_info(struct btrfs_fs_info *fs_info)
>         mutex_init(&fs_info->unused_bg_unpin_mutex);
>         mutex_init(&fs_info->reclaim_bgs_lock);
>         mutex_init(&fs_info->reloc_mutex);
> -       mutex_init(&fs_info->delalloc_root_mutex);
>         mutex_init(&fs_info->zoned_meta_io_lock);
>         mutex_init(&fs_info->zoned_data_reloc_io_lock);
>         seqlock_init(&fs_info->profiles_lock);
> diff --git a/fs/btrfs/fs.h b/fs/btrfs/fs.h
> index a388af40a251..04ebc976f841 100644
> --- a/fs/btrfs/fs.h
> +++ b/fs/btrfs/fs.h
> @@ -606,7 +606,6 @@ struct btrfs_fs_info {
>          */
>         struct list_head ordered_roots;
>
> -       struct mutex delalloc_root_mutex;
>         spinlock_t delalloc_root_lock;
>         /* All fs/file tree roots that have delalloc inodes. */
>         struct list_head delalloc_roots;
> diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
> index 80c72c594b19..d68f4ef61c43 100644
> --- a/fs/btrfs/inode.c
> +++ b/fs/btrfs/inode.c
> @@ -8766,7 +8766,6 @@ int btrfs_start_delalloc_roots(struct btrfs_fs_info *fs_info, long nr,
>         if (BTRFS_FS_ERROR(fs_info))
>                 return -EROFS;
>
> -       mutex_lock(&fs_info->delalloc_root_mutex);
>         spin_lock(&fs_info->delalloc_root_lock);
>         list_splice_init(&fs_info->delalloc_roots, &splice);
>         while (!list_empty(&splice)) {
> @@ -8800,7 +8799,6 @@ int btrfs_start_delalloc_roots(struct btrfs_fs_info *fs_info, long nr,
>                 list_splice_tail(&splice, &fs_info->delalloc_roots);
>                 spin_unlock(&fs_info->delalloc_root_lock);
>         }
> -       mutex_unlock(&fs_info->delalloc_root_mutex);

The lock is useful and exists to make sure two tasks calling this
function wait for all dealloc to be flushed and ordered extents to
complete for all inodes from all roots.

The problem is similar to the one I pointed out for the next patch,
but perhaps a bit more subtle.

So after applying this patch:

1) Task A enters btrfs_start_delalloc_roots() and takes the spinlock
fs_info->delalloc_root_lock;

2) Task A splices the fs_info->delalloc_roots list into the local
splice list. The list has two roots, root X and root Y;

3) Task  A enters the first iteration of the while loop, extracts root
X from the split list, grabs a reference for it, adds it back to the
fs_info->delalloc_roots list and unlocks fs_info->delalloc_root_lock;

4) Task B enters btrfs_start_delalloc_roots(), takes the
fs_info->delalloc_root_lock lock;

5) Task B splices the fs_info->delalloc_roots list into the local
splice list - the list only contains root X -> root Y is held only the
splice list from task A.

As a consequence task B will never wait for writeback and ordered
extents from inodes from root Y to complete.
Therefore breaking the expected semantics of btrfs_start_delalloc_roots().

Thanks.



>         return ret;
>  }

>
> --
> 2.49.0
>
>

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH RFC 9/9] btrfs: remove unused bgs on allocation failure
  2025-06-27 11:38   ` Christoph Hellwig
@ 2025-06-30 11:45     ` Johannes Thumshirn
  2025-06-30 12:05       ` Filipe Manana
  0 siblings, 1 reply; 23+ messages in thread
From: Johannes Thumshirn @ 2025-06-30 11:45 UTC (permalink / raw)
  To: hch@infradead.org, Johannes Thumshirn
  Cc: linux-btrfs@vger.kernel.org, Damien Le Moal, Naohiro Aota,
	David Sterba, Josef Bacik, Boris Burkov, Filipe Manana

On 27.06.25 13:39, Christoph Hellwig wrote:
> On Fri, Jun 27, 2025 at 11:19:14AM +0200, Johannes Thumshirn wrote:
>> From: Johannes Thumshirn <johannes.thumshirn@wdc.com>
>>
>> In case find_free_extent() return ENOSPC, check if there are block-groups
>> in the filsystem which have been marked as 'unused' and if so, reclaim the
>> space occupied by these block-groups.
>>
>> Restart the search for free space to place the extent afterwards.
>>
>> In case the allocation is targeted for the data relocation root, skip this
>> step, as it can cause deadlocks between block group deletion and relocation.
> 
> Assuming an unused BG is one without space in it that just needs a zone
> reset or discard (a quick look at the code seems to confirm that, but
> with some extra caveats):  why don't you reclaim it ASAP once it becomes
> unused, at least modulo those space reservation caveats (which I don't
> understand from that quick look).
> 
> 

I've looked into it looks promising. Threw it into fstests and (up to 
now) nothing broke. So I'll run Damien's scripts on a ZNS drive and 
we'll see if it helps.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH RFC 9/9] btrfs: remove unused bgs on allocation failure
  2025-06-30 11:45     ` Johannes Thumshirn
@ 2025-06-30 12:05       ` Filipe Manana
  0 siblings, 0 replies; 23+ messages in thread
From: Filipe Manana @ 2025-06-30 12:05 UTC (permalink / raw)
  To: Johannes Thumshirn
  Cc: hch@infradead.org, Johannes Thumshirn,
	linux-btrfs@vger.kernel.org, Damien Le Moal, Naohiro Aota,
	David Sterba, Josef Bacik, Boris Burkov, Filipe Manana

On Mon, Jun 30, 2025 at 12:46 PM Johannes Thumshirn
<Johannes.Thumshirn@wdc.com> wrote:
>
> On 27.06.25 13:39, Christoph Hellwig wrote:
> > On Fri, Jun 27, 2025 at 11:19:14AM +0200, Johannes Thumshirn wrote:
> >> From: Johannes Thumshirn <johannes.thumshirn@wdc.com>
> >>
> >> In case find_free_extent() return ENOSPC, check if there are block-groups
> >> in the filsystem which have been marked as 'unused' and if so, reclaim the
> >> space occupied by these block-groups.
> >>
> >> Restart the search for free space to place the extent afterwards.
> >>
> >> In case the allocation is targeted for the data relocation root, skip this
> >> step, as it can cause deadlocks between block group deletion and relocation.
> >
> > Assuming an unused BG is one without space in it that just needs a zone
> > reset or discard (a quick look at the code seems to confirm that, but
> > with some extra caveats):  why don't you reclaim it ASAP once it becomes
> > unused, at least modulo those space reservation caveats (which I don't
> > understand from that quick look).
> >
> >
>
> I've looked into it looks promising. Threw it into fstests and (up to
> now) nothing broke. So I'll run Damien's scripts on a ZNS drive and
> we'll see if it helps.

That brings a new problem.

For example a data block group becomes empty and you delete it immediately.
If a data allocation happens before the transaction used to delete the
block group is committed and there are no other data block groups with
enough space and there's no more unallocated device space, we will
-ENOSPC, whereas before we wouldn't.

Remember that a delete block group's space can only be allocated again
after the transaction used to delete it is committed, to respect COW
semantics.
That's why you see the allocator using the commit root (at
find_free_dev_extent()).
And you can't commit the transaction as soon as the bg becomes used,
as we're holding a transaction handle open and would deadlock.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH RFC 8/9] btrfs: lower log level of relocation messages
  2025-06-27  9:19 ` [PATCH RFC 8/9] btrfs: lower log level of relocation messages Johannes Thumshirn
  2025-06-27 11:36   ` Christoph Hellwig
@ 2025-06-30 17:12   ` David Sterba
  2025-07-01  5:09     ` Johannes Thumshirn
  1 sibling, 1 reply; 23+ messages in thread
From: David Sterba @ 2025-06-30 17:12 UTC (permalink / raw)
  To: Johannes Thumshirn
  Cc: linux-btrfs, Damien Le Moal, Naohiro Aota, David Sterba,
	Josef Bacik, Boris Burkov, Filipe Manana, Johannes Thumshirn

On Fri, Jun 27, 2025 at 11:19:13AM +0200, Johannes Thumshirn wrote:
> From: Johannes Thumshirn <johannes.thumshirn@wdc.com>
> 
> When running a system with automatic reclaim/balancing enabled, there are
> lots of info level messages like the following in the kernel log:
> 
>  BTRFS info (device nvme2n1): relocating block group 629212708864 flags data
>  BTRFS info (device nvme2n1): found 510 extents, stage: move data extents
> 
> Lower the log level to debug for these messages.
> 
> Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>

I kind of like that the message is in the system log on the info level,
it's a high level operation and tracks the progress. Also it's been
there forever and I don't think I'm the only one used to seeing it
there. We have many info messages and vague guidelines when to use it,
but I think "once per block group" is still within the intentions.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH RFC 8/9] btrfs: lower log level of relocation messages
  2025-06-30 17:12   ` David Sterba
@ 2025-07-01  5:09     ` Johannes Thumshirn
  2025-07-01 14:43       ` David Sterba
  0 siblings, 1 reply; 23+ messages in thread
From: Johannes Thumshirn @ 2025-07-01  5:09 UTC (permalink / raw)
  To: dsterba@suse.cz, Johannes Thumshirn
  Cc: linux-btrfs@vger.kernel.org, Damien Le Moal, Naohiro Aota,
	David Sterba, Josef Bacik, Boris Burkov, Filipe Manana

On 30.06.25 19:12, David Sterba wrote:
> On Fri, Jun 27, 2025 at 11:19:13AM +0200, Johannes Thumshirn wrote:
>> From: Johannes Thumshirn <johannes.thumshirn@wdc.com>
>>
>> When running a system with automatic reclaim/balancing enabled, there are
>> lots of info level messages like the following in the kernel log:
>>
>>   BTRFS info (device nvme2n1): relocating block group 629212708864 flags data
>>   BTRFS info (device nvme2n1): found 510 extents, stage: move data extents
>>
>> Lower the log level to debug for these messages.
>>
>> Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
> 
> I kind of like that the message is in the system log on the info level,
> it's a high level operation and tracks the progress. Also it's been
> there forever and I don't think I'm the only one used to seeing it
> there. We have many info messages and vague guidelines when to use it,
> but I think "once per block group" is still within the intentions.
> 

Yes but now that automatic balancing is in place this is spamming all 
over dmesg.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH RFC 8/9] btrfs: lower log level of relocation messages
  2025-07-01  5:09     ` Johannes Thumshirn
@ 2025-07-01 14:43       ` David Sterba
  0 siblings, 0 replies; 23+ messages in thread
From: David Sterba @ 2025-07-01 14:43 UTC (permalink / raw)
  To: Johannes Thumshirn
  Cc: dsterba@suse.cz, Johannes Thumshirn, linux-btrfs@vger.kernel.org,
	Damien Le Moal, Naohiro Aota, David Sterba, Josef Bacik,
	Boris Burkov, Filipe Manana

On Tue, Jul 01, 2025 at 05:09:06AM +0000, Johannes Thumshirn wrote:
> On 30.06.25 19:12, David Sterba wrote:
> > On Fri, Jun 27, 2025 at 11:19:13AM +0200, Johannes Thumshirn wrote:
> >> From: Johannes Thumshirn <johannes.thumshirn@wdc.com>
> >>
> >> When running a system with automatic reclaim/balancing enabled, there are
> >> lots of info level messages like the following in the kernel log:
> >>
> >>   BTRFS info (device nvme2n1): relocating block group 629212708864 flags data
> >>   BTRFS info (device nvme2n1): found 510 extents, stage: move data extents
> >>
> >> Lower the log level to debug for these messages.
> >>
> >> Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
> > 
> > I kind of like that the message is in the system log on the info level,
> > it's a high level operation and tracks the progress. Also it's been
> > there forever and I don't think I'm the only one used to seeing it
> > there. We have many info messages and vague guidelines when to use it,
> > but I think "once per block group" is still within the intentions.
> > 
> 
> Yes but now that automatic balancing is in place this is spamming all 
> over dmesg.

We could distinguish the reason of relocation so the one started
manually will print what we have now and the automatic only say two
messages like "starting automatic bg cleanup" and once it finishes some
kind of summary "bg cleanup removed 123 block groups". If you want
additional debugging just for the automatic reclaim then it's OK.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH RFC 1/9] btrfs: zoned: do not select metadata BG as finish target
  2025-06-27 11:34   ` Christoph Hellwig
@ 2025-07-02 15:34     ` Naohiro Aota
  0 siblings, 0 replies; 23+ messages in thread
From: Naohiro Aota @ 2025-07-02 15:34 UTC (permalink / raw)
  To: hch@infradead.org, Johannes Thumshirn
  Cc: linux-btrfs@vger.kernel.org, Damien Le Moal, Naohiro Aota,
	David Sterba, Josef Bacik, Boris Burkov, Filipe Manana

On Fri Jun 27, 2025 at 8:34 PM JST, Christoph Hellwig wrote:
> On Fri, Jun 27, 2025 at 11:19:06AM +0200, Johannes Thumshirn wrote:
>> From: Naohiro Aota <naohiro.aota@wdc.com>
>> 
>> We call btrfs_zone_finish_one_bg() to zone finish one block group and make
>> a room to activate another block group. Currently, we can choose a metadata
>> block group as a target. But, as we reserve an active metadata block group,
>> we no longer want to select a metadata block group. So, skip it in the
>> loop.
>
> Q: why do you finish a currently open zone to start with?  If you add
> an extra zones worth of over provisioning, you have enough slack to
> always be able to fill to the advertized capacity, and never need to
> finish an open zone before it is fully filled.  Which simplifies the
> implementation and reduces P/E cycles.

Basically, this is called when data extent allocation cannot activate a
new zone, so the number of active zones == max active zones. In this
case, it first call btrfs_zone_finish_one_bg() to try to finish a zone
with minimum free space. If it succeeds, we can allocate new block group
and allocate an extent from there. Or, it retries the allocation with a
smaller size. So, it just prefers zone finishing than filling with a
fragmented allocation.

Another usage is when it writes to metadata. While we reserve zones for
metadata, this can be an escape hatch to finish some zones and make a
room for the new writing.

>
>> Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
>
> You'll also need to add your signoff here when sending the patch on.

^ permalink raw reply	[flat|nested] 23+ messages in thread

end of thread, other threads:[~2025-07-02 15:36 UTC | newest]

Thread overview: 23+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-06-27  9:19 [PATCH RFC 0/9] btrfs: zoned: fixes for garbage collection under preassure Johannes Thumshirn
2025-06-27  9:19 ` [PATCH RFC 1/9] btrfs: zoned: do not select metadata BG as finish target Johannes Thumshirn
2025-06-27 11:34   ` Christoph Hellwig
2025-07-02 15:34     ` Naohiro Aota
2025-06-27  9:19 ` [PATCH RFC 2/9] btrfs: zoned: get rid of relocation_bg_lock Johannes Thumshirn
2025-06-27  9:19 ` [PATCH RFC 3/9] btrfs: zoned: get rid of treelog_bg_lock Johannes Thumshirn
2025-06-27  9:19 ` [PATCH RFC 4/9] btrfs: zoned: don't hold space_info lock on zoned allocation Johannes Thumshirn
2025-06-27  9:19 ` [PATCH RFC 5/9] btrfs: remove delalloc_root_mutex Johannes Thumshirn
2025-06-27 12:42   ` Filipe Manana
2025-06-27  9:19 ` [PATCH RFC 6/9] btrfs: remove btrfs_root's delalloc_mutex Johannes Thumshirn
2025-06-27 12:30   ` Filipe Manana
2025-06-27  9:19 ` [PATCH RFC 7/9] btrfs: lower auto-reclaim message log level Johannes Thumshirn
2025-06-27 11:35   ` Christoph Hellwig
2025-06-27  9:19 ` [PATCH RFC 8/9] btrfs: lower log level of relocation messages Johannes Thumshirn
2025-06-27 11:36   ` Christoph Hellwig
2025-06-30 17:12   ` David Sterba
2025-07-01  5:09     ` Johannes Thumshirn
2025-07-01 14:43       ` David Sterba
2025-06-27  9:19 ` [PATCH RFC 9/9] btrfs: remove unused bgs on allocation failure Johannes Thumshirn
2025-06-27 11:38   ` Christoph Hellwig
2025-06-30 11:45     ` Johannes Thumshirn
2025-06-30 12:05       ` Filipe Manana
2025-06-27 12:14   ` Filipe Manana

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox