[PATCH RFC 0/9] btrfs: zoned: fixes for garbage collection under preassure

All of lore.kernel.org
 help / color / mirror / Atom feed

* [PATCH RFC 0/9] btrfs: zoned: fixes for garbage collection under preassure
@ 2025-06-27  9:19 Johannes Thumshirn
  2025-06-27  9:19 ` [PATCH RFC 1/9] btrfs: zoned: do not select metadata BG as finish target Johannes Thumshirn
                   ` (8 more replies)
  0 siblings, 9 replies; 25+ messages in thread
From: Johannes Thumshirn @ 2025-06-27  9:19 UTC (permalink / raw)
  To: linux-btrfs
  Cc: Damien Le Moal, Naohiro Aota, David Sterba, Josef Bacik,
	Boris Burkov, Filipe Manana, Johannes Thumshirn

From: Johannes Thumshirn <johannes.thumshirn@wdc.com>

A few weeks ago Damien started benchmarking zoned BTRFS and zoned XFS
garbage collection on SMR and ZNS drives. 

One of these benchmarks was filling the FS to 75% and then subsequientially
overwriting the files again.

One thing that came out of these trials, is that BTRFS on ZNS devices did
not finish the overwrite phase because it prematurely ran out of space.

Once I tried reclaiming earlier (patch 9/9) I've seen haning task because
of extremely high lock contention on serveral locks. The patches 2 to 6
either remove these locks, or in case of the space-info lock remove
unneeded lock contentiom.

Patches 7 and 8 lower the log level of reclaim messages, because with the
automatic reclaim BTRFS has these days there is a lot of log spamming.

Pathch 8 triggers removal of unused block groups on allocation failure, I
opted for making it non zoned specific as I think even regular filesystems
can benefit from this.. The first patch from Naohiro is also needed so we
don't prematurely finish metadata block groups on zoned devices.

Reasons for RFC, there are a lot of locking changes involved in this
series. As much as I'm comfortable with running fio workloads and fstests,
I really prefere more eyes. Especially Filipe always is a good reviewer of
locking related changes and Boris for reclaiming. The series is probably
not the final solution either but only forward progress, but I don't want
to be running into the wrong direction.

This series also includes a previously sent series of me titled "btrfs:
zoned: reduce lock contention in zoned extent allocator" and can be found
here:
https://lore.kernel.org/linux-btrfs/20250624093710.18685-1-jth@kernel.org/

Johannes Thumshirn (8):
  btrfs: zoned: get rid of relocation_bg_lock
  btrfs: zoned: get rid of treelog_bg_lock
  btrfs: zoned: don't hold space_info lock on zoned allocation
  btrfs: remove delalloc_root_mutex
  btrfs: remove btrfs_root's delalloc_mutex
  btrfs: lower auto-reclaim message log level
  btrfs: lower log level of relocation messages
  btrfs: remove unused bgs on allocation failure

Naohiro Aota (1):
  btrfs: zoned: do not select metadata BG as finish target

 fs/btrfs/block-group.c |  2 +-
 fs/btrfs/block-group.h | 11 ++++++
 fs/btrfs/ctree.h       |  1 -
 fs/btrfs/disk-io.c     |  4 ---
 fs/btrfs/extent-tree.c | 81 ++++++++++++++----------------------------
 fs/btrfs/fs.h          |  8 +----
 fs/btrfs/inode.c       |  4 ---
 fs/btrfs/relocation.c  |  4 +--
 fs/btrfs/zoned.c       | 19 +++++-----
 fs/btrfs/zoned.h       |  7 ++--
 10 files changed, 55 insertions(+), 86 deletions(-)

-- 
2.49.0

^ permalink raw reply	[flat|nested] 25+ messages in thread

* [PATCH RFC 1/9] btrfs: zoned: do not select metadata BG as finish target
  2025-06-27  9:19 [PATCH RFC 0/9] btrfs: zoned: fixes for garbage collection under preassure Johannes Thumshirn
@ 2025-06-27  9:19 ` Johannes Thumshirn
  2025-06-27 11:34   ` Christoph Hellwig
  2025-06-27  9:19 ` [PATCH RFC 2/9] btrfs: zoned: get rid of relocation_bg_lock Johannes Thumshirn
                   ` (7 subsequent siblings)
  8 siblings, 1 reply; 25+ messages in thread
From: Johannes Thumshirn @ 2025-06-27  9:19 UTC (permalink / raw)
  To: linux-btrfs
  Cc: Damien Le Moal, Naohiro Aota, David Sterba, Josef Bacik,
	Boris Burkov, Filipe Manana

From: Naohiro Aota <naohiro.aota@wdc.com>

We call btrfs_zone_finish_one_bg() to zone finish one block group and make
a room to activate another block group. Currently, we can choose a metadata
block group as a target. But, as we reserve an active metadata block group,
we no longer want to select a metadata block group. So, skip it in the
loop.

Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
---
 fs/btrfs/zoned.c | 6 ++++--
 1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/fs/btrfs/zoned.c b/fs/btrfs/zoned.c
index bd987c90a05c..0d5d6db72b62 100644
--- a/fs/btrfs/zoned.c
+++ b/fs/btrfs/zoned.c
@@ -2651,8 +2651,10 @@ int btrfs_zone_finish_one_bg(struct btrfs_fs_info *fs_info)
 
 		spin_lock(&block_group->lock);
 		if (block_group->reserved || block_group->alloc_offset == 0 ||
-		    (block_group->flags & BTRFS_BLOCK_GROUP_SYSTEM) ||
-		    test_bit(BLOCK_GROUP_FLAG_ZONED_DATA_RELOC, &block_group->runtime_flags)) {
+		    (block_group->flags &
+		     (BTRFS_BLOCK_GROUP_METADATA | BTRFS_BLOCK_GROUP_SYSTEM)) ||
+		    test_bit(BLOCK_GROUP_FLAG_ZONED_DATA_RELOC,
+			     &block_group->runtime_flags)) {
 			spin_unlock(&block_group->lock);
 			continue;
 		}
-- 
2.49.0


^ permalink raw reply related	[flat|nested] 25+ messages in thread

* Re: [PATCH RFC 1/9] btrfs: zoned: do not select metadata BG as finish target
  2025-06-27  9:19 ` [PATCH RFC 1/9] btrfs: zoned: do not select metadata BG as finish target Johannes Thumshirn
@ 2025-06-27 11:34   ` Christoph Hellwig
  2025-07-02 15:34     ` Naohiro Aota
  0 siblings, 1 reply; 25+ messages in thread
From: Christoph Hellwig @ 2025-06-27 11:34 UTC (permalink / raw)
  To: Johannes Thumshirn
  Cc: linux-btrfs, Damien Le Moal, Naohiro Aota, David Sterba,
	Josef Bacik, Boris Burkov, Filipe Manana

On Fri, Jun 27, 2025 at 11:19:06AM +0200, Johannes Thumshirn wrote:
> From: Naohiro Aota <naohiro.aota@wdc.com>
> 
> We call btrfs_zone_finish_one_bg() to zone finish one block group and make
> a room to activate another block group. Currently, we can choose a metadata
> block group as a target. But, as we reserve an active metadata block group,
> we no longer want to select a metadata block group. So, skip it in the
> loop.

Q: why do you finish a currently open zone to start with?  If you add
an extra zones worth of over provisioning, you have enough slack to
always be able to fill to the advertized capacity, and never need to
finish an open zone before it is fully filled.  Which simplifies the
implementation and reduces P/E cycles.

> Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>

You'll also need to add your signoff here when sending the patch on.


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH RFC 1/9] btrfs: zoned: do not select metadata BG as finish target
  2025-06-27 11:34   ` Christoph Hellwig
@ 2025-07-02 15:34     ` Naohiro Aota
  0 siblings, 0 replies; 25+ messages in thread
From: Naohiro Aota @ 2025-07-02 15:34 UTC (permalink / raw)
  To: hch@infradead.org, Johannes Thumshirn
  Cc: linux-btrfs@vger.kernel.org, Damien Le Moal, Naohiro Aota,
	David Sterba, Josef Bacik, Boris Burkov, Filipe Manana

On Fri Jun 27, 2025 at 8:34 PM JST, Christoph Hellwig wrote:
> On Fri, Jun 27, 2025 at 11:19:06AM +0200, Johannes Thumshirn wrote:
>> From: Naohiro Aota <naohiro.aota@wdc.com>
>> 
>> We call btrfs_zone_finish_one_bg() to zone finish one block group and make
>> a room to activate another block group. Currently, we can choose a metadata
>> block group as a target. But, as we reserve an active metadata block group,
>> we no longer want to select a metadata block group. So, skip it in the
>> loop.
>
> Q: why do you finish a currently open zone to start with?  If you add
> an extra zones worth of over provisioning, you have enough slack to
> always be able to fill to the advertized capacity, and never need to
> finish an open zone before it is fully filled.  Which simplifies the
> implementation and reduces P/E cycles.

Basically, this is called when data extent allocation cannot activate a
new zone, so the number of active zones == max active zones. In this
case, it first call btrfs_zone_finish_one_bg() to try to finish a zone
with minimum free space. If it succeeds, we can allocate new block group
and allocate an extent from there. Or, it retries the allocation with a
smaller size. So, it just prefers zone finishing than filling with a
fragmented allocation.

Another usage is when it writes to metadata. While we reserve zones for
metadata, this can be an escape hatch to finish some zones and make a
room for the new writing.

>
>> Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
>
> You'll also need to add your signoff here when sending the patch on.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* [PATCH RFC 2/9] btrfs: zoned: get rid of relocation_bg_lock
  2025-06-27  9:19 [PATCH RFC 0/9] btrfs: zoned: fixes for garbage collection under preassure Johannes Thumshirn
  2025-06-27  9:19 ` [PATCH RFC 1/9] btrfs: zoned: do not select metadata BG as finish target Johannes Thumshirn
@ 2025-06-27  9:19 ` Johannes Thumshirn
  2025-06-27  9:19 ` [PATCH RFC 3/9] btrfs: zoned: get rid of treelog_bg_lock Johannes Thumshirn
                   ` (6 subsequent siblings)
  8 siblings, 0 replies; 25+ messages in thread
From: Johannes Thumshirn @ 2025-06-27  9:19 UTC (permalink / raw)
  To: linux-btrfs
  Cc: Damien Le Moal, Naohiro Aota, David Sterba, Josef Bacik,
	Boris Burkov, Filipe Manana, Johannes Thumshirn

From: Johannes Thumshirn <johannes.thumshirn@wdc.com>

Lockstat analysis of benchmark workloads shows a very high contention of
the relocation_bg_lock. But the relocation_bg_lock only protects a single
field in 'struct btrfs_fs_info', namely 'u64 data_reloc_bg'.

Use READ_ONCE()/WRITE_ONCE() to access 'btrfs_fs_info::data_reloc_bg'.

This is safe in the allocator path, as relocation I/O is only going to
block groups in the relocation sub-space_info and at the moment, there is
only one relocation block group in this space info.

Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
---
 fs/btrfs/disk-io.c     |  1 -
 fs/btrfs/extent-tree.c | 28 +++++++++++-----------------
 fs/btrfs/fs.h          |  6 +-----
 fs/btrfs/zoned.c       | 11 +++++------
 4 files changed, 17 insertions(+), 29 deletions(-)

diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index 6ac5be02dce7..9a13f5b1ed43 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -2791,7 +2791,6 @@ void btrfs_init_fs_info(struct btrfs_fs_info *fs_info)
 	spin_lock_init(&fs_info->unused_bgs_lock);
 	spin_lock_init(&fs_info->treelog_bg_lock);
 	spin_lock_init(&fs_info->zone_active_bgs_lock);
-	spin_lock_init(&fs_info->relocation_bg_lock);
 	rwlock_init(&fs_info->tree_mod_log_lock);
 	rwlock_init(&fs_info->global_root_lock);
 	mutex_init(&fs_info->unused_bg_unpin_mutex);
diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index 10f50c725313..a9bda68a1883 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -3865,14 +3865,10 @@ static int do_allocation_zoned(struct btrfs_block_group *block_group,
 	 * Do not allow non-relocation blocks in the dedicated relocation block
 	 * group, and vice versa.
 	 */
-	spin_lock(&fs_info->relocation_bg_lock);
-	data_reloc_bytenr = fs_info->data_reloc_bg;
+	data_reloc_bytenr = READ_ONCE(fs_info->data_reloc_bg);
 	if (data_reloc_bytenr &&
 	    ((ffe_ctl->for_data_reloc && bytenr != data_reloc_bytenr) ||
 	     (!ffe_ctl->for_data_reloc && bytenr == data_reloc_bytenr)))
-		skip = true;
-	spin_unlock(&fs_info->relocation_bg_lock);
-	if (skip)
 		return 1;
 
 	/* Check RO and no space case before trying to activate it */
@@ -3899,7 +3895,6 @@ static int do_allocation_zoned(struct btrfs_block_group *block_group,
 	spin_lock(&space_info->lock);
 	spin_lock(&block_group->lock);
 	spin_lock(&fs_info->treelog_bg_lock);
-	spin_lock(&fs_info->relocation_bg_lock);
 
 	if (ret)
 		goto out;
@@ -3908,8 +3903,8 @@ static int do_allocation_zoned(struct btrfs_block_group *block_group,
 	       block_group->start == fs_info->treelog_bg ||
 	       fs_info->treelog_bg == 0);
 	ASSERT(!ffe_ctl->for_data_reloc ||
-	       block_group->start == fs_info->data_reloc_bg ||
-	       fs_info->data_reloc_bg == 0);
+	       block_group->start == data_reloc_bytenr ||
+	       data_reloc_bytenr == 0);
 
 	if (block_group->ro ||
 	    (!ffe_ctl->for_data_reloc &&
@@ -3932,7 +3927,7 @@ static int do_allocation_zoned(struct btrfs_block_group *block_group,
 	 * Do not allow currently used block group to be the data relocation
 	 * dedicated block group.
 	 */
-	if (ffe_ctl->for_data_reloc && !fs_info->data_reloc_bg &&
+	if (ffe_ctl->for_data_reloc && data_reloc_bytenr == 0 &&
 	    (block_group->used || block_group->reserved)) {
 		ret = 1;
 		goto out;
@@ -3957,8 +3952,8 @@ static int do_allocation_zoned(struct btrfs_block_group *block_group,
 		fs_info->treelog_bg = block_group->start;
 
 	if (ffe_ctl->for_data_reloc) {
-		if (!fs_info->data_reloc_bg)
-			fs_info->data_reloc_bg = block_group->start;
+		if (READ_ONCE(fs_info->data_reloc_bg) == 0)
+			WRITE_ONCE(fs_info->data_reloc_bg, block_group->start);
 		/*
 		 * Do not allow allocations from this block group, unless it is
 		 * for data relocation. Compared to increasing the ->ro, setting
@@ -3994,8 +3989,7 @@ static int do_allocation_zoned(struct btrfs_block_group *block_group,
 	if (ret && ffe_ctl->for_treelog)
 		fs_info->treelog_bg = 0;
 	if (ret && ffe_ctl->for_data_reloc)
-		fs_info->data_reloc_bg = 0;
-	spin_unlock(&fs_info->relocation_bg_lock);
+		WRITE_ONCE(fs_info->data_reloc_bg, 0);
 	spin_unlock(&fs_info->treelog_bg_lock);
 	spin_unlock(&block_group->lock);
 	spin_unlock(&space_info->lock);
@@ -4304,10 +4298,10 @@ static int prepare_allocation_zoned(struct btrfs_fs_info *fs_info,
 			ffe_ctl->hint_byte = fs_info->treelog_bg;
 		spin_unlock(&fs_info->treelog_bg_lock);
 	} else if (ffe_ctl->for_data_reloc) {
-		spin_lock(&fs_info->relocation_bg_lock);
-		if (fs_info->data_reloc_bg)
-			ffe_ctl->hint_byte = fs_info->data_reloc_bg;
-		spin_unlock(&fs_info->relocation_bg_lock);
+		u64 data_reloc_bg = READ_ONCE(fs_info->data_reloc_bg);
+
+		if (data_reloc_bg)
+			ffe_ctl->hint_byte = data_reloc_bg;
 	} else if (ffe_ctl->flags & BTRFS_BLOCK_GROUP_DATA) {
 		struct btrfs_block_group *block_group;
 
diff --git a/fs/btrfs/fs.h b/fs/btrfs/fs.h
index b239e4b8421c..570f4b85096c 100644
--- a/fs/btrfs/fs.h
+++ b/fs/btrfs/fs.h
@@ -849,11 +849,7 @@ struct btrfs_fs_info {
 	spinlock_t treelog_bg_lock;
 	u64 treelog_bg;
 
-	/*
-	 * Start of the dedicated data relocation block group, protected by
-	 * relocation_bg_lock.
-	 */
-	spinlock_t relocation_bg_lock;
+	/* Start of the dedicated data relocation block group */
 	u64 data_reloc_bg;
 	struct mutex zoned_data_reloc_io_lock;
 
diff --git a/fs/btrfs/zoned.c b/fs/btrfs/zoned.c
index 0d5d6db72b62..388c277a84d3 100644
--- a/fs/btrfs/zoned.c
+++ b/fs/btrfs/zoned.c
@@ -2495,11 +2495,10 @@ void btrfs_schedule_zone_finish_bg(struct btrfs_block_group *bg,
 void btrfs_clear_data_reloc_bg(struct btrfs_block_group *bg)
 {
 	struct btrfs_fs_info *fs_info = bg->fs_info;
+	u64 data_reloc_bg = READ_ONCE(fs_info->data_reloc_bg);
 
-	spin_lock(&fs_info->relocation_bg_lock);
-	if (fs_info->data_reloc_bg == bg->start)
-		fs_info->data_reloc_bg = 0;
-	spin_unlock(&fs_info->relocation_bg_lock);
+	if (data_reloc_bg == bg->start)
+		WRITE_ONCE(fs_info->data_reloc_bg, 0);
 }
 
 void btrfs_zoned_reserve_data_reloc_bg(struct btrfs_fs_info *fs_info)
@@ -2518,7 +2517,7 @@ void btrfs_zoned_reserve_data_reloc_bg(struct btrfs_fs_info *fs_info)
 	if (!btrfs_is_zoned(fs_info))
 		return;
 
-	if (fs_info->data_reloc_bg)
+	if (READ_ONCE(fs_info->data_reloc_bg))
 		return;
 
 	if (sb_rdonly(fs_info->sb))
@@ -2539,7 +2538,7 @@ void btrfs_zoned_reserve_data_reloc_bg(struct btrfs_fs_info *fs_info)
 			continue;
 		}
 
-		fs_info->data_reloc_bg = bg->start;
+		WRITE_ONCE(fs_info->data_reloc_bg, bg->start);
 		set_bit(BLOCK_GROUP_FLAG_ZONED_DATA_RELOC, &bg->runtime_flags);
 		btrfs_zone_activate(bg);
 
-- 
2.49.0


^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCH RFC 3/9] btrfs: zoned: get rid of treelog_bg_lock
  2025-06-27  9:19 [PATCH RFC 0/9] btrfs: zoned: fixes for garbage collection under preassure Johannes Thumshirn
  2025-06-27  9:19 ` [PATCH RFC 1/9] btrfs: zoned: do not select metadata BG as finish target Johannes Thumshirn
  2025-06-27  9:19 ` [PATCH RFC 2/9] btrfs: zoned: get rid of relocation_bg_lock Johannes Thumshirn
@ 2025-06-27  9:19 ` Johannes Thumshirn
  2025-06-27  9:19 ` [PATCH RFC 4/9] btrfs: zoned: don't hold space_info lock on zoned allocation Johannes Thumshirn
                   ` (5 subsequent siblings)
  8 siblings, 0 replies; 25+ messages in thread
From: Johannes Thumshirn @ 2025-06-27  9:19 UTC (permalink / raw)
  To: linux-btrfs
  Cc: Damien Le Moal, Naohiro Aota, David Sterba, Josef Bacik,
	Boris Burkov, Filipe Manana, Johannes Thumshirn

From: Johannes Thumshirn <johannes.thumshirn@wdc.com>

Lockstat analysis of benchmark workloads shows a very high contention of
the treelog_bg_lock. But the treelog_bg_lock only protects a single
field in 'struct btrfs_fs_info', namely 'u64 treelog_bg'.

Use READ_ONCE()/WRITE_ONCE() to access 'btrfs_fs_info::treelog_bg'.

This is safe in the allocator path, as treelog I/O is only going to block
groups in the treelog sub-space_info and at the moment, there is only one
treelog block group in this space info.

Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
---
 fs/btrfs/disk-io.c     |  1 -
 fs/btrfs/extent-tree.c | 45 +++++++++++-------------------------------
 fs/btrfs/fs.h          |  1 -
 fs/btrfs/zoned.c       |  2 +-
 fs/btrfs/zoned.h       |  7 +++----
 5 files changed, 15 insertions(+), 41 deletions(-)

diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index 9a13f5b1ed43..35cd38de7727 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -2789,7 +2789,6 @@ void btrfs_init_fs_info(struct btrfs_fs_info *fs_info)
 	spin_lock_init(&fs_info->defrag_inodes_lock);
 	spin_lock_init(&fs_info->super_lock);
 	spin_lock_init(&fs_info->unused_bgs_lock);
-	spin_lock_init(&fs_info->treelog_bg_lock);
 	spin_lock_init(&fs_info->zone_active_bgs_lock);
 	rwlock_init(&fs_info->tree_mod_log_lock);
 	rwlock_init(&fs_info->global_root_lock);
diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index a9bda68a1883..46358a555f78 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -3809,22 +3809,6 @@ static int do_allocation_clustered(struct btrfs_block_group *block_group,
 	return find_free_extent_unclustered(block_group, ffe_ctl);
 }
 
-/*
- * Tree-log block group locking
- * ============================
- *
- * fs_info::treelog_bg_lock protects the fs_info::treelog_bg which
- * indicates the starting address of a block group, which is reserved only
- * for tree-log metadata.
- *
- * Lock nesting
- * ============
- *
- * space_info::lock
- *   block_group::lock
- *     fs_info::treelog_bg_lock
- */
-
 /*
  * Simple allocator for sequential-only block group. It only allows sequential
  * allocation. No need to play with trees. This function also reserves the
@@ -3844,7 +3828,6 @@ static int do_allocation_zoned(struct btrfs_block_group *block_group,
 	u64 log_bytenr;
 	u64 data_reloc_bytenr;
 	int ret = 0;
-	bool skip = false;
 
 	ASSERT(btrfs_is_zoned(block_group->fs_info));
 
@@ -3852,13 +3835,9 @@ static int do_allocation_zoned(struct btrfs_block_group *block_group,
 	 * Do not allow non-tree-log blocks in the dedicated tree-log block
 	 * group, and vice versa.
 	 */
-	spin_lock(&fs_info->treelog_bg_lock);
-	log_bytenr = fs_info->treelog_bg;
+	log_bytenr = READ_ONCE(fs_info->treelog_bg);
 	if (log_bytenr && ((ffe_ctl->for_treelog && bytenr != log_bytenr) ||
 			   (!ffe_ctl->for_treelog && bytenr == log_bytenr)))
-		skip = true;
-	spin_unlock(&fs_info->treelog_bg_lock);
-	if (skip)
 		return 1;
 
 	/*
@@ -3894,14 +3873,13 @@ static int do_allocation_zoned(struct btrfs_block_group *block_group,
 
 	spin_lock(&space_info->lock);
 	spin_lock(&block_group->lock);
-	spin_lock(&fs_info->treelog_bg_lock);
 
 	if (ret)
 		goto out;
 
 	ASSERT(!ffe_ctl->for_treelog ||
-	       block_group->start == fs_info->treelog_bg ||
-	       fs_info->treelog_bg == 0);
+	       block_group->start == log_bytenr ||
+	       log_bytenr == 0);
 	ASSERT(!ffe_ctl->for_data_reloc ||
 	       block_group->start == data_reloc_bytenr ||
 	       data_reloc_bytenr == 0);
@@ -3917,7 +3895,7 @@ static int do_allocation_zoned(struct btrfs_block_group *block_group,
 	 * Do not allow currently using block group to be tree-log dedicated
 	 * block group.
 	 */
-	if (ffe_ctl->for_treelog && !fs_info->treelog_bg &&
+	if (ffe_ctl->for_treelog && log_bytenr == 0 &&
 	    (block_group->used || block_group->reserved)) {
 		ret = 1;
 		goto out;
@@ -3948,8 +3926,8 @@ static int do_allocation_zoned(struct btrfs_block_group *block_group,
 		goto out;
 	}
 
-	if (ffe_ctl->for_treelog && !fs_info->treelog_bg)
-		fs_info->treelog_bg = block_group->start;
+	if (ffe_ctl->for_treelog && READ_ONCE(fs_info->treelog_bg) == 0)
+		WRITE_ONCE(fs_info->treelog_bg, block_group->start);
 
 	if (ffe_ctl->for_data_reloc) {
 		if (READ_ONCE(fs_info->data_reloc_bg) == 0)
@@ -3987,10 +3965,9 @@ static int do_allocation_zoned(struct btrfs_block_group *block_group,
 
 out:
 	if (ret && ffe_ctl->for_treelog)
-		fs_info->treelog_bg = 0;
+		WRITE_ONCE(fs_info->treelog_bg, 0);
 	if (ret && ffe_ctl->for_data_reloc)
 		WRITE_ONCE(fs_info->data_reloc_bg, 0);
-	spin_unlock(&fs_info->treelog_bg_lock);
 	spin_unlock(&block_group->lock);
 	spin_unlock(&space_info->lock);
 	return ret;
@@ -4293,10 +4270,10 @@ static int prepare_allocation_zoned(struct btrfs_fs_info *fs_info,
 				    struct find_free_extent_ctl *ffe_ctl)
 {
 	if (ffe_ctl->for_treelog) {
-		spin_lock(&fs_info->treelog_bg_lock);
-		if (fs_info->treelog_bg)
-			ffe_ctl->hint_byte = fs_info->treelog_bg;
-		spin_unlock(&fs_info->treelog_bg_lock);
+		u64 treelog_bg = READ_ONCE(fs_info->treelog_bg);
+
+		if (treelog_bg)
+			ffe_ctl->hint_byte = treelog_bg;
 	} else if (ffe_ctl->for_data_reloc) {
 		u64 data_reloc_bg = READ_ONCE(fs_info->data_reloc_bg);
 
diff --git a/fs/btrfs/fs.h b/fs/btrfs/fs.h
index 570f4b85096c..a388af40a251 100644
--- a/fs/btrfs/fs.h
+++ b/fs/btrfs/fs.h
@@ -846,7 +846,6 @@ struct btrfs_fs_info {
 	u64 max_zone_append_size;
 
 	struct mutex zoned_meta_io_lock;
-	spinlock_t treelog_bg_lock;
 	u64 treelog_bg;
 
 	/* Start of the dedicated data relocation block group */
diff --git a/fs/btrfs/zoned.c b/fs/btrfs/zoned.c
index 388c277a84d3..c89f846af6dd 100644
--- a/fs/btrfs/zoned.c
+++ b/fs/btrfs/zoned.c
@@ -1948,7 +1948,7 @@ static bool check_bg_is_active(struct btrfs_eb_write_context *ctx,
 	if (test_bit(BLOCK_GROUP_FLAG_ZONE_IS_ACTIVE, &block_group->runtime_flags))
 		return true;
 
-	if (fs_info->treelog_bg == block_group->start) {
+	if (READ_ONCE(fs_info->treelog_bg) == block_group->start) {
 		if (!btrfs_zone_activate(block_group)) {
 			int ret_fin = btrfs_zone_finish_one_bg(fs_info);
 
diff --git a/fs/btrfs/zoned.h b/fs/btrfs/zoned.h
index 6e11533b8e14..c1b3a5c3a799 100644
--- a/fs/btrfs/zoned.h
+++ b/fs/btrfs/zoned.h
@@ -383,14 +383,13 @@ static inline void btrfs_zoned_meta_io_unlock(struct btrfs_fs_info *fs_info)
 static inline void btrfs_clear_treelog_bg(struct btrfs_block_group *bg)
 {
 	struct btrfs_fs_info *fs_info = bg->fs_info;
+	u64 treelog_bg = READ_ONCE(fs_info->treelog_bg);
 
 	if (!btrfs_is_zoned(fs_info))
 		return;
 
-	spin_lock(&fs_info->treelog_bg_lock);
-	if (fs_info->treelog_bg == bg->start)
-		fs_info->treelog_bg = 0;
-	spin_unlock(&fs_info->treelog_bg_lock);
+	if (treelog_bg == bg->start)
+		WRITE_ONCE(fs_info->treelog_bg, 0);
 }
 
 static inline void btrfs_zoned_data_reloc_lock(struct btrfs_inode *inode)
-- 
2.49.0


^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCH RFC 4/9] btrfs: zoned: don't hold space_info lock on zoned allocation
  2025-06-27  9:19 [PATCH RFC 0/9] btrfs: zoned: fixes for garbage collection under preassure Johannes Thumshirn
                   ` (2 preceding siblings ...)
  2025-06-27  9:19 ` [PATCH RFC 3/9] btrfs: zoned: get rid of treelog_bg_lock Johannes Thumshirn
@ 2025-06-27  9:19 ` Johannes Thumshirn
  2025-06-27  9:19 ` [PATCH RFC 5/9] btrfs: remove delalloc_root_mutex Johannes Thumshirn
                   ` (4 subsequent siblings)
  8 siblings, 0 replies; 25+ messages in thread
From: Johannes Thumshirn @ 2025-06-27  9:19 UTC (permalink / raw)
  To: linux-btrfs
  Cc: Damien Le Moal, Naohiro Aota, David Sterba, Josef Bacik,
	Boris Burkov, Filipe Manana, Johannes Thumshirn

From: Johannes Thumshirn <johannes.thumshirn@wdc.com>

The zoned extent allocator holds 'struct btrfs_space_info::lock' nearly
over the entirety of the allocation process, but nothing in
do_allocation_zoned() is actually accessing fields of 'struct
btrfs_space_info'.

Furthermore taking lock_stat snapshots in performance testing, always shows
the space_info::lock as the most contented lock in the entire system.

Remove locking the space_info lock during do_allocation_zoned() to reduce
lock contention.

Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
---
 fs/btrfs/extent-tree.c | 3 ---
 1 file changed, 3 deletions(-)

diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index 46358a555f78..da731f6d4dad 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -3819,7 +3819,6 @@ static int do_allocation_zoned(struct btrfs_block_group *block_group,
 			       struct btrfs_block_group **bg_ret)
 {
 	struct btrfs_fs_info *fs_info = block_group->fs_info;
-	struct btrfs_space_info *space_info = block_group->space_info;
 	struct btrfs_free_space_ctl *ctl = block_group->free_space_ctl;
 	u64 start = block_group->start;
 	u64 num_bytes = ffe_ctl->num_bytes;
@@ -3871,7 +3870,6 @@ static int do_allocation_zoned(struct btrfs_block_group *block_group,
 		 */
 	}
 
-	spin_lock(&space_info->lock);
 	spin_lock(&block_group->lock);
 
 	if (ret)
@@ -3969,7 +3967,6 @@ static int do_allocation_zoned(struct btrfs_block_group *block_group,
 	if (ret && ffe_ctl->for_data_reloc)
 		WRITE_ONCE(fs_info->data_reloc_bg, 0);
 	spin_unlock(&block_group->lock);
-	spin_unlock(&space_info->lock);
 	return ret;
 }
 
-- 
2.49.0


^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCH RFC 5/9] btrfs: remove delalloc_root_mutex
  2025-06-27  9:19 [PATCH RFC 0/9] btrfs: zoned: fixes for garbage collection under preassure Johannes Thumshirn
                   ` (3 preceding siblings ...)
  2025-06-27  9:19 ` [PATCH RFC 4/9] btrfs: zoned: don't hold space_info lock on zoned allocation Johannes Thumshirn
@ 2025-06-27  9:19 ` Johannes Thumshirn
  2025-06-27 12:42   ` Filipe Manana
  2025-06-27  9:19 ` [PATCH RFC 6/9] btrfs: remove btrfs_root's delalloc_mutex Johannes Thumshirn
                   ` (3 subsequent siblings)
  8 siblings, 1 reply; 25+ messages in thread
From: Johannes Thumshirn @ 2025-06-27  9:19 UTC (permalink / raw)
  To: linux-btrfs
  Cc: Damien Le Moal, Naohiro Aota, David Sterba, Josef Bacik,
	Boris Burkov, Filipe Manana, Johannes Thumshirn

From: Johannes Thumshirn <johannes.thumshirn@wdc.com>

When benchmarking garbage collection on zoned BTRFS filesystems on ZNS
drives, we regularly observe hung_task messages like the following:

INFO: task kworker/u132:2:297 blocked for more than 122 seconds.
       Not tainted 6.16.0-rc1+ #1225
 "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
 task:kworker/u132:2  state:D stack:0     pid:297   tgid:297   ppid:2      task_flags:0x4208060 flags:0x00004000
 Workqueue: events_unbound btrfs_preempt_reclaim_metadata_space
 Call Trace:
  <TASK>
  __schedule+0x2f9/0x7b0
  schedule+0x27/0x80
  schedule_preempt_disabled+0x15/0x30
  __mutex_lock.constprop.0+0x4af/0x890
  ? srso_return_thunk+0x5/0x5f
  btrfs_start_delalloc_roots+0x8a/0x290
  ? timerqueue_del+0x2e/0x60
  shrink_delalloc+0x10c/0x2d0
  ? srso_return_thunk+0x5/0x5f
  ? psi_group_change+0x19e/0x460
  ? srso_return_thunk+0x5/0x5f
  ? btrfs_reduce_alloc_profile+0x9a/0x1d0
  flush_space+0x202/0x280
  ? srso_return_thunk+0x5/0x5f
  ? need_preemptive_reclaim+0xaa/0x190
  btrfs_preempt_reclaim_metadata_space+0xe7/0x340
  process_one_work+0x192/0x350
  worker_thread+0x25a/0x3a0
  ? __pfx_worker_thread+0x10/0x10
  kthread+0xfc/0x240
  ? __pfx_kthread+0x10/0x10
  ? __pfx_kthread+0x10/0x10
  ret_from_fork+0x152/0x180
  ? __pfx_kthread+0x10/0x10
  ret_from_fork_asm+0x1a/0x30
  </TASK>
 INFO: task kworker/u132:2:297 is blocked on a mutex likely owned by task kworker/u129:0:2359.
 task:kworker/u129:0  state:R  running task     stack:0     pid:2359  tgid:2359  ppid:2

The affected tasks are blocked on 'struct btrfs_fs_info::delalloc_root_mutex',
a global lock that serializes entry into btrfs_start_delalloc_roots().
This lock was introduced in commit 573bfb72f760 ("Btrfs: fix possible
empty list access when flushing the delalloc inodes") but without a
clear justification for its necessity.

However, the condition it was meant to protect against—a possibly empty
list access—is already safely handled by 'list_splice_init()', which
does nothing when the source list is empty.

There are no known concurrency issues in btrfs_start_delalloc_roots()
that require serialization via this mutex. All critical regions are
either covered by per-root locking or operate on safely isolated lists.

Removing the lock eliminates the observed hangs and improves metadata
GC throughput, particularly on systems with high concurrency like
ZNS-based deployments.

Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
---
 fs/btrfs/disk-io.c | 1 -
 fs/btrfs/fs.h      | 1 -
 fs/btrfs/inode.c   | 2 --
 3 files changed, 4 deletions(-)

diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index 35cd38de7727..929f39886b0e 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -2795,7 +2795,6 @@ void btrfs_init_fs_info(struct btrfs_fs_info *fs_info)
 	mutex_init(&fs_info->unused_bg_unpin_mutex);
 	mutex_init(&fs_info->reclaim_bgs_lock);
 	mutex_init(&fs_info->reloc_mutex);
-	mutex_init(&fs_info->delalloc_root_mutex);
 	mutex_init(&fs_info->zoned_meta_io_lock);
 	mutex_init(&fs_info->zoned_data_reloc_io_lock);
 	seqlock_init(&fs_info->profiles_lock);
diff --git a/fs/btrfs/fs.h b/fs/btrfs/fs.h
index a388af40a251..04ebc976f841 100644
--- a/fs/btrfs/fs.h
+++ b/fs/btrfs/fs.h
@@ -606,7 +606,6 @@ struct btrfs_fs_info {
 	 */
 	struct list_head ordered_roots;
 
-	struct mutex delalloc_root_mutex;
 	spinlock_t delalloc_root_lock;
 	/* All fs/file tree roots that have delalloc inodes. */
 	struct list_head delalloc_roots;
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 80c72c594b19..d68f4ef61c43 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -8766,7 +8766,6 @@ int btrfs_start_delalloc_roots(struct btrfs_fs_info *fs_info, long nr,
 	if (BTRFS_FS_ERROR(fs_info))
 		return -EROFS;
 
-	mutex_lock(&fs_info->delalloc_root_mutex);
 	spin_lock(&fs_info->delalloc_root_lock);
 	list_splice_init(&fs_info->delalloc_roots, &splice);
 	while (!list_empty(&splice)) {
@@ -8800,7 +8799,6 @@ int btrfs_start_delalloc_roots(struct btrfs_fs_info *fs_info, long nr,
 		list_splice_tail(&splice, &fs_info->delalloc_roots);
 		spin_unlock(&fs_info->delalloc_root_lock);
 	}
-	mutex_unlock(&fs_info->delalloc_root_mutex);
 	return ret;
 }
 
-- 
2.49.0


^ permalink raw reply related	[flat|nested] 25+ messages in thread

* Re: [PATCH RFC 5/9] btrfs: remove delalloc_root_mutex
  2025-06-27  9:19 ` [PATCH RFC 5/9] btrfs: remove delalloc_root_mutex Johannes Thumshirn
@ 2025-06-27 12:42   ` Filipe Manana
  0 siblings, 0 replies; 25+ messages in thread
From: Filipe Manana @ 2025-06-27 12:42 UTC (permalink / raw)
  To: Johannes Thumshirn
  Cc: linux-btrfs, Damien Le Moal, Naohiro Aota, David Sterba,
	Josef Bacik, Boris Burkov, Filipe Manana, Johannes Thumshirn

On Fri, Jun 27, 2025 at 10:23 AM Johannes Thumshirn <jth@kernel.org> wrote:
>
> From: Johannes Thumshirn <johannes.thumshirn@wdc.com>
>
> When benchmarking garbage collection on zoned BTRFS filesystems on ZNS
> drives, we regularly observe hung_task messages like the following:
>
> INFO: task kworker/u132:2:297 blocked for more than 122 seconds.
>        Not tainted 6.16.0-rc1+ #1225
>  "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
>  task:kworker/u132:2  state:D stack:0     pid:297   tgid:297   ppid:2      task_flags:0x4208060 flags:0x00004000
>  Workqueue: events_unbound btrfs_preempt_reclaim_metadata_space
>  Call Trace:
>   <TASK>
>   __schedule+0x2f9/0x7b0
>   schedule+0x27/0x80
>   schedule_preempt_disabled+0x15/0x30
>   __mutex_lock.constprop.0+0x4af/0x890
>   ? srso_return_thunk+0x5/0x5f
>   btrfs_start_delalloc_roots+0x8a/0x290
>   ? timerqueue_del+0x2e/0x60
>   shrink_delalloc+0x10c/0x2d0
>   ? srso_return_thunk+0x5/0x5f
>   ? psi_group_change+0x19e/0x460
>   ? srso_return_thunk+0x5/0x5f
>   ? btrfs_reduce_alloc_profile+0x9a/0x1d0
>   flush_space+0x202/0x280
>   ? srso_return_thunk+0x5/0x5f
>   ? need_preemptive_reclaim+0xaa/0x190
>   btrfs_preempt_reclaim_metadata_space+0xe7/0x340
>   process_one_work+0x192/0x350
>   worker_thread+0x25a/0x3a0
>   ? __pfx_worker_thread+0x10/0x10
>   kthread+0xfc/0x240
>   ? __pfx_kthread+0x10/0x10
>   ? __pfx_kthread+0x10/0x10
>   ret_from_fork+0x152/0x180
>   ? __pfx_kthread+0x10/0x10
>   ret_from_fork_asm+0x1a/0x30
>   </TASK>
>  INFO: task kworker/u132:2:297 is blocked on a mutex likely owned by task kworker/u129:0:2359.
>  task:kworker/u129:0  state:R  running task     stack:0     pid:2359  tgid:2359  ppid:2
>
> The affected tasks are blocked on 'struct btrfs_fs_info::delalloc_root_mutex',
> a global lock that serializes entry into btrfs_start_delalloc_roots().
> This lock was introduced in commit 573bfb72f760 ("Btrfs: fix possible
> empty list access when flushing the delalloc inodes") but without a
> clear justification for its necessity.
>
> However, the condition it was meant to protect against—a possibly empty
> list access—is already safely handled by 'list_splice_init()', which
> does nothing when the source list is empty.
>
> There are no known concurrency issues in btrfs_start_delalloc_roots()
> that require serialization via this mutex. All critical regions are
> either covered by per-root locking or operate on safely isolated lists.

Nop... see comments further below.

>
> Removing the lock eliminates the observed hangs and improves metadata
> GC throughput, particularly on systems with high concurrency like
> ZNS-based deployments.
>
> Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
> ---
>  fs/btrfs/disk-io.c | 1 -
>  fs/btrfs/fs.h      | 1 -
>  fs/btrfs/inode.c   | 2 --
>  3 files changed, 4 deletions(-)
>
> diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
> index 35cd38de7727..929f39886b0e 100644
> --- a/fs/btrfs/disk-io.c
> +++ b/fs/btrfs/disk-io.c
> @@ -2795,7 +2795,6 @@ void btrfs_init_fs_info(struct btrfs_fs_info *fs_info)
>         mutex_init(&fs_info->unused_bg_unpin_mutex);
>         mutex_init(&fs_info->reclaim_bgs_lock);
>         mutex_init(&fs_info->reloc_mutex);
> -       mutex_init(&fs_info->delalloc_root_mutex);
>         mutex_init(&fs_info->zoned_meta_io_lock);
>         mutex_init(&fs_info->zoned_data_reloc_io_lock);
>         seqlock_init(&fs_info->profiles_lock);
> diff --git a/fs/btrfs/fs.h b/fs/btrfs/fs.h
> index a388af40a251..04ebc976f841 100644
> --- a/fs/btrfs/fs.h
> +++ b/fs/btrfs/fs.h
> @@ -606,7 +606,6 @@ struct btrfs_fs_info {
>          */
>         struct list_head ordered_roots;
>
> -       struct mutex delalloc_root_mutex;
>         spinlock_t delalloc_root_lock;
>         /* All fs/file tree roots that have delalloc inodes. */
>         struct list_head delalloc_roots;
> diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
> index 80c72c594b19..d68f4ef61c43 100644
> --- a/fs/btrfs/inode.c
> +++ b/fs/btrfs/inode.c
> @@ -8766,7 +8766,6 @@ int btrfs_start_delalloc_roots(struct btrfs_fs_info *fs_info, long nr,
>         if (BTRFS_FS_ERROR(fs_info))
>                 return -EROFS;
>
> -       mutex_lock(&fs_info->delalloc_root_mutex);
>         spin_lock(&fs_info->delalloc_root_lock);
>         list_splice_init(&fs_info->delalloc_roots, &splice);
>         while (!list_empty(&splice)) {
> @@ -8800,7 +8799,6 @@ int btrfs_start_delalloc_roots(struct btrfs_fs_info *fs_info, long nr,
>                 list_splice_tail(&splice, &fs_info->delalloc_roots);
>                 spin_unlock(&fs_info->delalloc_root_lock);
>         }
> -       mutex_unlock(&fs_info->delalloc_root_mutex);

The lock is useful and exists to make sure two tasks calling this
function wait for all dealloc to be flushed and ordered extents to
complete for all inodes from all roots.

The problem is similar to the one I pointed out for the next patch,
but perhaps a bit more subtle.

So after applying this patch:

1) Task A enters btrfs_start_delalloc_roots() and takes the spinlock
fs_info->delalloc_root_lock;

2) Task A splices the fs_info->delalloc_roots list into the local
splice list. The list has two roots, root X and root Y;

3) Task  A enters the first iteration of the while loop, extracts root
X from the split list, grabs a reference for it, adds it back to the
fs_info->delalloc_roots list and unlocks fs_info->delalloc_root_lock;

4) Task B enters btrfs_start_delalloc_roots(), takes the
fs_info->delalloc_root_lock lock;

5) Task B splices the fs_info->delalloc_roots list into the local
splice list - the list only contains root X -> root Y is held only the
splice list from task A.

As a consequence task B will never wait for writeback and ordered
extents from inodes from root Y to complete.
Therefore breaking the expected semantics of btrfs_start_delalloc_roots().

Thanks.



>         return ret;
>  }

>
> --
> 2.49.0
>
>

^ permalink raw reply	[flat|nested] 25+ messages in thread

* [PATCH RFC 6/9] btrfs: remove btrfs_root's delalloc_mutex
  2025-06-27  9:19 [PATCH RFC 0/9] btrfs: zoned: fixes for garbage collection under preassure Johannes Thumshirn
                   ` (4 preceding siblings ...)
  2025-06-27  9:19 ` [PATCH RFC 5/9] btrfs: remove delalloc_root_mutex Johannes Thumshirn
@ 2025-06-27  9:19 ` Johannes Thumshirn
  2025-06-27 12:30   ` Filipe Manana
  2025-06-27  9:19 ` [PATCH RFC 7/9] btrfs: lower auto-reclaim message log level Johannes Thumshirn
                   ` (2 subsequent siblings)
  8 siblings, 1 reply; 25+ messages in thread
From: Johannes Thumshirn @ 2025-06-27  9:19 UTC (permalink / raw)
  To: linux-btrfs
  Cc: Damien Le Moal, Naohiro Aota, David Sterba, Josef Bacik,
	Boris Burkov, Filipe Manana, Johannes Thumshirn

From: Johannes Thumshirn <johannes.thumshirn@wdc.com>

When running metadata space reclaim under high I/O concurrency, we observe
hung tasks caused by lock contention on `btrfs_root::delalloc_mutex`. For
example:

  INFO: task kworker/u132:1:2177 blocked for more than 122 seconds.
        Not tainted 6.16.0-rc3+ #1246
  Workqueue: events_unbound btrfs_preempt_reclaim_metadata_space
  Call Trace:
    __schedule+0x2f9/0x7b0
    schedule+0x27/0x80
    __mutex_lock.constprop.0+0x4af/0x890
    start_delalloc_inodes+0x6e/0x400
    btrfs_start_delalloc_roots+0x162/0x270
    shrink_delalloc+0x10c/0x2d0
    flush_space+0x202/0x280
    btrfs_preempt_reclaim_metadata_space+0xe7/0x340

The `delalloc_mutex` serializes delalloc flushing per root but is no
longer necessary. All critical paths (inode flushing, extent writing,
metadata updates) are already synchronized using finer-grained locking at
the inode, page, and tree levels. In particular, concurrent flushers
coordinate via inode locking, and no shared state requires global
serialization across the root.

Removing this mutex avoids unnecessary blocking in reclaim paths and
improves responsiveness under pressure, especially on systems with many
flushers or multi-queue SSDs/ZNS devices.

Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>

Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
---
 fs/btrfs/ctree.h   | 1 -
 fs/btrfs/disk-io.c | 1 -
 fs/btrfs/inode.c   | 2 --
 3 files changed, 4 deletions(-)

diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index 8a54a0b6e502..06c7742a5de0 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -238,7 +238,6 @@ struct btrfs_root {
 	spinlock_t root_item_lock;
 	refcount_t refs;
 
-	struct mutex delalloc_mutex;
 	spinlock_t delalloc_lock;
 	/*
 	 * all of the inodes that have delalloc bytes.  It is possible for
diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index 929f39886b0e..e39f5e893312 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -678,7 +678,6 @@ static struct btrfs_root *btrfs_alloc_root(struct btrfs_fs_info *fs_info,
 	mutex_init(&root->objectid_mutex);
 	mutex_init(&root->log_mutex);
 	mutex_init(&root->ordered_extent_mutex);
-	mutex_init(&root->delalloc_mutex);
 	init_waitqueue_head(&root->qgroup_flush_wait);
 	init_waitqueue_head(&root->log_writer_wait);
 	init_waitqueue_head(&root->log_commit_wait[0]);
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index d68f4ef61c43..b9c52b9ea912 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -8673,7 +8673,6 @@ static int start_delalloc_inodes(struct btrfs_root *root,
 	int ret = 0;
 	bool full_flush = wbc->nr_to_write == LONG_MAX;
 
-	mutex_lock(&root->delalloc_mutex);
 	spin_lock(&root->delalloc_lock);
 	list_splice_init(&root->delalloc_inodes, &splice);
 	while (!list_empty(&splice)) {
@@ -8730,7 +8729,6 @@ static int start_delalloc_inodes(struct btrfs_root *root,
 		list_splice_tail(&splice, &root->delalloc_inodes);
 		spin_unlock(&root->delalloc_lock);
 	}
-	mutex_unlock(&root->delalloc_mutex);
 	return ret;
 }
 
-- 
2.49.0


^ permalink raw reply related	[flat|nested] 25+ messages in thread

* Re: [PATCH RFC 6/9] btrfs: remove btrfs_root's delalloc_mutex
  2025-06-27  9:19 ` [PATCH RFC 6/9] btrfs: remove btrfs_root's delalloc_mutex Johannes Thumshirn
@ 2025-06-27 12:30   ` Filipe Manana
  0 siblings, 0 replies; 25+ messages in thread
From: Filipe Manana @ 2025-06-27 12:30 UTC (permalink / raw)
  To: Johannes Thumshirn
  Cc: linux-btrfs, Damien Le Moal, Naohiro Aota, David Sterba,
	Josef Bacik, Boris Burkov, Filipe Manana, Johannes Thumshirn

On Fri, Jun 27, 2025 at 10:23 AM Johannes Thumshirn <jth@kernel.org> wrote:
>
> From: Johannes Thumshirn <johannes.thumshirn@wdc.com>
>
> When running metadata space reclaim under high I/O concurrency, we observe
> hung tasks caused by lock contention on `btrfs_root::delalloc_mutex`. For
> example:
>
>   INFO: task kworker/u132:1:2177 blocked for more than 122 seconds.
>         Not tainted 6.16.0-rc3+ #1246
>   Workqueue: events_unbound btrfs_preempt_reclaim_metadata_space
>   Call Trace:
>     __schedule+0x2f9/0x7b0
>     schedule+0x27/0x80
>     __mutex_lock.constprop.0+0x4af/0x890
>     start_delalloc_inodes+0x6e/0x400
>     btrfs_start_delalloc_roots+0x162/0x270
>     shrink_delalloc+0x10c/0x2d0
>     flush_space+0x202/0x280
>     btrfs_preempt_reclaim_metadata_space+0xe7/0x340
>
> The `delalloc_mutex` serializes delalloc flushing per root but is no
> longer necessary. All critical paths (inode flushing, extent writing,
> metadata updates) are already synchronized using finer-grained locking at
> the inode, page, and tree levels. In particular, concurrent flushers
> coordinate via inode locking, and no shared state requires global
> serialization across the root.

Well that's not enough...
The mutex is there to ensure that if two (or more) tasks call
start_delalloc_inodes(), they only return after all IO is flushed and
ordered extents completed.
Without the mutex we break the semantics.

For example, after removing the mutex as you propose here, we have:

1) Task A enters start_delalloc_inodes() and takes the spinlock
root->delalloc_lock;

2) Task B also enters start_delalloc_inodes() and spins on
root->delalloc_lock because it's currently held by task A;

3) Task A extracts inode X from the local split list, grabs a
reference for it and unlocks root->delalloc_lock;

4) Task B takes the root->delalloc_lock and sees an empty
root->delalloc_inodes list, so it will never wait for all the
writeback and ordered extents from inode X to complete, and return
before they complete.

You see, inode locks, page locks, etc, don't even enter the equation
here and are irrelevant.


>
> Removing this mutex avoids unnecessary blocking in reclaim paths and
> improves responsiveness under pressure, especially on systems with many
> flushers or multi-queue SSDs/ZNS devices.
>
> Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
>
> Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>

Duplicated SoB tags.

Thanks.

> ---
>  fs/btrfs/ctree.h   | 1 -
>  fs/btrfs/disk-io.c | 1 -
>  fs/btrfs/inode.c   | 2 --
>  3 files changed, 4 deletions(-)
>
> diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
> index 8a54a0b6e502..06c7742a5de0 100644
> --- a/fs/btrfs/ctree.h
> +++ b/fs/btrfs/ctree.h
> @@ -238,7 +238,6 @@ struct btrfs_root {
>         spinlock_t root_item_lock;
>         refcount_t refs;
>
> -       struct mutex delalloc_mutex;
>         spinlock_t delalloc_lock;
>         /*
>          * all of the inodes that have delalloc bytes.  It is possible for
> diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
> index 929f39886b0e..e39f5e893312 100644
> --- a/fs/btrfs/disk-io.c
> +++ b/fs/btrfs/disk-io.c
> @@ -678,7 +678,6 @@ static struct btrfs_root *btrfs_alloc_root(struct btrfs_fs_info *fs_info,
>         mutex_init(&root->objectid_mutex);
>         mutex_init(&root->log_mutex);
>         mutex_init(&root->ordered_extent_mutex);
> -       mutex_init(&root->delalloc_mutex);
>         init_waitqueue_head(&root->qgroup_flush_wait);
>         init_waitqueue_head(&root->log_writer_wait);
>         init_waitqueue_head(&root->log_commit_wait[0]);
> diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
> index d68f4ef61c43..b9c52b9ea912 100644
> --- a/fs/btrfs/inode.c
> +++ b/fs/btrfs/inode.c
> @@ -8673,7 +8673,6 @@ static int start_delalloc_inodes(struct btrfs_root *root,
>         int ret = 0;
>         bool full_flush = wbc->nr_to_write == LONG_MAX;
>
> -       mutex_lock(&root->delalloc_mutex);
>         spin_lock(&root->delalloc_lock);
>         list_splice_init(&root->delalloc_inodes, &splice);
>         while (!list_empty(&splice)) {
> @@ -8730,7 +8729,6 @@ static int start_delalloc_inodes(struct btrfs_root *root,
>                 list_splice_tail(&splice, &root->delalloc_inodes);
>                 spin_unlock(&root->delalloc_lock);
>         }
> -       mutex_unlock(&root->delalloc_mutex);
>         return ret;
>  }
>
> --
> 2.49.0
>
>

^ permalink raw reply	[flat|nested] 25+ messages in thread

* [PATCH RFC 7/9] btrfs: lower auto-reclaim message log level
  2025-06-27  9:19 [PATCH RFC 0/9] btrfs: zoned: fixes for garbage collection under preassure Johannes Thumshirn
                   ` (5 preceding siblings ...)
  2025-06-27  9:19 ` [PATCH RFC 6/9] btrfs: remove btrfs_root's delalloc_mutex Johannes Thumshirn
@ 2025-06-27  9:19 ` Johannes Thumshirn
  2025-06-27 11:35   ` Christoph Hellwig
  2025-06-27 23:24   ` kernel test robot
  2025-06-27  9:19 ` [PATCH RFC 8/9] btrfs: lower log level of relocation messages Johannes Thumshirn
  2025-06-27  9:19 ` [PATCH RFC 9/9] btrfs: remove unused bgs on allocation failure Johannes Thumshirn
  8 siblings, 2 replies; 25+ messages in thread
From: Johannes Thumshirn @ 2025-06-27  9:19 UTC (permalink / raw)
  To: linux-btrfs
  Cc: Damien Le Moal, Naohiro Aota, David Sterba, Josef Bacik,
	Boris Burkov, Filipe Manana, Johannes Thumshirn

From: Johannes Thumshirn <johannes.thumshirn@wdc.com>

When running a system with automatic reclaim/balancing enabled, there are
lots of info level messages like the following in the kernel log:

 BTRFS info (device nvme2n1): reclaiming chunk 1138166333440 with 10% used 0% reserved 89% unusable

Lower the log level to debug for these messages.

Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
---
 fs/btrfs/block-group.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/fs/btrfs/block-group.c b/fs/btrfs/block-group.c
index 00e567a4cd16..5e6aead653c4 100644
--- a/fs/btrfs/block-group.c
+++ b/fs/btrfs/block-group.c
@@ -1963,7 +1963,7 @@ void btrfs_reclaim_bgs_work(struct work_struct *work)
 		reserved = bg->reserved;
 		spin_unlock(&bg->lock);
 
-		btrfs_info(fs_info,
+		btrfs_debug(fs_info,
 	"reclaiming chunk %llu with %llu%% used %llu%% reserved %llu%% unusable",
 				bg->start,
 				div64_u64(used * 100, bg->length),
-- 
2.49.0


^ permalink raw reply related	[flat|nested] 25+ messages in thread

* Re: [PATCH RFC 7/9] btrfs: lower auto-reclaim message log level
  2025-06-27  9:19 ` [PATCH RFC 7/9] btrfs: lower auto-reclaim message log level Johannes Thumshirn
@ 2025-06-27 11:35   ` Christoph Hellwig
  2025-06-27 23:24   ` kernel test robot
  1 sibling, 0 replies; 25+ messages in thread
From: Christoph Hellwig @ 2025-06-27 11:35 UTC (permalink / raw)
  To: Johannes Thumshirn
  Cc: linux-btrfs, Damien Le Moal, Naohiro Aota, David Sterba,
	Josef Bacik, Boris Burkov, Filipe Manana, Johannes Thumshirn

On Fri, Jun 27, 2025 at 11:19:12AM +0200, Johannes Thumshirn wrote:
> From: Johannes Thumshirn <johannes.thumshirn@wdc.com>
> 
> When running a system with automatic reclaim/balancing enabled, there are
> lots of info level messages like the following in the kernel log:
> 
>  BTRFS info (device nvme2n1): reclaiming chunk 1138166333440 with 10% used 0% reserved 89% unusable
> 
> Lower the log level to debug for these messages.

Wouldn't a trace even be an even better choice for this?

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH RFC 7/9] btrfs: lower auto-reclaim message log level
  2025-06-27  9:19 ` [PATCH RFC 7/9] btrfs: lower auto-reclaim message log level Johannes Thumshirn
  2025-06-27 11:35   ` Christoph Hellwig
@ 2025-06-27 23:24   ` kernel test robot
  1 sibling, 0 replies; 25+ messages in thread
From: kernel test robot @ 2025-06-27 23:24 UTC (permalink / raw)
  To: Johannes Thumshirn; +Cc: llvm, oe-kbuild-all

Hi Johannes,

[This is a private test report for your RFC patch.]
kernel test robot noticed the following build warnings:

[auto build test WARNING on next-20250626]
[cannot apply to kdave/for-next v6.16-rc3 v6.16-rc2 v6.16-rc1 linus/master v6.16-rc3]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch#_base_tree_information]

url:    https://github.com/intel-lab-lkp/linux/commits/Johannes-Thumshirn/btrfs-zoned-do-not-select-metadata-BG-as-finish-target/20250627-172551
base:   next-20250626
patch link:    https://lore.kernel.org/r/20250627091914.100715-8-jth%40kernel.org
patch subject: [PATCH RFC 7/9] btrfs: lower auto-reclaim message log level
config: x86_64-buildonly-randconfig-002-20250628 (https://download.01.org/0day-ci/archive/20250628/202506280733.ut2JoUtS-lkp@intel.com/config)
compiler: clang version 20.1.7 (https://github.com/llvm/llvm-project 6146a88f60492b520a36f8f8f3231e15f3cc6082)
reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20250628/202506280733.ut2JoUtS-lkp@intel.com/reproduce)

If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202506280733.ut2JoUtS-lkp@intel.com/

All warnings (new ones prefixed by >>):

>> fs/btrfs/block-group.c:1846:7: warning: variable 'zone_unusable' set but not used [-Wunused-but-set-variable]
    1846 |                 u64 zone_unusable;
         |                     ^
   1 warning generated.


vim +/zone_unusable +1846 fs/btrfs/block-group.c

81531225e5bd50c Boris Burkov       2022-10-13  1803  
18bb8bbf13c1839 Johannes Thumshirn 2021-04-19  1804  void btrfs_reclaim_bgs_work(struct work_struct *work)
18bb8bbf13c1839 Johannes Thumshirn 2021-04-19  1805  {
18bb8bbf13c1839 Johannes Thumshirn 2021-04-19  1806  	struct btrfs_fs_info *fs_info =
18bb8bbf13c1839 Johannes Thumshirn 2021-04-19  1807  		container_of(work, struct btrfs_fs_info, reclaim_bgs_work);
18bb8bbf13c1839 Johannes Thumshirn 2021-04-19  1808  	struct btrfs_block_group *bg;
18bb8bbf13c1839 Johannes Thumshirn 2021-04-19  1809  	struct btrfs_space_info *space_info;
4eb4e85c4f81849 Boris Burkov       2024-06-07  1810  	LIST_HEAD(retry_list);
18bb8bbf13c1839 Johannes Thumshirn 2021-04-19  1811  
18bb8bbf13c1839 Johannes Thumshirn 2021-04-19  1812  	if (!test_bit(BTRFS_FS_OPEN, &fs_info->flags))
18bb8bbf13c1839 Johannes Thumshirn 2021-04-19  1813  		return;
18bb8bbf13c1839 Johannes Thumshirn 2021-04-19  1814  
2f12741f81af638 Josef Bacik        2022-07-15  1815  	if (btrfs_fs_closing(fs_info))
2f12741f81af638 Josef Bacik        2022-07-15  1816  		return;
2f12741f81af638 Josef Bacik        2022-07-15  1817  
3687fcb0752ac9c Johannes Thumshirn 2022-03-29  1818  	if (!btrfs_should_reclaim(fs_info))
3687fcb0752ac9c Johannes Thumshirn 2022-03-29  1819  		return;
3687fcb0752ac9c Johannes Thumshirn 2022-03-29  1820  
ca5e4ea0beaec8b Naohiro Aota       2022-02-18  1821  	sb_start_write(fs_info->sb);
ca5e4ea0beaec8b Naohiro Aota       2022-02-18  1822  
ca5e4ea0beaec8b Naohiro Aota       2022-02-18  1823  	if (!btrfs_exclop_start(fs_info, BTRFS_EXCLOP_BALANCE)) {
ca5e4ea0beaec8b Naohiro Aota       2022-02-18  1824  		sb_end_write(fs_info->sb);
18bb8bbf13c1839 Johannes Thumshirn 2021-04-19  1825  		return;
ca5e4ea0beaec8b Naohiro Aota       2022-02-18  1826  	}
18bb8bbf13c1839 Johannes Thumshirn 2021-04-19  1827  
9cc0b837e14ae91 Johannes Thumshirn 2021-07-06  1828  	/*
9cc0b837e14ae91 Johannes Thumshirn 2021-07-06  1829  	 * Long running balances can keep us blocked here for eternity, so
9cc0b837e14ae91 Johannes Thumshirn 2021-07-06  1830  	 * simply skip reclaim if we're unable to get the mutex.
9cc0b837e14ae91 Johannes Thumshirn 2021-07-06  1831  	 */
9cc0b837e14ae91 Johannes Thumshirn 2021-07-06  1832  	if (!mutex_trylock(&fs_info->reclaim_bgs_lock)) {
9cc0b837e14ae91 Johannes Thumshirn 2021-07-06  1833  		btrfs_exclop_finish(fs_info);
ca5e4ea0beaec8b Naohiro Aota       2022-02-18  1834  		sb_end_write(fs_info->sb);
9cc0b837e14ae91 Johannes Thumshirn 2021-07-06  1835  		return;
9cc0b837e14ae91 Johannes Thumshirn 2021-07-06  1836  	}
9cc0b837e14ae91 Johannes Thumshirn 2021-07-06  1837  
18bb8bbf13c1839 Johannes Thumshirn 2021-04-19  1838  	spin_lock(&fs_info->unused_bgs_lock);
2ca0ec770c62b32 Johannes Thumshirn 2021-10-14  1839  	/*
2ca0ec770c62b32 Johannes Thumshirn 2021-10-14  1840  	 * Sort happens under lock because we can't simply splice it and sort.
2ca0ec770c62b32 Johannes Thumshirn 2021-10-14  1841  	 * The block groups might still be in use and reachable via bg_list,
2ca0ec770c62b32 Johannes Thumshirn 2021-10-14  1842  	 * and their presence in the reclaim_bgs list must be preserved.
2ca0ec770c62b32 Johannes Thumshirn 2021-10-14  1843  	 */
2ca0ec770c62b32 Johannes Thumshirn 2021-10-14  1844  	list_sort(NULL, &fs_info->reclaim_bgs, reclaim_bgs_cmp);
18bb8bbf13c1839 Johannes Thumshirn 2021-04-19  1845  	while (!list_empty(&fs_info->reclaim_bgs)) {
5f93e776c6734ce Johannes Thumshirn 2021-06-29 @1846  		u64 zone_unusable;
ba5d06440cae63e Filipe Manana      2025-02-24  1847  		u64 used;
620768704326c9a Filipe Manana      2025-02-24  1848  		u64 reserved;
1cea5cf0e664290 Filipe Manana      2021-06-21  1849  		int ret = 0;
1cea5cf0e664290 Filipe Manana      2021-06-21  1850  
18bb8bbf13c1839 Johannes Thumshirn 2021-04-19  1851  		bg = list_first_entry(&fs_info->reclaim_bgs,
18bb8bbf13c1839 Johannes Thumshirn 2021-04-19  1852  				      struct btrfs_block_group,
18bb8bbf13c1839 Johannes Thumshirn 2021-04-19  1853  				      bg_list);
18bb8bbf13c1839 Johannes Thumshirn 2021-04-19  1854  		list_del_init(&bg->bg_list);
18bb8bbf13c1839 Johannes Thumshirn 2021-04-19  1855  
18bb8bbf13c1839 Johannes Thumshirn 2021-04-19  1856  		space_info = bg->space_info;
18bb8bbf13c1839 Johannes Thumshirn 2021-04-19  1857  		spin_unlock(&fs_info->unused_bgs_lock);
18bb8bbf13c1839 Johannes Thumshirn 2021-04-19  1858  
18bb8bbf13c1839 Johannes Thumshirn 2021-04-19  1859  		/* Don't race with allocators so take the groups_sem */
18bb8bbf13c1839 Johannes Thumshirn 2021-04-19  1860  		down_write(&space_info->groups_sem);
18bb8bbf13c1839 Johannes Thumshirn 2021-04-19  1861  
f5ff64ccf7bb727 Boris Burkov       2024-02-02  1862  		spin_lock(&space_info->lock);
18bb8bbf13c1839 Johannes Thumshirn 2021-04-19  1863  		spin_lock(&bg->lock);
18bb8bbf13c1839 Johannes Thumshirn 2021-04-19  1864  		if (bg->reserved || bg->pinned || bg->ro) {
18bb8bbf13c1839 Johannes Thumshirn 2021-04-19  1865  			/*
18bb8bbf13c1839 Johannes Thumshirn 2021-04-19  1866  			 * We want to bail if we made new allocations or have
18bb8bbf13c1839 Johannes Thumshirn 2021-04-19  1867  			 * outstanding allocations in this block group.  We do
18bb8bbf13c1839 Johannes Thumshirn 2021-04-19  1868  			 * the ro check in case balance is currently acting on
18bb8bbf13c1839 Johannes Thumshirn 2021-04-19  1869  			 * this block group.
18bb8bbf13c1839 Johannes Thumshirn 2021-04-19  1870  			 */
18bb8bbf13c1839 Johannes Thumshirn 2021-04-19  1871  			spin_unlock(&bg->lock);
f5ff64ccf7bb727 Boris Burkov       2024-02-02  1872  			spin_unlock(&space_info->lock);
18bb8bbf13c1839 Johannes Thumshirn 2021-04-19  1873  			up_write(&space_info->groups_sem);
18bb8bbf13c1839 Johannes Thumshirn 2021-04-19  1874  			goto next;
18bb8bbf13c1839 Johannes Thumshirn 2021-04-19  1875  		}
cc4804bfd6392bc Boris Burkov       2022-10-13  1876  		if (bg->used == 0) {
cc4804bfd6392bc Boris Burkov       2022-10-13  1877  			/*
cc4804bfd6392bc Boris Burkov       2022-10-13  1878  			 * It is possible that we trigger relocation on a block
cc4804bfd6392bc Boris Burkov       2022-10-13  1879  			 * group as its extents are deleted and it first goes
cc4804bfd6392bc Boris Burkov       2022-10-13  1880  			 * below the threshold, then shortly after goes empty.
cc4804bfd6392bc Boris Burkov       2022-10-13  1881  			 *
cc4804bfd6392bc Boris Burkov       2022-10-13  1882  			 * In this case, relocating it does delete it, but has
cc4804bfd6392bc Boris Burkov       2022-10-13  1883  			 * some overhead in relocation specific metadata, looking
cc4804bfd6392bc Boris Burkov       2022-10-13  1884  			 * for the non-existent extents and running some extra
cc4804bfd6392bc Boris Burkov       2022-10-13  1885  			 * transactions, which we can avoid by using one of the
cc4804bfd6392bc Boris Burkov       2022-10-13  1886  			 * other mechanisms for dealing with empty block groups.
cc4804bfd6392bc Boris Burkov       2022-10-13  1887  			 */
cc4804bfd6392bc Boris Burkov       2022-10-13  1888  			if (!btrfs_test_opt(fs_info, DISCARD_ASYNC))
cc4804bfd6392bc Boris Burkov       2022-10-13  1889  				btrfs_mark_bg_unused(bg);
cc4804bfd6392bc Boris Burkov       2022-10-13  1890  			spin_unlock(&bg->lock);
f5ff64ccf7bb727 Boris Burkov       2024-02-02  1891  			spin_unlock(&space_info->lock);
cc4804bfd6392bc Boris Burkov       2022-10-13  1892  			up_write(&space_info->groups_sem);
cc4804bfd6392bc Boris Burkov       2022-10-13  1893  			goto next;
81531225e5bd50c Boris Burkov       2022-10-13  1894  
81531225e5bd50c Boris Burkov       2022-10-13  1895  		}
81531225e5bd50c Boris Burkov       2022-10-13  1896  		/*
81531225e5bd50c Boris Burkov       2022-10-13  1897  		 * The block group might no longer meet the reclaim condition by
81531225e5bd50c Boris Burkov       2022-10-13  1898  		 * the time we get around to reclaiming it, so to avoid
81531225e5bd50c Boris Burkov       2022-10-13  1899  		 * reclaiming overly full block_groups, skip reclaiming them.
81531225e5bd50c Boris Burkov       2022-10-13  1900  		 *
81531225e5bd50c Boris Burkov       2022-10-13  1901  		 * Since the decision making process also depends on the amount
81531225e5bd50c Boris Burkov       2022-10-13  1902  		 * being freed, pass in a fake giant value to skip that extra
81531225e5bd50c Boris Burkov       2022-10-13  1903  		 * check, which is more meaningful when adding to the list in
81531225e5bd50c Boris Burkov       2022-10-13  1904  		 * the first place.
81531225e5bd50c Boris Burkov       2022-10-13  1905  		 */
81531225e5bd50c Boris Burkov       2022-10-13  1906  		if (!should_reclaim_block_group(bg, bg->length)) {
81531225e5bd50c Boris Burkov       2022-10-13  1907  			spin_unlock(&bg->lock);
f5ff64ccf7bb727 Boris Burkov       2024-02-02  1908  			spin_unlock(&space_info->lock);
81531225e5bd50c Boris Burkov       2022-10-13  1909  			up_write(&space_info->groups_sem);
81531225e5bd50c Boris Burkov       2022-10-13  1910  			goto next;
cc4804bfd6392bc Boris Burkov       2022-10-13  1911  		}
1283b8c125a83bf Filipe Manana      2025-02-21  1912  
1283b8c125a83bf Filipe Manana      2025-02-21  1913  		/*
1283b8c125a83bf Filipe Manana      2025-02-21  1914  		 * Cache the zone_unusable value before turning the block group
1283b8c125a83bf Filipe Manana      2025-02-21  1915  		 * to read only. As soon as the block group is read only it's
1283b8c125a83bf Filipe Manana      2025-02-21  1916  		 * zone_unusable value gets moved to the block group's read-only
1283b8c125a83bf Filipe Manana      2025-02-21  1917  		 * bytes and isn't available for calculations anymore. We also
1283b8c125a83bf Filipe Manana      2025-02-21  1918  		 * cache it before unlocking the block group, to prevent races
1283b8c125a83bf Filipe Manana      2025-02-21  1919  		 * (reports from KCSAN and such tools) with tasks updating it.
1283b8c125a83bf Filipe Manana      2025-02-21  1920  		 */
1283b8c125a83bf Filipe Manana      2025-02-21  1921  		zone_unusable = bg->zone_unusable;
1283b8c125a83bf Filipe Manana      2025-02-21  1922  
18bb8bbf13c1839 Johannes Thumshirn 2021-04-19  1923  		spin_unlock(&bg->lock);
f5ff64ccf7bb727 Boris Burkov       2024-02-02  1924  		spin_unlock(&space_info->lock);
18bb8bbf13c1839 Johannes Thumshirn 2021-04-19  1925  
93463ff7b54626f Naohiro Aota       2023-06-06  1926  		/*
93463ff7b54626f Naohiro Aota       2023-06-06  1927  		 * Get out fast, in case we're read-only or unmounting the
93463ff7b54626f Naohiro Aota       2023-06-06  1928  		 * filesystem. It is OK to drop block groups from the list even
93463ff7b54626f Naohiro Aota       2023-06-06  1929  		 * for the read-only case. As we did sb_start_write(),
93463ff7b54626f Naohiro Aota       2023-06-06  1930  		 * "mount -o remount,ro" won't happen and read-only filesystem
93463ff7b54626f Naohiro Aota       2023-06-06  1931  		 * means it is forced read-only due to a fatal error. So, it
93463ff7b54626f Naohiro Aota       2023-06-06  1932  		 * never gets back to read-write to let us reclaim again.
93463ff7b54626f Naohiro Aota       2023-06-06  1933  		 */
93463ff7b54626f Naohiro Aota       2023-06-06  1934  		if (btrfs_need_cleaner_sleep(fs_info)) {
18bb8bbf13c1839 Johannes Thumshirn 2021-04-19  1935  			up_write(&space_info->groups_sem);
18bb8bbf13c1839 Johannes Thumshirn 2021-04-19  1936  			goto next;
18bb8bbf13c1839 Johannes Thumshirn 2021-04-19  1937  		}
18bb8bbf13c1839 Johannes Thumshirn 2021-04-19  1938  
18bb8bbf13c1839 Johannes Thumshirn 2021-04-19  1939  		ret = inc_block_group_ro(bg, 0);
18bb8bbf13c1839 Johannes Thumshirn 2021-04-19  1940  		up_write(&space_info->groups_sem);
18bb8bbf13c1839 Johannes Thumshirn 2021-04-19  1941  		if (ret < 0)
18bb8bbf13c1839 Johannes Thumshirn 2021-04-19  1942  			goto next;
18bb8bbf13c1839 Johannes Thumshirn 2021-04-19  1943  
ba5d06440cae63e Filipe Manana      2025-02-24  1944  		/*
620768704326c9a Filipe Manana      2025-02-24  1945  		 * The amount of bytes reclaimed corresponds to the sum of the
620768704326c9a Filipe Manana      2025-02-24  1946  		 * "used" and "reserved" counters. We have set the block group
620768704326c9a Filipe Manana      2025-02-24  1947  		 * to RO above, which prevents reservations from happening but
620768704326c9a Filipe Manana      2025-02-24  1948  		 * we may have existing reservations for which allocation has
620768704326c9a Filipe Manana      2025-02-24  1949  		 * not yet been done - btrfs_update_block_group() was not yet
620768704326c9a Filipe Manana      2025-02-24  1950  		 * called, which is where we will transfer a reserved extent's
620768704326c9a Filipe Manana      2025-02-24  1951  		 * size from the "reserved" counter to the "used" counter - this
620768704326c9a Filipe Manana      2025-02-24  1952  		 * happens when running delayed references. When we relocate the
620768704326c9a Filipe Manana      2025-02-24  1953  		 * chunk below, relocation first flushes dellaloc, waits for
620768704326c9a Filipe Manana      2025-02-24  1954  		 * ordered extent completion (which is where we create delayed
620768704326c9a Filipe Manana      2025-02-24  1955  		 * references for data extents) and commits the current
620768704326c9a Filipe Manana      2025-02-24  1956  		 * transaction (which runs delayed references), and only after
620768704326c9a Filipe Manana      2025-02-24  1957  		 * it does the actual work to move extents out of the block
620768704326c9a Filipe Manana      2025-02-24  1958  		 * group. So the reported amount of reclaimed bytes is
620768704326c9a Filipe Manana      2025-02-24  1959  		 * effectively the sum of the 'used' and 'reserved' counters.
ba5d06440cae63e Filipe Manana      2025-02-24  1960  		 */
ba5d06440cae63e Filipe Manana      2025-02-24  1961  		spin_lock(&bg->lock);
ba5d06440cae63e Filipe Manana      2025-02-24  1962  		used = bg->used;
620768704326c9a Filipe Manana      2025-02-24  1963  		reserved = bg->reserved;
ba5d06440cae63e Filipe Manana      2025-02-24  1964  		spin_unlock(&bg->lock);
ba5d06440cae63e Filipe Manana      2025-02-24  1965  
3ba0572b72b1363 Johannes Thumshirn 2025-06-27  1966  		btrfs_debug(fs_info,
620768704326c9a Filipe Manana      2025-02-24  1967  	"reclaiming chunk %llu with %llu%% used %llu%% reserved %llu%% unusable",
95cd356ca23c380 Johannes Thumshirn 2023-02-21  1968  				bg->start,
ba5d06440cae63e Filipe Manana      2025-02-24  1969  				div64_u64(used * 100, bg->length),
620768704326c9a Filipe Manana      2025-02-24  1970  				div64_u64(reserved * 100, bg->length),
5f93e776c6734ce Johannes Thumshirn 2021-06-29  1971  				div64_u64(zone_unusable * 100, bg->length));
18bb8bbf13c1839 Johannes Thumshirn 2021-04-19  1972  		trace_btrfs_reclaim_block_group(bg);
18bb8bbf13c1839 Johannes Thumshirn 2021-04-19  1973  		ret = btrfs_relocate_chunk(fs_info, bg->start);
74944c873602a3e Josef Bacik        2022-07-25  1974  		if (ret) {
74944c873602a3e Josef Bacik        2022-07-25  1975  			btrfs_dec_block_group_ro(bg);
18bb8bbf13c1839 Johannes Thumshirn 2021-04-19  1976  			btrfs_err(fs_info, "error relocating chunk %llu",
18bb8bbf13c1839 Johannes Thumshirn 2021-04-19  1977  				  bg->start);
ba5d06440cae63e Filipe Manana      2025-02-24  1978  			used = 0;
620768704326c9a Filipe Manana      2025-02-24  1979  			reserved = 0;
243192b6764990e Boris Burkov       2024-01-25  1980  			spin_lock(&space_info->lock);
243192b6764990e Boris Burkov       2024-01-25  1981  			space_info->reclaim_errors++;
813d4c642251649 Boris Burkov       2024-02-14  1982  			if (READ_ONCE(space_info->periodic_reclaim))
813d4c642251649 Boris Burkov       2024-02-14  1983  				space_info->periodic_reclaim_ready = false;
243192b6764990e Boris Burkov       2024-01-25  1984  			spin_unlock(&space_info->lock);
74944c873602a3e Josef Bacik        2022-07-25  1985  		}
243192b6764990e Boris Burkov       2024-01-25  1986  		spin_lock(&space_info->lock);
243192b6764990e Boris Burkov       2024-01-25  1987  		space_info->reclaim_count++;
ba5d06440cae63e Filipe Manana      2025-02-24  1988  		space_info->reclaim_bytes += used;
620768704326c9a Filipe Manana      2025-02-24  1989  		space_info->reclaim_bytes += reserved;
243192b6764990e Boris Burkov       2024-01-25  1990  		spin_unlock(&space_info->lock);
18bb8bbf13c1839 Johannes Thumshirn 2021-04-19  1991  
18bb8bbf13c1839 Johannes Thumshirn 2021-04-19  1992  next:
0497dfba98c00ed Boris Burkov       2025-03-05  1993  		if (ret && !READ_ONCE(space_info->periodic_reclaim))
0497dfba98c00ed Boris Burkov       2025-03-05  1994  			btrfs_link_bg_list(bg, &retry_list);
1cea5cf0e664290 Filipe Manana      2021-06-21  1995  		btrfs_put_block_group(bg);
3ed01616bad6c7e Naohiro Aota       2023-06-06  1996  
3ed01616bad6c7e Naohiro Aota       2023-06-06  1997  		mutex_unlock(&fs_info->reclaim_bgs_lock);
3ed01616bad6c7e Naohiro Aota       2023-06-06  1998  		/*
3ed01616bad6c7e Naohiro Aota       2023-06-06  1999  		 * Reclaiming all the block groups in the list can take really
3ed01616bad6c7e Naohiro Aota       2023-06-06  2000  		 * long.  Prioritize cleaning up unused block groups.
3ed01616bad6c7e Naohiro Aota       2023-06-06  2001  		 */
3ed01616bad6c7e Naohiro Aota       2023-06-06  2002  		btrfs_delete_unused_bgs(fs_info);
3ed01616bad6c7e Naohiro Aota       2023-06-06  2003  		/*
3ed01616bad6c7e Naohiro Aota       2023-06-06  2004  		 * If we are interrupted by a balance, we can just bail out. The
3ed01616bad6c7e Naohiro Aota       2023-06-06  2005  		 * cleaner thread restart again if necessary.
3ed01616bad6c7e Naohiro Aota       2023-06-06  2006  		 */
3ed01616bad6c7e Naohiro Aota       2023-06-06  2007  		if (!mutex_trylock(&fs_info->reclaim_bgs_lock))
3ed01616bad6c7e Naohiro Aota       2023-06-06  2008  			goto end;
d96b34248c2f4ea Filipe Manana      2021-11-22  2009  		spin_lock(&fs_info->unused_bgs_lock);
18bb8bbf13c1839 Johannes Thumshirn 2021-04-19  2010  	}
18bb8bbf13c1839 Johannes Thumshirn 2021-04-19  2011  	spin_unlock(&fs_info->unused_bgs_lock);
18bb8bbf13c1839 Johannes Thumshirn 2021-04-19  2012  	mutex_unlock(&fs_info->reclaim_bgs_lock);
3ed01616bad6c7e Naohiro Aota       2023-06-06  2013  end:
4eb4e85c4f81849 Boris Burkov       2024-06-07  2014  	spin_lock(&fs_info->unused_bgs_lock);
4eb4e85c4f81849 Boris Burkov       2024-06-07  2015  	list_splice_tail(&retry_list, &fs_info->reclaim_bgs);
4eb4e85c4f81849 Boris Burkov       2024-06-07  2016  	spin_unlock(&fs_info->unused_bgs_lock);
18bb8bbf13c1839 Johannes Thumshirn 2021-04-19  2017  	btrfs_exclop_finish(fs_info);
ca5e4ea0beaec8b Naohiro Aota       2022-02-18  2018  	sb_end_write(fs_info->sb);
18bb8bbf13c1839 Johannes Thumshirn 2021-04-19  2019  }
18bb8bbf13c1839 Johannes Thumshirn 2021-04-19  2020  

-- 
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki

^ permalink raw reply	[flat|nested] 25+ messages in thread

* [PATCH RFC 8/9] btrfs: lower log level of relocation messages
  2025-06-27  9:19 [PATCH RFC 0/9] btrfs: zoned: fixes for garbage collection under preassure Johannes Thumshirn
                   ` (6 preceding siblings ...)
  2025-06-27  9:19 ` [PATCH RFC 7/9] btrfs: lower auto-reclaim message log level Johannes Thumshirn
@ 2025-06-27  9:19 ` Johannes Thumshirn
  2025-06-27 11:36   ` Christoph Hellwig
                     ` (2 more replies)
  2025-06-27  9:19 ` [PATCH RFC 9/9] btrfs: remove unused bgs on allocation failure Johannes Thumshirn
  8 siblings, 3 replies; 25+ messages in thread
From: Johannes Thumshirn @ 2025-06-27  9:19 UTC (permalink / raw)
  To: linux-btrfs
  Cc: Damien Le Moal, Naohiro Aota, David Sterba, Josef Bacik,
	Boris Burkov, Filipe Manana, Johannes Thumshirn

From: Johannes Thumshirn <johannes.thumshirn@wdc.com>

When running a system with automatic reclaim/balancing enabled, there are
lots of info level messages like the following in the kernel log:

 BTRFS info (device nvme2n1): relocating block group 629212708864 flags data
 BTRFS info (device nvme2n1): found 510 extents, stage: move data extents

Lower the log level to debug for these messages.

Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
---
 fs/btrfs/relocation.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/fs/btrfs/relocation.c b/fs/btrfs/relocation.c
index d7ec1d72821c..46b9236708ed 100644
--- a/fs/btrfs/relocation.c
+++ b/fs/btrfs/relocation.c
@@ -3892,7 +3892,7 @@ static void describe_relocation(struct btrfs_block_group *block_group)
 
 	btrfs_describe_block_groups(block_group->flags, buf, sizeof(buf));
 
-	btrfs_info(block_group->fs_info, "relocating block group %llu flags %s",
+	btrfs_debug(block_group->fs_info, "relocating block group %llu flags %s",
 		   block_group->start, buf);
 }
 
@@ -4044,7 +4044,7 @@ int btrfs_relocate_block_group(struct btrfs_fs_info *fs_info, u64 group_start)
 		if (rc->extents_found == 0)
 			break;
 
-		btrfs_info(fs_info, "found %llu extents, stage: %s",
+		btrfs_debug(fs_info, "found %llu extents, stage: %s",
 			   rc->extents_found, stage_to_string(finishes_stage));
 	}
 
-- 
2.49.0


^ permalink raw reply related	[flat|nested] 25+ messages in thread

* Re: [PATCH RFC 8/9] btrfs: lower log level of relocation messages
  2025-06-27  9:19 ` [PATCH RFC 8/9] btrfs: lower log level of relocation messages Johannes Thumshirn
@ 2025-06-27 11:36   ` Christoph Hellwig
  2025-06-27 23:44   ` kernel test robot
  2025-06-30 17:12   ` David Sterba
  2 siblings, 0 replies; 25+ messages in thread
From: Christoph Hellwig @ 2025-06-27 11:36 UTC (permalink / raw)
  To: Johannes Thumshirn
  Cc: linux-btrfs, Damien Le Moal, Naohiro Aota, David Sterba,
	Josef Bacik, Boris Burkov, Filipe Manana, Johannes Thumshirn

On Fri, Jun 27, 2025 at 11:19:13AM +0200, Johannes Thumshirn wrote:
> From: Johannes Thumshirn <johannes.thumshirn@wdc.com>
> 
> When running a system with automatic reclaim/balancing enabled, there are
> lots of info level messages like the following in the kernel log:
> 
>  BTRFS info (device nvme2n1): relocating block group 629212708864 flags data
>  BTRFS info (device nvme2n1): found 510 extents, stage: move data extents
> 
> Lower the log level to debug for these messages.

Same here.  This is useful debug information, but I'd expect something
like this to be a trace event that is enabled as needed, and not
something going into the kernel log.


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH RFC 8/9] btrfs: lower log level of relocation messages
  2025-06-27  9:19 ` [PATCH RFC 8/9] btrfs: lower log level of relocation messages Johannes Thumshirn
  2025-06-27 11:36   ` Christoph Hellwig
@ 2025-06-27 23:44   ` kernel test robot
  2025-06-30 17:12   ` David Sterba
  2 siblings, 0 replies; 25+ messages in thread
From: kernel test robot @ 2025-06-27 23:44 UTC (permalink / raw)
  To: Johannes Thumshirn; +Cc: oe-kbuild-all

Hi Johannes,

[This is a private test report for your RFC patch.]
kernel test robot noticed the following build warnings:

[auto build test WARNING on next-20250626]
[cannot apply to kdave/for-next v6.16-rc3 v6.16-rc2 v6.16-rc1 linus/master v6.16-rc3]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch#_base_tree_information]

url:    https://github.com/intel-lab-lkp/linux/commits/Johannes-Thumshirn/btrfs-zoned-do-not-select-metadata-BG-as-finish-target/20250627-172551
base:   next-20250626
patch link:    https://lore.kernel.org/r/20250627091914.100715-9-jth%40kernel.org
patch subject: [PATCH RFC 8/9] btrfs: lower log level of relocation messages
config: x86_64-buildonly-randconfig-003-20250628 (https://download.01.org/0day-ci/archive/20250628/202506280720.2zYUJWXx-lkp@intel.com/config)
compiler: gcc-12 (Debian 12.2.0-14+deb12u1) 12.2.0
reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20250628/202506280720.2zYUJWXx-lkp@intel.com/reproduce)

If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202506280720.2zYUJWXx-lkp@intel.com/

All warnings (new ones prefixed by >>):

   fs/btrfs/relocation.c: In function 'btrfs_relocate_block_group':
   fs/btrfs/relocation.c:4013:34: warning: variable 'finishes_stage' set but not used [-Wunused-but-set-variable]
    4013 |                 enum reloc_stage finishes_stage;
         |                                  ^~~~~~~~~~~~~~
   fs/btrfs/relocation.c: At top level:
>> fs/btrfs/relocation.c:3899:20: warning: 'stage_to_string' defined but not used [-Wunused-function]
    3899 | static const char *stage_to_string(enum reloc_stage stage)
         |                    ^~~~~~~~~~~~~~~


vim +/stage_to_string +3899 fs/btrfs/relocation.c

ebce0e01b930bf Adam Borowski      2016-11-14  3898  
8daf07cf2b7919 David Sterba       2023-09-22 @3899  static const char *stage_to_string(enum reloc_stage stage)
430640e31649be Qu Wenruo          2019-11-29  3900  {
430640e31649be Qu Wenruo          2019-11-29  3901  	if (stage == MOVE_DATA_EXTENTS)
430640e31649be Qu Wenruo          2019-11-29  3902  		return "move data extents";
430640e31649be Qu Wenruo          2019-11-29  3903  	if (stage == UPDATE_DATA_PTRS)
430640e31649be Qu Wenruo          2019-11-29  3904  		return "update data pointers";
430640e31649be Qu Wenruo          2019-11-29  3905  	return "unknown";
430640e31649be Qu Wenruo          2019-11-29  3906  }
430640e31649be Qu Wenruo          2019-11-29  3907  
5d4f98a28c7d33 Yan Zheng          2009-06-10  3908  /*
5d4f98a28c7d33 Yan Zheng          2009-06-10  3909   * function to relocate all extents in a block group.
5d4f98a28c7d33 Yan Zheng          2009-06-10  3910   */
6bccf3ab1e1f09 Jeff Mahoney       2016-06-21  3911  int btrfs_relocate_block_group(struct btrfs_fs_info *fs_info, u64 group_start)
5d4f98a28c7d33 Yan Zheng          2009-06-10  3912  {
32da5386d9a4fd David Sterba       2019-10-29  3913  	struct btrfs_block_group *bg;
29cbcf401793f4 Josef Bacik        2021-11-05  3914  	struct btrfs_root *extent_root = btrfs_extent_root(fs_info, group_start);
5d4f98a28c7d33 Yan Zheng          2009-06-10  3915  	struct reloc_control *rc;
0af3d00bad38d3 Josef Bacik        2010-06-21  3916  	struct inode *inode;
0af3d00bad38d3 Josef Bacik        2010-06-21  3917  	struct btrfs_path *path;
5d4f98a28c7d33 Yan Zheng          2009-06-10  3918  	int ret;
f0486c68e4bd9a Yan, Zheng         2010-05-16  3919  	int rw = 0;
5d4f98a28c7d33 Yan Zheng          2009-06-10  3920  	int err = 0;
5d4f98a28c7d33 Yan Zheng          2009-06-10  3921  
b4be6aefa73c9a Josef Bacik        2022-02-18  3922  	/*
b4be6aefa73c9a Josef Bacik        2022-02-18  3923  	 * This only gets set if we had a half-deleted snapshot on mount.  We
b4be6aefa73c9a Josef Bacik        2022-02-18  3924  	 * cannot allow relocation to start while we're still trying to clean up
b4be6aefa73c9a Josef Bacik        2022-02-18  3925  	 * these pending deletions.
b4be6aefa73c9a Josef Bacik        2022-02-18  3926  	 */
b4be6aefa73c9a Josef Bacik        2022-02-18  3927  	ret = wait_on_bit(&fs_info->flags, BTRFS_FS_UNFINISHED_DROPS, TASK_INTERRUPTIBLE);
b4be6aefa73c9a Josef Bacik        2022-02-18  3928  	if (ret)
b4be6aefa73c9a Josef Bacik        2022-02-18  3929  		return ret;
b4be6aefa73c9a Josef Bacik        2022-02-18  3930  
b4be6aefa73c9a Josef Bacik        2022-02-18  3931  	/* We may have been woken up by close_ctree, so bail if we're closing. */
b4be6aefa73c9a Josef Bacik        2022-02-18  3932  	if (btrfs_fs_closing(fs_info))
b4be6aefa73c9a Josef Bacik        2022-02-18  3933  		return -EINTR;
b4be6aefa73c9a Josef Bacik        2022-02-18  3934  
eede2bf34f4fa8 Omar Sandoval      2016-11-03  3935  	bg = btrfs_lookup_block_group(fs_info, group_start);
eede2bf34f4fa8 Omar Sandoval      2016-11-03  3936  	if (!bg)
eede2bf34f4fa8 Omar Sandoval      2016-11-03  3937  		return -ENOENT;
eede2bf34f4fa8 Omar Sandoval      2016-11-03  3938  
0320b3538b2b81 Naohiro Aota       2022-03-29  3939  	/*
0320b3538b2b81 Naohiro Aota       2022-03-29  3940  	 * Relocation of a data block group creates ordered extents.  Without
0320b3538b2b81 Naohiro Aota       2022-03-29  3941  	 * sb_start_write(), we can freeze the filesystem while unfinished
0320b3538b2b81 Naohiro Aota       2022-03-29  3942  	 * ordered extents are left. Such ordered extents can cause a deadlock
0320b3538b2b81 Naohiro Aota       2022-03-29  3943  	 * e.g. when syncfs() is waiting for their completion but they can't
0320b3538b2b81 Naohiro Aota       2022-03-29  3944  	 * finish because they block when joining a transaction, due to the
0320b3538b2b81 Naohiro Aota       2022-03-29  3945  	 * fact that the freeze locks are being held in write mode.
0320b3538b2b81 Naohiro Aota       2022-03-29  3946  	 */
0320b3538b2b81 Naohiro Aota       2022-03-29  3947  	if (bg->flags & BTRFS_BLOCK_GROUP_DATA)
0320b3538b2b81 Naohiro Aota       2022-03-29  3948  		ASSERT(sb_write_started(fs_info->sb));
0320b3538b2b81 Naohiro Aota       2022-03-29  3949  
eede2bf34f4fa8 Omar Sandoval      2016-11-03  3950  	if (btrfs_pinned_by_swapfile(fs_info, bg)) {
eede2bf34f4fa8 Omar Sandoval      2016-11-03  3951  		btrfs_put_block_group(bg);
eede2bf34f4fa8 Omar Sandoval      2016-11-03  3952  		return -ETXTBSY;
eede2bf34f4fa8 Omar Sandoval      2016-11-03  3953  	}
eede2bf34f4fa8 Omar Sandoval      2016-11-03  3954  
c258d6e36442eb Qu Wenruo          2019-03-01  3955  	rc = alloc_reloc_control(fs_info);
eede2bf34f4fa8 Omar Sandoval      2016-11-03  3956  	if (!rc) {
eede2bf34f4fa8 Omar Sandoval      2016-11-03  3957  		btrfs_put_block_group(bg);
5d4f98a28c7d33 Yan Zheng          2009-06-10  3958  		return -ENOMEM;
eede2bf34f4fa8 Omar Sandoval      2016-11-03  3959  	}
5d4f98a28c7d33 Yan Zheng          2009-06-10  3960  
907d2710d72754 David Sterba       2021-05-18  3961  	ret = reloc_chunk_start(fs_info);
907d2710d72754 David Sterba       2021-05-18  3962  	if (ret < 0) {
907d2710d72754 David Sterba       2021-05-18  3963  		err = ret;
907d2710d72754 David Sterba       2021-05-18  3964  		goto out_put_bg;
907d2710d72754 David Sterba       2021-05-18  3965  	}
907d2710d72754 David Sterba       2021-05-18  3966  
f0486c68e4bd9a Yan, Zheng         2010-05-16  3967  	rc->extent_root = extent_root;
eede2bf34f4fa8 Omar Sandoval      2016-11-03  3968  	rc->block_group = bg;
5d4f98a28c7d33 Yan Zheng          2009-06-10  3969  
b12de52896c0e8 Qu Wenruo          2019-11-15  3970  	ret = btrfs_inc_block_group_ro(rc->block_group, true);
f0486c68e4bd9a Yan, Zheng         2010-05-16  3971  	if (ret) {
f0486c68e4bd9a Yan, Zheng         2010-05-16  3972  		err = ret;
f0486c68e4bd9a Yan, Zheng         2010-05-16  3973  		goto out;
f0486c68e4bd9a Yan, Zheng         2010-05-16  3974  	}
f0486c68e4bd9a Yan, Zheng         2010-05-16  3975  	rw = 1;
f0486c68e4bd9a Yan, Zheng         2010-05-16  3976  
0af3d00bad38d3 Josef Bacik        2010-06-21  3977  	path = btrfs_alloc_path();
0af3d00bad38d3 Josef Bacik        2010-06-21  3978  	if (!path) {
0af3d00bad38d3 Josef Bacik        2010-06-21  3979  		err = -ENOMEM;
0af3d00bad38d3 Josef Bacik        2010-06-21  3980  		goto out;
0af3d00bad38d3 Josef Bacik        2010-06-21  3981  	}
0af3d00bad38d3 Josef Bacik        2010-06-21  3982  
7949f3392ed65d David Sterba       2019-03-20  3983  	inode = lookup_free_space_inode(rc->block_group, path);
0af3d00bad38d3 Josef Bacik        2010-06-21  3984  	btrfs_free_path(path);
0af3d00bad38d3 Josef Bacik        2010-06-21  3985  
0af3d00bad38d3 Josef Bacik        2010-06-21  3986  	if (!IS_ERR(inode))
20faaab2c32f37 Filipe Manana      2025-03-07  3987  		ret = delete_block_group_cache(rc->block_group, inode, 0);
0af3d00bad38d3 Josef Bacik        2010-06-21  3988  	else
0af3d00bad38d3 Josef Bacik        2010-06-21  3989  		ret = PTR_ERR(inode);
0af3d00bad38d3 Josef Bacik        2010-06-21  3990  
0af3d00bad38d3 Josef Bacik        2010-06-21  3991  	if (ret && ret != -ENOENT) {
0af3d00bad38d3 Josef Bacik        2010-06-21  3992  		err = ret;
0af3d00bad38d3 Josef Bacik        2010-06-21  3993  		goto out;
0af3d00bad38d3 Josef Bacik        2010-06-21  3994  	}
0af3d00bad38d3 Josef Bacik        2010-06-21  3995  
f75a043737ecf1 Filipe Manana      2025-03-07  3996  	rc->data_inode = create_reloc_inode(rc->block_group);
5d4f98a28c7d33 Yan Zheng          2009-06-10  3997  	if (IS_ERR(rc->data_inode)) {
5d4f98a28c7d33 Yan Zheng          2009-06-10  3998  		err = PTR_ERR(rc->data_inode);
5d4f98a28c7d33 Yan Zheng          2009-06-10  3999  		rc->data_inode = NULL;
5d4f98a28c7d33 Yan Zheng          2009-06-10  4000  		goto out;
5d4f98a28c7d33 Yan Zheng          2009-06-10  4001  	}
5d4f98a28c7d33 Yan Zheng          2009-06-10  4002  
17a21d79149b24 Johannes Thumshirn 2024-06-05  4003  	describe_relocation(rc->block_group);
5d4f98a28c7d33 Yan Zheng          2009-06-10  4004  
9cfa3e34e20e67 Filipe Manana      2016-04-26  4005  	btrfs_wait_block_group_reservations(rc->block_group);
f78c436c3931e7 Filipe Manana      2016-05-09  4006  	btrfs_wait_nocow_writers(rc->block_group);
42317ab440c110 David Sterba       2024-05-14  4007  	btrfs_wait_ordered_roots(fs_info, U64_MAX, rc->block_group);
5d4f98a28c7d33 Yan Zheng          2009-06-10  4008  
7ae9bd18032e81 Naohiro Aota       2021-08-19  4009  	ret = btrfs_zone_finish(rc->block_group);
7ae9bd18032e81 Naohiro Aota       2021-08-19  4010  	WARN_ON(ret && ret != -EAGAIN);
7ae9bd18032e81 Naohiro Aota       2021-08-19  4011  
5d4f98a28c7d33 Yan Zheng          2009-06-10  4012  	while (1) {
8daf07cf2b7919 David Sterba       2023-09-22 @4013  		enum reloc_stage finishes_stage;
430640e31649be Qu Wenruo          2019-11-29  4014  
76dda93c6ae2c1 Yan, Zheng         2009-09-21  4015  		mutex_lock(&fs_info->cleaner_mutex);
5d4f98a28c7d33 Yan Zheng          2009-06-10  4016  		ret = relocate_block_group(rc);
76dda93c6ae2c1 Yan, Zheng         2009-09-21  4017  		mutex_unlock(&fs_info->cleaner_mutex);
ff612ba7849964 Josef Bacik        2019-02-25  4018  		if (ret < 0)
5d4f98a28c7d33 Yan Zheng          2009-06-10  4019  			err = ret;
5d4f98a28c7d33 Yan Zheng          2009-06-10  4020  
430640e31649be Qu Wenruo          2019-11-29  4021  		finishes_stage = rc->stage;
ff612ba7849964 Josef Bacik        2019-02-25  4022  		/*
ff612ba7849964 Josef Bacik        2019-02-25  4023  		 * We may have gotten ENOSPC after we already dirtied some
ff612ba7849964 Josef Bacik        2019-02-25  4024  		 * extents.  If writeout happens while we're relocating a
ff612ba7849964 Josef Bacik        2019-02-25  4025  		 * different block group we could end up hitting the
ff612ba7849964 Josef Bacik        2019-02-25  4026  		 * BUG_ON(rc->stage == UPDATE_DATA_PTRS) in
ff612ba7849964 Josef Bacik        2019-02-25  4027  		 * btrfs_reloc_cow_block.  Make sure we write everything out
ff612ba7849964 Josef Bacik        2019-02-25  4028  		 * properly so we don't trip over this problem, and then break
ff612ba7849964 Josef Bacik        2019-02-25  4029  		 * out of the loop if we hit an error.
ff612ba7849964 Josef Bacik        2019-02-25  4030  		 */
5d4f98a28c7d33 Yan Zheng          2009-06-10  4031  		if (rc->stage == MOVE_DATA_EXTENTS && rc->found_file_extent) {
e641e323abb3ce Filipe Manana      2024-05-18  4032  			ret = btrfs_wait_ordered_range(BTRFS_I(rc->data_inode), 0,
0ef8b726075aa6 Josef Bacik        2013-10-25  4033  						       (u64)-1);
ff612ba7849964 Josef Bacik        2019-02-25  4034  			if (ret)
0ef8b726075aa6 Josef Bacik        2013-10-25  4035  				err = ret;
5d4f98a28c7d33 Yan Zheng          2009-06-10  4036  			invalidate_mapping_pages(rc->data_inode->i_mapping,
5d4f98a28c7d33 Yan Zheng          2009-06-10  4037  						 0, -1);
5d4f98a28c7d33 Yan Zheng          2009-06-10  4038  			rc->stage = UPDATE_DATA_PTRS;
5d4f98a28c7d33 Yan Zheng          2009-06-10  4039  		}
ff612ba7849964 Josef Bacik        2019-02-25  4040  
ff612ba7849964 Josef Bacik        2019-02-25  4041  		if (err < 0)
ff612ba7849964 Josef Bacik        2019-02-25  4042  			goto out;
ff612ba7849964 Josef Bacik        2019-02-25  4043  
ff612ba7849964 Josef Bacik        2019-02-25  4044  		if (rc->extents_found == 0)
ff612ba7849964 Josef Bacik        2019-02-25  4045  			break;
ff612ba7849964 Josef Bacik        2019-02-25  4046  
29dfe9961e3037 Johannes Thumshirn 2025-06-27  4047  		btrfs_debug(fs_info, "found %llu extents, stage: %s",
430640e31649be Qu Wenruo          2019-11-29  4048  			   rc->extents_found, stage_to_string(finishes_stage));
5d4f98a28c7d33 Yan Zheng          2009-06-10  4049  	}
5d4f98a28c7d33 Yan Zheng          2009-06-10  4050  
5d4f98a28c7d33 Yan Zheng          2009-06-10  4051  	WARN_ON(rc->block_group->pinned > 0);
5d4f98a28c7d33 Yan Zheng          2009-06-10  4052  	WARN_ON(rc->block_group->reserved > 0);
bf38be65f3703d David Sterba       2019-10-23  4053  	WARN_ON(rc->block_group->used > 0);
5d4f98a28c7d33 Yan Zheng          2009-06-10  4054  out:
f0486c68e4bd9a Yan, Zheng         2010-05-16  4055  	if (err && rw)
2ff7e61e0d30ff Jeff Mahoney       2016-06-22  4056  		btrfs_dec_block_group_ro(rc->block_group);
5d4f98a28c7d33 Yan Zheng          2009-06-10  4057  	iput(rc->data_inode);
907d2710d72754 David Sterba       2021-05-18  4058  out_put_bg:
907d2710d72754 David Sterba       2021-05-18  4059  	btrfs_put_block_group(bg);
907d2710d72754 David Sterba       2021-05-18  4060  	reloc_chunk_end(fs_info);
1a0afa0ecfc4db Josef Bacik        2020-03-04  4061  	free_reloc_control(rc);
5d4f98a28c7d33 Yan Zheng          2009-06-10  4062  	return err;
5d4f98a28c7d33 Yan Zheng          2009-06-10  4063  }
5d4f98a28c7d33 Yan Zheng          2009-06-10  4064  

-- 
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH RFC 8/9] btrfs: lower log level of relocation messages
  2025-06-27  9:19 ` [PATCH RFC 8/9] btrfs: lower log level of relocation messages Johannes Thumshirn
  2025-06-27 11:36   ` Christoph Hellwig
  2025-06-27 23:44   ` kernel test robot
@ 2025-06-30 17:12   ` David Sterba
  2025-07-01  5:09     ` Johannes Thumshirn
  2 siblings, 1 reply; 25+ messages in thread
From: David Sterba @ 2025-06-30 17:12 UTC (permalink / raw)
  To: Johannes Thumshirn
  Cc: linux-btrfs, Damien Le Moal, Naohiro Aota, David Sterba,
	Josef Bacik, Boris Burkov, Filipe Manana, Johannes Thumshirn

On Fri, Jun 27, 2025 at 11:19:13AM +0200, Johannes Thumshirn wrote:
> From: Johannes Thumshirn <johannes.thumshirn@wdc.com>
> 
> When running a system with automatic reclaim/balancing enabled, there are
> lots of info level messages like the following in the kernel log:
> 
>  BTRFS info (device nvme2n1): relocating block group 629212708864 flags data
>  BTRFS info (device nvme2n1): found 510 extents, stage: move data extents
> 
> Lower the log level to debug for these messages.
> 
> Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>

I kind of like that the message is in the system log on the info level,
it's a high level operation and tracks the progress. Also it's been
there forever and I don't think I'm the only one used to seeing it
there. We have many info messages and vague guidelines when to use it,
but I think "once per block group" is still within the intentions.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH RFC 8/9] btrfs: lower log level of relocation messages
  2025-06-30 17:12   ` David Sterba
@ 2025-07-01  5:09     ` Johannes Thumshirn
  2025-07-01 14:43       ` David Sterba
  0 siblings, 1 reply; 25+ messages in thread
From: Johannes Thumshirn @ 2025-07-01  5:09 UTC (permalink / raw)
  To: dsterba@suse.cz, Johannes Thumshirn
  Cc: linux-btrfs@vger.kernel.org, Damien Le Moal, Naohiro Aota,
	David Sterba, Josef Bacik, Boris Burkov, Filipe Manana

On 30.06.25 19:12, David Sterba wrote:
> On Fri, Jun 27, 2025 at 11:19:13AM +0200, Johannes Thumshirn wrote:
>> From: Johannes Thumshirn <johannes.thumshirn@wdc.com>
>>
>> When running a system with automatic reclaim/balancing enabled, there are
>> lots of info level messages like the following in the kernel log:
>>
>>   BTRFS info (device nvme2n1): relocating block group 629212708864 flags data
>>   BTRFS info (device nvme2n1): found 510 extents, stage: move data extents
>>
>> Lower the log level to debug for these messages.
>>
>> Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
> 
> I kind of like that the message is in the system log on the info level,
> it's a high level operation and tracks the progress. Also it's been
> there forever and I don't think I'm the only one used to seeing it
> there. We have many info messages and vague guidelines when to use it,
> but I think "once per block group" is still within the intentions.
> 

Yes but now that automatic balancing is in place this is spamming all 
over dmesg.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH RFC 8/9] btrfs: lower log level of relocation messages
  2025-07-01  5:09     ` Johannes Thumshirn
@ 2025-07-01 14:43       ` David Sterba
  0 siblings, 0 replies; 25+ messages in thread
From: David Sterba @ 2025-07-01 14:43 UTC (permalink / raw)
  To: Johannes Thumshirn
  Cc: dsterba@suse.cz, Johannes Thumshirn, linux-btrfs@vger.kernel.org,
	Damien Le Moal, Naohiro Aota, David Sterba, Josef Bacik,
	Boris Burkov, Filipe Manana

On Tue, Jul 01, 2025 at 05:09:06AM +0000, Johannes Thumshirn wrote:
> On 30.06.25 19:12, David Sterba wrote:
> > On Fri, Jun 27, 2025 at 11:19:13AM +0200, Johannes Thumshirn wrote:
> >> From: Johannes Thumshirn <johannes.thumshirn@wdc.com>
> >>
> >> When running a system with automatic reclaim/balancing enabled, there are
> >> lots of info level messages like the following in the kernel log:
> >>
> >>   BTRFS info (device nvme2n1): relocating block group 629212708864 flags data
> >>   BTRFS info (device nvme2n1): found 510 extents, stage: move data extents
> >>
> >> Lower the log level to debug for these messages.
> >>
> >> Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
> > 
> > I kind of like that the message is in the system log on the info level,
> > it's a high level operation and tracks the progress. Also it's been
> > there forever and I don't think I'm the only one used to seeing it
> > there. We have many info messages and vague guidelines when to use it,
> > but I think "once per block group" is still within the intentions.
> > 
> 
> Yes but now that automatic balancing is in place this is spamming all 
> over dmesg.

We could distinguish the reason of relocation so the one started
manually will print what we have now and the automatic only say two
messages like "starting automatic bg cleanup" and once it finishes some
kind of summary "bg cleanup removed 123 block groups". If you want
additional debugging just for the automatic reclaim then it's OK.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* [PATCH RFC 9/9] btrfs: remove unused bgs on allocation failure
  2025-06-27  9:19 [PATCH RFC 0/9] btrfs: zoned: fixes for garbage collection under preassure Johannes Thumshirn
                   ` (7 preceding siblings ...)
  2025-06-27  9:19 ` [PATCH RFC 8/9] btrfs: lower log level of relocation messages Johannes Thumshirn
@ 2025-06-27  9:19 ` Johannes Thumshirn
  2025-06-27 11:38   ` Christoph Hellwig
  2025-06-27 12:14   ` Filipe Manana
  8 siblings, 2 replies; 25+ messages in thread
From: Johannes Thumshirn @ 2025-06-27  9:19 UTC (permalink / raw)
  To: linux-btrfs
  Cc: Damien Le Moal, Naohiro Aota, David Sterba, Josef Bacik,
	Boris Burkov, Filipe Manana, Johannes Thumshirn

From: Johannes Thumshirn <johannes.thumshirn@wdc.com>

In case find_free_extent() return ENOSPC, check if there are block-groups
in the filsystem which have been marked as 'unused' and if so, reclaim the
space occupied by these block-groups.

Restart the search for free space to place the extent afterwards.

In case the allocation is targeted for the data relocation root, skip this
step, as it can cause deadlocks between block group deletion and relocation.

Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
---
 fs/btrfs/block-group.h | 11 +++++++++++
 fs/btrfs/extent-tree.c |  5 +++++
 2 files changed, 16 insertions(+)

diff --git a/fs/btrfs/block-group.h b/fs/btrfs/block-group.h
index a8bb8429c966..d5c91db88456 100644
--- a/fs/btrfs/block-group.h
+++ b/fs/btrfs/block-group.h
@@ -396,4 +396,15 @@ int btrfs_use_block_group_size_class(struct btrfs_block_group *bg,
 				     bool force_wrong_size_class);
 bool btrfs_block_group_should_use_size_class(const struct btrfs_block_group *bg);
 
+static inline bool btrfs_has_unused_block_groups(struct btrfs_fs_info *fs_info)
+{
+	bool unused_bgs;
+
+	spin_lock(&fs_info->unused_bgs_lock);
+	unused_bgs = !list_empty(&fs_info->unused_bgs);
+	spin_unlock(&fs_info->unused_bgs_lock);
+
+	return unused_bgs;
+}
+
 #endif /* BTRFS_BLOCK_GROUP_H */
diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index da731f6d4dad..34d21713c6ab 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -4683,6 +4683,11 @@ int btrfs_reserve_extent(struct btrfs_root *root, u64 ram_bytes,
 	if (!ret && !is_data) {
 		btrfs_dec_block_group_reservations(fs_info, ins->objectid);
 	} else if (ret == -ENOSPC) {
+		if (!btrfs_is_data_reloc_root(root) &&
+		    btrfs_has_unused_block_groups(fs_info)) {
+			btrfs_delete_unused_bgs(fs_info);
+			goto again;
+		}
 		if (!final_tried && ins->offset) {
 			num_bytes = min(num_bytes >> 1, ins->offset);
 			num_bytes = round_down(num_bytes,
-- 
2.49.0


^ permalink raw reply related	[flat|nested] 25+ messages in thread

* Re: [PATCH RFC 9/9] btrfs: remove unused bgs on allocation failure
  2025-06-27  9:19 ` [PATCH RFC 9/9] btrfs: remove unused bgs on allocation failure Johannes Thumshirn
@ 2025-06-27 11:38   ` Christoph Hellwig
  2025-06-30 11:45     ` Johannes Thumshirn
  2025-06-27 12:14   ` Filipe Manana
  1 sibling, 1 reply; 25+ messages in thread
From: Christoph Hellwig @ 2025-06-27 11:38 UTC (permalink / raw)
  To: Johannes Thumshirn
  Cc: linux-btrfs, Damien Le Moal, Naohiro Aota, David Sterba,
	Josef Bacik, Boris Burkov, Filipe Manana, Johannes Thumshirn

On Fri, Jun 27, 2025 at 11:19:14AM +0200, Johannes Thumshirn wrote:
> From: Johannes Thumshirn <johannes.thumshirn@wdc.com>
> 
> In case find_free_extent() return ENOSPC, check if there are block-groups
> in the filsystem which have been marked as 'unused' and if so, reclaim the
> space occupied by these block-groups.
> 
> Restart the search for free space to place the extent afterwards.
> 
> In case the allocation is targeted for the data relocation root, skip this
> step, as it can cause deadlocks between block group deletion and relocation.

Assuming an unused BG is one without space in it that just needs a zone
reset or discard (a quick look at the code seems to confirm that, but
with some extra caveats):  why don't you reclaim it ASAP once it becomes
unused, at least modulo those space reservation caveats (which I don't
understand from that quick look).


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH RFC 9/9] btrfs: remove unused bgs on allocation failure
  2025-06-27 11:38   ` Christoph Hellwig
@ 2025-06-30 11:45     ` Johannes Thumshirn
  2025-06-30 12:05       ` Filipe Manana
  0 siblings, 1 reply; 25+ messages in thread
From: Johannes Thumshirn @ 2025-06-30 11:45 UTC (permalink / raw)
  To: hch@infradead.org, Johannes Thumshirn
  Cc: linux-btrfs@vger.kernel.org, Damien Le Moal, Naohiro Aota,
	David Sterba, Josef Bacik, Boris Burkov, Filipe Manana

On 27.06.25 13:39, Christoph Hellwig wrote:
> On Fri, Jun 27, 2025 at 11:19:14AM +0200, Johannes Thumshirn wrote:
>> From: Johannes Thumshirn <johannes.thumshirn@wdc.com>
>>
>> In case find_free_extent() return ENOSPC, check if there are block-groups
>> in the filsystem which have been marked as 'unused' and if so, reclaim the
>> space occupied by these block-groups.
>>
>> Restart the search for free space to place the extent afterwards.
>>
>> In case the allocation is targeted for the data relocation root, skip this
>> step, as it can cause deadlocks between block group deletion and relocation.
> 
> Assuming an unused BG is one without space in it that just needs a zone
> reset or discard (a quick look at the code seems to confirm that, but
> with some extra caveats):  why don't you reclaim it ASAP once it becomes
> unused, at least modulo those space reservation caveats (which I don't
> understand from that quick look).
> 
> 

I've looked into it looks promising. Threw it into fstests and (up to 
now) nothing broke. So I'll run Damien's scripts on a ZNS drive and 
we'll see if it helps.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH RFC 9/9] btrfs: remove unused bgs on allocation failure
  2025-06-30 11:45     ` Johannes Thumshirn
@ 2025-06-30 12:05       ` Filipe Manana
  0 siblings, 0 replies; 25+ messages in thread
From: Filipe Manana @ 2025-06-30 12:05 UTC (permalink / raw)
  To: Johannes Thumshirn
  Cc: hch@infradead.org, Johannes Thumshirn,
	linux-btrfs@vger.kernel.org, Damien Le Moal, Naohiro Aota,
	David Sterba, Josef Bacik, Boris Burkov, Filipe Manana

On Mon, Jun 30, 2025 at 12:46 PM Johannes Thumshirn
<Johannes.Thumshirn@wdc.com> wrote:
>
> On 27.06.25 13:39, Christoph Hellwig wrote:
> > On Fri, Jun 27, 2025 at 11:19:14AM +0200, Johannes Thumshirn wrote:
> >> From: Johannes Thumshirn <johannes.thumshirn@wdc.com>
> >>
> >> In case find_free_extent() return ENOSPC, check if there are block-groups
> >> in the filsystem which have been marked as 'unused' and if so, reclaim the
> >> space occupied by these block-groups.
> >>
> >> Restart the search for free space to place the extent afterwards.
> >>
> >> In case the allocation is targeted for the data relocation root, skip this
> >> step, as it can cause deadlocks between block group deletion and relocation.
> >
> > Assuming an unused BG is one without space in it that just needs a zone
> > reset or discard (a quick look at the code seems to confirm that, but
> > with some extra caveats):  why don't you reclaim it ASAP once it becomes
> > unused, at least modulo those space reservation caveats (which I don't
> > understand from that quick look).
> >
> >
>
> I've looked into it looks promising. Threw it into fstests and (up to
> now) nothing broke. So I'll run Damien's scripts on a ZNS drive and
> we'll see if it helps.

That brings a new problem.

For example a data block group becomes empty and you delete it immediately.
If a data allocation happens before the transaction used to delete the
block group is committed and there are no other data block groups with
enough space and there's no more unallocated device space, we will
-ENOSPC, whereas before we wouldn't.

Remember that a delete block group's space can only be allocated again
after the transaction used to delete it is committed, to respect COW
semantics.
That's why you see the allocator using the commit root (at
find_free_dev_extent()).
And you can't commit the transaction as soon as the bg becomes used,
as we're holding a transaction handle open and would deadlock.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH RFC 9/9] btrfs: remove unused bgs on allocation failure
  2025-06-27  9:19 ` [PATCH RFC 9/9] btrfs: remove unused bgs on allocation failure Johannes Thumshirn
  2025-06-27 11:38   ` Christoph Hellwig
@ 2025-06-27 12:14   ` Filipe Manana
  1 sibling, 0 replies; 25+ messages in thread
From: Filipe Manana @ 2025-06-27 12:14 UTC (permalink / raw)
  To: Johannes Thumshirn
  Cc: linux-btrfs, Damien Le Moal, Naohiro Aota, David Sterba,
	Josef Bacik, Boris Burkov, Filipe Manana, Johannes Thumshirn

On Fri, Jun 27, 2025 at 10:36 AM Johannes Thumshirn <jth@kernel.org> wrote:
>
> From: Johannes Thumshirn <johannes.thumshirn@wdc.com>
>
> In case find_free_extent() return ENOSPC, check if there are block-groups
> in the filsystem which have been marked as 'unused' and if so, reclaim the
> space occupied by these block-groups.
>
> Restart the search for free space to place the extent afterwards.
>
> In case the allocation is targeted for the data relocation root, skip this
> step, as it can cause deadlocks between block group deletion and relocation.
>
> Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
> ---
>  fs/btrfs/block-group.h | 11 +++++++++++
>  fs/btrfs/extent-tree.c |  5 +++++
>  2 files changed, 16 insertions(+)
>
> diff --git a/fs/btrfs/block-group.h b/fs/btrfs/block-group.h
> index a8bb8429c966..d5c91db88456 100644
> --- a/fs/btrfs/block-group.h
> +++ b/fs/btrfs/block-group.h
> @@ -396,4 +396,15 @@ int btrfs_use_block_group_size_class(struct btrfs_block_group *bg,
>                                      bool force_wrong_size_class);
>  bool btrfs_block_group_should_use_size_class(const struct btrfs_block_group *bg);
>
> +static inline bool btrfs_has_unused_block_groups(struct btrfs_fs_info *fs_info)
> +{
> +       bool unused_bgs;
> +
> +       spin_lock(&fs_info->unused_bgs_lock);
> +       unused_bgs = !list_empty(&fs_info->unused_bgs);
> +       spin_unlock(&fs_info->unused_bgs_lock);
> +
> +       return unused_bgs;
> +}
> +
>  #endif /* BTRFS_BLOCK_GROUP_H */
> diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
> index da731f6d4dad..34d21713c6ab 100644
> --- a/fs/btrfs/extent-tree.c
> +++ b/fs/btrfs/extent-tree.c
> @@ -4683,6 +4683,11 @@ int btrfs_reserve_extent(struct btrfs_root *root, u64 ram_bytes,
>         if (!ret && !is_data) {
>                 btrfs_dec_block_group_reservations(fs_info, ins->objectid);
>         } else if (ret == -ENOSPC) {
> +               if (!btrfs_is_data_reloc_root(root) &&
> +                   btrfs_has_unused_block_groups(fs_info)) {
> +                       btrfs_delete_unused_bgs(fs_info);
> +                       goto again;
> +               }

Unfortunately this won't solve the -ENOSPC.

A deleted block group can't be reused in the same transaction, we have
to commit the transaction used to delete it.
This is to respect COW semantics and crash proof consistency.
And we can't commit here the transaction since that would deadlock for
any path that holds a transaction handle open, such as when modifying
any tree for example.

So unless some other task happens to commit the transaction used by
btrfs_delete_unused_bgs() after we call btrfs_delete_unused_bgs() and
before we retry the extent reservation when jumping to 'again',
-ENOSPC will happen again.


>                 if (!final_tried && ins->offset) {
>                         num_bytes = min(num_bytes >> 1, ins->offset);
>                         num_bytes = round_down(num_bytes,
> --
> 2.49.0
>
>

^ permalink raw reply	[flat|nested] 25+ messages in thread

end of thread, other threads:[~2025-07-02 15:36 UTC | newest]

Thread overview: 25+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-06-27  9:19 [PATCH RFC 0/9] btrfs: zoned: fixes for garbage collection under preassure Johannes Thumshirn
2025-06-27  9:19 ` [PATCH RFC 1/9] btrfs: zoned: do not select metadata BG as finish target Johannes Thumshirn
2025-06-27 11:34   ` Christoph Hellwig
2025-07-02 15:34     ` Naohiro Aota
2025-06-27  9:19 ` [PATCH RFC 2/9] btrfs: zoned: get rid of relocation_bg_lock Johannes Thumshirn
2025-06-27  9:19 ` [PATCH RFC 3/9] btrfs: zoned: get rid of treelog_bg_lock Johannes Thumshirn
2025-06-27  9:19 ` [PATCH RFC 4/9] btrfs: zoned: don't hold space_info lock on zoned allocation Johannes Thumshirn
2025-06-27  9:19 ` [PATCH RFC 5/9] btrfs: remove delalloc_root_mutex Johannes Thumshirn
2025-06-27 12:42   ` Filipe Manana
2025-06-27  9:19 ` [PATCH RFC 6/9] btrfs: remove btrfs_root's delalloc_mutex Johannes Thumshirn
2025-06-27 12:30   ` Filipe Manana
2025-06-27  9:19 ` [PATCH RFC 7/9] btrfs: lower auto-reclaim message log level Johannes Thumshirn
2025-06-27 11:35   ` Christoph Hellwig
2025-06-27 23:24   ` kernel test robot
2025-06-27  9:19 ` [PATCH RFC 8/9] btrfs: lower log level of relocation messages Johannes Thumshirn
2025-06-27 11:36   ` Christoph Hellwig
2025-06-27 23:44   ` kernel test robot
2025-06-30 17:12   ` David Sterba
2025-07-01  5:09     ` Johannes Thumshirn
2025-07-01 14:43       ` David Sterba
2025-06-27  9:19 ` [PATCH RFC 9/9] btrfs: remove unused bgs on allocation failure Johannes Thumshirn
2025-06-27 11:38   ` Christoph Hellwig
2025-06-30 11:45     ` Johannes Thumshirn
2025-06-30 12:05       ` Filipe Manana
2025-06-27 12:14   ` Filipe Manana

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.