[PATCH 0/3] btrfs: zoned fix two long standing ENOSPC bugs

public inbox for linux-btrfs@vger.kernel.org
 help / color / mirror / Atom feed

* [PATCH 0/3] btrfs: zoned fix two long standing ENOSPC bugs
@ 2026-02-09 14:36 Johannes Thumshirn
  2026-02-09 14:36 ` [PATCH 1/3] btrfs: zoned: cap delayed refs metadata reservation to avoid overcommit Johannes Thumshirn
                   ` (3 more replies)
  0 siblings, 4 replies; 8+ messages in thread
From: Johannes Thumshirn @ 2026-02-09 14:36 UTC (permalink / raw)
  To: linux-btrfs
  Cc: Filipe Manana, Damien Le Moal, Naohiro Aota, Christoph Hellwig,
	Hans Holmberg, Boris Burkov, David Sterba, Johannes Thumshirn

This series fixes two long standing bugs with zoned BTRFS leading to
premature ENOSPC.

Patch 1 caps the amount of metadata reservations to not let
`bytes_may_use + bytes_zone_available` climb beyond a space_info's
capacity.

Patch 2 moves a block-group to the reclaim list, if it cannot be deleted
in btrfs_delete_unused_bgs() because there's still user data left.

Patch 3 adds a new zoned only state to the flush machinery performing
reclaim of block-groups that can be reclaimed.


For all these patches a single reproducer was used (and it can be turned
into a fstest):

```
#!/bin/sh

SCRATCH_DEV="/dev/vdb"
SCRATCH_MNT="/tmp/scratch"
MOUNT_OPTIONS="-o enospc_debug"

mkdir -p $SCRATCH_MNT
mkfs.btrfs -f $SCRATCH_DEV
mount $MOUNT_OPTIONS $SCRATCH_DEV $SCRATCH_MNT

blocks="$(df -TB 1G $SCRATCH_DEV | awk '/btrfs/ { print $3}')"

loops=$(echo "$blocks * 4 - 2" | bc)

for (( i = 0; i < $loops; i++)); do
	dd if=/dev/zero of=$SCRATCH_MNT/test bs=1M count=1024 > /dev/null
	if [ $? -ne 0 ]; then
		break
	fi
done

umount $SCRATCH_DEV
btrfs check $SCRATCH_DEV
```

Johannes Thumshirn (3):
  btrfs: zoned: cap delayed refs metadata reservation to avoid
    overcommit
  btrfs: zoned: move partially zone_unusable block groups to reclaim
    list
  btrfs: zoned: add zone reclaim flush state for DATA space_info

 fs/btrfs/block-group.c | 16 ++++++++++++++++
 fs/btrfs/delayed-ref.c | 26 ++++++++++++++++++++++++++
 fs/btrfs/space-info.c  | 12 ++++++++++++
 fs/btrfs/space-info.h  |  1 +
 fs/btrfs/transaction.c |  7 +++++++
 5 files changed, 62 insertions(+)

-- 
2.53.0


^ permalink raw reply	[flat|nested] 8+ messages in thread

* [PATCH 1/3] btrfs: zoned: cap delayed refs metadata reservation to avoid overcommit
  2026-02-09 14:36 [PATCH 0/3] btrfs: zoned fix two long standing ENOSPC bugs Johannes Thumshirn
@ 2026-02-09 14:36 ` Johannes Thumshirn
  2026-02-09 16:47   ` Filipe Manana
  2026-02-09 14:36 ` [PATCH 2/3] btrfs: zoned: move partially zone_unusable block groups to reclaim list Johannes Thumshirn
                   ` (2 subsequent siblings)
  3 siblings, 1 reply; 8+ messages in thread
From: Johannes Thumshirn @ 2026-02-09 14:36 UTC (permalink / raw)
  To: linux-btrfs
  Cc: Filipe Manana, Damien Le Moal, Naohiro Aota, Christoph Hellwig,
	Hans Holmberg, Boris Burkov, David Sterba, Johannes Thumshirn

On zoned filesystems metadata space accounting can become overly optimistic
due to delayed refs reservations growing without a hard upper bound.

The delayed_refs_rsv block reservation is allowed to speculatively grow and
is only backed by actual metadata space when refilled. On zoned devices this
can result in delayed_refs_rsv reserving a large portion of metadata space
that is already effectively unusable due to zone write pointer constraints.
As a result, space_info->may_use can grow far beyond the usable metadata
capacity, causing the allocator to believe space is available when it is not.

This leads to premature ENOSPC failures and "cannot satisfy tickets" reports
even though commits would be able to make progress by flushing delayed refs.

Analysis of "-o enospc_debug" dumps using a Python debug script
confirmed that delayed_refs_rsv was responsible for the majority of
metadata overcommit on zoned devices. By correlating space_info counters
(total, used, may_use, zone_unusable) across transactions, the analysis
showed that may_use continued to grow even after usable metadata space
was exhausted, with delayed refs refills accounting for the excess
reservations.

Here's the output of the analysis:

  ======================================================================
  Space Type: METADATA
  ======================================================================

  Raw Values:
    Total:                256.00 MB (268435456 bytes)
    Used:                 128.00 KB (131072 bytes)
    Pinned:                16.00 KB (16384 bytes)
    Reserved:             144.00 KB (147456 bytes)
    May Use:              255.48 MB (267894784 bytes)
    Zone Unusable:        192.00 KB (196608 bytes)

  Calculated Metrics:
    Actually Usable:       255.81 MB (total - zone_unusable)
    Committed:             255.77 MB (used + pinned + reserved + may_use)
    Consumed:              320.00 KB (used + zone_unusable)

  Percentages:
    Zone Unusable:    0.07% of total
    May Use:         99.80% of total

Fix this by adding a zoned-specific cap in btrfs_delayed_refs_rsv_refill():
Before reserving additional metadata bytes, limit the delayed refs
reservation based on the usable metadata space (total bytes minus
zone_unusable). If the reservation would exceed this cap, return -EAGAIN
to trigger the existing flush/commit logic instead of overcommitting
metadata space.

This preserves the existing reservation and flushing semantics while
preventing metadata overcommit on zoned devices. The change is limited to
metadata space and does not affect non-zoned filesystems.

This patch addresses premature metadata ENOSPC conditions on zoned devices
and ensures delayed refs are throttled before exhausting usable metadata.

Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
---

If anyone's interested, if pushed the space_info analysis tool to:
https://github.com/morbidrsa/debug-scripts/blob/master/analyze-space_info.py
---
 fs/btrfs/delayed-ref.c | 26 ++++++++++++++++++++++++++
 fs/btrfs/transaction.c |  7 +++++++
 2 files changed, 33 insertions(+)

diff --git a/fs/btrfs/delayed-ref.c b/fs/btrfs/delayed-ref.c
index e8bc37453336..dc6a2685a5da 100644
--- a/fs/btrfs/delayed-ref.c
+++ b/fs/btrfs/delayed-ref.c
@@ -207,6 +207,28 @@ void btrfs_dec_delayed_refs_rsv_bg_updates(struct btrfs_fs_info *fs_info)
  * This will refill the delayed block_rsv up to 1 items size worth of space and
  * will return -ENOSPC if we can't make the reservation.
  */
+static int btrfs_zoned_cap_metadata_reservation(struct btrfs_space_info *space_info,
+						struct btrfs_block_rsv *block_rsv)
+{
+	struct btrfs_fs_info *fs_info = space_info->fs_info;
+	u64 usable;
+	u64 cap;
+
+	if (!btrfs_is_zoned(fs_info))
+		return 0;
+
+	if (!(space_info->flags & BTRFS_BLOCK_GROUP_METADATA))
+		return 0;
+
+	usable = space_info->total_bytes - space_info->bytes_zone_unusable;
+	cap = div_u64(usable, 2);
+
+	if (block_rsv->size > cap)
+		return -EAGAIN;
+
+	return 0;
+}
+
 int btrfs_delayed_refs_rsv_refill(struct btrfs_fs_info *fs_info,
 				  enum btrfs_reserve_flush_enum flush)
 {
@@ -228,6 +250,10 @@ int btrfs_delayed_refs_rsv_refill(struct btrfs_fs_info *fs_info,
 	if (!num_bytes)
 		return 0;

+	ret = btrfs_zoned_cap_metadata_reservation(space_info, block_rsv);
+	if (ret)
+		return ret;
+
 	ret = btrfs_reserve_metadata_bytes(space_info, num_bytes, flush);
 	if (ret)
 		return ret;
diff --git a/fs/btrfs/transaction.c b/fs/btrfs/transaction.c
index 0b2498749b1e..422c967a0916 100644
--- a/fs/btrfs/transaction.c
+++ b/fs/btrfs/transaction.c
@@ -678,6 +678,13 @@ start_transaction(struct btrfs_root *root, unsigned int num_items,
 		 * here.
 		 */
 		ret = btrfs_delayed_refs_rsv_refill(fs_info, flush);
+		if (ret == -EAGAIN) {
+			ret = btrfs_commit_current_transaction(root);
+			if (ret)
+				goto reserve_fail;
+			ret = btrfs_delayed_refs_rsv_refill(fs_info, flush);
+		}
+
 		if (ret)
 			goto reserve_fail;
 	}
-- 
2.53.0

^ permalink raw reply related	[flat|nested] 8+ messages in thread

* [PATCH 2/3] btrfs: zoned: move partially zone_unusable block groups to reclaim list
  2026-02-09 14:36 [PATCH 0/3] btrfs: zoned fix two long standing ENOSPC bugs Johannes Thumshirn
  2026-02-09 14:36 ` [PATCH 1/3] btrfs: zoned: cap delayed refs metadata reservation to avoid overcommit Johannes Thumshirn
@ 2026-02-09 14:36 ` Johannes Thumshirn
  2026-02-09 16:56   ` Filipe Manana
  2026-02-09 14:36 ` [PATCH 3/3] btrfs: zoned: add zone reclaim flush state for DATA space_info Johannes Thumshirn
  2026-02-09 14:51 ` [PATCH 0/3] btrfs: zoned fix two long standing ENOSPC bugs Christoph Hellwig
  3 siblings, 1 reply; 8+ messages in thread
From: Johannes Thumshirn @ 2026-02-09 14:36 UTC (permalink / raw)
  To: linux-btrfs
  Cc: Filipe Manana, Damien Le Moal, Naohiro Aota, Christoph Hellwig,
	Hans Holmberg, Boris Burkov, David Sterba, Johannes Thumshirn

On zoned block devices, block groups accumulate zone_unusable space
(space between the write pointer and zone end that cannot be allocated
until the zone is reset). When a block group becomes mostly
zone_unusable but still contains some valid data and it gets added to the
unused_bgs list it can never be deleted because it's not actually empty.

The deletion code (btrfs_delete_unused_bgs) skips these block groups
due to the btrfs_is_block_group_used() check, leaving them on the
unused_bgs list indefinitely. This causes two problems:
1. The block groups are never reclaimed, permanently wasting space
2. Eventually leads to ENOSPC even though reclaimable space exists

Fix by detecting block groups where zone_unusable exceeds 50% of the
block group size. Move these to the reclaim_bgs list instead of
skipping them. This triggers btrfs_reclaim_bgs_work() which:
1. Marks the block group read-only
2. Relocates the remaining valid data via btrfs_relocate_chunk()
3. Removes the emptied block group
4. Resets the zones, converting zone_unusable back to usable space

The 50% threshold ensures we only reclaim block groups where most space
is unusable, making relocation worthwhile. Block groups with less
zone_unusable are left on unused_bgs to potentially become fully empty
through normal deletion.

Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
---
 fs/btrfs/block-group.c | 16 ++++++++++++++++
 1 file changed, 16 insertions(+)

diff --git a/fs/btrfs/block-group.c b/fs/btrfs/block-group.c
index 3186ed4fd26d..1fb23834d90c 100644
--- a/fs/btrfs/block-group.c
+++ b/fs/btrfs/block-group.c
@@ -1597,6 +1597,22 @@ void btrfs_delete_unused_bgs(struct btrfs_fs_info *fs_info)

 		spin_lock(&space_info->lock);
 		spin_lock(&block_group->lock);
+
+		if (btrfs_is_zoned(fs_info) && btrfs_is_block_group_used(block_group) &&
+		    block_group->zone_unusable > div_u64(block_group->length, 2)) {
+			/*
+			 * If the block group has data left, but at least half
+			 * of the block group is zone_unusable, mark it as
+			 * reclaimable before continuing with the next block group.
+			 */
+			btrfs_mark_bg_to_reclaim(block_group);
+
+			spin_unlock(&block_group->lock);
+			spin_unlock(&space_info->lock);
+			up_write(&space_info->groups_sem);
+			goto next;
+		}
+
 		if (btrfs_is_block_group_used(block_group) ||
 		    (block_group->ro && !(block_group->flags & BTRFS_BLOCK_GROUP_REMAPPED)) ||
 		    list_is_singular(&block_group->list) ||
-- 
2.53.0

^ permalink raw reply related	[flat|nested] 8+ messages in thread

* [PATCH 3/3] btrfs: zoned: add zone reclaim flush state for DATA space_info
  2026-02-09 14:36 [PATCH 0/3] btrfs: zoned fix two long standing ENOSPC bugs Johannes Thumshirn
  2026-02-09 14:36 ` [PATCH 1/3] btrfs: zoned: cap delayed refs metadata reservation to avoid overcommit Johannes Thumshirn
  2026-02-09 14:36 ` [PATCH 2/3] btrfs: zoned: move partially zone_unusable block groups to reclaim list Johannes Thumshirn
@ 2026-02-09 14:36 ` Johannes Thumshirn
  2026-02-09 16:59   ` Filipe Manana
  2026-02-09 14:51 ` [PATCH 0/3] btrfs: zoned fix two long standing ENOSPC bugs Christoph Hellwig
  3 siblings, 1 reply; 8+ messages in thread
From: Johannes Thumshirn @ 2026-02-09 14:36 UTC (permalink / raw)
  To: linux-btrfs
  Cc: Filipe Manana, Damien Le Moal, Naohiro Aota, Christoph Hellwig,
	Hans Holmberg, Boris Burkov, David Sterba, Johannes Thumshirn

On zoned block devices, DATA block groups can accumulate large amounts
of zone_unusable space (space between the write pointer and zone end).
When zone_unusable reaches high levels (e.g., 98% of total space), new
allocations fail with ENOSPC even though space could be reclaimed by
relocating data and resetting zones.

The existing flush states don't handle this scenario effectively - they
either try to free cached space (which doesn't exist for zone_unusable)
or reset empty zones (which doesn't help when zones contain valid data
mixed with zone_unusable space).

Add a new RECLAIM_ZONES flush state that triggers the block group
reclaim machinery. This state:
- Calls btrfs_reclaim_sweep() to identify reclaimable block groups
- Calls btrfs_reclaim_bgs() to queue reclaim work
- Waits for reclaim_bgs_work to complete via flush_work()
- Commits the transaction to finalize changes

The reclaim work (btrfs_reclaim_bgs_work) safely relocates valid data
from fragmented block groups to other locations before resetting zones,
converting zone_unusable space back into usable space.

Insert RECLAIM_ZONES before RESET_ZONES in data_flush_states so that
we attempt to reclaim partially-used block groups before falling back
to resetting completely empty ones.

Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
---
 fs/btrfs/space-info.c | 12 ++++++++++++
 fs/btrfs/space-info.h |  1 +
 2 files changed, 13 insertions(+)

diff --git a/fs/btrfs/space-info.c b/fs/btrfs/space-info.c
index bb5aac7ee9d2..1d5d4f33116d 100644
--- a/fs/btrfs/space-info.c
+++ b/fs/btrfs/space-info.c
@@ -902,6 +902,17 @@ static void flush_space(struct btrfs_space_info *space_info, u64 num_bytes,
 		if (ret > 0 || ret == -ENOSPC)
 			ret = 0;
 		break;
+	case RECLAIM_ZONES:
+		ret = 0;
+		if (btrfs_is_zoned(fs_info)) {
+			btrfs_reclaim_sweep(fs_info);
+			btrfs_delete_unused_bgs(fs_info);
+			btrfs_reclaim_bgs(fs_info);
+			flush_work(&fs_info->reclaim_bgs_work);
+			ASSERT(current->journal_info == NULL);
+			ret = btrfs_commit_current_transaction(root);
+		}
+		break;
 	case RUN_DELAYED_IPUTS:
 		/*
 		 * If we have pending delayed iputs then we could free up a
@@ -1400,6 +1411,7 @@ static const enum btrfs_flush_state data_flush_states[] = {
 	FLUSH_DELALLOC_FULL,
 	RUN_DELAYED_IPUTS,
 	COMMIT_TRANS,
+	RECLAIM_ZONES,
 	RESET_ZONES,
 	ALLOC_CHUNK_FORCE,
 };
diff --git a/fs/btrfs/space-info.h b/fs/btrfs/space-info.h
index 0703f24b23f7..4359e3a89b41 100644
--- a/fs/btrfs/space-info.h
+++ b/fs/btrfs/space-info.h
@@ -96,6 +96,7 @@ enum btrfs_flush_state {
 	RUN_DELAYED_IPUTS	= 10,
 	COMMIT_TRANS		= 11,
 	RESET_ZONES		= 12,
+	RECLAIM_ZONES		= 13,
 };
 
 enum btrfs_space_info_sub_group {
-- 
2.53.0


^ permalink raw reply related	[flat|nested] 8+ messages in thread

* Re: [PATCH 0/3] btrfs: zoned fix two long standing ENOSPC bugs
  2026-02-09 14:36 [PATCH 0/3] btrfs: zoned fix two long standing ENOSPC bugs Johannes Thumshirn
                   ` (2 preceding siblings ...)
  2026-02-09 14:36 ` [PATCH 3/3] btrfs: zoned: add zone reclaim flush state for DATA space_info Johannes Thumshirn
@ 2026-02-09 14:51 ` Christoph Hellwig
  3 siblings, 0 replies; 8+ messages in thread
From: Christoph Hellwig @ 2026-02-09 14:51 UTC (permalink / raw)
  To: Johannes Thumshirn
  Cc: linux-btrfs, Filipe Manana, Damien Le Moal, Naohiro Aota,
	Christoph Hellwig, Hans Holmberg, Boris Burkov, David Sterba

On Mon, Feb 09, 2026 at 03:36:41PM +0100, Johannes Thumshirn wrote:
> For all these patches a single reproducer was used (and it can be turned
> into a fstest):

Yes, please.


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH 1/3] btrfs: zoned: cap delayed refs metadata reservation to avoid overcommit
  2026-02-09 14:36 ` [PATCH 1/3] btrfs: zoned: cap delayed refs metadata reservation to avoid overcommit Johannes Thumshirn
@ 2026-02-09 16:47   ` Filipe Manana
  0 siblings, 0 replies; 8+ messages in thread
From: Filipe Manana @ 2026-02-09 16:47 UTC (permalink / raw)
  To: Johannes Thumshirn
  Cc: linux-btrfs, Filipe Manana, Damien Le Moal, Naohiro Aota,
	Christoph Hellwig, Hans Holmberg, Boris Burkov, David Sterba

On Mon, Feb 9, 2026 at 2:46 PM Johannes Thumshirn
<johannes.thumshirn@wdc.com> wrote:
>
> On zoned filesystems metadata space accounting can become overly optimistic
> due to delayed refs reservations growing without a hard upper bound.
>
> The delayed_refs_rsv block reservation is allowed to speculatively grow and
> is only backed by actual metadata space when refilled. On zoned devices this
> can result in delayed_refs_rsv reserving a large portion of metadata space
> that is already effectively unusable due to zone write pointer constraints.
> As a result, space_info->may_use can grow far beyond the usable metadata
> capacity, causing the allocator to believe space is available when it is not.
>
> This leads to premature ENOSPC failures and "cannot satisfy tickets" reports
> even though commits would be able to make progress by flushing delayed refs.
>
> Analysis of "-o enospc_debug" dumps using a Python debug script
> confirmed that delayed_refs_rsv was responsible for the majority of
> metadata overcommit on zoned devices. By correlating space_info counters
> (total, used, may_use, zone_unusable) across transactions, the analysis
> showed that may_use continued to grow even after usable metadata space
> was exhausted, with delayed refs refills accounting for the excess
> reservations.
>
> Here's the output of the analysis:
>
>   ======================================================================
>   Space Type: METADATA
>   ======================================================================
>
>   Raw Values:
>     Total:                256.00 MB (268435456 bytes)
>     Used:                 128.00 KB (131072 bytes)
>     Pinned:                16.00 KB (16384 bytes)
>     Reserved:             144.00 KB (147456 bytes)
>     May Use:              255.48 MB (267894784 bytes)
>     Zone Unusable:        192.00 KB (196608 bytes)
>
>   Calculated Metrics:
>     Actually Usable:       255.81 MB (total - zone_unusable)
>     Committed:             255.77 MB (used + pinned + reserved + may_use)
>     Consumed:              320.00 KB (used + zone_unusable)
>
>   Percentages:
>     Zone Unusable:    0.07% of total
>     May Use:         99.80% of total
>
> Fix this by adding a zoned-specific cap in btrfs_delayed_refs_rsv_refill():
> Before reserving additional metadata bytes, limit the delayed refs
> reservation based on the usable metadata space (total bytes minus
> zone_unusable). If the reservation would exceed this cap, return -EAGAIN
> to trigger the existing flush/commit logic instead of overcommitting
> metadata space.
>
> This preserves the existing reservation and flushing semantics while
> preventing metadata overcommit on zoned devices. The change is limited to
> metadata space and does not affect non-zoned filesystems.
>
> This patch addresses premature metadata ENOSPC conditions on zoned devices
> and ensures delayed refs are throttled before exhausting usable metadata.
>
> Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
> ---
>
> If anyone's interested, if pushed the space_info analysis tool to:
> https://github.com/morbidrsa/debug-scripts/blob/master/analyze-space_info.py
> ---
>  fs/btrfs/delayed-ref.c | 26 ++++++++++++++++++++++++++
>  fs/btrfs/transaction.c |  7 +++++++
>  2 files changed, 33 insertions(+)
>
> diff --git a/fs/btrfs/delayed-ref.c b/fs/btrfs/delayed-ref.c
> index e8bc37453336..dc6a2685a5da 100644
> --- a/fs/btrfs/delayed-ref.c
> +++ b/fs/btrfs/delayed-ref.c
> @@ -207,6 +207,28 @@ void btrfs_dec_delayed_refs_rsv_bg_updates(struct btrfs_fs_info *fs_info)
>   * This will refill the delayed block_rsv up to 1 items size worth of space and
>   * will return -ENOSPC if we can't make the reservation.
>   */
> +static int btrfs_zoned_cap_metadata_reservation(struct btrfs_space_info *space_info,
> +                                               struct btrfs_block_rsv *block_rsv)

Can we simplify and not pass the block reserve?
This is meant only for delayed refs rsv, so in the function we can
just access fs_info->delayed_refs_rsv.

> +{
> +       struct btrfs_fs_info *fs_info = space_info->fs_info;
> +       u64 usable;
> +       u64 cap;
> +
> +       if (!btrfs_is_zoned(fs_info))
> +               return 0;
> +
> +       if (!(space_info->flags & BTRFS_BLOCK_GROUP_METADATA))
> +               return 0;

Removing the need for this check too, since the delayed refs rsv is metadata.

> +
> +       usable = space_info->total_bytes - space_info->bytes_zone_unusable;

This should be done while holding the space_info's lock, otherwise it's racy.

> +       cap = div_u64(usable, 2);

Could also be: usable >> 1
And then directly used below and eliminate the need for the variable.

> +
> +       if (block_rsv->size > cap)

Same as above, racy, the block reserve is not locked.

> +               return -EAGAIN;
> +
> +       return 0;
> +}
> +
>  int btrfs_delayed_refs_rsv_refill(struct btrfs_fs_info *fs_info,
>                                   enum btrfs_reserve_flush_enum flush)
>  {
> @@ -228,6 +250,10 @@ int btrfs_delayed_refs_rsv_refill(struct btrfs_fs_info *fs_info,
>         if (!num_bytes)
>                 return 0;
>
> +       ret = btrfs_zoned_cap_metadata_reservation(space_info, block_rsv);
> +       if (ret)
> +               return ret;
> +
>         ret = btrfs_reserve_metadata_bytes(space_info, num_bytes, flush);
>         if (ret)
>                 return ret;
> diff --git a/fs/btrfs/transaction.c b/fs/btrfs/transaction.c
> index 0b2498749b1e..422c967a0916 100644
> --- a/fs/btrfs/transaction.c
> +++ b/fs/btrfs/transaction.c
> @@ -678,6 +678,13 @@ start_transaction(struct btrfs_root *root, unsigned int num_items,
>                  * here.
>                  */
>                 ret = btrfs_delayed_refs_rsv_refill(fs_info, flush);
> +               if (ret == -EAGAIN) {

Probably make it explicit this is expected for zoned only, adding an
ASSERT(btrfs_is_zoned(fs_info)).

Otherwise it looks fine to me, thanks.


> +                       ret = btrfs_commit_current_transaction(root);
> +                       if (ret)
> +                               goto reserve_fail;
> +                       ret = btrfs_delayed_refs_rsv_refill(fs_info, flush);
> +               }
> +
>                 if (ret)
>                         goto reserve_fail;
>         }
> --
> 2.53.0
>
>

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH 2/3] btrfs: zoned: move partially zone_unusable block groups to reclaim list
  2026-02-09 14:36 ` [PATCH 2/3] btrfs: zoned: move partially zone_unusable block groups to reclaim list Johannes Thumshirn
@ 2026-02-09 16:56   ` Filipe Manana
  0 siblings, 0 replies; 8+ messages in thread
From: Filipe Manana @ 2026-02-09 16:56 UTC (permalink / raw)
  To: Johannes Thumshirn
  Cc: linux-btrfs, Filipe Manana, Damien Le Moal, Naohiro Aota,
	Christoph Hellwig, Hans Holmberg, Boris Burkov, David Sterba

On Mon, Feb 9, 2026 at 2:38 PM Johannes Thumshirn
<johannes.thumshirn@wdc.com> wrote:
>
> On zoned block devices, block groups accumulate zone_unusable space
> (space between the write pointer and zone end that cannot be allocated
> until the zone is reset). When a block group becomes mostly
> zone_unusable but still contains some valid data and it gets added to the
> unused_bgs list it can never be deleted because it's not actually empty.
>
> The deletion code (btrfs_delete_unused_bgs) skips these block groups
> due to the btrfs_is_block_group_used() check, leaving them on the
> unused_bgs list indefinitely. This causes two problems:
> 1. The block groups are never reclaimed, permanently wasting space
> 2. Eventually leads to ENOSPC even though reclaimable space exists
>
> Fix by detecting block groups where zone_unusable exceeds 50% of the
> block group size. Move these to the reclaim_bgs list instead of
> skipping them. This triggers btrfs_reclaim_bgs_work() which:
> 1. Marks the block group read-only
> 2. Relocates the remaining valid data via btrfs_relocate_chunk()
> 3. Removes the emptied block group
> 4. Resets the zones, converting zone_unusable back to usable space
>
> The 50% threshold ensures we only reclaim block groups where most space
> is unusable, making relocation worthwhile. Block groups with less
> zone_unusable are left on unused_bgs to potentially become fully empty
> through normal deletion.
>
> Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
> ---
>  fs/btrfs/block-group.c | 16 ++++++++++++++++
>  1 file changed, 16 insertions(+)
>
> diff --git a/fs/btrfs/block-group.c b/fs/btrfs/block-group.c
> index 3186ed4fd26d..1fb23834d90c 100644
> --- a/fs/btrfs/block-group.c
> +++ b/fs/btrfs/block-group.c
> @@ -1597,6 +1597,22 @@ void btrfs_delete_unused_bgs(struct btrfs_fs_info *fs_info)
>
>                 spin_lock(&space_info->lock);
>                 spin_lock(&block_group->lock);
> +
> +               if (btrfs_is_zoned(fs_info) && btrfs_is_block_group_used(block_group) &&
> +                   block_group->zone_unusable > div_u64(block_group->length, 2)) {
> +                       /*
> +                        * If the block group has data left, but at least half

The comment says "at least half", but in the code below we check if
more than half.
So either adjust here or change the code above from ">" to ">=".

> +                        * of the block group is zone_unusable, mark it as
> +                        * reclaimable before continuing with the next block group.
> +                        */
> +                       btrfs_mark_bg_to_reclaim(block_group);

We could avoid adding a locking dependency and call this below after
unlocking the bg, space_info and groups_sem.

Otherwise it looks fine, thanks.

> +
> +                       spin_unlock(&block_group->lock);
> +                       spin_unlock(&space_info->lock);
> +                       up_write(&space_info->groups_sem);
> +                       goto next;
> +               }
> +
>                 if (btrfs_is_block_group_used(block_group) ||
>                     (block_group->ro && !(block_group->flags & BTRFS_BLOCK_GROUP_REMAPPED)) ||
>                     list_is_singular(&block_group->list) ||
> --
> 2.53.0
>
>

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH 3/3] btrfs: zoned: add zone reclaim flush state for DATA space_info
  2026-02-09 14:36 ` [PATCH 3/3] btrfs: zoned: add zone reclaim flush state for DATA space_info Johannes Thumshirn
@ 2026-02-09 16:59   ` Filipe Manana
  0 siblings, 0 replies; 8+ messages in thread
From: Filipe Manana @ 2026-02-09 16:59 UTC (permalink / raw)
  To: Johannes Thumshirn
  Cc: linux-btrfs, Filipe Manana, Damien Le Moal, Naohiro Aota,
	Christoph Hellwig, Hans Holmberg, Boris Burkov, David Sterba

On Mon, Feb 9, 2026 at 2:46 PM Johannes Thumshirn
<johannes.thumshirn@wdc.com> wrote:
>
> On zoned block devices, DATA block groups can accumulate large amounts
> of zone_unusable space (space between the write pointer and zone end).
> When zone_unusable reaches high levels (e.g., 98% of total space), new
> allocations fail with ENOSPC even though space could be reclaimed by
> relocating data and resetting zones.
>
> The existing flush states don't handle this scenario effectively - they
> either try to free cached space (which doesn't exist for zone_unusable)
> or reset empty zones (which doesn't help when zones contain valid data
> mixed with zone_unusable space).
>
> Add a new RECLAIM_ZONES flush state that triggers the block group
> reclaim machinery. This state:
> - Calls btrfs_reclaim_sweep() to identify reclaimable block groups
> - Calls btrfs_reclaim_bgs() to queue reclaim work
> - Waits for reclaim_bgs_work to complete via flush_work()
> - Commits the transaction to finalize changes
>
> The reclaim work (btrfs_reclaim_bgs_work) safely relocates valid data
> from fragmented block groups to other locations before resetting zones,
> converting zone_unusable space back into usable space.
>
> Insert RECLAIM_ZONES before RESET_ZONES in data_flush_states so that
> we attempt to reclaim partially-used block groups before falling back
> to resetting completely empty ones.
>
> Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
> ---
>  fs/btrfs/space-info.c | 12 ++++++++++++
>  fs/btrfs/space-info.h |  1 +
>  2 files changed, 13 insertions(+)
>
> diff --git a/fs/btrfs/space-info.c b/fs/btrfs/space-info.c
> index bb5aac7ee9d2..1d5d4f33116d 100644
> --- a/fs/btrfs/space-info.c
> +++ b/fs/btrfs/space-info.c
> @@ -902,6 +902,17 @@ static void flush_space(struct btrfs_space_info *space_info, u64 num_bytes,
>                 if (ret > 0 || ret == -ENOSPC)
>                         ret = 0;
>                 break;
> +       case RECLAIM_ZONES:
> +               ret = 0;
> +               if (btrfs_is_zoned(fs_info)) {

It would seem more natural to have:

if (btrfs_is_zoned(fs_info)) {
    (...)
    ret = btrfs_commit_current_transaction(root);
} else {
    ret = 0;
}

Otherwise it looks good, thanks.

Reviewed-by: Filipe Manana <fdmanana@suse.com>


> +                       btrfs_reclaim_sweep(fs_info);
> +                       btrfs_delete_unused_bgs(fs_info);
> +                       btrfs_reclaim_bgs(fs_info);
> +                       flush_work(&fs_info->reclaim_bgs_work);
> +                       ASSERT(current->journal_info == NULL);
> +                       ret = btrfs_commit_current_transaction(root);
> +               }
> +               break;
>         case RUN_DELAYED_IPUTS:
>                 /*
>                  * If we have pending delayed iputs then we could free up a
> @@ -1400,6 +1411,7 @@ static const enum btrfs_flush_state data_flush_states[] = {
>         FLUSH_DELALLOC_FULL,
>         RUN_DELAYED_IPUTS,
>         COMMIT_TRANS,
> +       RECLAIM_ZONES,
>         RESET_ZONES,
>         ALLOC_CHUNK_FORCE,
>  };
> diff --git a/fs/btrfs/space-info.h b/fs/btrfs/space-info.h
> index 0703f24b23f7..4359e3a89b41 100644
> --- a/fs/btrfs/space-info.h
> +++ b/fs/btrfs/space-info.h
> @@ -96,6 +96,7 @@ enum btrfs_flush_state {
>         RUN_DELAYED_IPUTS       = 10,
>         COMMIT_TRANS            = 11,
>         RESET_ZONES             = 12,
> +       RECLAIM_ZONES           = 13,
>  };
>
>  enum btrfs_space_info_sub_group {
> --
> 2.53.0
>
>

^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2026-02-09 16:59 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-02-09 14:36 [PATCH 0/3] btrfs: zoned fix two long standing ENOSPC bugs Johannes Thumshirn
2026-02-09 14:36 ` [PATCH 1/3] btrfs: zoned: cap delayed refs metadata reservation to avoid overcommit Johannes Thumshirn
2026-02-09 16:47   ` Filipe Manana
2026-02-09 14:36 ` [PATCH 2/3] btrfs: zoned: move partially zone_unusable block groups to reclaim list Johannes Thumshirn
2026-02-09 16:56   ` Filipe Manana
2026-02-09 14:36 ` [PATCH 3/3] btrfs: zoned: add zone reclaim flush state for DATA space_info Johannes Thumshirn
2026-02-09 16:59   ` Filipe Manana
2026-02-09 14:51 ` [PATCH 0/3] btrfs: zoned fix two long standing ENOSPC bugs Christoph Hellwig

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox