public inbox for linux-btrfs@vger.kernel.org
 help / color / mirror / Atom feed
* [PATCH 0/3] btrfs: a few space reservation fixes and comment update
@ 2026-02-03 13:02 fdmanana
  2026-02-03 13:02 ` [PATCH 1/3] btrfs: be less agressive with metadata overcommit when we can do full flushing fdmanana
                   ` (3 more replies)
  0 siblings, 4 replies; 16+ messages in thread
From: fdmanana @ 2026-02-03 13:02 UTC (permalink / raw)
  To: linux-btrfs

From: Filipe Manana <fdmanana@suse.com>

A couple fixes for metadata space reservation and update a comment.
Details in the changelogs.

Filipe Manana (3):
  btrfs: be less agressive with metadata overcommit when we can do full flushing
  btrfs: don't allow log trees to consume global reserve or overcommit metadata
  btrfs: update comment for BTRFS_RESERVE_NO_FLUSH

 fs/btrfs/block-rsv.c  | 25 +++++++++++++++++++++++++
 fs/btrfs/space-info.c |  7 ++++---
 fs/btrfs/space-info.h | 19 ++++++++++++++++++-
 3 files changed, 47 insertions(+), 4 deletions(-)

-- 
2.47.2


^ permalink raw reply	[flat|nested] 16+ messages in thread

* [PATCH 1/3] btrfs: be less agressive with metadata overcommit when we can do full flushing
  2026-02-03 13:02 [PATCH 0/3] btrfs: a few space reservation fixes and comment update fdmanana
@ 2026-02-03 13:02 ` fdmanana
  2026-02-03 21:02   ` Qu Wenruo
  2026-02-03 13:02 ` [PATCH 2/3] btrfs: don't allow log trees to consume global reserve or overcommit metadata fdmanana
                   ` (2 subsequent siblings)
  3 siblings, 1 reply; 16+ messages in thread
From: fdmanana @ 2026-02-03 13:02 UTC (permalink / raw)
  To: linux-btrfs

From: Filipe Manana <fdmanana@suse.com>

Over the years we often get reports of some -ENOSPC failure while updating
metadata that leads to a transaction abort. I have seen this happen for
filesystems of all sizes and with workloads that are very user/customer
specific and unable to reproduce, but Aleksandar recently reported a
simple way to reproduce this with a 1G filesystem and using the bonnie++
benchmark tool. The following test script reproduces the failure:

    $ cat test.sh
    #!/bin/bash

    # Create and use a 1G null block device, memory backed, otherwise
    # the test takes a very long time.
    modprobe null_blk nr_devices="0"
    null_dev="/sys/kernel/config/nullb/nullb0"
    mkdir "$null_dev"
    size=$((1 * 1024)) # in MB
    echo 2 > "$null_dev/submit_queues"
    echo "$size" > "$null_dev/size"
    echo 1 > "$null_dev/memory_backed"
    echo 1 > "$null_dev/discard"
    echo 1 > "$null_dev/power"

    DEV=/dev/nullb0
    MNT=/mnt/nullb0

    mkfs.btrfs -f $DEV
    mount $DEV $MNT

    mkdir $MNT/test/
    bonnie++ -d $MNT/test/ -m BTRFS -u 0 -s 256M -r 128M -b

    umount $MNT

    echo 0 > "$null_dev/power"
    rmdir "$null_dev"

When running this bonnie++ fails in the phase where it deletes test
directories and files:

    $ ./test.sh
    (...)
    Using uid:0, gid:0.
    Writing a byte at a time...done
    Writing intelligently...done
    Rewriting...done
    Reading a byte at a time...done
    Reading intelligently...done
    start 'em...done...done...done...done...done...
    Create files in sequential order...done.
    Stat files in sequential order...done.
    Delete files in sequential order...done.
    Create files in random order...done.
    Stat files in random order...done.
    Delete files in random order...Can't sync directory, turning off dir-sync.
    Can't delete file 9Bq7sr0000000338
    Cleaning up test directory after error.
    Bonnie: drastic I/O error (rmdir): Read-only file system

And in the syslog/dmesg we can see the following transaction abort trace:

    [161915.501506] BTRFS warning (device nullb0): Skipping commit of aborted transaction.
    [161915.502983] ------------[ cut here ]------------
    [161915.503832] BTRFS: Transaction aborted (error -28)
    [161915.504748] WARNING: fs/btrfs/transaction.c:2045 at btrfs_commit_transaction+0xa21/0xd30 [btrfs], CPU#11: bonnie++/3377975
    [161915.506786] Modules linked in: btrfs dm_zero dm_snapshot (...)
    [161915.518759] CPU: 11 UID: 0 PID: 3377975 Comm: bonnie++ Tainted: G        W           6.19.0-rc7-btrfs-next-224+ #4 PREEMPT(full)
    [161915.520857] Tainted: [W]=WARN
    [161915.521405] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.16.2-0-gea1b7a073390-prebuilt.qemu.org 04/01/2014
    [161915.523414] RIP: 0010:btrfs_commit_transaction+0xa24/0xd30 [btrfs]
    [161915.524630] Code: 48 8b 7c 24 (...)
    [161915.526982] RSP: 0018:ffffd3fe8206fda8 EFLAGS: 00010292
    [161915.527707] RAX: 0000000000000002 RBX: ffff8f4886d3c000 RCX: 0000000000000000
    [161915.528723] RDX: 0000000002040001 RSI: 00000000ffffffe4 RDI: ffffffffc088f780
    [161915.529691] RBP: ffff8f4f5adae7e0 R08: 0000000000000000 R09: ffffd3fe8206fb90
    [161915.530842] R10: ffff8f4f9c1fffa8 R11: 0000000000000003 R12: 00000000ffffffe4
    [161915.532027] R13: ffff8f4ef2cf2400 R14: ffff8f4f5adae708 R15: ffff8f4f62d18000
    [161915.533229] FS:  00007ff93112a780(0000) GS:ffff8f4ff63ee000(0000) knlGS:0000000000000000
    [161915.534611] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
    [161915.535575] CR2: 00005571b3072000 CR3: 0000000176080005 CR4: 0000000000370ef0
    [161915.536758] Call Trace:
    [161915.537185]  <TASK>
    [161915.537575]  btrfs_sync_file+0x431/0x530 [btrfs]
    [161915.538473]  do_fsync+0x39/0x80
    [161915.539042]  __x64_sys_fsync+0xf/0x20
    [161915.539750]  do_syscall_64+0x50/0xf20
    [161915.540396]  entry_SYSCALL_64_after_hwframe+0x76/0x7e
    [161915.541301] RIP: 0033:0x7ff930ca49ee
    [161915.541904] Code: 08 0f 85 f5 (...)
    [161915.544830] RSP: 002b:00007ffd94291f38 EFLAGS: 00000246 ORIG_RAX: 000000000000004a
    [161915.546152] RAX: ffffffffffffffda RBX: 00007ff93112a780 RCX: 00007ff930ca49ee
    [161915.547263] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000003
    [161915.548383] RBP: 0000000000000dab R08: 0000000000000000 R09: 0000000000000000
    [161915.549853] R10: 0000000000000000 R11: 0000000000000246 R12: 00007ffd94291fb0
    [161915.551196] R13: 00007ffd94292350 R14: 0000000000000001 R15: 00007ffd94292340
    [161915.552161]  </TASK>
    [161915.552457] ---[ end trace 0000000000000000 ]---
    [161915.553232] BTRFS info (device nullb0 state A): dumping space info:
    [161915.553236] BTRFS info (device nullb0 state A): space_info DATA (sub-group id 0) has 12582912 free, is not full
    [161915.553239] BTRFS info (device nullb0 state A): space_info total=12582912, used=0, pinned=0, reserved=0, may_use=0, readonly=0 zone_unusable=0
    [161915.553243] BTRFS info (device nullb0 state A): space_info METADATA (sub-group id 0) has -5767168 free, is full
    [161915.553245] BTRFS info (device nullb0 state A): space_info total=53673984, used=6635520, pinned=46956544, reserved=16384, may_use=5767168, readonly=65536 zone_unusable=0
    [161915.553251] BTRFS info (device nullb0 state A): space_info SYSTEM (sub-group id 0) has 8355840 free, is not full
    [161915.553254] BTRFS info (device nullb0 state A): space_info total=8388608, used=16384, pinned=16384, reserved=0, may_use=0, readonly=0 zone_unusable=0
    [161915.553257] BTRFS info (device nullb0 state A): global_block_rsv: size 5767168 reserved 5767168
    [161915.553261] BTRFS info (device nullb0 state A): trans_block_rsv: size 0 reserved 0
    [161915.553263] BTRFS info (device nullb0 state A): chunk_block_rsv: size 0 reserved 0
    [161915.553265] BTRFS info (device nullb0 state A): remap_block_rsv: size 0 reserved 0
    [161915.553268] BTRFS info (device nullb0 state A): delayed_block_rsv: size 0 reserved 0
    [161915.553270] BTRFS info (device nullb0 state A): delayed_refs_rsv: size 0 reserved 0
    [161915.553272] BTRFS: error (device nullb0 state A) in cleanup_transaction:2045: errno=-28 No space left
    [161915.554463] BTRFS info (device nullb0 state EA): forced readonly

The problem is that we allow for a very agressive metadata overcommit,
about 1/8th of the currently available space, even when the task
attempting the reservation allows for full flushing. Over time this allows
more and more tasks to overcommit without getting a transaction commit to
release pinned extents, joining the same transaction and eventually lead
to the transaction abort when attempting some tree update, as the extent
allocator is not able to find any available metadata extent and it's not
able to allocate a new metadata block group either (not enough unallocated
space for that).

Fix this by allowing the overcommit to be up to 1/64th of the available
(unallocated) space instead and for that limit to apply to both types of
full flushing, BTRFS_RESERVE_FLUSH_ALL and BTRFS_RESERVE_FLUSH_ALL_STEAL.
This way we get more frequent transaction commits to release pinned
extents in case our caller is in a context where full flushing is allowed.

Reported-by: Aleksandar Gerasimovski <Aleksandar.Gerasimovski@belden.com>
Link: https://lore.kernel.org/linux-btrfs/SA1PR18MB56922F690C5EC2D85371408B998FA@SA1PR18MB5692.namprd18.prod.outlook.com/
Signed-off-by: Filipe Manana <fdmanana@suse.com>
---
 fs/btrfs/space-info.c | 7 ++++---
 1 file changed, 4 insertions(+), 3 deletions(-)

diff --git a/fs/btrfs/space-info.c b/fs/btrfs/space-info.c
index bb5aac7ee9d2..8192edf92d26 100644
--- a/fs/btrfs/space-info.c
+++ b/fs/btrfs/space-info.c
@@ -489,10 +489,11 @@ static u64 calc_available_free_space(const struct btrfs_space_info *space_info,
 	/*
 	 * If we aren't flushing all things, let us overcommit up to
 	 * 1/2th of the space. If we can flush, don't let us overcommit
-	 * too much, let it overcommit up to 1/8 of the space.
+	 * too much, let it overcommit up to 1/64th of the space.
 	 */
-	if (flush == BTRFS_RESERVE_FLUSH_ALL)
-		avail >>= 3;
+	if (flush == BTRFS_RESERVE_FLUSH_ALL ||
+	    flush == BTRFS_RESERVE_FLUSH_ALL_STEAL)
+		avail >>= 6;
 	else
 		avail >>= 1;
 
-- 
2.47.2


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [PATCH 2/3] btrfs: don't allow log trees to consume global reserve or overcommit metadata
  2026-02-03 13:02 [PATCH 0/3] btrfs: a few space reservation fixes and comment update fdmanana
  2026-02-03 13:02 ` [PATCH 1/3] btrfs: be less agressive with metadata overcommit when we can do full flushing fdmanana
@ 2026-02-03 13:02 ` fdmanana
  2026-02-03 19:52   ` Leo Martins
  2026-02-03 13:02 ` [PATCH 3/3] btrfs: update comment for BTRFS_RESERVE_NO_FLUSH fdmanana
  2026-02-03 23:38 ` [PATCH v2 0/3] btrfs: a few space reservation fixes and comment update fdmanana
  3 siblings, 1 reply; 16+ messages in thread
From: fdmanana @ 2026-02-03 13:02 UTC (permalink / raw)
  To: linux-btrfs

From: Filipe Manana <fdmanana@suse.com>

For a fsync we never reserve space in advance, we just start a transaction
without reserving space and we use an empty block reserve for a log tree.
We reserve space as we need while updating a log tree, we end up in
btrfs_use_block_rsv() when reserving space for the allocation of a log
tree extent buffer and we attempt first to reserve without flushing,
and if that fails we attempt to consume from the global reserve or
overcommit metadata. This makes us consume space that may be the last
resort for a transaction commit to succeed, therefore increasing the
chances for a transaction abort with -ENOSPC.

So make btrfs_use_block_rsv() fail if we can't reserve metadata space for
a log tree exent buffer allocation without flushing, making the fsync
fallback to a transaction commit and avoid using critical space that could
be the only resort for a transaction commit to succeed when we are in a
critical space situation.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
---
 fs/btrfs/block-rsv.c | 25 +++++++++++++++++++++++++
 1 file changed, 25 insertions(+)

diff --git a/fs/btrfs/block-rsv.c b/fs/btrfs/block-rsv.c
index e823230c09b7..fe81d9e9f08c 100644
--- a/fs/btrfs/block-rsv.c
+++ b/fs/btrfs/block-rsv.c
@@ -540,6 +540,31 @@ struct btrfs_block_rsv *btrfs_use_block_rsv(struct btrfs_trans_handle *trans,
 					   BTRFS_RESERVE_NO_FLUSH);
 	if (!ret)
 		return block_rsv;
+
+	/*
+	 * If we are being used for updating a log tree, fail immediately, which
+	 * makes the fsync fallback to a transaction commit.
+	 *
+	 * We don't want to consume from the global block reserve, as that is
+	 * precious space that may be needed to do updates to some trees for
+	 * which we don't reserve space during a transaction commit (update root
+	 * items in the root tree, device stat items in the device tree and
+	 * quota tree updates, see btrfs_init_root_block_rsv()), or to fallback
+	 * to in case we did not reserve enough space to run delayed items,
+	 * delayed references, or anything else we need in order to avoid a
+	 * transaction abort.
+	 *
+	 * We also don't want to do a reservation in flush emergency mode, as
+	 * we end up using metadata that could be critical to allow a
+	 * transaction to complete successfully and therefore increase the
+	 * chances for a transaction abort.
+	 *
+	 * Log trees are an optimization and should never consume from the
+	 * global reserve or be allowed overcommitting metadata.
+	 */
+	if (btrfs_root_id(root) == BTRFS_TREE_LOG_OBJECTID)
+		return ERR_PTR(ret);
+
 	/*
 	 * If we couldn't reserve metadata bytes try and use some from
 	 * the global reserve if its space type is the same as the global
-- 
2.47.2


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [PATCH 3/3] btrfs: update comment for BTRFS_RESERVE_NO_FLUSH
  2026-02-03 13:02 [PATCH 0/3] btrfs: a few space reservation fixes and comment update fdmanana
  2026-02-03 13:02 ` [PATCH 1/3] btrfs: be less agressive with metadata overcommit when we can do full flushing fdmanana
  2026-02-03 13:02 ` [PATCH 2/3] btrfs: don't allow log trees to consume global reserve or overcommit metadata fdmanana
@ 2026-02-03 13:02 ` fdmanana
  2026-02-03 23:38 ` [PATCH v2 0/3] btrfs: a few space reservation fixes and comment update fdmanana
  3 siblings, 0 replies; 16+ messages in thread
From: fdmanana @ 2026-02-03 13:02 UTC (permalink / raw)
  To: linux-btrfs

From: Filipe Manana <fdmanana@suse.com>

The comment is incomplete as BTRFS_RESERVE_NO_FLUSH is used for more
reasons than currently holding a transaction handle open. Update the
comment with all the other reasons and give some details.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
---
 fs/btrfs/space-info.h | 19 ++++++++++++++++++-
 1 file changed, 18 insertions(+), 1 deletion(-)

diff --git a/fs/btrfs/space-info.h b/fs/btrfs/space-info.h
index 0703f24b23f7..6f96cf48d7da 100644
--- a/fs/btrfs/space-info.h
+++ b/fs/btrfs/space-info.h
@@ -21,7 +21,24 @@ struct btrfs_block_group;
  * The higher the level, the more methods we try to reclaim space.
  */
 enum btrfs_reserve_flush_enum {
-	/* If we are in the transaction, we can't flush anything.*/
+	/*
+	 * Used when we can't flush or don't need:
+	 *
+	 * 1) We are holding a transaction handle open, so we can't flush as
+	 *    that could deadlock.
+	 *
+	 * 2) For a nowait write we don't want to block when reserving delalloc.
+	 *
+	 * 3) Joining a transaction or attaching a transaction, we don't want
+	 *    to wait and we don't need to reserve anything (any needed space
+	 *    was reserved before in a dedicated block reserve, or we rely on
+	 *    the global block reserve, see btrfs_init_root_block_rsv()).
+	 *
+	 * 4) Starting a transaction when we don't need to reserve space, as
+	 *    we don't need it because we previously reserved in a dedicated
+	 *    block reserve or rely on the global block reserve, like the above
+	 *    case.
+	 */
 	BTRFS_RESERVE_NO_FLUSH,
 
 	/*
-- 
2.47.2


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* Re: [PATCH 2/3] btrfs: don't allow log trees to consume global reserve or overcommit metadata
  2026-02-03 13:02 ` [PATCH 2/3] btrfs: don't allow log trees to consume global reserve or overcommit metadata fdmanana
@ 2026-02-03 19:52   ` Leo Martins
  0 siblings, 0 replies; 16+ messages in thread
From: Leo Martins @ 2026-02-03 19:52 UTC (permalink / raw)
  To: fdmanana; +Cc: linux-btrfs

On Tue,  3 Feb 2026 13:02:32 +0000 fdmanana@kernel.org wrote:

> From: Filipe Manana <fdmanana@suse.com>
> 
> For a fsync we never reserve space in advance, we just start a transaction
> without reserving space and we use an empty block reserve for a log tree.
> We reserve space as we need while updating a log tree, we end up in
> btrfs_use_block_rsv() when reserving space for the allocation of a log
> tree extent buffer and we attempt first to reserve without flushing,
> and if that fails we attempt to consume from the global reserve or
> overcommit metadata. This makes us consume space that may be the last
> resort for a transaction commit to succeed, therefore increasing the
> chances for a transaction abort with -ENOSPC.
> 
> So make btrfs_use_block_rsv() fail if we can't reserve metadata space for
> a log tree exent buffer allocation without flushing, making the fsync
> fallback to a transaction commit and avoid using critical space that could
> be the only resort for a transaction commit to succeed when we are in a
> critical space situation.

I agree. I thought it might be an interesting idea to use an
allowlist vs blocklist to be extra explicit about who is able
to use global block reserve, but it looks like the log tree
is unique in its ability to fallback from failing to reserve.

Reviewed-by: Leo Martins <loemra.dev@gmail.com>

> 
> Signed-off-by: Filipe Manana <fdmanana@suse.com>
> ---
>  fs/btrfs/block-rsv.c | 25 +++++++++++++++++++++++++
>  1 file changed, 25 insertions(+)
> 
> diff --git a/fs/btrfs/block-rsv.c b/fs/btrfs/block-rsv.c
> index e823230c09b7..fe81d9e9f08c 100644
> --- a/fs/btrfs/block-rsv.c
> +++ b/fs/btrfs/block-rsv.c
> @@ -540,6 +540,31 @@ struct btrfs_block_rsv *btrfs_use_block_rsv(struct btrfs_trans_handle *trans,
>  					   BTRFS_RESERVE_NO_FLUSH);
>  	if (!ret)
>  		return block_rsv;
> +
> +	/*
> +	 * If we are being used for updating a log tree, fail immediately, which
> +	 * makes the fsync fallback to a transaction commit.
> +	 *
> +	 * We don't want to consume from the global block reserve, as that is
> +	 * precious space that may be needed to do updates to some trees for
> +	 * which we don't reserve space during a transaction commit (update root
> +	 * items in the root tree, device stat items in the device tree and
> +	 * quota tree updates, see btrfs_init_root_block_rsv()), or to fallback
> +	 * to in case we did not reserve enough space to run delayed items,
> +	 * delayed references, or anything else we need in order to avoid a
> +	 * transaction abort.
> +	 *
> +	 * We also don't want to do a reservation in flush emergency mode, as
> +	 * we end up using metadata that could be critical to allow a
> +	 * transaction to complete successfully and therefore increase the
> +	 * chances for a transaction abort.
> +	 *
> +	 * Log trees are an optimization and should never consume from the
> +	 * global reserve or be allowed overcommitting metadata.
> +	 */
> +	if (btrfs_root_id(root) == BTRFS_TREE_LOG_OBJECTID)
> +		return ERR_PTR(ret);
> +
>  	/*
>  	 * If we couldn't reserve metadata bytes try and use some from
>  	 * the global reserve if its space type is the same as the global
> -- 
> 2.47.2

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 1/3] btrfs: be less agressive with metadata overcommit when we can do full flushing
  2026-02-03 13:02 ` [PATCH 1/3] btrfs: be less agressive with metadata overcommit when we can do full flushing fdmanana
@ 2026-02-03 21:02   ` Qu Wenruo
  2026-02-03 21:46     ` Filipe Manana
  0 siblings, 1 reply; 16+ messages in thread
From: Qu Wenruo @ 2026-02-03 21:02 UTC (permalink / raw)
  To: fdmanana, linux-btrfs



在 2026/2/3 23:32, fdmanana@kernel.org 写道:
> From: Filipe Manana <fdmanana@suse.com>
> 
> Over the years we often get reports of some -ENOSPC failure while updating
> metadata that leads to a transaction abort. I have seen this happen for
> filesystems of all sizes and with workloads that are very user/customer
> specific and unable to reproduce, but Aleksandar recently reported a
> simple way to reproduce this with a 1G filesystem and using the bonnie++
> benchmark tool. The following test script reproduces the failure:
> 
>      $ cat test.sh
>      #!/bin/bash
> 
>      # Create and use a 1G null block device, memory backed, otherwise
>      # the test takes a very long time.
>      modprobe null_blk nr_devices="0"
>      null_dev="/sys/kernel/config/nullb/nullb0"
>      mkdir "$null_dev"
>      size=$((1 * 1024)) # in MB
>      echo 2 > "$null_dev/submit_queues"
>      echo "$size" > "$null_dev/size"
>      echo 1 > "$null_dev/memory_backed"
>      echo 1 > "$null_dev/discard"
>      echo 1 > "$null_dev/power"
> 
>      DEV=/dev/nullb0
>      MNT=/mnt/nullb0
> 
>      mkfs.btrfs -f $DEV
>      mount $DEV $MNT
> 
>      mkdir $MNT/test/
>      bonnie++ -d $MNT/test/ -m BTRFS -u 0 -s 256M -r 128M -b
> 
>      umount $MNT
> 
>      echo 0 > "$null_dev/power"
>      rmdir "$null_dev"
> 
> When running this bonnie++ fails in the phase where it deletes test
> directories and files:
> 
>      $ ./test.sh
>      (...)
>      Using uid:0, gid:0.
>      Writing a byte at a time...done
>      Writing intelligently...done
>      Rewriting...done
>      Reading a byte at a time...done
>      Reading intelligently...done
>      start 'em...done...done...done...done...done...
>      Create files in sequential order...done.
>      Stat files in sequential order...done.
>      Delete files in sequential order...done.
>      Create files in random order...done.
>      Stat files in random order...done.
>      Delete files in random order...Can't sync directory, turning off dir-sync.
>      Can't delete file 9Bq7sr0000000338
>      Cleaning up test directory after error.
>      Bonnie: drastic I/O error (rmdir): Read-only file system
> 
> And in the syslog/dmesg we can see the following transaction abort trace:
> 
>      [161915.501506] BTRFS warning (device nullb0): Skipping commit of aborted transaction.
>      [161915.502983] ------------[ cut here ]------------
>      [161915.503832] BTRFS: Transaction aborted (error -28)
>      [161915.504748] WARNING: fs/btrfs/transaction.c:2045 at btrfs_commit_transaction+0xa21/0xd30 [btrfs], CPU#11: bonnie++/3377975
>      [161915.506786] Modules linked in: btrfs dm_zero dm_snapshot (...)
>      [161915.518759] CPU: 11 UID: 0 PID: 3377975 Comm: bonnie++ Tainted: G        W           6.19.0-rc7-btrfs-next-224+ #4 PREEMPT(full)
>      [161915.520857] Tainted: [W]=WARN
>      [161915.521405] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.16.2-0-gea1b7a073390-prebuilt.qemu.org 04/01/2014
>      [161915.523414] RIP: 0010:btrfs_commit_transaction+0xa24/0xd30 [btrfs]
>      [161915.524630] Code: 48 8b 7c 24 (...)
>      [161915.526982] RSP: 0018:ffffd3fe8206fda8 EFLAGS: 00010292
>      [161915.527707] RAX: 0000000000000002 RBX: ffff8f4886d3c000 RCX: 0000000000000000
>      [161915.528723] RDX: 0000000002040001 RSI: 00000000ffffffe4 RDI: ffffffffc088f780
>      [161915.529691] RBP: ffff8f4f5adae7e0 R08: 0000000000000000 R09: ffffd3fe8206fb90
>      [161915.530842] R10: ffff8f4f9c1fffa8 R11: 0000000000000003 R12: 00000000ffffffe4
>      [161915.532027] R13: ffff8f4ef2cf2400 R14: ffff8f4f5adae708 R15: ffff8f4f62d18000
>      [161915.533229] FS:  00007ff93112a780(0000) GS:ffff8f4ff63ee000(0000) knlGS:0000000000000000
>      [161915.534611] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>      [161915.535575] CR2: 00005571b3072000 CR3: 0000000176080005 CR4: 0000000000370ef0
>      [161915.536758] Call Trace:
>      [161915.537185]  <TASK>
>      [161915.537575]  btrfs_sync_file+0x431/0x530 [btrfs]
>      [161915.538473]  do_fsync+0x39/0x80
>      [161915.539042]  __x64_sys_fsync+0xf/0x20
>      [161915.539750]  do_syscall_64+0x50/0xf20
>      [161915.540396]  entry_SYSCALL_64_after_hwframe+0x76/0x7e
>      [161915.541301] RIP: 0033:0x7ff930ca49ee
>      [161915.541904] Code: 08 0f 85 f5 (...)
>      [161915.544830] RSP: 002b:00007ffd94291f38 EFLAGS: 00000246 ORIG_RAX: 000000000000004a
>      [161915.546152] RAX: ffffffffffffffda RBX: 00007ff93112a780 RCX: 00007ff930ca49ee
>      [161915.547263] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000003
>      [161915.548383] RBP: 0000000000000dab R08: 0000000000000000 R09: 0000000000000000
>      [161915.549853] R10: 0000000000000000 R11: 0000000000000246 R12: 00007ffd94291fb0
>      [161915.551196] R13: 00007ffd94292350 R14: 0000000000000001 R15: 00007ffd94292340
>      [161915.552161]  </TASK>
>      [161915.552457] ---[ end trace 0000000000000000 ]---
>      [161915.553232] BTRFS info (device nullb0 state A): dumping space info:
>      [161915.553236] BTRFS info (device nullb0 state A): space_info DATA (sub-group id 0) has 12582912 free, is not full
>      [161915.553239] BTRFS info (device nullb0 state A): space_info total=12582912, used=0, pinned=0, reserved=0, may_use=0, readonly=0 zone_unusable=0
>      [161915.553243] BTRFS info (device nullb0 state A): space_info METADATA (sub-group id 0) has -5767168 free, is full
>      [161915.553245] BTRFS info (device nullb0 state A): space_info total=53673984, used=6635520, pinned=46956544, reserved=16384, may_use=5767168, readonly=65536 zone_unusable=0
>      [161915.553251] BTRFS info (device nullb0 state A): space_info SYSTEM (sub-group id 0) has 8355840 free, is not full
>      [161915.553254] BTRFS info (device nullb0 state A): space_info total=8388608, used=16384, pinned=16384, reserved=0, may_use=0, readonly=0 zone_unusable=0
>      [161915.553257] BTRFS info (device nullb0 state A): global_block_rsv: size 5767168 reserved 5767168
>      [161915.553261] BTRFS info (device nullb0 state A): trans_block_rsv: size 0 reserved 0
>      [161915.553263] BTRFS info (device nullb0 state A): chunk_block_rsv: size 0 reserved 0
>      [161915.553265] BTRFS info (device nullb0 state A): remap_block_rsv: size 0 reserved 0
>      [161915.553268] BTRFS info (device nullb0 state A): delayed_block_rsv: size 0 reserved 0
>      [161915.553270] BTRFS info (device nullb0 state A): delayed_refs_rsv: size 0 reserved 0
>      [161915.553272] BTRFS: error (device nullb0 state A) in cleanup_transaction:2045: errno=-28 No space left
>      [161915.554463] BTRFS info (device nullb0 state EA): forced readonly
> 
> The problem is that we allow for a very agressive metadata overcommit,
> about 1/8th of the currently available space, even when the task
> attempting the reservation allows for full flushing. Over time this allows
> more and more tasks to overcommit without getting a transaction commit to
> release pinned extents, joining the same transaction and eventually lead
> to the transaction abort when attempting some tree update, as the extent
> allocator is not able to find any available metadata extent and it's not
> able to allocate a new metadata block group either (not enough unallocated
> space for that).

I'm a little curious about why we are unable to allocate a metadata bg.

Both the original report and your backtrace only shows a very small 
data/metadata/sys space info.

Data is only 12M, metadata is around 52MiB, system is 8MiB, even with 
DUP for metadata and system, they are still very tiny.
(Add up to less than 128MiB, vs 1GiB of the device size)


Thus I'm wondering if it's some other reason, like at certain locations 
we're not allowed to allocate new bgs?

Thanks,
Qu

> 
> Fix this by allowing the overcommit to be up to 1/64th of the available
> (unallocated) space instead and for that limit to apply to both types of
> full flushing, BTRFS_RESERVE_FLUSH_ALL and BTRFS_RESERVE_FLUSH_ALL_STEAL.
> This way we get more frequent transaction commits to release pinned
> extents in case our caller is in a context where full flushing is allowed.
> 
> Reported-by: Aleksandar Gerasimovski <Aleksandar.Gerasimovski@belden.com>
> Link: https://lore.kernel.org/linux-btrfs/SA1PR18MB56922F690C5EC2D85371408B998FA@SA1PR18MB5692.namprd18.prod.outlook.com/
> Signed-off-by: Filipe Manana <fdmanana@suse.com>
> ---
>   fs/btrfs/space-info.c | 7 ++++---
>   1 file changed, 4 insertions(+), 3 deletions(-)
> 
> diff --git a/fs/btrfs/space-info.c b/fs/btrfs/space-info.c
> index bb5aac7ee9d2..8192edf92d26 100644
> --- a/fs/btrfs/space-info.c
> +++ b/fs/btrfs/space-info.c
> @@ -489,10 +489,11 @@ static u64 calc_available_free_space(const struct btrfs_space_info *space_info,
>   	/*
>   	 * If we aren't flushing all things, let us overcommit up to
>   	 * 1/2th of the space. If we can flush, don't let us overcommit
> -	 * too much, let it overcommit up to 1/8 of the space.
> +	 * too much, let it overcommit up to 1/64th of the space.
>   	 */
> -	if (flush == BTRFS_RESERVE_FLUSH_ALL)
> -		avail >>= 3;
> +	if (flush == BTRFS_RESERVE_FLUSH_ALL ||
> +	    flush == BTRFS_RESERVE_FLUSH_ALL_STEAL)
> +		avail >>= 6;
>   	else
>   		avail >>= 1;
>   


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 1/3] btrfs: be less agressive with metadata overcommit when we can do full flushing
  2026-02-03 21:02   ` Qu Wenruo
@ 2026-02-03 21:46     ` Filipe Manana
  2026-02-03 21:59       ` Qu Wenruo
  0 siblings, 1 reply; 16+ messages in thread
From: Filipe Manana @ 2026-02-03 21:46 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: linux-btrfs

On Tue, Feb 3, 2026 at 9:02 PM Qu Wenruo <wqu@suse.com> wrote:
>
>
>
> 在 2026/2/3 23:32, fdmanana@kernel.org 写道:
> > From: Filipe Manana <fdmanana@suse.com>
> >
> > Over the years we often get reports of some -ENOSPC failure while updating
> > metadata that leads to a transaction abort. I have seen this happen for
> > filesystems of all sizes and with workloads that are very user/customer
> > specific and unable to reproduce, but Aleksandar recently reported a
> > simple way to reproduce this with a 1G filesystem and using the bonnie++
> > benchmark tool. The following test script reproduces the failure:
> >
> >      $ cat test.sh
> >      #!/bin/bash
> >
> >      # Create and use a 1G null block device, memory backed, otherwise
> >      # the test takes a very long time.
> >      modprobe null_blk nr_devices="0"
> >      null_dev="/sys/kernel/config/nullb/nullb0"
> >      mkdir "$null_dev"
> >      size=$((1 * 1024)) # in MB
> >      echo 2 > "$null_dev/submit_queues"
> >      echo "$size" > "$null_dev/size"
> >      echo 1 > "$null_dev/memory_backed"
> >      echo 1 > "$null_dev/discard"
> >      echo 1 > "$null_dev/power"
> >
> >      DEV=/dev/nullb0
> >      MNT=/mnt/nullb0
> >
> >      mkfs.btrfs -f $DEV
> >      mount $DEV $MNT
> >
> >      mkdir $MNT/test/
> >      bonnie++ -d $MNT/test/ -m BTRFS -u 0 -s 256M -r 128M -b
> >
> >      umount $MNT
> >
> >      echo 0 > "$null_dev/power"
> >      rmdir "$null_dev"
> >
> > When running this bonnie++ fails in the phase where it deletes test
> > directories and files:
> >
> >      $ ./test.sh
> >      (...)
> >      Using uid:0, gid:0.
> >      Writing a byte at a time...done
> >      Writing intelligently...done
> >      Rewriting...done
> >      Reading a byte at a time...done
> >      Reading intelligently...done
> >      start 'em...done...done...done...done...done...
> >      Create files in sequential order...done.
> >      Stat files in sequential order...done.
> >      Delete files in sequential order...done.
> >      Create files in random order...done.
> >      Stat files in random order...done.
> >      Delete files in random order...Can't sync directory, turning off dir-sync.
> >      Can't delete file 9Bq7sr0000000338
> >      Cleaning up test directory after error.
> >      Bonnie: drastic I/O error (rmdir): Read-only file system
> >
> > And in the syslog/dmesg we can see the following transaction abort trace:
> >
> >      [161915.501506] BTRFS warning (device nullb0): Skipping commit of aborted transaction.
> >      [161915.502983] ------------[ cut here ]------------
> >      [161915.503832] BTRFS: Transaction aborted (error -28)
> >      [161915.504748] WARNING: fs/btrfs/transaction.c:2045 at btrfs_commit_transaction+0xa21/0xd30 [btrfs], CPU#11: bonnie++/3377975
> >      [161915.506786] Modules linked in: btrfs dm_zero dm_snapshot (...)
> >      [161915.518759] CPU: 11 UID: 0 PID: 3377975 Comm: bonnie++ Tainted: G        W           6.19.0-rc7-btrfs-next-224+ #4 PREEMPT(full)
> >      [161915.520857] Tainted: [W]=WARN
> >      [161915.521405] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.16.2-0-gea1b7a073390-prebuilt.qemu.org 04/01/2014
> >      [161915.523414] RIP: 0010:btrfs_commit_transaction+0xa24/0xd30 [btrfs]
> >      [161915.524630] Code: 48 8b 7c 24 (...)
> >      [161915.526982] RSP: 0018:ffffd3fe8206fda8 EFLAGS: 00010292
> >      [161915.527707] RAX: 0000000000000002 RBX: ffff8f4886d3c000 RCX: 0000000000000000
> >      [161915.528723] RDX: 0000000002040001 RSI: 00000000ffffffe4 RDI: ffffffffc088f780
> >      [161915.529691] RBP: ffff8f4f5adae7e0 R08: 0000000000000000 R09: ffffd3fe8206fb90
> >      [161915.530842] R10: ffff8f4f9c1fffa8 R11: 0000000000000003 R12: 00000000ffffffe4
> >      [161915.532027] R13: ffff8f4ef2cf2400 R14: ffff8f4f5adae708 R15: ffff8f4f62d18000
> >      [161915.533229] FS:  00007ff93112a780(0000) GS:ffff8f4ff63ee000(0000) knlGS:0000000000000000
> >      [161915.534611] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> >      [161915.535575] CR2: 00005571b3072000 CR3: 0000000176080005 CR4: 0000000000370ef0
> >      [161915.536758] Call Trace:
> >      [161915.537185]  <TASK>
> >      [161915.537575]  btrfs_sync_file+0x431/0x530 [btrfs]
> >      [161915.538473]  do_fsync+0x39/0x80
> >      [161915.539042]  __x64_sys_fsync+0xf/0x20
> >      [161915.539750]  do_syscall_64+0x50/0xf20
> >      [161915.540396]  entry_SYSCALL_64_after_hwframe+0x76/0x7e
> >      [161915.541301] RIP: 0033:0x7ff930ca49ee
> >      [161915.541904] Code: 08 0f 85 f5 (...)
> >      [161915.544830] RSP: 002b:00007ffd94291f38 EFLAGS: 00000246 ORIG_RAX: 000000000000004a
> >      [161915.546152] RAX: ffffffffffffffda RBX: 00007ff93112a780 RCX: 00007ff930ca49ee
> >      [161915.547263] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000003
> >      [161915.548383] RBP: 0000000000000dab R08: 0000000000000000 R09: 0000000000000000
> >      [161915.549853] R10: 0000000000000000 R11: 0000000000000246 R12: 00007ffd94291fb0
> >      [161915.551196] R13: 00007ffd94292350 R14: 0000000000000001 R15: 00007ffd94292340
> >      [161915.552161]  </TASK>
> >      [161915.552457] ---[ end trace 0000000000000000 ]---
> >      [161915.553232] BTRFS info (device nullb0 state A): dumping space info:
> >      [161915.553236] BTRFS info (device nullb0 state A): space_info DATA (sub-group id 0) has 12582912 free, is not full
> >      [161915.553239] BTRFS info (device nullb0 state A): space_info total=12582912, used=0, pinned=0, reserved=0, may_use=0, readonly=0 zone_unusable=0
> >      [161915.553243] BTRFS info (device nullb0 state A): space_info METADATA (sub-group id 0) has -5767168 free, is full
> >      [161915.553245] BTRFS info (device nullb0 state A): space_info total=53673984, used=6635520, pinned=46956544, reserved=16384, may_use=5767168, readonly=65536 zone_unusable=0
> >      [161915.553251] BTRFS info (device nullb0 state A): space_info SYSTEM (sub-group id 0) has 8355840 free, is not full
> >      [161915.553254] BTRFS info (device nullb0 state A): space_info total=8388608, used=16384, pinned=16384, reserved=0, may_use=0, readonly=0 zone_unusable=0
> >      [161915.553257] BTRFS info (device nullb0 state A): global_block_rsv: size 5767168 reserved 5767168
> >      [161915.553261] BTRFS info (device nullb0 state A): trans_block_rsv: size 0 reserved 0
> >      [161915.553263] BTRFS info (device nullb0 state A): chunk_block_rsv: size 0 reserved 0
> >      [161915.553265] BTRFS info (device nullb0 state A): remap_block_rsv: size 0 reserved 0
> >      [161915.553268] BTRFS info (device nullb0 state A): delayed_block_rsv: size 0 reserved 0
> >      [161915.553270] BTRFS info (device nullb0 state A): delayed_refs_rsv: size 0 reserved 0
> >      [161915.553272] BTRFS: error (device nullb0 state A) in cleanup_transaction:2045: errno=-28 No space left
> >      [161915.554463] BTRFS info (device nullb0 state EA): forced readonly
> >
> > The problem is that we allow for a very agressive metadata overcommit,
> > about 1/8th of the currently available space, even when the task
> > attempting the reservation allows for full flushing. Over time this allows
> > more and more tasks to overcommit without getting a transaction commit to
> > release pinned extents, joining the same transaction and eventually lead
> > to the transaction abort when attempting some tree update, as the extent
> > allocator is not able to find any available metadata extent and it's not
> > able to allocate a new metadata block group either (not enough unallocated
> > space for that).
>
> I'm a little curious about why we are unable to allocate a metadata bg.
>
> Both the original report and your backtrace only shows a very small
> data/metadata/sys space info.
>
> Data is only 12M, metadata is around 52MiB, system is 8MiB, even with
> DUP for metadata and system, they are still very tiny.
> (Add up to less than 128MiB, vs 1GiB of the device size)
>
>
> Thus I'm wondering if it's some other reason, like at certain locations
> we're not allowed to allocate new bgs?

We can allocate when we attempt to allocate a metadata extent.
However here it fails because we really have no space:

at calc_available_free_space() we subtract the data chunk size, and
that leaves us at around 300M, which is not enough to allocate a
metadata chunk in DUP profile (256M * 2 = 512M).

>
> Thanks,
> Qu
>
> >
> > Fix this by allowing the overcommit to be up to 1/64th of the available
> > (unallocated) space instead and for that limit to apply to both types of
> > full flushing, BTRFS_RESERVE_FLUSH_ALL and BTRFS_RESERVE_FLUSH_ALL_STEAL.
> > This way we get more frequent transaction commits to release pinned
> > extents in case our caller is in a context where full flushing is allowed.
> >
> > Reported-by: Aleksandar Gerasimovski <Aleksandar.Gerasimovski@belden.com>
> > Link: https://lore.kernel.org/linux-btrfs/SA1PR18MB56922F690C5EC2D85371408B998FA@SA1PR18MB5692.namprd18.prod.outlook.com/
> > Signed-off-by: Filipe Manana <fdmanana@suse.com>
> > ---
> >   fs/btrfs/space-info.c | 7 ++++---
> >   1 file changed, 4 insertions(+), 3 deletions(-)
> >
> > diff --git a/fs/btrfs/space-info.c b/fs/btrfs/space-info.c
> > index bb5aac7ee9d2..8192edf92d26 100644
> > --- a/fs/btrfs/space-info.c
> > +++ b/fs/btrfs/space-info.c
> > @@ -489,10 +489,11 @@ static u64 calc_available_free_space(const struct btrfs_space_info *space_info,
> >       /*
> >        * If we aren't flushing all things, let us overcommit up to
> >        * 1/2th of the space. If we can flush, don't let us overcommit
> > -      * too much, let it overcommit up to 1/8 of the space.
> > +      * too much, let it overcommit up to 1/64th of the space.
> >        */
> > -     if (flush == BTRFS_RESERVE_FLUSH_ALL)
> > -             avail >>= 3;
> > +     if (flush == BTRFS_RESERVE_FLUSH_ALL ||
> > +         flush == BTRFS_RESERVE_FLUSH_ALL_STEAL)
> > +             avail >>= 6;
> >       else
> >               avail >>= 1;
> >
>

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 1/3] btrfs: be less agressive with metadata overcommit when we can do full flushing
  2026-02-03 21:46     ` Filipe Manana
@ 2026-02-03 21:59       ` Qu Wenruo
  2026-02-03 22:55         ` Filipe Manana
  0 siblings, 1 reply; 16+ messages in thread
From: Qu Wenruo @ 2026-02-03 21:59 UTC (permalink / raw)
  To: Filipe Manana; +Cc: linux-btrfs



在 2026/2/4 08:16, Filipe Manana 写道:
[...]
> 
> We can allocate when we attempt to allocate a metadata extent.
> However here it fails because we really have no space:
> 
> at calc_available_free_space() we subtract the data chunk size, and
> that leaves us at around 300M, which is not enough to allocate a
> metadata chunk in DUP profile (256M * 2 = 512M).
> 

For 1GB sized fs, 300MiB is enough for us to allocate a new metadata bg.
As the chunk size will be no larger than 10% of the fs.

In fact I just tried to for a 1GB btrfs to create a metadata bg by 
filling up the initial 51MiB metadata bg.

The resulted bg chunk size is 112MiB:

	item 0 key (DEV_ITEMS DEV_ITEM 1) itemoff 16185 itemsize 98
		devid 1 total_bytes 1073741824 bytes_used 367394816
                                     ^^^ 1GiB

	item 1 key (FIRST_CHUNK_TREE CHUNK_ITEM 13631488) itemoff 16105 itemsize 80
		length 8388608 owner 2 stripe_len 65536 type DATA|single
	item 2 key (FIRST_CHUNK_TREE CHUNK_ITEM 22020096) itemoff 15993 
itemsize 112
		length 8388608 owner 2 stripe_len 65536 type SYSTEM|DUP
	item 3 key (FIRST_CHUNK_TREE CHUNK_ITEM 30408704) itemoff 15881 
itemsize 112
		length 53673984 owner 2 stripe_len 65536 type METADATA|DUP
                        ^^^ The one from mkfs
	item 4 key (FIRST_CHUNK_TREE CHUNK_ITEM 84082688) itemoff 15769 
itemsize 112
		length 117440512 owner 2 stripe_len 65536 type METADATA|DUP
                        ^^^ The new one, 112MiB.

Mind to explain where the 256MiB requirement comes from?

Thanks,
Qu

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 1/3] btrfs: be less agressive with metadata overcommit when we can do full flushing
  2026-02-03 21:59       ` Qu Wenruo
@ 2026-02-03 22:55         ` Filipe Manana
  2026-02-03 23:04           ` Qu Wenruo
  2026-02-03 23:06           ` Filipe Manana
  0 siblings, 2 replies; 16+ messages in thread
From: Filipe Manana @ 2026-02-03 22:55 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: linux-btrfs

On Tue, Feb 3, 2026 at 9:59 PM Qu Wenruo <wqu@suse.com> wrote:
>
>
>
> 在 2026/2/4 08:16, Filipe Manana 写道:
> [...]
> >
> > We can allocate when we attempt to allocate a metadata extent.
> > However here it fails because we really have no space:
> >
> > at calc_available_free_space() we subtract the data chunk size, and
> > that leaves us at around 300M, which is not enough to allocate a
> > metadata chunk in DUP profile (256M * 2 = 512M).
> >
>
> For 1GB sized fs, 300MiB is enough for us to allocate a new metadata bg.
> As the chunk size will be no larger than 10% of the fs.
>
> In fact I just tried to for a 1GB btrfs to create a metadata bg by
> filling up the initial 51MiB metadata bg.
>
> The resulted bg chunk size is 112MiB:
>
>         item 0 key (DEV_ITEMS DEV_ITEM 1) itemoff 16185 itemsize 98
>                 devid 1 total_bytes 1073741824 bytes_used 367394816
>                                      ^^^ 1GiB
>
>         item 1 key (FIRST_CHUNK_TREE CHUNK_ITEM 13631488) itemoff 16105 itemsize 80
>                 length 8388608 owner 2 stripe_len 65536 type DATA|single
>         item 2 key (FIRST_CHUNK_TREE CHUNK_ITEM 22020096) itemoff 15993
> itemsize 112
>                 length 8388608 owner 2 stripe_len 65536 type SYSTEM|DUP
>         item 3 key (FIRST_CHUNK_TREE CHUNK_ITEM 30408704) itemoff 15881
> itemsize 112
>                 length 53673984 owner 2 stripe_len 65536 type METADATA|DUP
>                         ^^^ The one from mkfs
>         item 4 key (FIRST_CHUNK_TREE CHUNK_ITEM 84082688) itemoff 15769
> itemsize 112
>                 length 117440512 owner 2 stripe_len 65536 type METADATA|DUP
>                         ^^^ The new one, 112MiB.
>
> Mind to explain where the 256MiB requirement comes from?

So I was looking at an old trace before.

We fail to allocate a chunk because there's effectively no unallocated
space at some point.

We have a bunch of data chunks allocated by dbench++ and we reach a
point where a metadata chunk allocation fails.
During the first metadata chunk allocation attempt,
gather_device_info() finds no available space due to a bunch of
pending chunks for data block groups (each with a size of 117440512
bytes, except for the last two).

The tracing:

           mount-1793735 [011] ...1. 28877.261096:
btrfs_add_bg_to_space_info: added bg offset 13631488 length 8388608
flags 1 to space_info->flags 1 total_bytes 8388608 bytes_used 0
bytes_may_use 0
           mount-1793735 [011] ...1. 28877.261098:
btrfs_add_bg_to_space_info: added bg offset 22020096 length 8388608
flags 34 to space_info->flags 2 total_bytes 8388608 bytes_used 16384
bytes_may_use 0
           mount-1793735 [011] ...1. 28877.261100:
btrfs_add_bg_to_space_info: added bg offset 30408704 length 53673984
flags 36 to space_info->flags 4 total_bytes 53673984 bytes_used 131072
bytes_may_use 0

These are from loading the block groups created by mkfs during mount.

Then when bonnie++ starts doing its thing:

   kworker/u48:5-1792004 [011] ..... 28886.122050: btrfs_create_chunk:
gather_device_info 1 ctl->dev_extent_min = 65536 dev_extent_want
1073741824
   kworker/u48:5-1792004 [011] ..... 28886.122053: btrfs_create_chunk:
gather_device_info 2 ctl->dev_extent_min = 65536 dev_extent_want
1073741824 max_avail 927596544
   kworker/u48:5-1792004 [011] ..... 28886.122055:
btrfs_make_block_group: make bg offset 84082688 size 117440512 type 1
   kworker/u48:5-1792004 [011] ...1. 28886.122064:
btrfs_add_bg_to_space_info: added bg offset 84082688 length 117440512
flags 1 to space_info->flags 1 total_bytes 125829120 bytes_used 0
bytes_may_use 5251072

First allocation of a data block group of 112M.

   kworker/u48:5-1792004 [011] ..... 28886.192408: btrfs_create_chunk:
gather_device_info 1 ctl->dev_extent_min = 65536 dev_extent_want
1073741824
   kworker/u48:5-1792004 [011] ..... 28886.192413: btrfs_create_chunk:
gather_device_info 2 ctl->dev_extent_min = 65536 dev_extent_want
1073741824 max_avail 810156032
   kworker/u48:5-1792004 [011] ..... 28886.192415:
btrfs_make_block_group: make bg offset 201523200 size 117440512 type 1
   kworker/u48:5-1792004 [011] ...1. 28886.192425:
btrfs_add_bg_to_space_info: added bg offset 201523200 length 117440512
flags 1 to space_info->flags 1 total_bytes 243269632 bytes_used 0
bytes_may_use 122691584

Another 112M data block group allocated.

   kworker/u48:5-1792004 [011] ..... 28886.260935: btrfs_create_chunk:
gather_device_info 1 ctl->dev_extent_min = 65536 dev_extent_want
1073741824
   kworker/u48:5-1792004 [011] ..... 28886.260941: btrfs_create_chunk:
gather_device_info 2 ctl->dev_extent_min = 65536 dev_extent_want
1073741824 max_avail 692715520
   kworker/u48:5-1792004 [011] ..... 28886.260943:
btrfs_make_block_group: make bg offset 318963712 size 117440512 type 1
   kworker/u48:5-1792004 [011] ...1. 28886.260954:
btrfs_add_bg_to_space_info: added bg offset 318963712 length 117440512
flags 1 to space_info->flags 1 total_bytes 360710144 bytes_used 0
bytes_may_use 240132096

Yet another one.

        bonnie++-1793755 [010] ..... 28886.280407: btrfs_create_chunk:
gather_device_info 1 ctl->dev_extent_min = 65536 dev_extent_want
1073741824
        bonnie++-1793755 [010] ..... 28886.280412: btrfs_create_chunk:
gather_device_info 2 ctl->dev_extent_min = 65536 dev_extent_want
1073741824 max_avail 575275008
        bonnie++-1793755 [010] ..... 28886.280414:
btrfs_make_block_group: make bg offset 436404224 size 117440512 type 1
        bonnie++-1793755 [010] ...1. 28886.280419:
btrfs_add_bg_to_space_info: added bg offset 436404224 length 117440512
flags 1 to space_info->flags 1 total_bytes 478150656 bytes_used 0
bytes_may_use 268435456

One more.

   kworker/u48:5-1792004 [011] ..... 28886.566233: btrfs_create_chunk:
gather_device_info 1 ctl->dev_extent_min = 65536 dev_extent_want
1073741824
   kworker/u48:5-1792004 [011] ..... 28886.566238: btrfs_create_chunk:
gather_device_info 2 ctl->dev_extent_min = 65536 dev_extent_want
1073741824 max_avail 457834496
   kworker/u48:5-1792004 [011] ..... 28886.566241:
btrfs_make_block_group: make bg offset 553844736 size 117440512 type 1
   kworker/u48:5-1792004 [011] ...1. 28886.566250:
btrfs_add_bg_to_space_info: added bg offset 553844736 length 117440512
flags 1 to space_info->flags 1 total_bytes 595591168 bytes_used
268435456 bytes_may_use 2
09723392

Another one.

        bonnie++-1793755 [009] ..... 28886.613446: btrfs_create_chunk:
gather_device_info 1 ctl->dev_extent_min = 65536 dev_extent_want
1073741824
        bonnie++-1793755 [009] ..... 28886.613451: btrfs_create_chunk:
gather_device_info 2 ctl->dev_extent_min = 65536 dev_extent_want
1073741824 max_avail 340393984
        bonnie++-1793755 [009] ..... 28886.613453:
btrfs_make_block_group: make bg offset 671285248 size 117440512 type 1
        bonnie++-1793755 [009] ...1. 28886.613458:
btrfs_add_bg_to_space_info: added bg offset 671285248 length 117440512
flags 1 to space_info->flags 1 total_bytes 713031680 bytes_used
268435456 bytes_may_use 2
68435456

Another one.

        bonnie++-1793755 [009] ..... 28886.674953: btrfs_create_chunk:
gather_device_info 1 ctl->dev_extent_min = 65536 dev_extent_want
1073741824
        bonnie++-1793755 [009] ..... 28886.674957: btrfs_create_chunk:
gather_device_info 2 ctl->dev_extent_min = 65536 dev_extent_want
1073741824 max_avail 222953472
        bonnie++-1793755 [009] ..... 28886.674959:
btrfs_make_block_group: make bg offset 788725760 size 117440512 type 1
        bonnie++-1793755 [009] ...1. 28886.674963:
btrfs_add_bg_to_space_info: added bg offset 788725760 length 117440512
flags 1 to space_info->flags 1 total_bytes 830472192 bytes_used
268435456 bytes_may_use 1
34217728

Another one.

        bonnie++-1793755 [009] ..... 28886.674981: btrfs_create_chunk:
gather_device_info 1 ctl->dev_extent_min = 65536 dev_extent_want
1073741824
        bonnie++-1793755 [009] ..... 28886.674982: btrfs_create_chunk:
gather_device_info 2 ctl->dev_extent_min = 65536 dev_extent_want
1073741824 max_avail 105512960
        bonnie++-1793755 [009] ..... 28886.674983:
btrfs_make_block_group: make bg offset 906166272 size 105512960 type 1
        bonnie++-1793755 [009] ...1. 28886.674984:
btrfs_add_bg_to_space_info: added bg offset 906166272 length 105512960
flags 1 to space_info->flags 1 total_bytes 935985152 bytes_used
268435456 bytes_may_use 67108864

Another one, this time a bit smaller, ~100.6M, since we now have less space.

        bonnie++-1793758 [009] ..... 28891.962096: btrfs_create_chunk:
gather_device_info 1 ctl->dev_extent_min = 65536 dev_extent_want
1073741824
        bonnie++-1793758 [009] ..... 28891.962103: btrfs_create_chunk:
gather_device_info 2 ctl->dev_extent_min = 65536 dev_extent_want
1073741824 max_avail 12582912
        bonnie++-1793758 [009] ..... 28891.962105:
btrfs_make_block_group: make bg offset 1011679232 size 12582912 type 1
        bonnie++-1793758 [009] ...1. 28891.962114:
btrfs_add_bg_to_space_info: added bg offset 1011679232 length 12582912
flags 1 to space_info->flags 1 total_bytes 948568064 bytes_used
268435456 bytes_may_use 8192

Another one, this one even smaller, 12M.

   kworker/u48:5-1792004 [011] ..... 28892.112802: btrfs_chunk_alloc:
enter metadata chunk alloc
   kworker/u48:5-1792004 [011] ..... 28892.112805: btrfs_create_chunk:
gather_device_info 1 ctl->dev_extent_min = 131072 dev_extent_want
536870912
   kworker/u48:5-1792004 [011] ..... 28892.112806: btrfs_create_chunk:
gather_device_info 2 ctl->dev_extent_min = 131072 dev_extent_want
536870912 max_avail 0

536870912 is 512M, the 256M * 2 (DUP) thing.
max_avail is what find_free_dev_extent() returns to us in gather_device_info().

As a result it sets ctl->ndevs to 0, making decide_stripe_size() fail
with -ENOSPC, and therefore metadata chunk allocation fails.

   kworker/u48:5-1792004 [011] ..... 28892.112807: btrfs_create_chunk:
decide_stripe_size fail -ENOSPC


Yes, what dmesg shows after the transaction abort does not include all
those allocations.
My guess on that is that after the transaction aborts, pending block
groups are gone and it's influencing the dump. But that's another
thing to investigate.

But if we add a call to btrfs_dump_space_info_for_trans_abort() to
decide_stripe_size() when it returns -ENOSPC, before we have a
transaction abort:

[29972.409295] BTRFS info (device nullb0): dumping space info:
[29972.409300] BTRFS info (device nullb0): space_info DATA (sub-group
id 0) has 673341440 free, is not full
[29972.409303] BTRFS info (device nullb0): space_info total=948568064,
used=0, pinned=275226624, reserved=0, may_use=0, readonly=0
zone_unusable=0
[29972.409305] BTRFS info (device nullb0): space_info METADATA
(sub-group id 0) has 3915776 free, is not full
[29972.409306] BTRFS info (device nullb0): space_info total=53673984,
used=163840, pinned=42827776, reserved=147456, may_use=6553600,
readonly=65536 zone_unusable=0
[29972.409308] BTRFS info (device nullb0): space_info SYSTEM
(sub-group id 0) has 7979008 free, is not full
[29972.409310] BTRFS info (device nullb0): space_info total=8388608,
used=16384, pinned=0, reserved=0, may_use=393216, readonly=0
zone_unusable=0
[29972.409311] BTRFS info (device nullb0): global_block_rsv: size
5767168 reserved 5767168
[29972.409313] BTRFS info (device nullb0): trans_block_rsv: size 0 reserved 0
[29972.409314] BTRFS info (device nullb0): chunk_block_rsv: size
393216 reserved 393216
[29972.409315] BTRFS info (device nullb0): remap_block_rsv: size 0 reserved 0
[29972.409316] BTRFS info (device nullb0): delayed_block_rsv: size 0 reserved 0

So here we see there's over 900M of data space.

So lowering the metadata overcommit limit when we can flush, helps
getting rid of a ton of pinned space.

>
> Thanks,
> Qu

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 1/3] btrfs: be less agressive with metadata overcommit when we can do full flushing
  2026-02-03 22:55         ` Filipe Manana
@ 2026-02-03 23:04           ` Qu Wenruo
  2026-02-03 23:09             ` Filipe Manana
  2026-02-03 23:06           ` Filipe Manana
  1 sibling, 1 reply; 16+ messages in thread
From: Qu Wenruo @ 2026-02-03 23:04 UTC (permalink / raw)
  To: Filipe Manana; +Cc: linux-btrfs



在 2026/2/4 09:25, Filipe Manana 写道:
> On Tue, Feb 3, 2026 at 9:59 PM Qu Wenruo <wqu@suse.com> wrote:
>>
>>
>>
>> 在 2026/2/4 08:16, Filipe Manana 写道:
>> [...]
>>>
>>> We can allocate when we attempt to allocate a metadata extent.
>>> However here it fails because we really have no space:
>>>
>>> at calc_available_free_space() we subtract the data chunk size, and
>>> that leaves us at around 300M, which is not enough to allocate a
>>> metadata chunk in DUP profile (256M * 2 = 512M).
>>>
>>
>> For 1GB sized fs, 300MiB is enough for us to allocate a new metadata bg.
>> As the chunk size will be no larger than 10% of the fs.
>>
>> In fact I just tried to for a 1GB btrfs to create a metadata bg by
>> filling up the initial 51MiB metadata bg.
>>
>> The resulted bg chunk size is 112MiB:
>>
>>          item 0 key (DEV_ITEMS DEV_ITEM 1) itemoff 16185 itemsize 98
>>                  devid 1 total_bytes 1073741824 bytes_used 367394816
>>                                       ^^^ 1GiB
>>
>>          item 1 key (FIRST_CHUNK_TREE CHUNK_ITEM 13631488) itemoff 16105 itemsize 80
>>                  length 8388608 owner 2 stripe_len 65536 type DATA|single
>>          item 2 key (FIRST_CHUNK_TREE CHUNK_ITEM 22020096) itemoff 15993
>> itemsize 112
>>                  length 8388608 owner 2 stripe_len 65536 type SYSTEM|DUP
>>          item 3 key (FIRST_CHUNK_TREE CHUNK_ITEM 30408704) itemoff 15881
>> itemsize 112
>>                  length 53673984 owner 2 stripe_len 65536 type METADATA|DUP
>>                          ^^^ The one from mkfs
>>          item 4 key (FIRST_CHUNK_TREE CHUNK_ITEM 84082688) itemoff 15769
>> itemsize 112
>>                  length 117440512 owner 2 stripe_len 65536 type METADATA|DUP
>>                          ^^^ The new one, 112MiB.
>>
>> Mind to explain where the 256MiB requirement comes from?
> 
> So I was looking at an old trace before.
> 
> We fail to allocate a chunk because there's effectively no unallocated
> space at some point.
> 
> We have a bunch of data chunks allocated by dbench++ and we reach a
> point where a metadata chunk allocation fails.
> During the first metadata chunk allocation attempt,
> gather_device_info() finds no available space due to a bunch of
> pending chunks for data block groups (each with a size of 117440512
> bytes, except for the last two).
> 
> The tracing:
> 
>             mount-1793735 [011] ...1. 28877.261096:
> btrfs_add_bg_to_space_info: added bg offset 13631488 length 8388608
> flags 1 to space_info->flags 1 total_bytes 8388608 bytes_used 0
> bytes_may_use 0
>             mount-1793735 [011] ...1. 28877.261098:
> btrfs_add_bg_to_space_info: added bg offset 22020096 length 8388608
> flags 34 to space_info->flags 2 total_bytes 8388608 bytes_used 16384
> bytes_may_use 0
>             mount-1793735 [011] ...1. 28877.261100:
> btrfs_add_bg_to_space_info: added bg offset 30408704 length 53673984
> flags 36 to space_info->flags 4 total_bytes 53673984 bytes_used 131072
> bytes_may_use 0
> 
> These are from loading the block groups created by mkfs during mount.
> 
> Then when bonnie++ starts doing its thing:
> 
>     kworker/u48:5-1792004 [011] ..... 28886.122050: btrfs_create_chunk:
> gather_device_info 1 ctl->dev_extent_min = 65536 dev_extent_want
> 1073741824
>     kworker/u48:5-1792004 [011] ..... 28886.122053: btrfs_create_chunk:
> gather_device_info 2 ctl->dev_extent_min = 65536 dev_extent_want
> 1073741824 max_avail 927596544
>     kworker/u48:5-1792004 [011] ..... 28886.122055:
> btrfs_make_block_group: make bg offset 84082688 size 117440512 type 1
>     kworker/u48:5-1792004 [011] ...1. 28886.122064:
> btrfs_add_bg_to_space_info: added bg offset 84082688 length 117440512
> flags 1 to space_info->flags 1 total_bytes 125829120 bytes_used 0
> bytes_may_use 5251072
> 
> First allocation of a data block group of 112M.
> 
>     kworker/u48:5-1792004 [011] ..... 28886.192408: btrfs_create_chunk:
> gather_device_info 1 ctl->dev_extent_min = 65536 dev_extent_want
> 1073741824
>     kworker/u48:5-1792004 [011] ..... 28886.192413: btrfs_create_chunk:
> gather_device_info 2 ctl->dev_extent_min = 65536 dev_extent_want
> 1073741824 max_avail 810156032
>     kworker/u48:5-1792004 [011] ..... 28886.192415:
> btrfs_make_block_group: make bg offset 201523200 size 117440512 type 1
>     kworker/u48:5-1792004 [011] ...1. 28886.192425:
> btrfs_add_bg_to_space_info: added bg offset 201523200 length 117440512
> flags 1 to space_info->flags 1 total_bytes 243269632 bytes_used 0
> bytes_may_use 122691584
> 
> Another 112M data block group allocated.
> 
>     kworker/u48:5-1792004 [011] ..... 28886.260935: btrfs_create_chunk:
> gather_device_info 1 ctl->dev_extent_min = 65536 dev_extent_want
> 1073741824
>     kworker/u48:5-1792004 [011] ..... 28886.260941: btrfs_create_chunk:
> gather_device_info 2 ctl->dev_extent_min = 65536 dev_extent_want
> 1073741824 max_avail 692715520
>     kworker/u48:5-1792004 [011] ..... 28886.260943:
> btrfs_make_block_group: make bg offset 318963712 size 117440512 type 1
>     kworker/u48:5-1792004 [011] ...1. 28886.260954:
> btrfs_add_bg_to_space_info: added bg offset 318963712 length 117440512
> flags 1 to space_info->flags 1 total_bytes 360710144 bytes_used 0
> bytes_may_use 240132096
> 
> Yet another one.
> 
>          bonnie++-1793755 [010] ..... 28886.280407: btrfs_create_chunk:
> gather_device_info 1 ctl->dev_extent_min = 65536 dev_extent_want
> 1073741824
>          bonnie++-1793755 [010] ..... 28886.280412: btrfs_create_chunk:
> gather_device_info 2 ctl->dev_extent_min = 65536 dev_extent_want
> 1073741824 max_avail 575275008
>          bonnie++-1793755 [010] ..... 28886.280414:
> btrfs_make_block_group: make bg offset 436404224 size 117440512 type 1
>          bonnie++-1793755 [010] ...1. 28886.280419:
> btrfs_add_bg_to_space_info: added bg offset 436404224 length 117440512
> flags 1 to space_info->flags 1 total_bytes 478150656 bytes_used 0
> bytes_may_use 268435456
> 
> One more.
> 
>     kworker/u48:5-1792004 [011] ..... 28886.566233: btrfs_create_chunk:
> gather_device_info 1 ctl->dev_extent_min = 65536 dev_extent_want
> 1073741824
>     kworker/u48:5-1792004 [011] ..... 28886.566238: btrfs_create_chunk:
> gather_device_info 2 ctl->dev_extent_min = 65536 dev_extent_want
> 1073741824 max_avail 457834496
>     kworker/u48:5-1792004 [011] ..... 28886.566241:
> btrfs_make_block_group: make bg offset 553844736 size 117440512 type 1
>     kworker/u48:5-1792004 [011] ...1. 28886.566250:
> btrfs_add_bg_to_space_info: added bg offset 553844736 length 117440512
> flags 1 to space_info->flags 1 total_bytes 595591168 bytes_used
> 268435456 bytes_may_use 2
> 09723392
> 
> Another one.
> 
>          bonnie++-1793755 [009] ..... 28886.613446: btrfs_create_chunk:
> gather_device_info 1 ctl->dev_extent_min = 65536 dev_extent_want
> 1073741824
>          bonnie++-1793755 [009] ..... 28886.613451: btrfs_create_chunk:
> gather_device_info 2 ctl->dev_extent_min = 65536 dev_extent_want
> 1073741824 max_avail 340393984
>          bonnie++-1793755 [009] ..... 28886.613453:
> btrfs_make_block_group: make bg offset 671285248 size 117440512 type 1
>          bonnie++-1793755 [009] ...1. 28886.613458:
> btrfs_add_bg_to_space_info: added bg offset 671285248 length 117440512
> flags 1 to space_info->flags 1 total_bytes 713031680 bytes_used
> 268435456 bytes_may_use 2
> 68435456
> 
> Another one.
> 
>          bonnie++-1793755 [009] ..... 28886.674953: btrfs_create_chunk:
> gather_device_info 1 ctl->dev_extent_min = 65536 dev_extent_want
> 1073741824
>          bonnie++-1793755 [009] ..... 28886.674957: btrfs_create_chunk:
> gather_device_info 2 ctl->dev_extent_min = 65536 dev_extent_want
> 1073741824 max_avail 222953472
>          bonnie++-1793755 [009] ..... 28886.674959:
> btrfs_make_block_group: make bg offset 788725760 size 117440512 type 1
>          bonnie++-1793755 [009] ...1. 28886.674963:
> btrfs_add_bg_to_space_info: added bg offset 788725760 length 117440512
> flags 1 to space_info->flags 1 total_bytes 830472192 bytes_used
> 268435456 bytes_may_use 1
> 34217728
> 
> Another one.
> 
>          bonnie++-1793755 [009] ..... 28886.674981: btrfs_create_chunk:
> gather_device_info 1 ctl->dev_extent_min = 65536 dev_extent_want
> 1073741824
>          bonnie++-1793755 [009] ..... 28886.674982: btrfs_create_chunk:
> gather_device_info 2 ctl->dev_extent_min = 65536 dev_extent_want
> 1073741824 max_avail 105512960
>          bonnie++-1793755 [009] ..... 28886.674983:
> btrfs_make_block_group: make bg offset 906166272 size 105512960 type 1
>          bonnie++-1793755 [009] ...1. 28886.674984:
> btrfs_add_bg_to_space_info: added bg offset 906166272 length 105512960
> flags 1 to space_info->flags 1 total_bytes 935985152 bytes_used
> 268435456 bytes_may_use 67108864
> 
> Another one, this time a bit smaller, ~100.6M, since we now have less space.
> 
>          bonnie++-1793758 [009] ..... 28891.962096: btrfs_create_chunk:
> gather_device_info 1 ctl->dev_extent_min = 65536 dev_extent_want
> 1073741824
>          bonnie++-1793758 [009] ..... 28891.962103: btrfs_create_chunk:
> gather_device_info 2 ctl->dev_extent_min = 65536 dev_extent_want
> 1073741824 max_avail 12582912
>          bonnie++-1793758 [009] ..... 28891.962105:
> btrfs_make_block_group: make bg offset 1011679232 size 12582912 type 1
>          bonnie++-1793758 [009] ...1. 28891.962114:
> btrfs_add_bg_to_space_info: added bg offset 1011679232 length 12582912
> flags 1 to space_info->flags 1 total_bytes 948568064 bytes_used
> 268435456 bytes_may_use 8192
> 
> Another one, this one even smaller, 12M.
> 
>     kworker/u48:5-1792004 [011] ..... 28892.112802: btrfs_chunk_alloc:
> enter metadata chunk alloc
>     kworker/u48:5-1792004 [011] ..... 28892.112805: btrfs_create_chunk:
> gather_device_info 1 ctl->dev_extent_min = 131072 dev_extent_want
> 536870912
>     kworker/u48:5-1792004 [011] ..... 28892.112806: btrfs_create_chunk:
> gather_device_info 2 ctl->dev_extent_min = 131072 dev_extent_want
> 536870912 max_avail 0
> 
> 536870912 is 512M, the 256M * 2 (DUP) thing.
> max_avail is what find_free_dev_extent() returns to us in gather_device_info().
> 
> As a result it sets ctl->ndevs to 0, making decide_stripe_size() fail
> with -ENOSPC, and therefore metadata chunk allocation fails.
> 
>     kworker/u48:5-1792004 [011] ..... 28892.112807: btrfs_create_chunk:
> decide_stripe_size fail -ENOSPC
> 
> 
> Yes, what dmesg shows after the transaction abort does not include all
> those allocations.

Thanks a lot! This indeed solves my question on why the space dump is 
always so small.

So it's indeed lack of unallocated space, just the dump doesn't include 
those pending bgs.

> My guess on that is that after the transaction aborts, pending block
> groups are gone and it's influencing the dump. But that's another
> thing to investigate.
> 
> But if we add a call to btrfs_dump_space_info_for_trans_abort() to
> decide_stripe_size() when it returns -ENOSPC, before we have a
> transaction abort:
> 
> [29972.409295] BTRFS info (device nullb0): dumping space info:
> [29972.409300] BTRFS info (device nullb0): space_info DATA (sub-group
> id 0) has 673341440 free, is not full
> [29972.409303] BTRFS info (device nullb0): space_info total=948568064,
> used=0, pinned=275226624, reserved=0, may_use=0, readonly=0
> zone_unusable=0
> [29972.409305] BTRFS info (device nullb0): space_info METADATA
> (sub-group id 0) has 3915776 free, is not full
> [29972.409306] BTRFS info (device nullb0): space_info total=53673984,
> used=163840, pinned=42827776, reserved=147456, may_use=6553600,
> readonly=65536 zone_unusable=0

This is way better than the existing dump.

Definitely something we can improve.

> [29972.409308] BTRFS info (device nullb0): space_info SYSTEM
> (sub-group id 0) has 7979008 free, is not full
> [29972.409310] BTRFS info (device nullb0): space_info total=8388608,
> used=16384, pinned=0, reserved=0, may_use=393216, readonly=0
> zone_unusable=0
> [29972.409311] BTRFS info (device nullb0): global_block_rsv: size
> 5767168 reserved 5767168
> [29972.409313] BTRFS info (device nullb0): trans_block_rsv: size 0 reserved 0
> [29972.409314] BTRFS info (device nullb0): chunk_block_rsv: size
> 393216 reserved 393216
> [29972.409315] BTRFS info (device nullb0): remap_block_rsv: size 0 reserved 0
> [29972.409316] BTRFS info (device nullb0): delayed_block_rsv: size 0 reserved 0
> 
> So here we see there's over 900M of data space.
> 
> So lowering the metadata overcommit limit when we can flush, helps
> getting rid of a ton of pinned space.

Yep, now this explains the fix now.

Maybe you can update the commit message to include this version of space 
info dump.

Otherwise looks good to me.

Reviewed-by: Qu Wenruo <wqu@suse.com>

Thanks,
Qu

> 
>>
>> Thanks,
>> Qu


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 1/3] btrfs: be less agressive with metadata overcommit when we can do full flushing
  2026-02-03 22:55         ` Filipe Manana
  2026-02-03 23:04           ` Qu Wenruo
@ 2026-02-03 23:06           ` Filipe Manana
  1 sibling, 0 replies; 16+ messages in thread
From: Filipe Manana @ 2026-02-03 23:06 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: linux-btrfs

On Tue, Feb 3, 2026 at 10:55 PM Filipe Manana <fdmanana@kernel.org> wrote:
>
> On Tue, Feb 3, 2026 at 9:59 PM Qu Wenruo <wqu@suse.com> wrote:
> >
> >
> >
> > 在 2026/2/4 08:16, Filipe Manana 写道:
> > [...]
> > >
> > > We can allocate when we attempt to allocate a metadata extent.
> > > However here it fails because we really have no space:
> > >
> > > at calc_available_free_space() we subtract the data chunk size, and
> > > that leaves us at around 300M, which is not enough to allocate a
> > > metadata chunk in DUP profile (256M * 2 = 512M).
> > >
> >
> > For 1GB sized fs, 300MiB is enough for us to allocate a new metadata bg.
> > As the chunk size will be no larger than 10% of the fs.
> >
> > In fact I just tried to for a 1GB btrfs to create a metadata bg by
> > filling up the initial 51MiB metadata bg.
> >
> > The resulted bg chunk size is 112MiB:
> >
> >         item 0 key (DEV_ITEMS DEV_ITEM 1) itemoff 16185 itemsize 98
> >                 devid 1 total_bytes 1073741824 bytes_used 367394816
> >                                      ^^^ 1GiB
> >
> >         item 1 key (FIRST_CHUNK_TREE CHUNK_ITEM 13631488) itemoff 16105 itemsize 80
> >                 length 8388608 owner 2 stripe_len 65536 type DATA|single
> >         item 2 key (FIRST_CHUNK_TREE CHUNK_ITEM 22020096) itemoff 15993
> > itemsize 112
> >                 length 8388608 owner 2 stripe_len 65536 type SYSTEM|DUP
> >         item 3 key (FIRST_CHUNK_TREE CHUNK_ITEM 30408704) itemoff 15881
> > itemsize 112
> >                 length 53673984 owner 2 stripe_len 65536 type METADATA|DUP
> >                         ^^^ The one from mkfs
> >         item 4 key (FIRST_CHUNK_TREE CHUNK_ITEM 84082688) itemoff 15769
> > itemsize 112
> >                 length 117440512 owner 2 stripe_len 65536 type METADATA|DUP
> >                         ^^^ The new one, 112MiB.
> >
> > Mind to explain where the 256MiB requirement comes from?
>
> So I was looking at an old trace before.
>
> We fail to allocate a chunk because there's effectively no unallocated
> space at some point.
>
> We have a bunch of data chunks allocated by dbench++ and we reach a
> point where a metadata chunk allocation fails.
> During the first metadata chunk allocation attempt,
> gather_device_info() finds no available space due to a bunch of
> pending chunks for data block groups (each with a size of 117440512
> bytes, except for the last two).
>
> The tracing:
>
>            mount-1793735 [011] ...1. 28877.261096:
> btrfs_add_bg_to_space_info: added bg offset 13631488 length 8388608
> flags 1 to space_info->flags 1 total_bytes 8388608 bytes_used 0
> bytes_may_use 0
>            mount-1793735 [011] ...1. 28877.261098:
> btrfs_add_bg_to_space_info: added bg offset 22020096 length 8388608
> flags 34 to space_info->flags 2 total_bytes 8388608 bytes_used 16384
> bytes_may_use 0
>            mount-1793735 [011] ...1. 28877.261100:
> btrfs_add_bg_to_space_info: added bg offset 30408704 length 53673984
> flags 36 to space_info->flags 4 total_bytes 53673984 bytes_used 131072
> bytes_may_use 0
>
> These are from loading the block groups created by mkfs during mount.
>
> Then when bonnie++ starts doing its thing:
>
>    kworker/u48:5-1792004 [011] ..... 28886.122050: btrfs_create_chunk:
> gather_device_info 1 ctl->dev_extent_min = 65536 dev_extent_want
> 1073741824
>    kworker/u48:5-1792004 [011] ..... 28886.122053: btrfs_create_chunk:
> gather_device_info 2 ctl->dev_extent_min = 65536 dev_extent_want
> 1073741824 max_avail 927596544
>    kworker/u48:5-1792004 [011] ..... 28886.122055:
> btrfs_make_block_group: make bg offset 84082688 size 117440512 type 1
>    kworker/u48:5-1792004 [011] ...1. 28886.122064:
> btrfs_add_bg_to_space_info: added bg offset 84082688 length 117440512
> flags 1 to space_info->flags 1 total_bytes 125829120 bytes_used 0
> bytes_may_use 5251072
>
> First allocation of a data block group of 112M.
>
>    kworker/u48:5-1792004 [011] ..... 28886.192408: btrfs_create_chunk:
> gather_device_info 1 ctl->dev_extent_min = 65536 dev_extent_want
> 1073741824
>    kworker/u48:5-1792004 [011] ..... 28886.192413: btrfs_create_chunk:
> gather_device_info 2 ctl->dev_extent_min = 65536 dev_extent_want
> 1073741824 max_avail 810156032
>    kworker/u48:5-1792004 [011] ..... 28886.192415:
> btrfs_make_block_group: make bg offset 201523200 size 117440512 type 1
>    kworker/u48:5-1792004 [011] ...1. 28886.192425:
> btrfs_add_bg_to_space_info: added bg offset 201523200 length 117440512
> flags 1 to space_info->flags 1 total_bytes 243269632 bytes_used 0
> bytes_may_use 122691584
>
> Another 112M data block group allocated.
>
>    kworker/u48:5-1792004 [011] ..... 28886.260935: btrfs_create_chunk:
> gather_device_info 1 ctl->dev_extent_min = 65536 dev_extent_want
> 1073741824
>    kworker/u48:5-1792004 [011] ..... 28886.260941: btrfs_create_chunk:
> gather_device_info 2 ctl->dev_extent_min = 65536 dev_extent_want
> 1073741824 max_avail 692715520
>    kworker/u48:5-1792004 [011] ..... 28886.260943:
> btrfs_make_block_group: make bg offset 318963712 size 117440512 type 1
>    kworker/u48:5-1792004 [011] ...1. 28886.260954:
> btrfs_add_bg_to_space_info: added bg offset 318963712 length 117440512
> flags 1 to space_info->flags 1 total_bytes 360710144 bytes_used 0
> bytes_may_use 240132096
>
> Yet another one.
>
>         bonnie++-1793755 [010] ..... 28886.280407: btrfs_create_chunk:
> gather_device_info 1 ctl->dev_extent_min = 65536 dev_extent_want
> 1073741824
>         bonnie++-1793755 [010] ..... 28886.280412: btrfs_create_chunk:
> gather_device_info 2 ctl->dev_extent_min = 65536 dev_extent_want
> 1073741824 max_avail 575275008
>         bonnie++-1793755 [010] ..... 28886.280414:
> btrfs_make_block_group: make bg offset 436404224 size 117440512 type 1
>         bonnie++-1793755 [010] ...1. 28886.280419:
> btrfs_add_bg_to_space_info: added bg offset 436404224 length 117440512
> flags 1 to space_info->flags 1 total_bytes 478150656 bytes_used 0
> bytes_may_use 268435456
>
> One more.
>
>    kworker/u48:5-1792004 [011] ..... 28886.566233: btrfs_create_chunk:
> gather_device_info 1 ctl->dev_extent_min = 65536 dev_extent_want
> 1073741824
>    kworker/u48:5-1792004 [011] ..... 28886.566238: btrfs_create_chunk:
> gather_device_info 2 ctl->dev_extent_min = 65536 dev_extent_want
> 1073741824 max_avail 457834496
>    kworker/u48:5-1792004 [011] ..... 28886.566241:
> btrfs_make_block_group: make bg offset 553844736 size 117440512 type 1
>    kworker/u48:5-1792004 [011] ...1. 28886.566250:
> btrfs_add_bg_to_space_info: added bg offset 553844736 length 117440512
> flags 1 to space_info->flags 1 total_bytes 595591168 bytes_used
> 268435456 bytes_may_use 2
> 09723392
>
> Another one.
>
>         bonnie++-1793755 [009] ..... 28886.613446: btrfs_create_chunk:
> gather_device_info 1 ctl->dev_extent_min = 65536 dev_extent_want
> 1073741824
>         bonnie++-1793755 [009] ..... 28886.613451: btrfs_create_chunk:
> gather_device_info 2 ctl->dev_extent_min = 65536 dev_extent_want
> 1073741824 max_avail 340393984
>         bonnie++-1793755 [009] ..... 28886.613453:
> btrfs_make_block_group: make bg offset 671285248 size 117440512 type 1
>         bonnie++-1793755 [009] ...1. 28886.613458:
> btrfs_add_bg_to_space_info: added bg offset 671285248 length 117440512
> flags 1 to space_info->flags 1 total_bytes 713031680 bytes_used
> 268435456 bytes_may_use 2
> 68435456
>
> Another one.
>
>         bonnie++-1793755 [009] ..... 28886.674953: btrfs_create_chunk:
> gather_device_info 1 ctl->dev_extent_min = 65536 dev_extent_want
> 1073741824
>         bonnie++-1793755 [009] ..... 28886.674957: btrfs_create_chunk:
> gather_device_info 2 ctl->dev_extent_min = 65536 dev_extent_want
> 1073741824 max_avail 222953472
>         bonnie++-1793755 [009] ..... 28886.674959:
> btrfs_make_block_group: make bg offset 788725760 size 117440512 type 1
>         bonnie++-1793755 [009] ...1. 28886.674963:
> btrfs_add_bg_to_space_info: added bg offset 788725760 length 117440512
> flags 1 to space_info->flags 1 total_bytes 830472192 bytes_used
> 268435456 bytes_may_use 1
> 34217728
>
> Another one.
>
>         bonnie++-1793755 [009] ..... 28886.674981: btrfs_create_chunk:
> gather_device_info 1 ctl->dev_extent_min = 65536 dev_extent_want
> 1073741824
>         bonnie++-1793755 [009] ..... 28886.674982: btrfs_create_chunk:
> gather_device_info 2 ctl->dev_extent_min = 65536 dev_extent_want
> 1073741824 max_avail 105512960
>         bonnie++-1793755 [009] ..... 28886.674983:
> btrfs_make_block_group: make bg offset 906166272 size 105512960 type 1
>         bonnie++-1793755 [009] ...1. 28886.674984:
> btrfs_add_bg_to_space_info: added bg offset 906166272 length 105512960
> flags 1 to space_info->flags 1 total_bytes 935985152 bytes_used
> 268435456 bytes_may_use 67108864
>
> Another one, this time a bit smaller, ~100.6M, since we now have less space.
>
>         bonnie++-1793758 [009] ..... 28891.962096: btrfs_create_chunk:
> gather_device_info 1 ctl->dev_extent_min = 65536 dev_extent_want
> 1073741824
>         bonnie++-1793758 [009] ..... 28891.962103: btrfs_create_chunk:
> gather_device_info 2 ctl->dev_extent_min = 65536 dev_extent_want
> 1073741824 max_avail 12582912
>         bonnie++-1793758 [009] ..... 28891.962105:
> btrfs_make_block_group: make bg offset 1011679232 size 12582912 type 1
>         bonnie++-1793758 [009] ...1. 28891.962114:
> btrfs_add_bg_to_space_info: added bg offset 1011679232 length 12582912
> flags 1 to space_info->flags 1 total_bytes 948568064 bytes_used
> 268435456 bytes_may_use 8192
>
> Another one, this one even smaller, 12M.
>
>    kworker/u48:5-1792004 [011] ..... 28892.112802: btrfs_chunk_alloc:
> enter metadata chunk alloc
>    kworker/u48:5-1792004 [011] ..... 28892.112805: btrfs_create_chunk:
> gather_device_info 1 ctl->dev_extent_min = 131072 dev_extent_want
> 536870912
>    kworker/u48:5-1792004 [011] ..... 28892.112806: btrfs_create_chunk:
> gather_device_info 2 ctl->dev_extent_min = 131072 dev_extent_want
> 536870912 max_avail 0
>
> 536870912 is 512M, the 256M * 2 (DUP) thing.
> max_avail is what find_free_dev_extent() returns to us in gather_device_info().
>
> As a result it sets ctl->ndevs to 0, making decide_stripe_size() fail
> with -ENOSPC, and therefore metadata chunk allocation fails.
>
>    kworker/u48:5-1792004 [011] ..... 28892.112807: btrfs_create_chunk:
> decide_stripe_size fail -ENOSPC
>
>
> Yes, what dmesg shows after the transaction abort does not include all
> those allocations.
> My guess on that is that after the transaction aborts, pending block
> groups are gone and it's influencing the dump. But that's another
> thing to investigate.

Ok, so it's because the cleaner happens to kick in shortly after the
first metadata chunk allocation fails with -ENOSPC and removes the
data block groups that became unused in the meanwhile.
The bytes_may_use of the data space_info started to drop and also, not
shown in the trace, the bytes_readonly started to increase because
empty data block groups were turned to RO for deletion by the cleaner.

And the cleaner then finishes deleting all these data block groups
before the transaction abort dumps the space infos.

>
> But if we add a call to btrfs_dump_space_info_for_trans_abort() to
> decide_stripe_size() when it returns -ENOSPC, before we have a
> transaction abort:
>
> [29972.409295] BTRFS info (device nullb0): dumping space info:
> [29972.409300] BTRFS info (device nullb0): space_info DATA (sub-group
> id 0) has 673341440 free, is not full
> [29972.409303] BTRFS info (device nullb0): space_info total=948568064,
> used=0, pinned=275226624, reserved=0, may_use=0, readonly=0
> zone_unusable=0
> [29972.409305] BTRFS info (device nullb0): space_info METADATA
> (sub-group id 0) has 3915776 free, is not full
> [29972.409306] BTRFS info (device nullb0): space_info total=53673984,
> used=163840, pinned=42827776, reserved=147456, may_use=6553600,
> readonly=65536 zone_unusable=0
> [29972.409308] BTRFS info (device nullb0): space_info SYSTEM
> (sub-group id 0) has 7979008 free, is not full
> [29972.409310] BTRFS info (device nullb0): space_info total=8388608,
> used=16384, pinned=0, reserved=0, may_use=393216, readonly=0
> zone_unusable=0
> [29972.409311] BTRFS info (device nullb0): global_block_rsv: size
> 5767168 reserved 5767168
> [29972.409313] BTRFS info (device nullb0): trans_block_rsv: size 0 reserved 0
> [29972.409314] BTRFS info (device nullb0): chunk_block_rsv: size
> 393216 reserved 393216
> [29972.409315] BTRFS info (device nullb0): remap_block_rsv: size 0 reserved 0
> [29972.409316] BTRFS info (device nullb0): delayed_block_rsv: size 0 reserved 0
>
> So here we see there's over 900M of data space.
>
> So lowering the metadata overcommit limit when we can flush, helps
> getting rid of a ton of pinned space.
>
> >
> > Thanks,
> > Qu

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 1/3] btrfs: be less agressive with metadata overcommit when we can do full flushing
  2026-02-03 23:04           ` Qu Wenruo
@ 2026-02-03 23:09             ` Filipe Manana
  0 siblings, 0 replies; 16+ messages in thread
From: Filipe Manana @ 2026-02-03 23:09 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: linux-btrfs

On Tue, Feb 3, 2026 at 11:04 PM Qu Wenruo <wqu@suse.com> wrote:
>
>
>
> 在 2026/2/4 09:25, Filipe Manana 写道:
> > On Tue, Feb 3, 2026 at 9:59 PM Qu Wenruo <wqu@suse.com> wrote:
> >>
> >>
> >>
> >> 在 2026/2/4 08:16, Filipe Manana 写道:
> >> [...]
> >>>
> >>> We can allocate when we attempt to allocate a metadata extent.
> >>> However here it fails because we really have no space:
> >>>
> >>> at calc_available_free_space() we subtract the data chunk size, and
> >>> that leaves us at around 300M, which is not enough to allocate a
> >>> metadata chunk in DUP profile (256M * 2 = 512M).
> >>>
> >>
> >> For 1GB sized fs, 300MiB is enough for us to allocate a new metadata bg.
> >> As the chunk size will be no larger than 10% of the fs.
> >>
> >> In fact I just tried to for a 1GB btrfs to create a metadata bg by
> >> filling up the initial 51MiB metadata bg.
> >>
> >> The resulted bg chunk size is 112MiB:
> >>
> >>          item 0 key (DEV_ITEMS DEV_ITEM 1) itemoff 16185 itemsize 98
> >>                  devid 1 total_bytes 1073741824 bytes_used 367394816
> >>                                       ^^^ 1GiB
> >>
> >>          item 1 key (FIRST_CHUNK_TREE CHUNK_ITEM 13631488) itemoff 16105 itemsize 80
> >>                  length 8388608 owner 2 stripe_len 65536 type DATA|single
> >>          item 2 key (FIRST_CHUNK_TREE CHUNK_ITEM 22020096) itemoff 15993
> >> itemsize 112
> >>                  length 8388608 owner 2 stripe_len 65536 type SYSTEM|DUP
> >>          item 3 key (FIRST_CHUNK_TREE CHUNK_ITEM 30408704) itemoff 15881
> >> itemsize 112
> >>                  length 53673984 owner 2 stripe_len 65536 type METADATA|DUP
> >>                          ^^^ The one from mkfs
> >>          item 4 key (FIRST_CHUNK_TREE CHUNK_ITEM 84082688) itemoff 15769
> >> itemsize 112
> >>                  length 117440512 owner 2 stripe_len 65536 type METADATA|DUP
> >>                          ^^^ The new one, 112MiB.
> >>
> >> Mind to explain where the 256MiB requirement comes from?
> >
> > So I was looking at an old trace before.
> >
> > We fail to allocate a chunk because there's effectively no unallocated
> > space at some point.
> >
> > We have a bunch of data chunks allocated by dbench++ and we reach a
> > point where a metadata chunk allocation fails.
> > During the first metadata chunk allocation attempt,
> > gather_device_info() finds no available space due to a bunch of
> > pending chunks for data block groups (each with a size of 117440512
> > bytes, except for the last two).
> >
> > The tracing:
> >
> >             mount-1793735 [011] ...1. 28877.261096:
> > btrfs_add_bg_to_space_info: added bg offset 13631488 length 8388608
> > flags 1 to space_info->flags 1 total_bytes 8388608 bytes_used 0
> > bytes_may_use 0
> >             mount-1793735 [011] ...1. 28877.261098:
> > btrfs_add_bg_to_space_info: added bg offset 22020096 length 8388608
> > flags 34 to space_info->flags 2 total_bytes 8388608 bytes_used 16384
> > bytes_may_use 0
> >             mount-1793735 [011] ...1. 28877.261100:
> > btrfs_add_bg_to_space_info: added bg offset 30408704 length 53673984
> > flags 36 to space_info->flags 4 total_bytes 53673984 bytes_used 131072
> > bytes_may_use 0
> >
> > These are from loading the block groups created by mkfs during mount.
> >
> > Then when bonnie++ starts doing its thing:
> >
> >     kworker/u48:5-1792004 [011] ..... 28886.122050: btrfs_create_chunk:
> > gather_device_info 1 ctl->dev_extent_min = 65536 dev_extent_want
> > 1073741824
> >     kworker/u48:5-1792004 [011] ..... 28886.122053: btrfs_create_chunk:
> > gather_device_info 2 ctl->dev_extent_min = 65536 dev_extent_want
> > 1073741824 max_avail 927596544
> >     kworker/u48:5-1792004 [011] ..... 28886.122055:
> > btrfs_make_block_group: make bg offset 84082688 size 117440512 type 1
> >     kworker/u48:5-1792004 [011] ...1. 28886.122064:
> > btrfs_add_bg_to_space_info: added bg offset 84082688 length 117440512
> > flags 1 to space_info->flags 1 total_bytes 125829120 bytes_used 0
> > bytes_may_use 5251072
> >
> > First allocation of a data block group of 112M.
> >
> >     kworker/u48:5-1792004 [011] ..... 28886.192408: btrfs_create_chunk:
> > gather_device_info 1 ctl->dev_extent_min = 65536 dev_extent_want
> > 1073741824
> >     kworker/u48:5-1792004 [011] ..... 28886.192413: btrfs_create_chunk:
> > gather_device_info 2 ctl->dev_extent_min = 65536 dev_extent_want
> > 1073741824 max_avail 810156032
> >     kworker/u48:5-1792004 [011] ..... 28886.192415:
> > btrfs_make_block_group: make bg offset 201523200 size 117440512 type 1
> >     kworker/u48:5-1792004 [011] ...1. 28886.192425:
> > btrfs_add_bg_to_space_info: added bg offset 201523200 length 117440512
> > flags 1 to space_info->flags 1 total_bytes 243269632 bytes_used 0
> > bytes_may_use 122691584
> >
> > Another 112M data block group allocated.
> >
> >     kworker/u48:5-1792004 [011] ..... 28886.260935: btrfs_create_chunk:
> > gather_device_info 1 ctl->dev_extent_min = 65536 dev_extent_want
> > 1073741824
> >     kworker/u48:5-1792004 [011] ..... 28886.260941: btrfs_create_chunk:
> > gather_device_info 2 ctl->dev_extent_min = 65536 dev_extent_want
> > 1073741824 max_avail 692715520
> >     kworker/u48:5-1792004 [011] ..... 28886.260943:
> > btrfs_make_block_group: make bg offset 318963712 size 117440512 type 1
> >     kworker/u48:5-1792004 [011] ...1. 28886.260954:
> > btrfs_add_bg_to_space_info: added bg offset 318963712 length 117440512
> > flags 1 to space_info->flags 1 total_bytes 360710144 bytes_used 0
> > bytes_may_use 240132096
> >
> > Yet another one.
> >
> >          bonnie++-1793755 [010] ..... 28886.280407: btrfs_create_chunk:
> > gather_device_info 1 ctl->dev_extent_min = 65536 dev_extent_want
> > 1073741824
> >          bonnie++-1793755 [010] ..... 28886.280412: btrfs_create_chunk:
> > gather_device_info 2 ctl->dev_extent_min = 65536 dev_extent_want
> > 1073741824 max_avail 575275008
> >          bonnie++-1793755 [010] ..... 28886.280414:
> > btrfs_make_block_group: make bg offset 436404224 size 117440512 type 1
> >          bonnie++-1793755 [010] ...1. 28886.280419:
> > btrfs_add_bg_to_space_info: added bg offset 436404224 length 117440512
> > flags 1 to space_info->flags 1 total_bytes 478150656 bytes_used 0
> > bytes_may_use 268435456
> >
> > One more.
> >
> >     kworker/u48:5-1792004 [011] ..... 28886.566233: btrfs_create_chunk:
> > gather_device_info 1 ctl->dev_extent_min = 65536 dev_extent_want
> > 1073741824
> >     kworker/u48:5-1792004 [011] ..... 28886.566238: btrfs_create_chunk:
> > gather_device_info 2 ctl->dev_extent_min = 65536 dev_extent_want
> > 1073741824 max_avail 457834496
> >     kworker/u48:5-1792004 [011] ..... 28886.566241:
> > btrfs_make_block_group: make bg offset 553844736 size 117440512 type 1
> >     kworker/u48:5-1792004 [011] ...1. 28886.566250:
> > btrfs_add_bg_to_space_info: added bg offset 553844736 length 117440512
> > flags 1 to space_info->flags 1 total_bytes 595591168 bytes_used
> > 268435456 bytes_may_use 2
> > 09723392
> >
> > Another one.
> >
> >          bonnie++-1793755 [009] ..... 28886.613446: btrfs_create_chunk:
> > gather_device_info 1 ctl->dev_extent_min = 65536 dev_extent_want
> > 1073741824
> >          bonnie++-1793755 [009] ..... 28886.613451: btrfs_create_chunk:
> > gather_device_info 2 ctl->dev_extent_min = 65536 dev_extent_want
> > 1073741824 max_avail 340393984
> >          bonnie++-1793755 [009] ..... 28886.613453:
> > btrfs_make_block_group: make bg offset 671285248 size 117440512 type 1
> >          bonnie++-1793755 [009] ...1. 28886.613458:
> > btrfs_add_bg_to_space_info: added bg offset 671285248 length 117440512
> > flags 1 to space_info->flags 1 total_bytes 713031680 bytes_used
> > 268435456 bytes_may_use 2
> > 68435456
> >
> > Another one.
> >
> >          bonnie++-1793755 [009] ..... 28886.674953: btrfs_create_chunk:
> > gather_device_info 1 ctl->dev_extent_min = 65536 dev_extent_want
> > 1073741824
> >          bonnie++-1793755 [009] ..... 28886.674957: btrfs_create_chunk:
> > gather_device_info 2 ctl->dev_extent_min = 65536 dev_extent_want
> > 1073741824 max_avail 222953472
> >          bonnie++-1793755 [009] ..... 28886.674959:
> > btrfs_make_block_group: make bg offset 788725760 size 117440512 type 1
> >          bonnie++-1793755 [009] ...1. 28886.674963:
> > btrfs_add_bg_to_space_info: added bg offset 788725760 length 117440512
> > flags 1 to space_info->flags 1 total_bytes 830472192 bytes_used
> > 268435456 bytes_may_use 1
> > 34217728
> >
> > Another one.
> >
> >          bonnie++-1793755 [009] ..... 28886.674981: btrfs_create_chunk:
> > gather_device_info 1 ctl->dev_extent_min = 65536 dev_extent_want
> > 1073741824
> >          bonnie++-1793755 [009] ..... 28886.674982: btrfs_create_chunk:
> > gather_device_info 2 ctl->dev_extent_min = 65536 dev_extent_want
> > 1073741824 max_avail 105512960
> >          bonnie++-1793755 [009] ..... 28886.674983:
> > btrfs_make_block_group: make bg offset 906166272 size 105512960 type 1
> >          bonnie++-1793755 [009] ...1. 28886.674984:
> > btrfs_add_bg_to_space_info: added bg offset 906166272 length 105512960
> > flags 1 to space_info->flags 1 total_bytes 935985152 bytes_used
> > 268435456 bytes_may_use 67108864
> >
> > Another one, this time a bit smaller, ~100.6M, since we now have less space.
> >
> >          bonnie++-1793758 [009] ..... 28891.962096: btrfs_create_chunk:
> > gather_device_info 1 ctl->dev_extent_min = 65536 dev_extent_want
> > 1073741824
> >          bonnie++-1793758 [009] ..... 28891.962103: btrfs_create_chunk:
> > gather_device_info 2 ctl->dev_extent_min = 65536 dev_extent_want
> > 1073741824 max_avail 12582912
> >          bonnie++-1793758 [009] ..... 28891.962105:
> > btrfs_make_block_group: make bg offset 1011679232 size 12582912 type 1
> >          bonnie++-1793758 [009] ...1. 28891.962114:
> > btrfs_add_bg_to_space_info: added bg offset 1011679232 length 12582912
> > flags 1 to space_info->flags 1 total_bytes 948568064 bytes_used
> > 268435456 bytes_may_use 8192
> >
> > Another one, this one even smaller, 12M.
> >
> >     kworker/u48:5-1792004 [011] ..... 28892.112802: btrfs_chunk_alloc:
> > enter metadata chunk alloc
> >     kworker/u48:5-1792004 [011] ..... 28892.112805: btrfs_create_chunk:
> > gather_device_info 1 ctl->dev_extent_min = 131072 dev_extent_want
> > 536870912
> >     kworker/u48:5-1792004 [011] ..... 28892.112806: btrfs_create_chunk:
> > gather_device_info 2 ctl->dev_extent_min = 131072 dev_extent_want
> > 536870912 max_avail 0
> >
> > 536870912 is 512M, the 256M * 2 (DUP) thing.
> > max_avail is what find_free_dev_extent() returns to us in gather_device_info().
> >
> > As a result it sets ctl->ndevs to 0, making decide_stripe_size() fail
> > with -ENOSPC, and therefore metadata chunk allocation fails.
> >
> >     kworker/u48:5-1792004 [011] ..... 28892.112807: btrfs_create_chunk:
> > decide_stripe_size fail -ENOSPC
> >
> >
> > Yes, what dmesg shows after the transaction abort does not include all
> > those allocations.
>
> Thanks a lot! This indeed solves my question on why the space dump is
> always so small.
>
> So it's indeed lack of unallocated space, just the dump doesn't include
> those pending bgs.
>
> > My guess on that is that after the transaction aborts, pending block
> > groups are gone and it's influencing the dump. But that's another
> > thing to investigate.
> >
> > But if we add a call to btrfs_dump_space_info_for_trans_abort() to
> > decide_stripe_size() when it returns -ENOSPC, before we have a
> > transaction abort:
> >
> > [29972.409295] BTRFS info (device nullb0): dumping space info:
> > [29972.409300] BTRFS info (device nullb0): space_info DATA (sub-group
> > id 0) has 673341440 free, is not full
> > [29972.409303] BTRFS info (device nullb0): space_info total=948568064,
> > used=0, pinned=275226624, reserved=0, may_use=0, readonly=0
> > zone_unusable=0
> > [29972.409305] BTRFS info (device nullb0): space_info METADATA
> > (sub-group id 0) has 3915776 free, is not full
> > [29972.409306] BTRFS info (device nullb0): space_info total=53673984,
> > used=163840, pinned=42827776, reserved=147456, may_use=6553600,
> > readonly=65536 zone_unusable=0
>
> This is way better than the existing dump.
>
> Definitely something we can improve.
>
> > [29972.409308] BTRFS info (device nullb0): space_info SYSTEM
> > (sub-group id 0) has 7979008 free, is not full
> > [29972.409310] BTRFS info (device nullb0): space_info total=8388608,
> > used=16384, pinned=0, reserved=0, may_use=393216, readonly=0
> > zone_unusable=0
> > [29972.409311] BTRFS info (device nullb0): global_block_rsv: size
> > 5767168 reserved 5767168
> > [29972.409313] BTRFS info (device nullb0): trans_block_rsv: size 0 reserved 0
> > [29972.409314] BTRFS info (device nullb0): chunk_block_rsv: size
> > 393216 reserved 393216
> > [29972.409315] BTRFS info (device nullb0): remap_block_rsv: size 0 reserved 0
> > [29972.409316] BTRFS info (device nullb0): delayed_block_rsv: size 0 reserved 0
> >
> > So here we see there's over 900M of data space.
> >
> > So lowering the metadata overcommit limit when we can flush, helps
> > getting rid of a ton of pinned space.
>
> Yep, now this explains the fix now.
>
> Maybe you can update the commit message to include this version of space
> info dump.

Sure, I'll add those tomorrow and resend, those custom traces and the
space_info dumps right after metadata chunk allocation failure and
before the transaction abort, and also a Link tag to the thread.

Thanks.

>
> Otherwise looks good to me.
>
> Reviewed-by: Qu Wenruo <wqu@suse.com>
>
> Thanks,
> Qu
>
> >
> >>
> >> Thanks,
> >> Qu
>

^ permalink raw reply	[flat|nested] 16+ messages in thread

* [PATCH v2 0/3] btrfs: a few space reservation fixes and comment update
  2026-02-03 13:02 [PATCH 0/3] btrfs: a few space reservation fixes and comment update fdmanana
                   ` (2 preceding siblings ...)
  2026-02-03 13:02 ` [PATCH 3/3] btrfs: update comment for BTRFS_RESERVE_NO_FLUSH fdmanana
@ 2026-02-03 23:38 ` fdmanana
  2026-02-03 23:38   ` [PATCH v2 1/3] btrfs: be less agressive with metadata overcommit when we can do full flushing fdmanana
                     ` (2 more replies)
  3 siblings, 3 replies; 16+ messages in thread
From: fdmanana @ 2026-02-03 23:38 UTC (permalink / raw)
  To: linux-btrfs

From: Filipe Manana <fdmanana@suse.com>

A couple fixes for metadata space reservation and update a comment.
Details in the changelogs.

V2: Updated changelog of patch 1/3 with much more details about why
    metadata chunk allocation fails. Added Review-by tags to patches
    1/3 and 2/3.

Filipe Manana (3):
  btrfs: be less agressive with metadata overcommit when we can do full flushing
  btrfs: don't allow log trees to consume global reserve or overcommit metadata
  btrfs: update comment for BTRFS_RESERVE_NO_FLUSH

 fs/btrfs/block-rsv.c  | 25 +++++++++++++++++++++++++
 fs/btrfs/space-info.c |  7 ++++---
 fs/btrfs/space-info.h | 19 ++++++++++++++++++-
 3 files changed, 47 insertions(+), 4 deletions(-)

-- 
2.47.2


^ permalink raw reply	[flat|nested] 16+ messages in thread

* [PATCH v2 1/3] btrfs: be less agressive with metadata overcommit when we can do full flushing
  2026-02-03 23:38 ` [PATCH v2 0/3] btrfs: a few space reservation fixes and comment update fdmanana
@ 2026-02-03 23:38   ` fdmanana
  2026-02-03 23:39   ` [PATCH v2 2/3] btrfs: don't allow log trees to consume global reserve or overcommit metadata fdmanana
  2026-02-03 23:39   ` [PATCH v2 3/3] btrfs: update comment for BTRFS_RESERVE_NO_FLUSH fdmanana
  2 siblings, 0 replies; 16+ messages in thread
From: fdmanana @ 2026-02-03 23:38 UTC (permalink / raw)
  To: linux-btrfs

From: Filipe Manana <fdmanana@suse.com>

Over the years we often get reports of some -ENOSPC failure while updating
metadata that leads to a transaction abort. I have seen this happen for
filesystems of all sizes and with workloads that are very user/customer
specific and unable to reproduce, but Aleksandar recently reported a
simple way to reproduce this with a 1G filesystem and using the bonnie++
benchmark tool. The following test script reproduces the failure:

    $ cat test.sh
    #!/bin/bash

    # Create and use a 1G null block device, memory backed, otherwise
    # the test takes a very long time.
    modprobe null_blk nr_devices="0"
    null_dev="/sys/kernel/config/nullb/nullb0"
    mkdir "$null_dev"
    size=$((1 * 1024)) # in MB
    echo 2 > "$null_dev/submit_queues"
    echo "$size" > "$null_dev/size"
    echo 1 > "$null_dev/memory_backed"
    echo 1 > "$null_dev/discard"
    echo 1 > "$null_dev/power"

    DEV=/dev/nullb0
    MNT=/mnt/nullb0

    mkfs.btrfs -f $DEV
    mount $DEV $MNT

    mkdir $MNT/test/
    bonnie++ -d $MNT/test/ -m BTRFS -u 0 -s 256M -r 128M -b

    umount $MNT

    echo 0 > "$null_dev/power"
    rmdir "$null_dev"

When running this bonnie++ fails in the phase where it deletes test
directories and files:

    $ ./test.sh
    (...)
    Using uid:0, gid:0.
    Writing a byte at a time...done
    Writing intelligently...done
    Rewriting...done
    Reading a byte at a time...done
    Reading intelligently...done
    start 'em...done...done...done...done...done...
    Create files in sequential order...done.
    Stat files in sequential order...done.
    Delete files in sequential order...done.
    Create files in random order...done.
    Stat files in random order...done.
    Delete files in random order...Can't sync directory, turning off dir-sync.
    Can't delete file 9Bq7sr0000000338
    Cleaning up test directory after error.
    Bonnie: drastic I/O error (rmdir): Read-only file system

And in the syslog/dmesg we can see the following transaction abort trace:

    [161915.501506] BTRFS warning (device nullb0): Skipping commit of aborted transaction.
    [161915.502983] ------------[ cut here ]------------
    [161915.503832] BTRFS: Transaction aborted (error -28)
    [161915.504748] WARNING: fs/btrfs/transaction.c:2045 at btrfs_commit_transaction+0xa21/0xd30 [btrfs], CPU#11: bonnie++/3377975
    [161915.506786] Modules linked in: btrfs dm_zero dm_snapshot (...)
    [161915.518759] CPU: 11 UID: 0 PID: 3377975 Comm: bonnie++ Tainted: G        W           6.19.0-rc7-btrfs-next-224+ #4 PREEMPT(full)
    [161915.520857] Tainted: [W]=WARN
    [161915.521405] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.16.2-0-gea1b7a073390-prebuilt.qemu.org 04/01/2014
    [161915.523414] RIP: 0010:btrfs_commit_transaction+0xa24/0xd30 [btrfs]
    [161915.524630] Code: 48 8b 7c 24 (...)
    [161915.526982] RSP: 0018:ffffd3fe8206fda8 EFLAGS: 00010292
    [161915.527707] RAX: 0000000000000002 RBX: ffff8f4886d3c000 RCX: 0000000000000000
    [161915.528723] RDX: 0000000002040001 RSI: 00000000ffffffe4 RDI: ffffffffc088f780
    [161915.529691] RBP: ffff8f4f5adae7e0 R08: 0000000000000000 R09: ffffd3fe8206fb90
    [161915.530842] R10: ffff8f4f9c1fffa8 R11: 0000000000000003 R12: 00000000ffffffe4
    [161915.532027] R13: ffff8f4ef2cf2400 R14: ffff8f4f5adae708 R15: ffff8f4f62d18000
    [161915.533229] FS:  00007ff93112a780(0000) GS:ffff8f4ff63ee000(0000) knlGS:0000000000000000
    [161915.534611] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
    [161915.535575] CR2: 00005571b3072000 CR3: 0000000176080005 CR4: 0000000000370ef0
    [161915.536758] Call Trace:
    [161915.537185]  <TASK>
    [161915.537575]  btrfs_sync_file+0x431/0x530 [btrfs]
    [161915.538473]  do_fsync+0x39/0x80
    [161915.539042]  __x64_sys_fsync+0xf/0x20
    [161915.539750]  do_syscall_64+0x50/0xf20
    [161915.540396]  entry_SYSCALL_64_after_hwframe+0x76/0x7e
    [161915.541301] RIP: 0033:0x7ff930ca49ee
    [161915.541904] Code: 08 0f 85 f5 (...)
    [161915.544830] RSP: 002b:00007ffd94291f38 EFLAGS: 00000246 ORIG_RAX: 000000000000004a
    [161915.546152] RAX: ffffffffffffffda RBX: 00007ff93112a780 RCX: 00007ff930ca49ee
    [161915.547263] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000003
    [161915.548383] RBP: 0000000000000dab R08: 0000000000000000 R09: 0000000000000000
    [161915.549853] R10: 0000000000000000 R11: 0000000000000246 R12: 00007ffd94291fb0
    [161915.551196] R13: 00007ffd94292350 R14: 0000000000000001 R15: 00007ffd94292340
    [161915.552161]  </TASK>
    [161915.552457] ---[ end trace 0000000000000000 ]---
    [161915.553232] BTRFS info (device nullb0 state A): dumping space info:
    [161915.553236] BTRFS info (device nullb0 state A): space_info DATA (sub-group id 0) has 12582912 free, is not full
    [161915.553239] BTRFS info (device nullb0 state A): space_info total=12582912, used=0, pinned=0, reserved=0, may_use=0, readonly=0 zone_unusable=0
    [161915.553243] BTRFS info (device nullb0 state A): space_info METADATA (sub-group id 0) has -5767168 free, is full
    [161915.553245] BTRFS info (device nullb0 state A): space_info total=53673984, used=6635520, pinned=46956544, reserved=16384, may_use=5767168, readonly=65536 zone_unusable=0
    [161915.553251] BTRFS info (device nullb0 state A): space_info SYSTEM (sub-group id 0) has 8355840 free, is not full
    [161915.553254] BTRFS info (device nullb0 state A): space_info total=8388608, used=16384, pinned=16384, reserved=0, may_use=0, readonly=0 zone_unusable=0
    [161915.553257] BTRFS info (device nullb0 state A): global_block_rsv: size 5767168 reserved 5767168
    [161915.553261] BTRFS info (device nullb0 state A): trans_block_rsv: size 0 reserved 0
    [161915.553263] BTRFS info (device nullb0 state A): chunk_block_rsv: size 0 reserved 0
    [161915.553265] BTRFS info (device nullb0 state A): remap_block_rsv: size 0 reserved 0
    [161915.553268] BTRFS info (device nullb0 state A): delayed_block_rsv: size 0 reserved 0
    [161915.553270] BTRFS info (device nullb0 state A): delayed_refs_rsv: size 0 reserved 0
    [161915.553272] BTRFS: error (device nullb0 state A) in cleanup_transaction:2045: errno=-28 No space left
    [161915.554463] BTRFS info (device nullb0 state EA): forced readonly

The problem is that we allow for a very agressive metadata overcommit,
about 1/8th of the currently available space, even when the task
attempting the reservation allows for full flushing. Over time this allows
more and more tasks to overcommit without getting a transaction commit to
release pinned extents, joining the same transaction and eventually lead
to the transaction abort when attempting some tree update, as the extent
allocator is not able to find any available metadata extent and it's not
able to allocate a new metadata block group either (not enough unallocated
space for that).

Fix this by allowing the overcommit to be up to 1/64th of the available
(unallocated) space instead and for that limit to apply to both types of
full flushing, BTRFS_RESERVE_FLUSH_ALL and BTRFS_RESERVE_FLUSH_ALL_STEAL.
This way we get more frequent transaction commits to release pinned
extents in case our caller is in a context where full flushing is allowed.

Note that the space infos dump in the dmesg/syslog right after the
transaction abort give the wrong idea that we have plenty of unallocated
space when the abort happened. During the bonnie++ workload we had a
metadata chunk allocaton attempt and it failed with -ENOSPC because at
that time we had a bunch of data block groups allocated, which then became
empty and got deleted by the cleaner kthread after the metadata chunk
allocation failed with -ENOSPC and before the transaction abort happened
and dumped the space infos.

The custom tracing (some trace_printk() calls spread in strategic places)
used to check that:

  mount-1793735 [011] ...1. 28877.261096: btrfs_add_bg_to_space_info: added bg offset 13631488 length 8388608 flags 1 to space_info->flags 1 total_bytes 8388608 bytes_used 0 bytes_may_use 0
  mount-1793735 [011] ...1. 28877.261098: btrfs_add_bg_to_space_info: added bg offset 22020096 length 8388608 flags 34 to space_info->flags 2 total_bytes 8388608 bytes_used 16384 bytes_may_use 0
  mount-1793735 [011] ...1. 28877.261100: btrfs_add_bg_to_space_info: added bg offset 30408704 length 53673984 flags 36 to space_info->flags 4 total_bytes 53673984 bytes_used 131072 bytes_may_use 0

These are from loading the block groups created by mkfs during mount.

Then when bonnie++ starts doing its thing:

  kworker/u48:5-1792004 [011] ..... 28886.122050: btrfs_create_chunk: gather_device_info 1 ctl->dev_extent_min = 65536 dev_extent_want 1073741824
  kworker/u48:5-1792004 [011] ..... 28886.122053: btrfs_create_chunk: gather_device_info 2 ctl->dev_extent_min = 65536 dev_extent_want 1073741824 max_avail 927596544
  kworker/u48:5-1792004 [011] ..... 28886.122055: btrfs_make_block_group: make bg offset 84082688 size 117440512 type 1
  kworker/u48:5-1792004 [011] ...1. 28886.122064: btrfs_add_bg_to_space_info: added bg offset 84082688 length 117440512 flags 1 to space_info->flags 1 total_bytes 125829120 bytes_used 0 bytes_may_use 5251072

First allocation of a data block group of 112M.

  kworker/u48:5-1792004 [011] ..... 28886.192408: btrfs_create_chunk: gather_device_info 1 ctl->dev_extent_min = 65536 dev_extent_want 1073741824
  kworker/u48:5-1792004 [011] ..... 28886.192413: btrfs_create_chunk: gather_device_info 2 ctl->dev_extent_min = 65536 dev_extent_want 1073741824 max_avail 810156032
  kworker/u48:5-1792004 [011] ..... 28886.192415: btrfs_make_block_group: make bg offset 201523200 size 117440512 type 1
  kworker/u48:5-1792004 [011] ...1. 28886.192425: btrfs_add_bg_to_space_info: added bg offset 201523200 length 117440512 flags 1 to space_info->flags 1 total_bytes 243269632 bytes_used 0 bytes_may_use 122691584

Another 112M data block group allocated.

  kworker/u48:5-1792004 [011] ..... 28886.260935: btrfs_create_chunk: gather_device_info 1 ctl->dev_extent_min = 65536 dev_extent_want 1073741824
  kworker/u48:5-1792004 [011] ..... 28886.260941: btrfs_create_chunk: gather_device_info 2 ctl->dev_extent_min = 65536 dev_extent_want 1073741824 max_avail 692715520
  kworker/u48:5-1792004 [011] ..... 28886.260943: btrfs_make_block_group: make bg offset 318963712 size 117440512 type 1
  kworker/u48:5-1792004 [011] ...1. 28886.260954: btrfs_add_bg_to_space_info: added bg offset 318963712 length 117440512 flags 1 to space_info->flags 1 total_bytes 360710144 bytes_used 0 bytes_may_use 240132096

Yet another one.

  bonnie++-1793755 [010] ..... 28886.280407: btrfs_create_chunk: gather_device_info 1 ctl->dev_extent_min = 65536 dev_extent_want 1073741824
  bonnie++-1793755 [010] ..... 28886.280412: btrfs_create_chunk: gather_device_info 2 ctl->dev_extent_min = 65536 dev_extent_want 1073741824 max_avail 575275008
  bonnie++-1793755 [010] ..... 28886.280414: btrfs_make_block_group: make bg offset 436404224 size 117440512 type 1
  bonnie++-1793755 [010] ...1. 28886.280419: btrfs_add_bg_to_space_info: added bg offset 436404224 length 117440512 flags 1 to space_info->flags 1 total_bytes 478150656 bytes_used 0 bytes_may_use 268435456

One more.

  kworker/u48:5-1792004 [011] ..... 28886.566233: btrfs_create_chunk: gather_device_info 1 ctl->dev_extent_min = 65536 dev_extent_want 1073741824
  kworker/u48:5-1792004 [011] ..... 28886.566238: btrfs_create_chunk: gather_device_info 2 ctl->dev_extent_min = 65536 dev_extent_want 1073741824 max_avail 457834496
  kworker/u48:5-1792004 [011] ..... 28886.566241: btrfs_make_block_group: make bg offset 553844736 size 117440512 type 1
  kworker/u48:5-1792004 [011] ...1. 28886.566250: btrfs_add_bg_to_space_info: added bg offset 553844736 length 117440512 flags 1 to space_info->flags 1 total_bytes 595591168 bytes_used 268435456 bytes_may_use 209723392

Another one.

  bonnie++-1793755 [009] ..... 28886.613446: btrfs_create_chunk: gather_device_info 1 ctl->dev_extent_min = 65536 dev_extent_want 1073741824
  bonnie++-1793755 [009] ..... 28886.613451: btrfs_create_chunk: gather_device_info 2 ctl->dev_extent_min = 65536 dev_extent_want 1073741824 max_avail 340393984
  bonnie++-1793755 [009] ..... 28886.613453: btrfs_make_block_group: make bg offset 671285248 size 117440512 type 1
  bonnie++-1793755 [009] ...1. 28886.613458: btrfs_add_bg_to_space_info: added bg offset 671285248 length 117440512 flags 1 to space_info->flags 1 total_bytes 713031680 bytes_used 268435456 bytes_may_use 2 68435456

Another one.

  bonnie++-1793755 [009] ..... 28886.674953: btrfs_create_chunk: gather_device_info 1 ctl->dev_extent_min = 65536 dev_extent_want 1073741824
  bonnie++-1793755 [009] ..... 28886.674957: btrfs_create_chunk: gather_device_info 2 ctl->dev_extent_min = 65536 dev_extent_want 1073741824 max_avail 222953472
  bonnie++-1793755 [009] ..... 28886.674959: btrfs_make_block_group: make bg offset 788725760 size 117440512 type 1
  bonnie++-1793755 [009] ...1. 28886.674963: btrfs_add_bg_to_space_info: added bg offset 788725760 length 117440512 flags 1 to space_info->flags 1 total_bytes 830472192 bytes_used 268435456 bytes_may_use 1 34217728

Another one.

  bonnie++-1793755 [009] ..... 28886.674981: btrfs_create_chunk: gather_device_info 1 ctl->dev_extent_min = 65536 dev_extent_want 1073741824
  bonnie++-1793755 [009] ..... 28886.674982: btrfs_create_chunk: gather_device_info 2 ctl->dev_extent_min = 65536 dev_extent_want 1073741824 max_avail 105512960
  bonnie++-1793755 [009] ..... 28886.674983: btrfs_make_block_group: make bg offset 906166272 size 105512960 type 1
  bonnie++-1793755 [009] ...1. 28886.674984: btrfs_add_bg_to_space_info: added bg offset 906166272 length 105512960 flags 1 to space_info->flags 1 total_bytes 935985152 bytes_used 268435456 bytes_may_use 67108864

Another one, but a bit smaller (~100.6M) since we now have less space.

  bonnie++-1793758 [009] ..... 28891.962096: btrfs_create_chunk: gather_device_info 1 ctl->dev_extent_min = 65536 dev_extent_want 1073741824
  bonnie++-1793758 [009] ..... 28891.962103: btrfs_create_chunk: gather_device_info 2 ctl->dev_extent_min = 65536 dev_extent_want 1073741824 max_avail 12582912
  bonnie++-1793758 [009] ..... 28891.962105: btrfs_make_block_group: make bg offset 1011679232 size 12582912 type 1
  bonnie++-1793758 [009] ...1. 28891.962114: btrfs_add_bg_to_space_info: added bg offset 1011679232 length 12582912 flags 1 to space_info->flags 1 total_bytes 948568064 bytes_used 268435456 bytes_may_use 8192

Another one, this one even smaller (12M).

   kworker/u48:5-1792004 [011] ..... 28892.112802: btrfs_chunk_alloc: enter first metadata chunk alloc attempt
   kworker/u48:5-1792004 [011] ..... 28892.112805: btrfs_create_chunk: gather_device_info 1 ctl->dev_extent_min = 131072 dev_extent_want 536870912
   kworker/u48:5-1792004 [011] ..... 28892.112806: btrfs_create_chunk: gather_device_info 2 ctl->dev_extent_min = 131072 dev_extent_want 536870912 max_avail 0

536870912 is 512M, the standard 256M metadata chunk size times 2 because
of the DUP profile for metadata.
'max_avail' is what find_free_dev_extent() returns to us in
gather_device_info().

As a result, gather_device_info() sets ctl->ndevs to 0, making
decide_stripe_size() fail with -ENOSPC, and therefore metadata chunk
allocation fails while we are attempting to run delayed items during
the transaction commit.

   kworker/u48:5-1792004 [011] ..... 28892.112807: btrfs_create_chunk:
decide_stripe_size fail -ENOSPC

In the syslog/dmesg pasted above, which happened after the transaction was
aborted, the space info dumps did not account for all these data block
groups that were allocated during bonnie++'s workload. And that is because
after the metadata chunk allocation failed with -ENOSPC and before the
transaction abort happened, most of the data block groups had become empty
and got deleted by by the cleaner kthread - when the abort happened, we
had bonnie++ in the middle of deleting the files it created.

But dumping the space infos right after the metdata chunk allocation fails
by adding a call to btrfs_dump_space_info_for_trans_abort() in
decide_stripe_size() when it returns -ENOSPC, we get:

  [29972.409295] BTRFS info (device nullb0): dumping space info:
  [29972.409300] BTRFS info (device nullb0): space_info DATA (sub-group id 0) has 673341440 free, is not full
  [29972.409303] BTRFS info (device nullb0): space_info total=948568064, used=0, pinned=275226624, reserved=0, may_use=0, readonly=0 zone_unusable=0
  [29972.409305] BTRFS info (device nullb0): space_info METADATA (sub-group id 0) has 3915776 free, is not full
  [29972.409306] BTRFS info (device nullb0): space_info total=53673984, used=163840, pinned=42827776, reserved=147456, may_use=6553600, readonly=65536 zone_unusable=0
  [29972.409308] BTRFS info (device nullb0): space_info SYSTEM (sub-group id 0) has 7979008 free, is not full
  [29972.409310] BTRFS info (device nullb0): space_info total=8388608, used=16384, pinned=0, reserved=0, may_use=393216, readonly=0 zone_unusable=0
  [29972.409311] BTRFS info (device nullb0): global_block_rsv: size 5767168 reserved 5767168
  [29972.409313] BTRFS info (device nullb0): trans_block_rsv: size 0 reserved 0
  [29972.409314] BTRFS info (device nullb0): chunk_block_rsv: size 393216 reserved 393216
  [29972.409315] BTRFS info (device nullb0): remap_block_rsv: size 0 reserved 0
  [29972.409316] BTRFS info (device nullb0): delayed_block_rsv: size 0 reserved 0

So here we see there's ~904.6M of data space, ~51.2M of metdata space and
8M of system space, making a total of 963.8M.

Reported-by: Aleksandar Gerasimovski <Aleksandar.Gerasimovski@belden.com>
Link: https://lore.kernel.org/linux-btrfs/SA1PR18MB56922F690C5EC2D85371408B998FA@SA1PR18MB5692.namprd18.prod.outlook.com/
Link: https://lore.kernel.org/linux-btrfs/CAL3q7H61vZ3_+eqJ1A9po2WcgNJJjUu9MJQoYB2oDSAAecHaug@mail.gmail.com/
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
---
 fs/btrfs/space-info.c | 7 ++++---
 1 file changed, 4 insertions(+), 3 deletions(-)

diff --git a/fs/btrfs/space-info.c b/fs/btrfs/space-info.c
index bb5aac7ee9d2..8192edf92d26 100644
--- a/fs/btrfs/space-info.c
+++ b/fs/btrfs/space-info.c
@@ -489,10 +489,11 @@ static u64 calc_available_free_space(const struct btrfs_space_info *space_info,
 	/*
 	 * If we aren't flushing all things, let us overcommit up to
 	 * 1/2th of the space. If we can flush, don't let us overcommit
-	 * too much, let it overcommit up to 1/8 of the space.
+	 * too much, let it overcommit up to 1/64th of the space.
 	 */
-	if (flush == BTRFS_RESERVE_FLUSH_ALL)
-		avail >>= 3;
+	if (flush == BTRFS_RESERVE_FLUSH_ALL ||
+	    flush == BTRFS_RESERVE_FLUSH_ALL_STEAL)
+		avail >>= 6;
 	else
 		avail >>= 1;
 
-- 
2.47.2


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [PATCH v2 2/3] btrfs: don't allow log trees to consume global reserve or overcommit metadata
  2026-02-03 23:38 ` [PATCH v2 0/3] btrfs: a few space reservation fixes and comment update fdmanana
  2026-02-03 23:38   ` [PATCH v2 1/3] btrfs: be less agressive with metadata overcommit when we can do full flushing fdmanana
@ 2026-02-03 23:39   ` fdmanana
  2026-02-03 23:39   ` [PATCH v2 3/3] btrfs: update comment for BTRFS_RESERVE_NO_FLUSH fdmanana
  2 siblings, 0 replies; 16+ messages in thread
From: fdmanana @ 2026-02-03 23:39 UTC (permalink / raw)
  To: linux-btrfs

From: Filipe Manana <fdmanana@suse.com>

For a fsync we never reserve space in advance, we just start a transaction
without reserving space and we use an empty block reserve for a log tree.
We reserve space as we need while updating a log tree, we end up in
btrfs_use_block_rsv() when reserving space for the allocation of a log
tree extent buffer and we attempt first to reserve without flushing,
and if that fails we attempt to consume from the global reserve or
overcommit metadata. This makes us consume space that may be the last
resort for a transaction commit to succeed, therefore increasing the
chances for a transaction abort with -ENOSPC.

So make btrfs_use_block_rsv() fail if we can't reserve metadata space for
a log tree exent buffer allocation without flushing, making the fsync
fallback to a transaction commit and avoid using critical space that could
be the only resort for a transaction commit to succeed when we are in a
critical space situation.

Reviewed-by: Leo Martins <loemra.dev@gmail.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
---
 fs/btrfs/block-rsv.c | 25 +++++++++++++++++++++++++
 1 file changed, 25 insertions(+)

diff --git a/fs/btrfs/block-rsv.c b/fs/btrfs/block-rsv.c
index e823230c09b7..fe81d9e9f08c 100644
--- a/fs/btrfs/block-rsv.c
+++ b/fs/btrfs/block-rsv.c
@@ -540,6 +540,31 @@ struct btrfs_block_rsv *btrfs_use_block_rsv(struct btrfs_trans_handle *trans,
 					   BTRFS_RESERVE_NO_FLUSH);
 	if (!ret)
 		return block_rsv;
+
+	/*
+	 * If we are being used for updating a log tree, fail immediately, which
+	 * makes the fsync fallback to a transaction commit.
+	 *
+	 * We don't want to consume from the global block reserve, as that is
+	 * precious space that may be needed to do updates to some trees for
+	 * which we don't reserve space during a transaction commit (update root
+	 * items in the root tree, device stat items in the device tree and
+	 * quota tree updates, see btrfs_init_root_block_rsv()), or to fallback
+	 * to in case we did not reserve enough space to run delayed items,
+	 * delayed references, or anything else we need in order to avoid a
+	 * transaction abort.
+	 *
+	 * We also don't want to do a reservation in flush emergency mode, as
+	 * we end up using metadata that could be critical to allow a
+	 * transaction to complete successfully and therefore increase the
+	 * chances for a transaction abort.
+	 *
+	 * Log trees are an optimization and should never consume from the
+	 * global reserve or be allowed overcommitting metadata.
+	 */
+	if (btrfs_root_id(root) == BTRFS_TREE_LOG_OBJECTID)
+		return ERR_PTR(ret);
+
 	/*
 	 * If we couldn't reserve metadata bytes try and use some from
 	 * the global reserve if its space type is the same as the global
-- 
2.47.2


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [PATCH v2 3/3] btrfs: update comment for BTRFS_RESERVE_NO_FLUSH
  2026-02-03 23:38 ` [PATCH v2 0/3] btrfs: a few space reservation fixes and comment update fdmanana
  2026-02-03 23:38   ` [PATCH v2 1/3] btrfs: be less agressive with metadata overcommit when we can do full flushing fdmanana
  2026-02-03 23:39   ` [PATCH v2 2/3] btrfs: don't allow log trees to consume global reserve or overcommit metadata fdmanana
@ 2026-02-03 23:39   ` fdmanana
  2 siblings, 0 replies; 16+ messages in thread
From: fdmanana @ 2026-02-03 23:39 UTC (permalink / raw)
  To: linux-btrfs

From: Filipe Manana <fdmanana@suse.com>

The comment is incomplete as BTRFS_RESERVE_NO_FLUSH is used for more
reasons than currently holding a transaction handle open. Update the
comment with all the other reasons and give some details.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
---
 fs/btrfs/space-info.h | 19 ++++++++++++++++++-
 1 file changed, 18 insertions(+), 1 deletion(-)

diff --git a/fs/btrfs/space-info.h b/fs/btrfs/space-info.h
index 0703f24b23f7..6f96cf48d7da 100644
--- a/fs/btrfs/space-info.h
+++ b/fs/btrfs/space-info.h
@@ -21,7 +21,24 @@ struct btrfs_block_group;
  * The higher the level, the more methods we try to reclaim space.
  */
 enum btrfs_reserve_flush_enum {
-	/* If we are in the transaction, we can't flush anything.*/
+	/*
+	 * Used when we can't flush or don't need:
+	 *
+	 * 1) We are holding a transaction handle open, so we can't flush as
+	 *    that could deadlock.
+	 *
+	 * 2) For a nowait write we don't want to block when reserving delalloc.
+	 *
+	 * 3) Joining a transaction or attaching a transaction, we don't want
+	 *    to wait and we don't need to reserve anything (any needed space
+	 *    was reserved before in a dedicated block reserve, or we rely on
+	 *    the global block reserve, see btrfs_init_root_block_rsv()).
+	 *
+	 * 4) Starting a transaction when we don't need to reserve space, as
+	 *    we don't need it because we previously reserved in a dedicated
+	 *    block reserve or rely on the global block reserve, like the above
+	 *    case.
+	 */
 	BTRFS_RESERVE_NO_FLUSH,
 
 	/*
-- 
2.47.2


^ permalink raw reply related	[flat|nested] 16+ messages in thread

end of thread, other threads:[~2026-02-03 23:39 UTC | newest]

Thread overview: 16+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-02-03 13:02 [PATCH 0/3] btrfs: a few space reservation fixes and comment update fdmanana
2026-02-03 13:02 ` [PATCH 1/3] btrfs: be less agressive with metadata overcommit when we can do full flushing fdmanana
2026-02-03 21:02   ` Qu Wenruo
2026-02-03 21:46     ` Filipe Manana
2026-02-03 21:59       ` Qu Wenruo
2026-02-03 22:55         ` Filipe Manana
2026-02-03 23:04           ` Qu Wenruo
2026-02-03 23:09             ` Filipe Manana
2026-02-03 23:06           ` Filipe Manana
2026-02-03 13:02 ` [PATCH 2/3] btrfs: don't allow log trees to consume global reserve or overcommit metadata fdmanana
2026-02-03 19:52   ` Leo Martins
2026-02-03 13:02 ` [PATCH 3/3] btrfs: update comment for BTRFS_RESERVE_NO_FLUSH fdmanana
2026-02-03 23:38 ` [PATCH v2 0/3] btrfs: a few space reservation fixes and comment update fdmanana
2026-02-03 23:38   ` [PATCH v2 1/3] btrfs: be less agressive with metadata overcommit when we can do full flushing fdmanana
2026-02-03 23:39   ` [PATCH v2 2/3] btrfs: don't allow log trees to consume global reserve or overcommit metadata fdmanana
2026-02-03 23:39   ` [PATCH v2 3/3] btrfs: update comment for BTRFS_RESERVE_NO_FLUSH fdmanana

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox