[PATCH v3 0/4] btrfs: improve stalls under sudden writeback

public inbox for linux-btrfs@vger.kernel.org
 help / color / mirror / Atom feed

* [PATCH v3 0/4] btrfs: improve stalls under sudden writeback
@ 2026-04-07 19:30 Boris Burkov
  2026-04-07 19:30 ` [PATCH v3 1/4] btrfs: reserve space for delayed_refs in delalloc Boris Burkov
                   ` (3 more replies)
  0 siblings, 4 replies; 11+ messages in thread
From: Boris Burkov @ 2026-04-07 19:30 UTC (permalink / raw)
  To: linux-btrfs, kernel-team

If you have a system with very large memory (TiBs) and a normal
percentage based dirty_ratio/dirty_background_ratio like the defaults of
20%/10%, then we can theoretically rack up 100s of GiB of dirty pages
before doing any writeback. This is further exacerbated if we also see a
sudden drop in the free memory due to a large allocation. If we
(relatively likely for a large ram system) also have a large disk, we are
unlikely to do trigger much preemptive metadata reclaim either.

Once we do start doing writeback with such a large supply, the results
are somewhat ugly. The delalloc work generates a huge amount of delayed
refs without proper reservations which sends the metadata space system
into a tailspin trying to run yet more delalloc to free space.
Ultimately, the system stalls waiting for huge amounts of ordered
extents and delayed refs blocking all users in start_transaction() on
tickets in reserve_space().

This patch series aims to address these issues in a relatively targeted
way by improving our reservations for delalloc delayed refs and by doing
some very basic smoothing of the work in flush_space(). Further work
could be done to improve flush_space() heuristics and latency but this
is already a big help on my observed workloads.

I was able to reproduce stalls on a more "modest" system with 264GiB of
ram by using a somewhat silly 80% dirty_ratio.

I was unfortunately unable to reproduce any stalls on a yet smaller
system with only 32GiB of ram.

The first 2 patches do the delayed_ref rsv accounting on btrfs_inode,
mirroring inode->block_rsv.
The 3th patch is a cleanup to the types counting max extents
The 4th patch reduces the size of the unit of work in shrink_delalloc()
to further reduce stalls.
---
Changelog:
v3:
- Merge csum reservation patch (2) into main delalloc delrefs rsv patch (1)
- Add delayed refs reservations for RST and subvol tree metadata cow to
  patch 1.
- Do the migration in the nocow/prealloc finish_one_ordered() cases as
  there are still metadata delayed refs generated.
- Double delref rsv for cows (add+drop). This seems really conservative
  to me, but I think it is correct. If we like it, it needs to happen
  more places too...
- Upgrade ASSERTs in patch 3 (old patch 4) to log unexpected values.
- Remove unused return value in migrate function.
- Various stylistic issues in several patches.
v2:
- patch 1 no longer embeds a new block_rsv on btrfs_inode for the
  delayed reservation. Instead it does the reservation on
  inode->block_rsv and migrates it to trans->delayed_rsv at the moment
  of truth.

Boris Burkov (4):
  btrfs: reserve space for delayed_refs in delalloc
  btrfs: account for compression in delalloc extent reservation
  btrfs: make inode->outstanding_extents a u64
  btrfs: cap shrink_delalloc iterations to 128M

 fs/btrfs/btrfs_inode.h       | 20 ++++++--
 fs/btrfs/delalloc-space.c    | 84 +++++++++++++++++++++++++++------
 fs/btrfs/delalloc-space.h    |  3 ++
 fs/btrfs/fs.h                | 13 ------
 fs/btrfs/inode.c             | 90 ++++++++++++++++++++++++++++--------
 fs/btrfs/ordered-data.c      |  4 +-
 fs/btrfs/space-info.c        | 31 ++++++++-----
 fs/btrfs/tests/inode-tests.c | 18 ++++----
 fs/btrfs/transaction.c       | 36 ++++++---------
 include/trace/events/btrfs.h |  8 ++--
 10 files changed, 210 insertions(+), 97 deletions(-)

-- 
2.53.0

^ permalink raw reply	[flat|nested] 11+ messages in thread

* [PATCH v3 1/4] btrfs: reserve space for delayed_refs in delalloc
  2026-04-07 19:30 [PATCH v3 0/4] btrfs: improve stalls under sudden writeback Boris Burkov
@ 2026-04-07 19:30 ` Boris Burkov
  2026-04-08 14:56   ` Filipe Manana
  2026-04-07 19:30 ` [PATCH v3 2/4] btrfs: account for compression in delalloc extent reservation Boris Burkov
                   ` (2 subsequent siblings)
  3 siblings, 1 reply; 11+ messages in thread
From: Boris Burkov @ 2026-04-07 19:30 UTC (permalink / raw)
  To: linux-btrfs, kernel-team

delalloc uses a per-inode block_rsv to perform metadata reservations for
the cow operations it anticipates based on the number of outstanding
extents. This calculation is done based on inode->outstanding_extents in
btrfs_calculate_inode_block_rsv_size(). The reservation is *not*
meticulously tracked as each ordered_extent is actually created in
writeback, but rather delalloc attempts to over-estimate and the
writeback and ordered_extent finish portions are responsible to release
all the reservation.

However, there is a notable gap in this reservation, it reserves no
space for the resulting delayed_refs. If you compare to how
btrfs_start_transaction() reservations work, this is a noteable
difference.

As writeback actually occurs, and we trigger btrfs_finish_one_ordered(),
that function will start generating delayed refs, which will draw from
the trans_handle's delayed_refs_rsv via btrfs_update_delayed_refs_rsv():

For example, we can trace the primary data delayed ref:

btrfs_finish_one_ordered()
  insert_ordered_extent_file_extent()
    insert_reserved_file_extent()
      btrfs_alloc_reserved_file_extent()
        btrfs_add_delayed_data_ref()
          add_delayed_ref()
            btrfs_update_delayed_refs_rsv();

This trans_handle was created in finish_one_ordered() with
btrfs_join_transaction() which calls start_transaction with
num_items=0 and BTRFS_RESERVE_NO_FLUSH. As a result, this trans_handle
has no reserved in h->delayed_rsv, as neither the num_items reservation
nor the btrfs_delayed_refs_rsv_refill() reservation is run.

Thus, when btrfs_update_delayed_refs_rsv() runs, reserved_bytes is 0 and
fs_info->delayed_rsv->size grows but not fs_info->delayed_rsv->reserved.

If a large amount of writeback happens all at once (perhaps due to
dirty_ratio being tuned too high), this results in, among other things,
erroneous assessments of the amount of delayed_refs reserved in the
metadata space reclaim logic, like need_preemptive_reclaim() which
relies on fs_info->delayed_rsv->reserved and even worse, poor decision
making in btrfs_preempt_reclaim_metadata_space() which counts
delalloc_bytes like so:

  block_rsv_size = global_rsv_size +
          btrfs_block_rsv_reserved(delayed_block_rsv) +
          btrfs_block_rsv_reserved(delayed_refs_rsv) +
          btrfs_block_rsv_reserved(trans_rsv);
  delalloc_size = bytes_may_use - block_rsv_size;

So all that lost delayed refs usage gets accounted as delalloc_size and
leads to preemptive reclaim continuously choosing FLUSH_DELALLOC, which
further exacerbates the problem.

With enough writeback around, we can run enough delalloc that we get
into async reclaim which starts blocking start_transaction() and
eventually hits FLUSH_DELALLOC_WAIT/FLUSH_DELALLOC_FULL at which point
the filesystem gets heavily blocked on metadata space in reserve_space(),
blocking all new transaction work until all the ordered_extents finish.

If we had an accurate view of the reservation for delayed refs, then we
could mostly break this feedback loop in preemptive reclaim, and
generally would be able to make more accurate decisions with regards to
metadata space reclamation.

This patch adds extra metadata reservation to the inode's block_rsv to
account for the delayed refs. When the ordered_extent finishes and we
are about to do work in the transaction that uses delayed refs, we
migrate enough for 1 extent. Since this is not necessarily perfect, we
have to be careful and do a "soft" migrate which succeeds even if there
is not enough reservation. This is strictly better than what we have and
also matches how the delayed ref rsv gets used in the transaction at
btrfs_update_delayed_refs_rsv().

Aside from this data delayed_ref, there are also some metadata
delayed_refs to consider. These are:
- subvolume tree for the file extent item
- csum tree for data csums
- raid stripe tree if enabled
- free space tree if enabled

So account for those delayed_refs in the reservation as well. This
greatly increases the size of the reservation as each metadata cow
results in two delayed refs: one add for the new block in
btrfs_alloc_tree_block() and one drop for the old in
btrfs_free_tree_block(). As a result, to be completely conservative,
we need to reserve 2 delayed refs worth of space for each cow.

Signed-off-by: Boris Burkov <boris@bur.io>
---
 fs/btrfs/delalloc-space.c | 56 +++++++++++++++++++++++++++++++++++++++
 fs/btrfs/delalloc-space.h |  3 +++
 fs/btrfs/inode.c          |  2 ++
 fs/btrfs/transaction.c    | 36 +++++++++++--------------
 4 files changed, 76 insertions(+), 21 deletions(-)

diff --git a/fs/btrfs/delalloc-space.c b/fs/btrfs/delalloc-space.c
index 0970799d0aa4..c99f516ce8af 100644
--- a/fs/btrfs/delalloc-space.c
+++ b/fs/btrfs/delalloc-space.c
@@ -1,13 +1,16 @@
 // SPDX-License-Identifier: GPL-2.0
 
+#include "linux/btrfs_tree.h"
 #include "messages.h"
 #include "ctree.h"
 #include "delalloc-space.h"
+#include "delayed-ref.h"
 #include "block-rsv.h"
 #include "btrfs_inode.h"
 #include "space-info.h"
 #include "qgroup.h"
 #include "fs.h"
+#include "transaction.h"
 
 /*
  * HOW DOES THIS WORK
@@ -247,6 +250,38 @@ static void btrfs_inode_rsv_release(struct btrfs_inode *inode, bool qgroup_free)
 						   qgroup_to_release);
 }
 
+/*
+ * Each delalloc extent could become an ordered_extent and end up inserting a
+ * new data extent and modify a number of btrees. Each of those is associated with
+ * adding delayed refs which need a corresponding delayed refs reservation.
+ *
+ * Each metadata cow operation results in an add and a drop delayed ref, both of
+ * which call add_delayed_ref() and ultimately btrfs_update_delayed_refs_rsv(),
+ * so each must account for 2 delayed refs.
+ */
+static u64 delalloc_calc_delayed_refs_rsv(const struct btrfs_inode *inode, u64 nr_extents)
+{
+	const struct btrfs_fs_info *fs_info = inode->root->fs_info;
+
+	/*
+	 * Factor for how many delayed refs updates we will generate per extent.
+	 * Non-optional:
+	 * extent tree: 1 data delayed ref.
+	 * subvolume tree: 1 cow => 2 metadata delayed refs.
+	 */
+	int factor = 3;
+
+	/* The remaining trees are only written to conditionally. */
+	if (!(inode->flags & BTRFS_INODE_NODATASUM))
+		factor += 2;
+	if (btrfs_test_opt(fs_info, FREE_SPACE_TREE))
+		factor += 2;
+	if (btrfs_fs_incompat(fs_info, RAID_STRIPE_TREE))
+		factor += 2;
+
+	return btrfs_calc_insert_metadata_size(fs_info, nr_extents) * factor;
+}
+
 static void btrfs_calculate_inode_block_rsv_size(struct btrfs_fs_info *fs_info,
 						 struct btrfs_inode *inode)
 {
@@ -266,6 +301,7 @@ static void btrfs_calculate_inode_block_rsv_size(struct btrfs_fs_info *fs_info,
 		reserve_size = btrfs_calc_insert_metadata_size(fs_info,
 						outstanding_extents);
 		reserve_size += btrfs_calc_metadata_size(fs_info, 1);
+		reserve_size += delalloc_calc_delayed_refs_rsv(inode, outstanding_extents);
 	}
 	if (!(inode->flags & BTRFS_INODE_NODATASUM)) {
 		u64 csum_leaves;
@@ -309,9 +345,29 @@ static void calc_inode_reservations(struct btrfs_inode *inode,
 	 * for an inode update.
 	 */
 	*meta_reserve += inode_update;
+
+	*meta_reserve += delalloc_calc_delayed_refs_rsv(inode, nr_extents);
+
 	*qgroup_reserve = nr_extents * fs_info->nodesize;
 }
 
+void btrfs_delalloc_migrate_delayed_refs_rsv(struct btrfs_trans_handle *trans,
+					     struct btrfs_inode *inode)
+{
+       struct btrfs_block_rsv *inode_rsv = &inode->block_rsv;
+       struct btrfs_block_rsv *trans_rsv = &trans->delayed_rsv;
+       u64 num_bytes = delalloc_calc_delayed_refs_rsv(inode, 1);
+
+       spin_lock(&inode_rsv->lock);
+       num_bytes = min(num_bytes, inode_rsv->reserved);
+       inode_rsv->reserved -= num_bytes;
+       inode_rsv->full = (inode_rsv->reserved >= inode_rsv->size);
+       spin_unlock(&inode_rsv->lock);
+
+       btrfs_block_rsv_add_bytes(trans_rsv, num_bytes, true);
+       trans->delayed_refs_bytes_reserved += num_bytes;
+}
+
 int btrfs_delalloc_reserve_metadata(struct btrfs_inode *inode, u64 num_bytes,
 				    u64 disk_num_bytes, bool noflush)
 {
diff --git a/fs/btrfs/delalloc-space.h b/fs/btrfs/delalloc-space.h
index 6119c0d3f883..bd7041166987 100644
--- a/fs/btrfs/delalloc-space.h
+++ b/fs/btrfs/delalloc-space.h
@@ -8,6 +8,7 @@
 struct extent_changeset;
 struct btrfs_inode;
 struct btrfs_fs_info;
+struct btrfs_trans_handle;
 
 int btrfs_alloc_data_chunk_ondemand(const struct btrfs_inode *inode, u64 bytes);
 int btrfs_check_data_free_space(struct btrfs_inode *inode,
@@ -27,5 +28,7 @@ int btrfs_delalloc_reserve_metadata(struct btrfs_inode *inode, u64 num_bytes,
 				    u64 disk_num_bytes, bool noflush);
 void btrfs_delalloc_release_extents(struct btrfs_inode *inode, u64 num_bytes);
 void btrfs_delalloc_shrink_extents(struct btrfs_inode *inode, u64 reserved_len, u64 new_len);
+void btrfs_delalloc_migrate_delayed_refs_rsv(struct btrfs_trans_handle *trans,
+					     struct btrfs_inode *inode);
 
 #endif /* BTRFS_DELALLOC_SPACE_H */
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 40474014c03f..15945744a304 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -653,6 +653,7 @@ static noinline int __cow_file_range_inline(struct btrfs_inode *inode,
 		goto out;
 	}
 	trans->block_rsv = &inode->block_rsv;
+	btrfs_delalloc_migrate_delayed_refs_rsv(trans, inode);
 
 	drop_args.path = path;
 	drop_args.start = 0;
@@ -3259,6 +3260,7 @@ int btrfs_finish_one_ordered(struct btrfs_ordered_extent *ordered_extent)
 	}
 
 	trans->block_rsv = &inode->block_rsv;
+	btrfs_delalloc_migrate_delayed_refs_rsv(trans, inode);
 
 	ret = btrfs_insert_raid_extent(trans, ordered_extent);
 	if (unlikely(ret)) {
diff --git a/fs/btrfs/transaction.c b/fs/btrfs/transaction.c
index 248adb785051..55791bb100a2 100644
--- a/fs/btrfs/transaction.c
+++ b/fs/btrfs/transaction.c
@@ -1047,29 +1047,23 @@ static void btrfs_trans_release_metadata(struct btrfs_trans_handle *trans)
 		return;
 	}
 
-	if (!trans->bytes_reserved) {
-		ASSERT(trans->delayed_refs_bytes_reserved == 0,
-		       "trans->delayed_refs_bytes_reserved=%llu",
-		       trans->delayed_refs_bytes_reserved);
-		return;
+	if (trans->bytes_reserved) {
+		ASSERT(trans->block_rsv == &fs_info->trans_block_rsv);
+		trace_btrfs_space_reservation(fs_info, "transaction",
+					trans->transid, trans->bytes_reserved, 0);
+		btrfs_block_rsv_release(fs_info, trans->block_rsv,
+					trans->bytes_reserved, NULL);
+		trans->bytes_reserved = 0;
 	}
 
-	ASSERT(trans->block_rsv == &fs_info->trans_block_rsv);
-	trace_btrfs_space_reservation(fs_info, "transaction",
-				      trans->transid, trans->bytes_reserved, 0);
-	btrfs_block_rsv_release(fs_info, trans->block_rsv,
-				trans->bytes_reserved, NULL);
-	trans->bytes_reserved = 0;
-
-	if (!trans->delayed_refs_bytes_reserved)
-		return;
-
-	trace_btrfs_space_reservation(fs_info, "local_delayed_refs_rsv",
-				      trans->transid,
-				      trans->delayed_refs_bytes_reserved, 0);
-	btrfs_block_rsv_release(fs_info, &trans->delayed_rsv,
-				trans->delayed_refs_bytes_reserved, NULL);
-	trans->delayed_refs_bytes_reserved = 0;
+	if (trans->delayed_refs_bytes_reserved) {
+		trace_btrfs_space_reservation(fs_info, "local_delayed_refs_rsv",
+					trans->transid,
+					trans->delayed_refs_bytes_reserved, 0);
+		btrfs_block_rsv_release(fs_info, &trans->delayed_rsv,
+					trans->delayed_refs_bytes_reserved, NULL);
+		trans->delayed_refs_bytes_reserved = 0;
+	}
 }
 
 static int __btrfs_end_transaction(struct btrfs_trans_handle *trans,
-- 
2.53.0


^ permalink raw reply related	[flat|nested] 11+ messages in thread

* Re: [PATCH v3 1/4] btrfs: reserve space for delayed_refs in delalloc
  2026-04-07 19:30 ` [PATCH v3 1/4] btrfs: reserve space for delayed_refs in delalloc Boris Burkov
@ 2026-04-08 14:56   ` Filipe Manana
  2026-04-08 17:34     ` Boris Burkov
  0 siblings, 1 reply; 11+ messages in thread
From: Filipe Manana @ 2026-04-08 14:56 UTC (permalink / raw)
  To: Boris Burkov; +Cc: linux-btrfs, kernel-team

On Tue, Apr 7, 2026 at 8:31 PM Boris Burkov <boris@bur.io> wrote:
>
> delalloc uses a per-inode block_rsv to perform metadata reservations for
> the cow operations it anticipates based on the number of outstanding
> extents. This calculation is done based on inode->outstanding_extents in
> btrfs_calculate_inode_block_rsv_size(). The reservation is *not*
> meticulously tracked as each ordered_extent is actually created in
> writeback, but rather delalloc attempts to over-estimate and the
> writeback and ordered_extent finish portions are responsible to release
> all the reservation.
>
> However, there is a notable gap in this reservation, it reserves no
> space for the resulting delayed_refs. If you compare to how
> btrfs_start_transaction() reservations work, this is a noteable
> difference.
>
> As writeback actually occurs, and we trigger btrfs_finish_one_ordered(),
> that function will start generating delayed refs, which will draw from
> the trans_handle's delayed_refs_rsv via btrfs_update_delayed_refs_rsv():
>
> For example, we can trace the primary data delayed ref:
>
> btrfs_finish_one_ordered()
>   insert_ordered_extent_file_extent()
>     insert_reserved_file_extent()
>       btrfs_alloc_reserved_file_extent()
>         btrfs_add_delayed_data_ref()
>           add_delayed_ref()
>             btrfs_update_delayed_refs_rsv();
>
> This trans_handle was created in finish_one_ordered() with
> btrfs_join_transaction() which calls start_transaction with
> num_items=0 and BTRFS_RESERVE_NO_FLUSH. As a result, this trans_handle
> has no reserved in h->delayed_rsv, as neither the num_items reservation
> nor the btrfs_delayed_refs_rsv_refill() reservation is run.
>
> Thus, when btrfs_update_delayed_refs_rsv() runs, reserved_bytes is 0 and
> fs_info->delayed_rsv->size grows but not fs_info->delayed_rsv->reserved.
>
> If a large amount of writeback happens all at once (perhaps due to
> dirty_ratio being tuned too high), this results in, among other things,
> erroneous assessments of the amount of delayed_refs reserved in the
> metadata space reclaim logic, like need_preemptive_reclaim() which
> relies on fs_info->delayed_rsv->reserved and even worse, poor decision
> making in btrfs_preempt_reclaim_metadata_space() which counts
> delalloc_bytes like so:
>
>   block_rsv_size = global_rsv_size +
>           btrfs_block_rsv_reserved(delayed_block_rsv) +
>           btrfs_block_rsv_reserved(delayed_refs_rsv) +
>           btrfs_block_rsv_reserved(trans_rsv);
>   delalloc_size = bytes_may_use - block_rsv_size;
>
> So all that lost delayed refs usage gets accounted as delalloc_size and
> leads to preemptive reclaim continuously choosing FLUSH_DELALLOC, which
> further exacerbates the problem.
>
> With enough writeback around, we can run enough delalloc that we get
> into async reclaim which starts blocking start_transaction() and
> eventually hits FLUSH_DELALLOC_WAIT/FLUSH_DELALLOC_FULL at which point
> the filesystem gets heavily blocked on metadata space in reserve_space(),
> blocking all new transaction work until all the ordered_extents finish.
>
> If we had an accurate view of the reservation for delayed refs, then we
> could mostly break this feedback loop in preemptive reclaim, and
> generally would be able to make more accurate decisions with regards to
> metadata space reclamation.
>
> This patch adds extra metadata reservation to the inode's block_rsv to
> account for the delayed refs. When the ordered_extent finishes and we
> are about to do work in the transaction that uses delayed refs, we
> migrate enough for 1 extent. Since this is not necessarily perfect, we
> have to be careful and do a "soft" migrate which succeeds even if there
> is not enough reservation. This is strictly better than what we have and
> also matches how the delayed ref rsv gets used in the transaction at
> btrfs_update_delayed_refs_rsv().
>
> Aside from this data delayed_ref, there are also some metadata
> delayed_refs to consider. These are:
> - subvolume tree for the file extent item
> - csum tree for data csums
> - raid stripe tree if enabled
> - free space tree if enabled
>
> So account for those delayed_refs in the reservation as well. This
> greatly increases the size of the reservation as each metadata cow
> results in two delayed refs: one add for the new block in
> btrfs_alloc_tree_block() and one drop for the old in
> btrfs_free_tree_block(). As a result, to be completely conservative,
> we need to reserve 2 delayed refs worth of space for each cow.
>
> Signed-off-by: Boris Burkov <boris@bur.io>
> ---
>  fs/btrfs/delalloc-space.c | 56 +++++++++++++++++++++++++++++++++++++++
>  fs/btrfs/delalloc-space.h |  3 +++
>  fs/btrfs/inode.c          |  2 ++
>  fs/btrfs/transaction.c    | 36 +++++++++++--------------
>  4 files changed, 76 insertions(+), 21 deletions(-)
>
> diff --git a/fs/btrfs/delalloc-space.c b/fs/btrfs/delalloc-space.c
> index 0970799d0aa4..c99f516ce8af 100644
> --- a/fs/btrfs/delalloc-space.c
> +++ b/fs/btrfs/delalloc-space.c
> @@ -1,13 +1,16 @@
>  // SPDX-License-Identifier: GPL-2.0
>
> +#include "linux/btrfs_tree.h"

Not needed.

>  #include "messages.h"
>  #include "ctree.h"
>  #include "delalloc-space.h"
> +#include "delayed-ref.h"
>  #include "block-rsv.h"
>  #include "btrfs_inode.h"
>  #include "space-info.h"
>  #include "qgroup.h"
>  #include "fs.h"
> +#include "transaction.h"
>
>  /*
>   * HOW DOES THIS WORK
> @@ -247,6 +250,38 @@ static void btrfs_inode_rsv_release(struct btrfs_inode *inode, bool qgroup_free)
>                                                    qgroup_to_release);
>  }
>
> +/*
> + * Each delalloc extent could become an ordered_extent and end up inserting a
> + * new data extent and modify a number of btrees. Each of those is associated with
> + * adding delayed refs which need a corresponding delayed refs reservation.
> + *
> + * Each metadata cow operation results in an add and a drop delayed ref, both of
> + * which call add_delayed_ref() and ultimately btrfs_update_delayed_refs_rsv(),
> + * so each must account for 2 delayed refs.
> + */
> +static u64 delalloc_calc_delayed_refs_rsv(const struct btrfs_inode *inode, u64 nr_extents)
> +{
> +       const struct btrfs_fs_info *fs_info = inode->root->fs_info;
> +
> +       /*
> +        * Factor for how many delayed refs updates we will generate per extent.
> +        * Non-optional:
> +        * extent tree: 1 data delayed ref.

Running the delayed data ref to insert the extent item in the extent
tree requires COWing an extent tree leaf, which creates a delayed tree
ref for deletion too.
So, 2.

Otherwise it looks good, thanks.

> +        * subvolume tree: 1 cow => 2 metadata delayed refs.
> +        */
> +       int factor = 3;
> +
> +       /* The remaining trees are only written to conditionally. */
> +       if (!(inode->flags & BTRFS_INODE_NODATASUM))
> +               factor += 2;
> +       if (btrfs_test_opt(fs_info, FREE_SPACE_TREE))
> +               factor += 2;
> +       if (btrfs_fs_incompat(fs_info, RAID_STRIPE_TREE))
> +               factor += 2;
> +
> +       return btrfs_calc_insert_metadata_size(fs_info, nr_extents) * factor;
> +}
> +
>  static void btrfs_calculate_inode_block_rsv_size(struct btrfs_fs_info *fs_info,
>                                                  struct btrfs_inode *inode)
>  {
> @@ -266,6 +301,7 @@ static void btrfs_calculate_inode_block_rsv_size(struct btrfs_fs_info *fs_info,
>                 reserve_size = btrfs_calc_insert_metadata_size(fs_info,
>                                                 outstanding_extents);
>                 reserve_size += btrfs_calc_metadata_size(fs_info, 1);
> +               reserve_size += delalloc_calc_delayed_refs_rsv(inode, outstanding_extents);
>         }
>         if (!(inode->flags & BTRFS_INODE_NODATASUM)) {
>                 u64 csum_leaves;
> @@ -309,9 +345,29 @@ static void calc_inode_reservations(struct btrfs_inode *inode,
>          * for an inode update.
>          */
>         *meta_reserve += inode_update;
> +
> +       *meta_reserve += delalloc_calc_delayed_refs_rsv(inode, nr_extents);
> +
>         *qgroup_reserve = nr_extents * fs_info->nodesize;
>  }
>
> +void btrfs_delalloc_migrate_delayed_refs_rsv(struct btrfs_trans_handle *trans,
> +                                            struct btrfs_inode *inode)
> +{
> +       struct btrfs_block_rsv *inode_rsv = &inode->block_rsv;
> +       struct btrfs_block_rsv *trans_rsv = &trans->delayed_rsv;
> +       u64 num_bytes = delalloc_calc_delayed_refs_rsv(inode, 1);
> +
> +       spin_lock(&inode_rsv->lock);
> +       num_bytes = min(num_bytes, inode_rsv->reserved);
> +       inode_rsv->reserved -= num_bytes;
> +       inode_rsv->full = (inode_rsv->reserved >= inode_rsv->size);
> +       spin_unlock(&inode_rsv->lock);
> +
> +       btrfs_block_rsv_add_bytes(trans_rsv, num_bytes, true);
> +       trans->delayed_refs_bytes_reserved += num_bytes;
> +}
> +
>  int btrfs_delalloc_reserve_metadata(struct btrfs_inode *inode, u64 num_bytes,
>                                     u64 disk_num_bytes, bool noflush)
>  {
> diff --git a/fs/btrfs/delalloc-space.h b/fs/btrfs/delalloc-space.h
> index 6119c0d3f883..bd7041166987 100644
> --- a/fs/btrfs/delalloc-space.h
> +++ b/fs/btrfs/delalloc-space.h
> @@ -8,6 +8,7 @@
>  struct extent_changeset;
>  struct btrfs_inode;
>  struct btrfs_fs_info;
> +struct btrfs_trans_handle;
>
>  int btrfs_alloc_data_chunk_ondemand(const struct btrfs_inode *inode, u64 bytes);
>  int btrfs_check_data_free_space(struct btrfs_inode *inode,
> @@ -27,5 +28,7 @@ int btrfs_delalloc_reserve_metadata(struct btrfs_inode *inode, u64 num_bytes,
>                                     u64 disk_num_bytes, bool noflush);
>  void btrfs_delalloc_release_extents(struct btrfs_inode *inode, u64 num_bytes);
>  void btrfs_delalloc_shrink_extents(struct btrfs_inode *inode, u64 reserved_len, u64 new_len);
> +void btrfs_delalloc_migrate_delayed_refs_rsv(struct btrfs_trans_handle *trans,
> +                                            struct btrfs_inode *inode);
>
>  #endif /* BTRFS_DELALLOC_SPACE_H */
> diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
> index 40474014c03f..15945744a304 100644
> --- a/fs/btrfs/inode.c
> +++ b/fs/btrfs/inode.c
> @@ -653,6 +653,7 @@ static noinline int __cow_file_range_inline(struct btrfs_inode *inode,
>                 goto out;
>         }
>         trans->block_rsv = &inode->block_rsv;
> +       btrfs_delalloc_migrate_delayed_refs_rsv(trans, inode);
>
>         drop_args.path = path;
>         drop_args.start = 0;
> @@ -3259,6 +3260,7 @@ int btrfs_finish_one_ordered(struct btrfs_ordered_extent *ordered_extent)
>         }
>
>         trans->block_rsv = &inode->block_rsv;
> +       btrfs_delalloc_migrate_delayed_refs_rsv(trans, inode);
>
>         ret = btrfs_insert_raid_extent(trans, ordered_extent);
>         if (unlikely(ret)) {
> diff --git a/fs/btrfs/transaction.c b/fs/btrfs/transaction.c
> index 248adb785051..55791bb100a2 100644
> --- a/fs/btrfs/transaction.c
> +++ b/fs/btrfs/transaction.c
> @@ -1047,29 +1047,23 @@ static void btrfs_trans_release_metadata(struct btrfs_trans_handle *trans)
>                 return;
>         }
>
> -       if (!trans->bytes_reserved) {
> -               ASSERT(trans->delayed_refs_bytes_reserved == 0,
> -                      "trans->delayed_refs_bytes_reserved=%llu",
> -                      trans->delayed_refs_bytes_reserved);
> -               return;
> +       if (trans->bytes_reserved) {
> +               ASSERT(trans->block_rsv == &fs_info->trans_block_rsv);
> +               trace_btrfs_space_reservation(fs_info, "transaction",
> +                                       trans->transid, trans->bytes_reserved, 0);
> +               btrfs_block_rsv_release(fs_info, trans->block_rsv,
> +                                       trans->bytes_reserved, NULL);
> +               trans->bytes_reserved = 0;
>         }
>
> -       ASSERT(trans->block_rsv == &fs_info->trans_block_rsv);
> -       trace_btrfs_space_reservation(fs_info, "transaction",
> -                                     trans->transid, trans->bytes_reserved, 0);
> -       btrfs_block_rsv_release(fs_info, trans->block_rsv,
> -                               trans->bytes_reserved, NULL);
> -       trans->bytes_reserved = 0;
> -
> -       if (!trans->delayed_refs_bytes_reserved)
> -               return;
> -
> -       trace_btrfs_space_reservation(fs_info, "local_delayed_refs_rsv",
> -                                     trans->transid,
> -                                     trans->delayed_refs_bytes_reserved, 0);
> -       btrfs_block_rsv_release(fs_info, &trans->delayed_rsv,
> -                               trans->delayed_refs_bytes_reserved, NULL);
> -       trans->delayed_refs_bytes_reserved = 0;
> +       if (trans->delayed_refs_bytes_reserved) {
> +               trace_btrfs_space_reservation(fs_info, "local_delayed_refs_rsv",
> +                                       trans->transid,
> +                                       trans->delayed_refs_bytes_reserved, 0);
> +               btrfs_block_rsv_release(fs_info, &trans->delayed_rsv,
> +                                       trans->delayed_refs_bytes_reserved, NULL);
> +               trans->delayed_refs_bytes_reserved = 0;
> +       }
>  }
>
>  static int __btrfs_end_transaction(struct btrfs_trans_handle *trans,
> --
> 2.53.0
>
>

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH v3 1/4] btrfs: reserve space for delayed_refs in delalloc
  2026-04-08 14:56   ` Filipe Manana
@ 2026-04-08 17:34     ` Boris Burkov
  2026-04-10 15:27       ` Filipe Manana
  0 siblings, 1 reply; 11+ messages in thread
From: Boris Burkov @ 2026-04-08 17:34 UTC (permalink / raw)
  To: Filipe Manana; +Cc: linux-btrfs, kernel-team

On Wed, Apr 08, 2026 at 03:56:14PM +0100, Filipe Manana wrote:
> On Tue, Apr 7, 2026 at 8:31 PM Boris Burkov <boris@bur.io> wrote:
> >
> > delalloc uses a per-inode block_rsv to perform metadata reservations for
> > the cow operations it anticipates based on the number of outstanding
> > extents. This calculation is done based on inode->outstanding_extents in
> > btrfs_calculate_inode_block_rsv_size(). The reservation is *not*
> > meticulously tracked as each ordered_extent is actually created in
> > writeback, but rather delalloc attempts to over-estimate and the
> > writeback and ordered_extent finish portions are responsible to release
> > all the reservation.
> >
> > However, there is a notable gap in this reservation, it reserves no
> > space for the resulting delayed_refs. If you compare to how
> > btrfs_start_transaction() reservations work, this is a noteable
> > difference.
> >
> > As writeback actually occurs, and we trigger btrfs_finish_one_ordered(),
> > that function will start generating delayed refs, which will draw from
> > the trans_handle's delayed_refs_rsv via btrfs_update_delayed_refs_rsv():
> >
> > For example, we can trace the primary data delayed ref:
> >
> > btrfs_finish_one_ordered()
> >   insert_ordered_extent_file_extent()
> >     insert_reserved_file_extent()
> >       btrfs_alloc_reserved_file_extent()
> >         btrfs_add_delayed_data_ref()
> >           add_delayed_ref()
> >             btrfs_update_delayed_refs_rsv();
> >
> > This trans_handle was created in finish_one_ordered() with
> > btrfs_join_transaction() which calls start_transaction with
> > num_items=0 and BTRFS_RESERVE_NO_FLUSH. As a result, this trans_handle
> > has no reserved in h->delayed_rsv, as neither the num_items reservation
> > nor the btrfs_delayed_refs_rsv_refill() reservation is run.
> >
> > Thus, when btrfs_update_delayed_refs_rsv() runs, reserved_bytes is 0 and
> > fs_info->delayed_rsv->size grows but not fs_info->delayed_rsv->reserved.
> >
> > If a large amount of writeback happens all at once (perhaps due to
> > dirty_ratio being tuned too high), this results in, among other things,
> > erroneous assessments of the amount of delayed_refs reserved in the
> > metadata space reclaim logic, like need_preemptive_reclaim() which
> > relies on fs_info->delayed_rsv->reserved and even worse, poor decision
> > making in btrfs_preempt_reclaim_metadata_space() which counts
> > delalloc_bytes like so:
> >
> >   block_rsv_size = global_rsv_size +
> >           btrfs_block_rsv_reserved(delayed_block_rsv) +
> >           btrfs_block_rsv_reserved(delayed_refs_rsv) +
> >           btrfs_block_rsv_reserved(trans_rsv);
> >   delalloc_size = bytes_may_use - block_rsv_size;
> >
> > So all that lost delayed refs usage gets accounted as delalloc_size and
> > leads to preemptive reclaim continuously choosing FLUSH_DELALLOC, which
> > further exacerbates the problem.
> >
> > With enough writeback around, we can run enough delalloc that we get
> > into async reclaim which starts blocking start_transaction() and
> > eventually hits FLUSH_DELALLOC_WAIT/FLUSH_DELALLOC_FULL at which point
> > the filesystem gets heavily blocked on metadata space in reserve_space(),
> > blocking all new transaction work until all the ordered_extents finish.
> >
> > If we had an accurate view of the reservation for delayed refs, then we
> > could mostly break this feedback loop in preemptive reclaim, and
> > generally would be able to make more accurate decisions with regards to
> > metadata space reclamation.
> >
> > This patch adds extra metadata reservation to the inode's block_rsv to
> > account for the delayed refs. When the ordered_extent finishes and we
> > are about to do work in the transaction that uses delayed refs, we
> > migrate enough for 1 extent. Since this is not necessarily perfect, we
> > have to be careful and do a "soft" migrate which succeeds even if there
> > is not enough reservation. This is strictly better than what we have and
> > also matches how the delayed ref rsv gets used in the transaction at
> > btrfs_update_delayed_refs_rsv().
> >
> > Aside from this data delayed_ref, there are also some metadata
> > delayed_refs to consider. These are:
> > - subvolume tree for the file extent item
> > - csum tree for data csums
> > - raid stripe tree if enabled
> > - free space tree if enabled
> >
> > So account for those delayed_refs in the reservation as well. This
> > greatly increases the size of the reservation as each metadata cow
> > results in two delayed refs: one add for the new block in
> > btrfs_alloc_tree_block() and one drop for the old in
> > btrfs_free_tree_block(). As a result, to be completely conservative,
> > we need to reserve 2 delayed refs worth of space for each cow.
> >
> > Signed-off-by: Boris Burkov <boris@bur.io>
> > ---
> >  fs/btrfs/delalloc-space.c | 56 +++++++++++++++++++++++++++++++++++++++
> >  fs/btrfs/delalloc-space.h |  3 +++
> >  fs/btrfs/inode.c          |  2 ++
> >  fs/btrfs/transaction.c    | 36 +++++++++++--------------
> >  4 files changed, 76 insertions(+), 21 deletions(-)
> >
> > diff --git a/fs/btrfs/delalloc-space.c b/fs/btrfs/delalloc-space.c
> > index 0970799d0aa4..c99f516ce8af 100644
> > --- a/fs/btrfs/delalloc-space.c
> > +++ b/fs/btrfs/delalloc-space.c
> > @@ -1,13 +1,16 @@
> >  // SPDX-License-Identifier: GPL-2.0
> >
> > +#include "linux/btrfs_tree.h"
> 
> Not needed.
> 
> >  #include "messages.h"
> >  #include "ctree.h"
> >  #include "delalloc-space.h"
> > +#include "delayed-ref.h"
> >  #include "block-rsv.h"
> >  #include "btrfs_inode.h"
> >  #include "space-info.h"
> >  #include "qgroup.h"
> >  #include "fs.h"
> > +#include "transaction.h"
> >
> >  /*
> >   * HOW DOES THIS WORK
> > @@ -247,6 +250,38 @@ static void btrfs_inode_rsv_release(struct btrfs_inode *inode, bool qgroup_free)
> >                                                    qgroup_to_release);
> >  }
> >
> > +/*
> > + * Each delalloc extent could become an ordered_extent and end up inserting a
> > + * new data extent and modify a number of btrees. Each of those is associated with
> > + * adding delayed refs which need a corresponding delayed refs reservation.
> > + *
> > + * Each metadata cow operation results in an add and a drop delayed ref, both of
> > + * which call add_delayed_ref() and ultimately btrfs_update_delayed_refs_rsv(),
> > + * so each must account for 2 delayed refs.
> > + */
> > +static u64 delalloc_calc_delayed_refs_rsv(const struct btrfs_inode *inode, u64 nr_extents)
> > +{
> > +       const struct btrfs_fs_info *fs_info = inode->root->fs_info;
> > +
> > +       /*
> > +        * Factor for how many delayed refs updates we will generate per extent.
> > +        * Non-optional:
> > +        * extent tree: 1 data delayed ref.
> 
> Running the delayed data ref to insert the extent item in the extent
> tree requires COWing an extent tree leaf, which creates a delayed tree
> ref for deletion too.
> So, 2.
> 
> Otherwise it looks good, thanks.
> 

I think I might be getting confused now :)

I'll try to work it out "out loud" and hopefully I can reach a good
conclusion.

I have been trying to reserve for the calls to add_delayed_ref in the
transaction handle context. And tracing each cow in that context yields
2 delayed refs each while tracing the data extent insertion only creates
one. The reason I care particularly about this context is the
trans_handle delayed_refs rsv accounting in
btrfs_update_delayed_refs_rsv(), as described in the patch. Ultimately,
all the reservations get sunk into the global reserve from
trans->delayed_rsv at btrfs_end_transaction(). Does this all sound right
to you so far?

Cows that we clearly do from finish_one_ordered() are the ones to:
raid stripe tree, subvol tree, and csum tree. Notably, the free space
tree *does not* happen in this context, it happens when we run the
delayed ref, so I think I was not consistent to include that. In that
context we cow the extent and free space tree for the data extent.

But the point of reserving for a delayed ref is to account for the cows
that it will ultimately cause and reserve ahead for them. So I think it
does make sense to reserve for the fst and extent tree. On the other
hand, I think we actively *don't* want to account for the potential
infinite regress where each metadata delayed_ref run "unluckily" results
in cow-ing the extent tree again and again since it doesn't actually
happen in practice (especially with Leo's writeback inhibition and
hopefully eb overwriting soon). That series diverges so it would require
an infinite reservation anyway.

In conclusion, I think I am agreeing and proposing that I shift the goal
from "count add_delayed_ref calls under the trans handle" to "predict
metadata reservation due to delayed refs" and in that world, it should
be 2 for the data extent in the extent tree.

The other principled option is to count 1 for the data delayed ref now
and not account for the fst or the extent tree cows that happen at
delayed_refs time, but I think that makes less sense.

Sorry if that was a lot of text to just agree with you, but I want to be
careful since it's easy to go astray and "get away with it" cause of the
reservations being conservative and getting sloshed into the global
reserves.

One final note / question:
the "normal" start_transaction() reservation which computes the space
for the cow itself and then the delayed_refs cows does *not* seem to
count 2 per cow for delayed refs in btrfs_calc_delayed_ref_bytes().
But in theory the same reasoning applies that we will touch two metadata
items for each cow.  I assume this is OK in practice because the depth
is generally not 8 anyway, but do you think that is something I should
clean up as well?

Thanks again,
Boris

> > +        * subvolume tree: 1 cow => 2 metadata delayed refs.
> > +        */
> > +       int factor = 3;
> > +
> > +       /* The remaining trees are only written to conditionally. */
> > +       if (!(inode->flags & BTRFS_INODE_NODATASUM))
> > +               factor += 2;
> > +       if (btrfs_test_opt(fs_info, FREE_SPACE_TREE))
> > +               factor += 2;
> > +       if (btrfs_fs_incompat(fs_info, RAID_STRIPE_TREE))
> > +               factor += 2;
> > +
> > +       return btrfs_calc_insert_metadata_size(fs_info, nr_extents) * factor;
> > +}
> > +
> >  static void btrfs_calculate_inode_block_rsv_size(struct btrfs_fs_info *fs_info,
> >                                                  struct btrfs_inode *inode)
> >  {
> > @@ -266,6 +301,7 @@ static void btrfs_calculate_inode_block_rsv_size(struct btrfs_fs_info *fs_info,
> >                 reserve_size = btrfs_calc_insert_metadata_size(fs_info,
> >                                                 outstanding_extents);
> >                 reserve_size += btrfs_calc_metadata_size(fs_info, 1);
> > +               reserve_size += delalloc_calc_delayed_refs_rsv(inode, outstanding_extents);
> >         }
> >         if (!(inode->flags & BTRFS_INODE_NODATASUM)) {
> >                 u64 csum_leaves;
> > @@ -309,9 +345,29 @@ static void calc_inode_reservations(struct btrfs_inode *inode,
> >          * for an inode update.
> >          */
> >         *meta_reserve += inode_update;
> > +
> > +       *meta_reserve += delalloc_calc_delayed_refs_rsv(inode, nr_extents);
> > +
> >         *qgroup_reserve = nr_extents * fs_info->nodesize;
> >  }
> >
> > +void btrfs_delalloc_migrate_delayed_refs_rsv(struct btrfs_trans_handle *trans,
> > +                                            struct btrfs_inode *inode)
> > +{
> > +       struct btrfs_block_rsv *inode_rsv = &inode->block_rsv;
> > +       struct btrfs_block_rsv *trans_rsv = &trans->delayed_rsv;
> > +       u64 num_bytes = delalloc_calc_delayed_refs_rsv(inode, 1);
> > +
> > +       spin_lock(&inode_rsv->lock);
> > +       num_bytes = min(num_bytes, inode_rsv->reserved);
> > +       inode_rsv->reserved -= num_bytes;
> > +       inode_rsv->full = (inode_rsv->reserved >= inode_rsv->size);
> > +       spin_unlock(&inode_rsv->lock);
> > +
> > +       btrfs_block_rsv_add_bytes(trans_rsv, num_bytes, true);
> > +       trans->delayed_refs_bytes_reserved += num_bytes;
> > +}
> > +
> >  int btrfs_delalloc_reserve_metadata(struct btrfs_inode *inode, u64 num_bytes,
> >                                     u64 disk_num_bytes, bool noflush)
> >  {
> > diff --git a/fs/btrfs/delalloc-space.h b/fs/btrfs/delalloc-space.h
> > index 6119c0d3f883..bd7041166987 100644
> > --- a/fs/btrfs/delalloc-space.h
> > +++ b/fs/btrfs/delalloc-space.h
> > @@ -8,6 +8,7 @@
> >  struct extent_changeset;
> >  struct btrfs_inode;
> >  struct btrfs_fs_info;
> > +struct btrfs_trans_handle;
> >
> >  int btrfs_alloc_data_chunk_ondemand(const struct btrfs_inode *inode, u64 bytes);
> >  int btrfs_check_data_free_space(struct btrfs_inode *inode,
> > @@ -27,5 +28,7 @@ int btrfs_delalloc_reserve_metadata(struct btrfs_inode *inode, u64 num_bytes,
> >                                     u64 disk_num_bytes, bool noflush);
> >  void btrfs_delalloc_release_extents(struct btrfs_inode *inode, u64 num_bytes);
> >  void btrfs_delalloc_shrink_extents(struct btrfs_inode *inode, u64 reserved_len, u64 new_len);
> > +void btrfs_delalloc_migrate_delayed_refs_rsv(struct btrfs_trans_handle *trans,
> > +                                            struct btrfs_inode *inode);
> >
> >  #endif /* BTRFS_DELALLOC_SPACE_H */
> > diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
> > index 40474014c03f..15945744a304 100644
> > --- a/fs/btrfs/inode.c
> > +++ b/fs/btrfs/inode.c
> > @@ -653,6 +653,7 @@ static noinline int __cow_file_range_inline(struct btrfs_inode *inode,
> >                 goto out;
> >         }
> >         trans->block_rsv = &inode->block_rsv;
> > +       btrfs_delalloc_migrate_delayed_refs_rsv(trans, inode);
> >
> >         drop_args.path = path;
> >         drop_args.start = 0;
> > @@ -3259,6 +3260,7 @@ int btrfs_finish_one_ordered(struct btrfs_ordered_extent *ordered_extent)
> >         }
> >
> >         trans->block_rsv = &inode->block_rsv;
> > +       btrfs_delalloc_migrate_delayed_refs_rsv(trans, inode);
> >
> >         ret = btrfs_insert_raid_extent(trans, ordered_extent);
> >         if (unlikely(ret)) {
> > diff --git a/fs/btrfs/transaction.c b/fs/btrfs/transaction.c
> > index 248adb785051..55791bb100a2 100644
> > --- a/fs/btrfs/transaction.c
> > +++ b/fs/btrfs/transaction.c
> > @@ -1047,29 +1047,23 @@ static void btrfs_trans_release_metadata(struct btrfs_trans_handle *trans)
> >                 return;
> >         }
> >
> > -       if (!trans->bytes_reserved) {
> > -               ASSERT(trans->delayed_refs_bytes_reserved == 0,
> > -                      "trans->delayed_refs_bytes_reserved=%llu",
> > -                      trans->delayed_refs_bytes_reserved);
> > -               return;
> > +       if (trans->bytes_reserved) {
> > +               ASSERT(trans->block_rsv == &fs_info->trans_block_rsv);
> > +               trace_btrfs_space_reservation(fs_info, "transaction",
> > +                                       trans->transid, trans->bytes_reserved, 0);
> > +               btrfs_block_rsv_release(fs_info, trans->block_rsv,
> > +                                       trans->bytes_reserved, NULL);
> > +               trans->bytes_reserved = 0;
> >         }
> >
> > -       ASSERT(trans->block_rsv == &fs_info->trans_block_rsv);
> > -       trace_btrfs_space_reservation(fs_info, "transaction",
> > -                                     trans->transid, trans->bytes_reserved, 0);
> > -       btrfs_block_rsv_release(fs_info, trans->block_rsv,
> > -                               trans->bytes_reserved, NULL);
> > -       trans->bytes_reserved = 0;
> > -
> > -       if (!trans->delayed_refs_bytes_reserved)
> > -               return;
> > -
> > -       trace_btrfs_space_reservation(fs_info, "local_delayed_refs_rsv",
> > -                                     trans->transid,
> > -                                     trans->delayed_refs_bytes_reserved, 0);
> > -       btrfs_block_rsv_release(fs_info, &trans->delayed_rsv,
> > -                               trans->delayed_refs_bytes_reserved, NULL);
> > -       trans->delayed_refs_bytes_reserved = 0;
> > +       if (trans->delayed_refs_bytes_reserved) {
> > +               trace_btrfs_space_reservation(fs_info, "local_delayed_refs_rsv",
> > +                                       trans->transid,
> > +                                       trans->delayed_refs_bytes_reserved, 0);
> > +               btrfs_block_rsv_release(fs_info, &trans->delayed_rsv,
> > +                                       trans->delayed_refs_bytes_reserved, NULL);
> > +               trans->delayed_refs_bytes_reserved = 0;
> > +       }
> >  }
> >
> >  static int __btrfs_end_transaction(struct btrfs_trans_handle *trans,
> > --
> > 2.53.0
> >
> >

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH v3 1/4] btrfs: reserve space for delayed_refs in delalloc
  2026-04-08 17:34     ` Boris Burkov
@ 2026-04-10 15:27       ` Filipe Manana
  2026-04-10 23:06         ` Boris Burkov
  0 siblings, 1 reply; 11+ messages in thread
From: Filipe Manana @ 2026-04-10 15:27 UTC (permalink / raw)
  To: Boris Burkov; +Cc: linux-btrfs, kernel-team

On Wed, Apr 8, 2026 at 6:34 PM Boris Burkov <boris@bur.io> wrote:
>
> On Wed, Apr 08, 2026 at 03:56:14PM +0100, Filipe Manana wrote:
> > On Tue, Apr 7, 2026 at 8:31 PM Boris Burkov <boris@bur.io> wrote:
> > >
> > > delalloc uses a per-inode block_rsv to perform metadata reservations for
> > > the cow operations it anticipates based on the number of outstanding
> > > extents. This calculation is done based on inode->outstanding_extents in
> > > btrfs_calculate_inode_block_rsv_size(). The reservation is *not*
> > > meticulously tracked as each ordered_extent is actually created in
> > > writeback, but rather delalloc attempts to over-estimate and the
> > > writeback and ordered_extent finish portions are responsible to release
> > > all the reservation.
> > >
> > > However, there is a notable gap in this reservation, it reserves no
> > > space for the resulting delayed_refs. If you compare to how
> > > btrfs_start_transaction() reservations work, this is a noteable
> > > difference.
> > >
> > > As writeback actually occurs, and we trigger btrfs_finish_one_ordered(),
> > > that function will start generating delayed refs, which will draw from
> > > the trans_handle's delayed_refs_rsv via btrfs_update_delayed_refs_rsv():
> > >
> > > For example, we can trace the primary data delayed ref:
> > >
> > > btrfs_finish_one_ordered()
> > >   insert_ordered_extent_file_extent()
> > >     insert_reserved_file_extent()
> > >       btrfs_alloc_reserved_file_extent()
> > >         btrfs_add_delayed_data_ref()
> > >           add_delayed_ref()
> > >             btrfs_update_delayed_refs_rsv();
> > >
> > > This trans_handle was created in finish_one_ordered() with
> > > btrfs_join_transaction() which calls start_transaction with
> > > num_items=0 and BTRFS_RESERVE_NO_FLUSH. As a result, this trans_handle
> > > has no reserved in h->delayed_rsv, as neither the num_items reservation
> > > nor the btrfs_delayed_refs_rsv_refill() reservation is run.
> > >
> > > Thus, when btrfs_update_delayed_refs_rsv() runs, reserved_bytes is 0 and
> > > fs_info->delayed_rsv->size grows but not fs_info->delayed_rsv->reserved.
> > >
> > > If a large amount of writeback happens all at once (perhaps due to
> > > dirty_ratio being tuned too high), this results in, among other things,
> > > erroneous assessments of the amount of delayed_refs reserved in the
> > > metadata space reclaim logic, like need_preemptive_reclaim() which
> > > relies on fs_info->delayed_rsv->reserved and even worse, poor decision
> > > making in btrfs_preempt_reclaim_metadata_space() which counts
> > > delalloc_bytes like so:
> > >
> > >   block_rsv_size = global_rsv_size +
> > >           btrfs_block_rsv_reserved(delayed_block_rsv) +
> > >           btrfs_block_rsv_reserved(delayed_refs_rsv) +
> > >           btrfs_block_rsv_reserved(trans_rsv);
> > >   delalloc_size = bytes_may_use - block_rsv_size;
> > >
> > > So all that lost delayed refs usage gets accounted as delalloc_size and
> > > leads to preemptive reclaim continuously choosing FLUSH_DELALLOC, which
> > > further exacerbates the problem.
> > >
> > > With enough writeback around, we can run enough delalloc that we get
> > > into async reclaim which starts blocking start_transaction() and
> > > eventually hits FLUSH_DELALLOC_WAIT/FLUSH_DELALLOC_FULL at which point
> > > the filesystem gets heavily blocked on metadata space in reserve_space(),
> > > blocking all new transaction work until all the ordered_extents finish.
> > >
> > > If we had an accurate view of the reservation for delayed refs, then we
> > > could mostly break this feedback loop in preemptive reclaim, and
> > > generally would be able to make more accurate decisions with regards to
> > > metadata space reclamation.
> > >
> > > This patch adds extra metadata reservation to the inode's block_rsv to
> > > account for the delayed refs. When the ordered_extent finishes and we
> > > are about to do work in the transaction that uses delayed refs, we
> > > migrate enough for 1 extent. Since this is not necessarily perfect, we
> > > have to be careful and do a "soft" migrate which succeeds even if there
> > > is not enough reservation. This is strictly better than what we have and
> > > also matches how the delayed ref rsv gets used in the transaction at
> > > btrfs_update_delayed_refs_rsv().
> > >
> > > Aside from this data delayed_ref, there are also some metadata
> > > delayed_refs to consider. These are:
> > > - subvolume tree for the file extent item
> > > - csum tree for data csums
> > > - raid stripe tree if enabled
> > > - free space tree if enabled
> > >
> > > So account for those delayed_refs in the reservation as well. This
> > > greatly increases the size of the reservation as each metadata cow
> > > results in two delayed refs: one add for the new block in
> > > btrfs_alloc_tree_block() and one drop for the old in
> > > btrfs_free_tree_block(). As a result, to be completely conservative,
> > > we need to reserve 2 delayed refs worth of space for each cow.
> > >
> > > Signed-off-by: Boris Burkov <boris@bur.io>
> > > ---
> > >  fs/btrfs/delalloc-space.c | 56 +++++++++++++++++++++++++++++++++++++++
> > >  fs/btrfs/delalloc-space.h |  3 +++
> > >  fs/btrfs/inode.c          |  2 ++
> > >  fs/btrfs/transaction.c    | 36 +++++++++++--------------
> > >  4 files changed, 76 insertions(+), 21 deletions(-)
> > >
> > > diff --git a/fs/btrfs/delalloc-space.c b/fs/btrfs/delalloc-space.c
> > > index 0970799d0aa4..c99f516ce8af 100644
> > > --- a/fs/btrfs/delalloc-space.c
> > > +++ b/fs/btrfs/delalloc-space.c
> > > @@ -1,13 +1,16 @@
> > >  // SPDX-License-Identifier: GPL-2.0
> > >
> > > +#include "linux/btrfs_tree.h"
> >
> > Not needed.
> >
> > >  #include "messages.h"
> > >  #include "ctree.h"
> > >  #include "delalloc-space.h"
> > > +#include "delayed-ref.h"
> > >  #include "block-rsv.h"
> > >  #include "btrfs_inode.h"
> > >  #include "space-info.h"
> > >  #include "qgroup.h"
> > >  #include "fs.h"
> > > +#include "transaction.h"
> > >
> > >  /*
> > >   * HOW DOES THIS WORK
> > > @@ -247,6 +250,38 @@ static void btrfs_inode_rsv_release(struct btrfs_inode *inode, bool qgroup_free)
> > >                                                    qgroup_to_release);
> > >  }
> > >
> > > +/*
> > > + * Each delalloc extent could become an ordered_extent and end up inserting a
> > > + * new data extent and modify a number of btrees. Each of those is associated with
> > > + * adding delayed refs which need a corresponding delayed refs reservation.
> > > + *
> > > + * Each metadata cow operation results in an add and a drop delayed ref, both of
> > > + * which call add_delayed_ref() and ultimately btrfs_update_delayed_refs_rsv(),
> > > + * so each must account for 2 delayed refs.
> > > + */
> > > +static u64 delalloc_calc_delayed_refs_rsv(const struct btrfs_inode *inode, u64 nr_extents)
> > > +{
> > > +       const struct btrfs_fs_info *fs_info = inode->root->fs_info;
> > > +
> > > +       /*
> > > +        * Factor for how many delayed refs updates we will generate per extent.
> > > +        * Non-optional:
> > > +        * extent tree: 1 data delayed ref.
> >
> > Running the delayed data ref to insert the extent item in the extent
> > tree requires COWing an extent tree leaf, which creates a delayed tree
> > ref for deletion too.
> > So, 2.
> >
> > Otherwise it looks good, thanks.
> >
>
> I think I might be getting confused now :)
>
> I'll try to work it out "out loud" and hopefully I can reach a good
> conclusion.
>
> I have been trying to reserve for the calls to add_delayed_ref in the
> transaction handle context. And tracing each cow in that context yields
> 2 delayed refs each while tracing the data extent insertion only creates
> one. The reason I care particularly about this context is the
> trans_handle delayed_refs rsv accounting in
> btrfs_update_delayed_refs_rsv(), as described in the patch. Ultimately,
> all the reservations get sunk into the global reserve from
> trans->delayed_rsv at btrfs_end_transaction(). Does this all sound right
> to you so far?

Yes.

>
> Cows that we clearly do from finish_one_ordered() are the ones to:
> raid stripe tree, subvol tree, and csum tree. Notably, the free space
> tree *does not* happen in this context, it happens when we run the
> delayed ref, so I think I was not consistent to include that. In that
> context we cow the extent and free space tree for the data extent.

Yes.

But the issue is that if we get a lot of delalloc and run it, we may
generate many delayed refs - those directly created by ordered extent
completion operations as you mention.

Then when we run them, either during the early phase of a transaction
commit or from the space flushing code, we will generate more delayed
refs to update the extent and free space trees (for COWing).
During that time we are running them, we will try to use space from
the global (fs_info->delayed_refs_rsv) delayed refs reserve and if
that fails, we try the global reserve.

Our calculations are pessimistic so even if we don't account for all
delayed refs, it's ok most of the time since the calculations assume
we have trees with a height of 8 and that we get node splits at every
level of a tree.
So that ends up compensating unless we have a really large amount of
delayed refs.

Sure that once we COW a leaf from the extent or free space tree, we
will not need to COW it again for every subsequent modification.
However, memory pressure or a call to btrfs_btree_balance_dirty() from
some other task can flush every COWed leaf/node, requiring another COW
operation.

So, in extreme cases, we could eventually run out of space. We
sporadically see reports of transaction aborts with -ENOSPC when
running delayed refs, both upstream and downstream. This isn't new;
it's been like this forever.

>
> But the point of reserving for a delayed ref is to account for the cows
> that it will ultimately cause and reserve ahead for them. So I think it
> does make sense to reserve for the fst and extent tree. On the other
> hand, I think we actively *don't* want to account for the potential
> infinite regress where each metadata delayed_ref run "unluckily" results
> in cow-ing the extent tree again and again since it doesn't actually
> happen in practice (especially with Leo's writeback inhibition and
> hopefully eb overwriting soon). That series diverges so it would require
> an infinite reservation anyway.

The writeback inhibition thing doesn't prevent what I mentioned
before, it only applies when a tree modification must restart.
I'm talking about subsequent tree modifications that require COWing
the same nodes/leaves because writeback was triggered for the btree
inode in between.

>
> In conclusion, I think I am agreeing and proposing that I shift the goal
> from "count add_delayed_ref calls under the trans handle" to "predict
> metadata reservation due to delayed refs" and in that world, it should
> be 2 for the data extent in the extent tree.
>
> The other principled option is to count 1 for the data delayed ref now
> and not account for the fst or the extent tree cows that happen at
> delayed_refs time, but I think that makes less sense.

And how would you do that?
Because running delayed refs happens normally at transaction commit
time, both before and after the critical section, or when flushing
space.
Those are places where I don't see how we can reserve space.

>
> Sorry if that was a lot of text to just agree with you, but I want to be
> careful since it's easy to go astray and "get away with it" cause of the
> reservations being conservative and getting sloshed into the global
> reserves.

No worries.

>
> One final note / question:
> the "normal" start_transaction() reservation which computes the space
> for the cow itself and then the delayed_refs cows does *not* seem to
> count 2 per cow for delayed refs in btrfs_calc_delayed_ref_bytes().
> But in theory the same reasoning applies that we will touch two metadata
> items for each cow.  I assume this is OK in practice because the depth
> is generally not 8 anyway, but do you think that is something I should
> clean up as well?

Yes, we have a lot of places where we don't account for everything.
Either because someone forgot about it or because we don't know in
advance how much we will need.

For example, assuming an entire delalloc range up to 128M results in a
single extent is very optimistic.
We assume this, but we don't know until we actually run writeback and
try to reserve space - a 32M write might only be satisfied by
allocating 8192 4K extents due to fragmentation - yet we reserve space
accounting for a single extent.

I've seen cases where this resulted in a transaction abort, with
-ENOSPC, during ordered extent completion.

One thing to pay special attention to, is that if we start reserving a
lot more space, some tests that currently pass may start failing with
-ENOSPC or they may take significantly longer due to more space
flushing.
It's a very tricky area. I've experienced this pain several times in the past.

So it's probably better if this patch accounts only for delayed refs
generated during extent completion.
I noticed you already sent a new patch version before I could reply, I
will look at it now.

Thanks.

>
> Thanks again,
> Boris
>
> > > +        * subvolume tree: 1 cow => 2 metadata delayed refs.
> > > +        */
> > > +       int factor = 3;
> > > +
> > > +       /* The remaining trees are only written to conditionally. */
> > > +       if (!(inode->flags & BTRFS_INODE_NODATASUM))
> > > +               factor += 2;
> > > +       if (btrfs_test_opt(fs_info, FREE_SPACE_TREE))
> > > +               factor += 2;
> > > +       if (btrfs_fs_incompat(fs_info, RAID_STRIPE_TREE))
> > > +               factor += 2;
> > > +
> > > +       return btrfs_calc_insert_metadata_size(fs_info, nr_extents) * factor;
> > > +}
> > > +
> > >  static void btrfs_calculate_inode_block_rsv_size(struct btrfs_fs_info *fs_info,
> > >                                                  struct btrfs_inode *inode)
> > >  {
> > > @@ -266,6 +301,7 @@ static void btrfs_calculate_inode_block_rsv_size(struct btrfs_fs_info *fs_info,
> > >                 reserve_size = btrfs_calc_insert_metadata_size(fs_info,
> > >                                                 outstanding_extents);
> > >                 reserve_size += btrfs_calc_metadata_size(fs_info, 1);
> > > +               reserve_size += delalloc_calc_delayed_refs_rsv(inode, outstanding_extents);
> > >         }
> > >         if (!(inode->flags & BTRFS_INODE_NODATASUM)) {
> > >                 u64 csum_leaves;
> > > @@ -309,9 +345,29 @@ static void calc_inode_reservations(struct btrfs_inode *inode,
> > >          * for an inode update.
> > >          */
> > >         *meta_reserve += inode_update;
> > > +
> > > +       *meta_reserve += delalloc_calc_delayed_refs_rsv(inode, nr_extents);
> > > +
> > >         *qgroup_reserve = nr_extents * fs_info->nodesize;
> > >  }
> > >
> > > +void btrfs_delalloc_migrate_delayed_refs_rsv(struct btrfs_trans_handle *trans,
> > > +                                            struct btrfs_inode *inode)
> > > +{
> > > +       struct btrfs_block_rsv *inode_rsv = &inode->block_rsv;
> > > +       struct btrfs_block_rsv *trans_rsv = &trans->delayed_rsv;
> > > +       u64 num_bytes = delalloc_calc_delayed_refs_rsv(inode, 1);
> > > +
> > > +       spin_lock(&inode_rsv->lock);
> > > +       num_bytes = min(num_bytes, inode_rsv->reserved);
> > > +       inode_rsv->reserved -= num_bytes;
> > > +       inode_rsv->full = (inode_rsv->reserved >= inode_rsv->size);
> > > +       spin_unlock(&inode_rsv->lock);
> > > +
> > > +       btrfs_block_rsv_add_bytes(trans_rsv, num_bytes, true);
> > > +       trans->delayed_refs_bytes_reserved += num_bytes;
> > > +}
> > > +
> > >  int btrfs_delalloc_reserve_metadata(struct btrfs_inode *inode, u64 num_bytes,
> > >                                     u64 disk_num_bytes, bool noflush)
> > >  {
> > > diff --git a/fs/btrfs/delalloc-space.h b/fs/btrfs/delalloc-space.h
> > > index 6119c0d3f883..bd7041166987 100644
> > > --- a/fs/btrfs/delalloc-space.h
> > > +++ b/fs/btrfs/delalloc-space.h
> > > @@ -8,6 +8,7 @@
> > >  struct extent_changeset;
> > >  struct btrfs_inode;
> > >  struct btrfs_fs_info;
> > > +struct btrfs_trans_handle;
> > >
> > >  int btrfs_alloc_data_chunk_ondemand(const struct btrfs_inode *inode, u64 bytes);
> > >  int btrfs_check_data_free_space(struct btrfs_inode *inode,
> > > @@ -27,5 +28,7 @@ int btrfs_delalloc_reserve_metadata(struct btrfs_inode *inode, u64 num_bytes,
> > >                                     u64 disk_num_bytes, bool noflush);
> > >  void btrfs_delalloc_release_extents(struct btrfs_inode *inode, u64 num_bytes);
> > >  void btrfs_delalloc_shrink_extents(struct btrfs_inode *inode, u64 reserved_len, u64 new_len);
> > > +void btrfs_delalloc_migrate_delayed_refs_rsv(struct btrfs_trans_handle *trans,
> > > +                                            struct btrfs_inode *inode);
> > >
> > >  #endif /* BTRFS_DELALLOC_SPACE_H */
> > > diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
> > > index 40474014c03f..15945744a304 100644
> > > --- a/fs/btrfs/inode.c
> > > +++ b/fs/btrfs/inode.c
> > > @@ -653,6 +653,7 @@ static noinline int __cow_file_range_inline(struct btrfs_inode *inode,
> > >                 goto out;
> > >         }
> > >         trans->block_rsv = &inode->block_rsv;
> > > +       btrfs_delalloc_migrate_delayed_refs_rsv(trans, inode);
> > >
> > >         drop_args.path = path;
> > >         drop_args.start = 0;
> > > @@ -3259,6 +3260,7 @@ int btrfs_finish_one_ordered(struct btrfs_ordered_extent *ordered_extent)
> > >         }
> > >
> > >         trans->block_rsv = &inode->block_rsv;
> > > +       btrfs_delalloc_migrate_delayed_refs_rsv(trans, inode);
> > >
> > >         ret = btrfs_insert_raid_extent(trans, ordered_extent);
> > >         if (unlikely(ret)) {
> > > diff --git a/fs/btrfs/transaction.c b/fs/btrfs/transaction.c
> > > index 248adb785051..55791bb100a2 100644
> > > --- a/fs/btrfs/transaction.c
> > > +++ b/fs/btrfs/transaction.c
> > > @@ -1047,29 +1047,23 @@ static void btrfs_trans_release_metadata(struct btrfs_trans_handle *trans)
> > >                 return;
> > >         }
> > >
> > > -       if (!trans->bytes_reserved) {
> > > -               ASSERT(trans->delayed_refs_bytes_reserved == 0,
> > > -                      "trans->delayed_refs_bytes_reserved=%llu",
> > > -                      trans->delayed_refs_bytes_reserved);
> > > -               return;
> > > +       if (trans->bytes_reserved) {
> > > +               ASSERT(trans->block_rsv == &fs_info->trans_block_rsv);
> > > +               trace_btrfs_space_reservation(fs_info, "transaction",
> > > +                                       trans->transid, trans->bytes_reserved, 0);
> > > +               btrfs_block_rsv_release(fs_info, trans->block_rsv,
> > > +                                       trans->bytes_reserved, NULL);
> > > +               trans->bytes_reserved = 0;
> > >         }
> > >
> > > -       ASSERT(trans->block_rsv == &fs_info->trans_block_rsv);
> > > -       trace_btrfs_space_reservation(fs_info, "transaction",
> > > -                                     trans->transid, trans->bytes_reserved, 0);
> > > -       btrfs_block_rsv_release(fs_info, trans->block_rsv,
> > > -                               trans->bytes_reserved, NULL);
> > > -       trans->bytes_reserved = 0;
> > > -
> > > -       if (!trans->delayed_refs_bytes_reserved)
> > > -               return;
> > > -
> > > -       trace_btrfs_space_reservation(fs_info, "local_delayed_refs_rsv",
> > > -                                     trans->transid,
> > > -                                     trans->delayed_refs_bytes_reserved, 0);
> > > -       btrfs_block_rsv_release(fs_info, &trans->delayed_rsv,
> > > -                               trans->delayed_refs_bytes_reserved, NULL);
> > > -       trans->delayed_refs_bytes_reserved = 0;
> > > +       if (trans->delayed_refs_bytes_reserved) {
> > > +               trace_btrfs_space_reservation(fs_info, "local_delayed_refs_rsv",
> > > +                                       trans->transid,
> > > +                                       trans->delayed_refs_bytes_reserved, 0);
> > > +               btrfs_block_rsv_release(fs_info, &trans->delayed_rsv,
> > > +                                       trans->delayed_refs_bytes_reserved, NULL);
> > > +               trans->delayed_refs_bytes_reserved = 0;
> > > +       }
> > >  }
> > >
> > >  static int __btrfs_end_transaction(struct btrfs_trans_handle *trans,
> > > --
> > > 2.53.0
> > >
> > >

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH v3 1/4] btrfs: reserve space for delayed_refs in delalloc
  2026-04-10 15:27       ` Filipe Manana
@ 2026-04-10 23:06         ` Boris Burkov
  0 siblings, 0 replies; 11+ messages in thread
From: Boris Burkov @ 2026-04-10 23:06 UTC (permalink / raw)
  To: Filipe Manana; +Cc: linux-btrfs, kernel-team

On Fri, Apr 10, 2026 at 04:27:28PM +0100, Filipe Manana wrote:
> On Wed, Apr 8, 2026 at 6:34 PM Boris Burkov <boris@bur.io> wrote:
> >
> > On Wed, Apr 08, 2026 at 03:56:14PM +0100, Filipe Manana wrote:
> > > On Tue, Apr 7, 2026 at 8:31 PM Boris Burkov <boris@bur.io> wrote:
> > > >
> > > > delalloc uses a per-inode block_rsv to perform metadata reservations for
> > > > the cow operations it anticipates based on the number of outstanding
> > > > extents. This calculation is done based on inode->outstanding_extents in
> > > > btrfs_calculate_inode_block_rsv_size(). The reservation is *not*
> > > > meticulously tracked as each ordered_extent is actually created in
> > > > writeback, but rather delalloc attempts to over-estimate and the
> > > > writeback and ordered_extent finish portions are responsible to release
> > > > all the reservation.
> > > >
> > > > However, there is a notable gap in this reservation, it reserves no
> > > > space for the resulting delayed_refs. If you compare to how
> > > > btrfs_start_transaction() reservations work, this is a noteable
> > > > difference.
> > > >
> > > > As writeback actually occurs, and we trigger btrfs_finish_one_ordered(),
> > > > that function will start generating delayed refs, which will draw from
> > > > the trans_handle's delayed_refs_rsv via btrfs_update_delayed_refs_rsv():
> > > >
> > > > For example, we can trace the primary data delayed ref:
> > > >
> > > > btrfs_finish_one_ordered()
> > > >   insert_ordered_extent_file_extent()
> > > >     insert_reserved_file_extent()
> > > >       btrfs_alloc_reserved_file_extent()
> > > >         btrfs_add_delayed_data_ref()
> > > >           add_delayed_ref()
> > > >             btrfs_update_delayed_refs_rsv();
> > > >
> > > > This trans_handle was created in finish_one_ordered() with
> > > > btrfs_join_transaction() which calls start_transaction with
> > > > num_items=0 and BTRFS_RESERVE_NO_FLUSH. As a result, this trans_handle
> > > > has no reserved in h->delayed_rsv, as neither the num_items reservation
> > > > nor the btrfs_delayed_refs_rsv_refill() reservation is run.
> > > >
> > > > Thus, when btrfs_update_delayed_refs_rsv() runs, reserved_bytes is 0 and
> > > > fs_info->delayed_rsv->size grows but not fs_info->delayed_rsv->reserved.
> > > >
> > > > If a large amount of writeback happens all at once (perhaps due to
> > > > dirty_ratio being tuned too high), this results in, among other things,
> > > > erroneous assessments of the amount of delayed_refs reserved in the
> > > > metadata space reclaim logic, like need_preemptive_reclaim() which
> > > > relies on fs_info->delayed_rsv->reserved and even worse, poor decision
> > > > making in btrfs_preempt_reclaim_metadata_space() which counts
> > > > delalloc_bytes like so:
> > > >
> > > >   block_rsv_size = global_rsv_size +
> > > >           btrfs_block_rsv_reserved(delayed_block_rsv) +
> > > >           btrfs_block_rsv_reserved(delayed_refs_rsv) +
> > > >           btrfs_block_rsv_reserved(trans_rsv);
> > > >   delalloc_size = bytes_may_use - block_rsv_size;
> > > >
> > > > So all that lost delayed refs usage gets accounted as delalloc_size and
> > > > leads to preemptive reclaim continuously choosing FLUSH_DELALLOC, which
> > > > further exacerbates the problem.
> > > >
> > > > With enough writeback around, we can run enough delalloc that we get
> > > > into async reclaim which starts blocking start_transaction() and
> > > > eventually hits FLUSH_DELALLOC_WAIT/FLUSH_DELALLOC_FULL at which point
> > > > the filesystem gets heavily blocked on metadata space in reserve_space(),
> > > > blocking all new transaction work until all the ordered_extents finish.
> > > >
> > > > If we had an accurate view of the reservation for delayed refs, then we
> > > > could mostly break this feedback loop in preemptive reclaim, and
> > > > generally would be able to make more accurate decisions with regards to
> > > > metadata space reclamation.
> > > >
> > > > This patch adds extra metadata reservation to the inode's block_rsv to
> > > > account for the delayed refs. When the ordered_extent finishes and we
> > > > are about to do work in the transaction that uses delayed refs, we
> > > > migrate enough for 1 extent. Since this is not necessarily perfect, we
> > > > have to be careful and do a "soft" migrate which succeeds even if there
> > > > is not enough reservation. This is strictly better than what we have and
> > > > also matches how the delayed ref rsv gets used in the transaction at
> > > > btrfs_update_delayed_refs_rsv().
> > > >
> > > > Aside from this data delayed_ref, there are also some metadata
> > > > delayed_refs to consider. These are:
> > > > - subvolume tree for the file extent item
> > > > - csum tree for data csums
> > > > - raid stripe tree if enabled
> > > > - free space tree if enabled
> > > >
> > > > So account for those delayed_refs in the reservation as well. This
> > > > greatly increases the size of the reservation as each metadata cow
> > > > results in two delayed refs: one add for the new block in
> > > > btrfs_alloc_tree_block() and one drop for the old in
> > > > btrfs_free_tree_block(). As a result, to be completely conservative,
> > > > we need to reserve 2 delayed refs worth of space for each cow.
> > > >
> > > > Signed-off-by: Boris Burkov <boris@bur.io>
> > > > ---
> > > >  fs/btrfs/delalloc-space.c | 56 +++++++++++++++++++++++++++++++++++++++
> > > >  fs/btrfs/delalloc-space.h |  3 +++
> > > >  fs/btrfs/inode.c          |  2 ++
> > > >  fs/btrfs/transaction.c    | 36 +++++++++++--------------
> > > >  4 files changed, 76 insertions(+), 21 deletions(-)
> > > >
> > > > diff --git a/fs/btrfs/delalloc-space.c b/fs/btrfs/delalloc-space.c
> > > > index 0970799d0aa4..c99f516ce8af 100644
> > > > --- a/fs/btrfs/delalloc-space.c
> > > > +++ b/fs/btrfs/delalloc-space.c
> > > > @@ -1,13 +1,16 @@
> > > >  // SPDX-License-Identifier: GPL-2.0
> > > >
> > > > +#include "linux/btrfs_tree.h"
> > >
> > > Not needed.
> > >
> > > >  #include "messages.h"
> > > >  #include "ctree.h"
> > > >  #include "delalloc-space.h"
> > > > +#include "delayed-ref.h"
> > > >  #include "block-rsv.h"
> > > >  #include "btrfs_inode.h"
> > > >  #include "space-info.h"
> > > >  #include "qgroup.h"
> > > >  #include "fs.h"
> > > > +#include "transaction.h"
> > > >
> > > >  /*
> > > >   * HOW DOES THIS WORK
> > > > @@ -247,6 +250,38 @@ static void btrfs_inode_rsv_release(struct btrfs_inode *inode, bool qgroup_free)
> > > >                                                    qgroup_to_release);
> > > >  }
> > > >
> > > > +/*
> > > > + * Each delalloc extent could become an ordered_extent and end up inserting a
> > > > + * new data extent and modify a number of btrees. Each of those is associated with
> > > > + * adding delayed refs which need a corresponding delayed refs reservation.
> > > > + *
> > > > + * Each metadata cow operation results in an add and a drop delayed ref, both of
> > > > + * which call add_delayed_ref() and ultimately btrfs_update_delayed_refs_rsv(),
> > > > + * so each must account for 2 delayed refs.
> > > > + */
> > > > +static u64 delalloc_calc_delayed_refs_rsv(const struct btrfs_inode *inode, u64 nr_extents)
> > > > +{
> > > > +       const struct btrfs_fs_info *fs_info = inode->root->fs_info;
> > > > +
> > > > +       /*
> > > > +        * Factor for how many delayed refs updates we will generate per extent.
> > > > +        * Non-optional:
> > > > +        * extent tree: 1 data delayed ref.
> > >
> > > Running the delayed data ref to insert the extent item in the extent
> > > tree requires COWing an extent tree leaf, which creates a delayed tree
> > > ref for deletion too.
> > > So, 2.
> > >
> > > Otherwise it looks good, thanks.
> > >
> >
> > I think I might be getting confused now :)
> >
> > I'll try to work it out "out loud" and hopefully I can reach a good
> > conclusion.
> >
> > I have been trying to reserve for the calls to add_delayed_ref in the
> > transaction handle context. And tracing each cow in that context yields
> > 2 delayed refs each while tracing the data extent insertion only creates
> > one. The reason I care particularly about this context is the
> > trans_handle delayed_refs rsv accounting in
> > btrfs_update_delayed_refs_rsv(), as described in the patch. Ultimately,
> > all the reservations get sunk into the global reserve from
> > trans->delayed_rsv at btrfs_end_transaction(). Does this all sound right
> > to you so far?
> 
> Yes.
> 
> >
> > Cows that we clearly do from finish_one_ordered() are the ones to:
> > raid stripe tree, subvol tree, and csum tree. Notably, the free space
> > tree *does not* happen in this context, it happens when we run the
> > delayed ref, so I think I was not consistent to include that. In that
> > context we cow the extent and free space tree for the data extent.
> 
> Yes.
> 
> But the issue is that if we get a lot of delalloc and run it, we may
> generate many delayed refs - those directly created by ordered extent
> completion operations as you mention.
> 
> Then when we run them, either during the early phase of a transaction
> commit or from the space flushing code, we will generate more delayed
> refs to update the extent and free space trees (for COWing).
> During that time we are running them, we will try to use space from
> the global (fs_info->delayed_refs_rsv) delayed refs reserve and if
> that fails, we try the global reserve.
> 
> Our calculations are pessimistic so even if we don't account for all
> delayed refs, it's ok most of the time since the calculations assume
> we have trees with a height of 8 and that we get node splits at every
> level of a tree.
> So that ends up compensating unless we have a really large amount of
> delayed refs.
> 
> Sure that once we COW a leaf from the extent or free space tree, we
> will not need to COW it again for every subsequent modification.
> However, memory pressure or a call to btrfs_btree_balance_dirty() from
> some other task can flush every COWed leaf/node, requiring another COW
> operation.
> 
> So, in extreme cases, we could eventually run out of space. We
> sporadically see reports of transaction aborts with -ENOSPC when
> running delayed refs, both upstream and downstream. This isn't new;
> it's been like this forever.
> 

All makes sense, and agreed.

> >
> > But the point of reserving for a delayed ref is to account for the cows
> > that it will ultimately cause and reserve ahead for them. So I think it
> > does make sense to reserve for the fst and extent tree. On the other
> > hand, I think we actively *don't* want to account for the potential
> > infinite regress where each metadata delayed_ref run "unluckily" results
> > in cow-ing the extent tree again and again since it doesn't actually
> > happen in practice (especially with Leo's writeback inhibition and
> > hopefully eb overwriting soon). That series diverges so it would require
> > an infinite reservation anyway.
> 
> The writeback inhibition thing doesn't prevent what I mentioned
> before, it only applies when a tree modification must restart.
> I'm talking about subsequent tree modifications that require COWing
> the same nodes/leaves because writeback was triggered for the btree
> inode in between.
> 

+1

> >
> > In conclusion, I think I am agreeing and proposing that I shift the goal
> > from "count add_delayed_ref calls under the trans handle" to "predict
> > metadata reservation due to delayed refs" and in that world, it should
> > be 2 for the data extent in the extent tree.
> >
> > The other principled option is to count 1 for the data delayed ref now
> > and not account for the fst or the extent tree cows that happen at
> > delayed_refs time, but I think that makes less sense.
> 
> And how would you do that?
> Because running delayed refs happens normally at transaction commit
> time, both before and after the critical section, or when flushing
> space.
> Those are places where I don't see how we can reserve space.
> 
> >
> > Sorry if that was a lot of text to just agree with you, but I want to be
> > careful since it's easy to go astray and "get away with it" cause of the
> > reservations being conservative and getting sloshed into the global
> > reserves.
> 
> No worries.
> 
> >
> > One final note / question:
> > the "normal" start_transaction() reservation which computes the space
> > for the cow itself and then the delayed_refs cows does *not* seem to
> > count 2 per cow for delayed refs in btrfs_calc_delayed_ref_bytes().
> > But in theory the same reasoning applies that we will touch two metadata
> > items for each cow.  I assume this is OK in practice because the depth
> > is generally not 8 anyway, but do you think that is something I should
> > clean up as well?
> 
> Yes, we have a lot of places where we don't account for everything.
> Either because someone forgot about it or because we don't know in
> advance how much we will need.
> 
> For example, assuming an entire delalloc range up to 128M results in a
> single extent is very optimistic.
> We assume this, but we don't know until we actually run writeback and
> try to reserve space - a 32M write might only be satisfied by
> allocating 8192 4K extents due to fragmentation - yet we reserve space
> accounting for a single extent.
> 
> I've seen cases where this resulted in a transaction abort, with
> -ENOSPC, during ordered extent completion.
> 
> One thing to pay special attention to, is that if we start reserving a
> lot more space, some tests that currently pass may start failing with
> -ENOSPC or they may take significantly longer due to more space
> flushing.
> It's a very tricky area. I've experienced this pain several times in the past.

Yes, I haven't concretely felt this ENOSPC or excessive flushing pain
enough yet, so I feel a bit guilty for just cranking the rsv up and up
and up without any consequence.. Like, why not make the depth guess
8->100 and be even better? (Obviously bad idea, just rhetorical). I also
wouldn't propose reserving space hyper pessimistically in every
dimension, like assuming min extents instead of max extents.

I feel we have an inherent tension between optimistic reservation (e.g.
assuming we delalloc large extents) and pessimistic reservation (e.g.
assuming we have depth 8 and cow every layer with a split).

What do you think about moving towards trying to make on-average
accurate reservations?

What I mean by that is something like:
do change the start_transaction delrefs to account for cow "correctly"
detect / track depth (per root? global max?)
track splitting "risk"
expose a better sense for the fragmentation to rsv (we have the size
indexed free space rbtrees already)
etc...

And instead of a mix of over-guessing and under-guessing we generally
try to right-size the guesses, and bake in some intentional
conservatism.

Put another way, it feels quite precarious as making an "improvement"
like not reserving all the impossible depth reservations actually just
enospc aborts everybody horribly because we relied on it getting to the
global reserves. Or perhaps this change of mine will cause horrible
flushing problems elsewhere despite helping this heavy writeback
workload.

Working on this would of course require investing in some more workloads
and/or reproducers to cover the various tradeoffs you brought up.

> 
> So it's probably better if this patch accounts only for delayed refs
> generated during extent completion.
> I noticed you already sent a new patch version before I could reply, I
> will look at it now.
> 
> Thanks.
> 

Thank you for the in-depth discussion, by the way, it was really helpful
for me.

Boris

> >
> > Thanks again,
> > Boris
> >
> > > > +        * subvolume tree: 1 cow => 2 metadata delayed refs.
> > > > +        */
> > > > +       int factor = 3;
> > > > +
> > > > +       /* The remaining trees are only written to conditionally. */
> > > > +       if (!(inode->flags & BTRFS_INODE_NODATASUM))
> > > > +               factor += 2;
> > > > +       if (btrfs_test_opt(fs_info, FREE_SPACE_TREE))
> > > > +               factor += 2;
> > > > +       if (btrfs_fs_incompat(fs_info, RAID_STRIPE_TREE))
> > > > +               factor += 2;
> > > > +
> > > > +       return btrfs_calc_insert_metadata_size(fs_info, nr_extents) * factor;
> > > > +}
> > > > +
> > > >  static void btrfs_calculate_inode_block_rsv_size(struct btrfs_fs_info *fs_info,
> > > >                                                  struct btrfs_inode *inode)
> > > >  {
> > > > @@ -266,6 +301,7 @@ static void btrfs_calculate_inode_block_rsv_size(struct btrfs_fs_info *fs_info,
> > > >                 reserve_size = btrfs_calc_insert_metadata_size(fs_info,
> > > >                                                 outstanding_extents);
> > > >                 reserve_size += btrfs_calc_metadata_size(fs_info, 1);
> > > > +               reserve_size += delalloc_calc_delayed_refs_rsv(inode, outstanding_extents);
> > > >         }
> > > >         if (!(inode->flags & BTRFS_INODE_NODATASUM)) {
> > > >                 u64 csum_leaves;
> > > > @@ -309,9 +345,29 @@ static void calc_inode_reservations(struct btrfs_inode *inode,
> > > >          * for an inode update.
> > > >          */
> > > >         *meta_reserve += inode_update;
> > > > +
> > > > +       *meta_reserve += delalloc_calc_delayed_refs_rsv(inode, nr_extents);
> > > > +
> > > >         *qgroup_reserve = nr_extents * fs_info->nodesize;
> > > >  }
> > > >
> > > > +void btrfs_delalloc_migrate_delayed_refs_rsv(struct btrfs_trans_handle *trans,
> > > > +                                            struct btrfs_inode *inode)
> > > > +{
> > > > +       struct btrfs_block_rsv *inode_rsv = &inode->block_rsv;
> > > > +       struct btrfs_block_rsv *trans_rsv = &trans->delayed_rsv;
> > > > +       u64 num_bytes = delalloc_calc_delayed_refs_rsv(inode, 1);
> > > > +
> > > > +       spin_lock(&inode_rsv->lock);
> > > > +       num_bytes = min(num_bytes, inode_rsv->reserved);
> > > > +       inode_rsv->reserved -= num_bytes;
> > > > +       inode_rsv->full = (inode_rsv->reserved >= inode_rsv->size);
> > > > +       spin_unlock(&inode_rsv->lock);
> > > > +
> > > > +       btrfs_block_rsv_add_bytes(trans_rsv, num_bytes, true);
> > > > +       trans->delayed_refs_bytes_reserved += num_bytes;
> > > > +}
> > > > +
> > > >  int btrfs_delalloc_reserve_metadata(struct btrfs_inode *inode, u64 num_bytes,
> > > >                                     u64 disk_num_bytes, bool noflush)
> > > >  {
> > > > diff --git a/fs/btrfs/delalloc-space.h b/fs/btrfs/delalloc-space.h
> > > > index 6119c0d3f883..bd7041166987 100644
> > > > --- a/fs/btrfs/delalloc-space.h
> > > > +++ b/fs/btrfs/delalloc-space.h
> > > > @@ -8,6 +8,7 @@
> > > >  struct extent_changeset;
> > > >  struct btrfs_inode;
> > > >  struct btrfs_fs_info;
> > > > +struct btrfs_trans_handle;
> > > >
> > > >  int btrfs_alloc_data_chunk_ondemand(const struct btrfs_inode *inode, u64 bytes);
> > > >  int btrfs_check_data_free_space(struct btrfs_inode *inode,
> > > > @@ -27,5 +28,7 @@ int btrfs_delalloc_reserve_metadata(struct btrfs_inode *inode, u64 num_bytes,
> > > >                                     u64 disk_num_bytes, bool noflush);
> > > >  void btrfs_delalloc_release_extents(struct btrfs_inode *inode, u64 num_bytes);
> > > >  void btrfs_delalloc_shrink_extents(struct btrfs_inode *inode, u64 reserved_len, u64 new_len);
> > > > +void btrfs_delalloc_migrate_delayed_refs_rsv(struct btrfs_trans_handle *trans,
> > > > +                                            struct btrfs_inode *inode);
> > > >
> > > >  #endif /* BTRFS_DELALLOC_SPACE_H */
> > > > diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
> > > > index 40474014c03f..15945744a304 100644
> > > > --- a/fs/btrfs/inode.c
> > > > +++ b/fs/btrfs/inode.c
> > > > @@ -653,6 +653,7 @@ static noinline int __cow_file_range_inline(struct btrfs_inode *inode,
> > > >                 goto out;
> > > >         }
> > > >         trans->block_rsv = &inode->block_rsv;
> > > > +       btrfs_delalloc_migrate_delayed_refs_rsv(trans, inode);
> > > >
> > > >         drop_args.path = path;
> > > >         drop_args.start = 0;
> > > > @@ -3259,6 +3260,7 @@ int btrfs_finish_one_ordered(struct btrfs_ordered_extent *ordered_extent)
> > > >         }
> > > >
> > > >         trans->block_rsv = &inode->block_rsv;
> > > > +       btrfs_delalloc_migrate_delayed_refs_rsv(trans, inode);
> > > >
> > > >         ret = btrfs_insert_raid_extent(trans, ordered_extent);
> > > >         if (unlikely(ret)) {
> > > > diff --git a/fs/btrfs/transaction.c b/fs/btrfs/transaction.c
> > > > index 248adb785051..55791bb100a2 100644
> > > > --- a/fs/btrfs/transaction.c
> > > > +++ b/fs/btrfs/transaction.c
> > > > @@ -1047,29 +1047,23 @@ static void btrfs_trans_release_metadata(struct btrfs_trans_handle *trans)
> > > >                 return;
> > > >         }
> > > >
> > > > -       if (!trans->bytes_reserved) {
> > > > -               ASSERT(trans->delayed_refs_bytes_reserved == 0,
> > > > -                      "trans->delayed_refs_bytes_reserved=%llu",
> > > > -                      trans->delayed_refs_bytes_reserved);
> > > > -               return;
> > > > +       if (trans->bytes_reserved) {
> > > > +               ASSERT(trans->block_rsv == &fs_info->trans_block_rsv);
> > > > +               trace_btrfs_space_reservation(fs_info, "transaction",
> > > > +                                       trans->transid, trans->bytes_reserved, 0);
> > > > +               btrfs_block_rsv_release(fs_info, trans->block_rsv,
> > > > +                                       trans->bytes_reserved, NULL);
> > > > +               trans->bytes_reserved = 0;
> > > >         }
> > > >
> > > > -       ASSERT(trans->block_rsv == &fs_info->trans_block_rsv);
> > > > -       trace_btrfs_space_reservation(fs_info, "transaction",
> > > > -                                     trans->transid, trans->bytes_reserved, 0);
> > > > -       btrfs_block_rsv_release(fs_info, trans->block_rsv,
> > > > -                               trans->bytes_reserved, NULL);
> > > > -       trans->bytes_reserved = 0;
> > > > -
> > > > -       if (!trans->delayed_refs_bytes_reserved)
> > > > -               return;
> > > > -
> > > > -       trace_btrfs_space_reservation(fs_info, "local_delayed_refs_rsv",
> > > > -                                     trans->transid,
> > > > -                                     trans->delayed_refs_bytes_reserved, 0);
> > > > -       btrfs_block_rsv_release(fs_info, &trans->delayed_rsv,
> > > > -                               trans->delayed_refs_bytes_reserved, NULL);
> > > > -       trans->delayed_refs_bytes_reserved = 0;
> > > > +       if (trans->delayed_refs_bytes_reserved) {
> > > > +               trace_btrfs_space_reservation(fs_info, "local_delayed_refs_rsv",
> > > > +                                       trans->transid,
> > > > +                                       trans->delayed_refs_bytes_reserved, 0);
> > > > +               btrfs_block_rsv_release(fs_info, &trans->delayed_rsv,
> > > > +                                       trans->delayed_refs_bytes_reserved, NULL);
> > > > +               trans->delayed_refs_bytes_reserved = 0;
> > > > +       }
> > > >  }
> > > >
> > > >  static int __btrfs_end_transaction(struct btrfs_trans_handle *trans,
> > > > --
> > > > 2.53.0
> > > >
> > > >

^ permalink raw reply	[flat|nested] 11+ messages in thread

* [PATCH v3 2/4] btrfs: account for compression in delalloc extent reservation
  2026-04-07 19:30 [PATCH v3 0/4] btrfs: improve stalls under sudden writeback Boris Burkov
  2026-04-07 19:30 ` [PATCH v3 1/4] btrfs: reserve space for delayed_refs in delalloc Boris Burkov
@ 2026-04-07 19:30 ` Boris Burkov
  2026-04-08 14:58   ` Filipe Manana
  2026-04-07 19:30 ` [PATCH v3 3/4] btrfs: make inode->outstanding_extents a u64 Boris Burkov
  2026-04-07 19:30 ` [PATCH v3 4/4] btrfs: cap shrink_delalloc iterations to 128M Boris Burkov
  3 siblings, 1 reply; 11+ messages in thread
From: Boris Burkov @ 2026-04-07 19:30 UTC (permalink / raw)
  To: linux-btrfs, kernel-team

The btrfs maximum uncompressed extent size is 128MiB. The maximum
compressed extent size in file extent space is 128KiB. Therefore, the
estimate for outstanding_extents is off by 3 orders of magnitude when
COMPRESS_FORCE is set or the inode is set to always compress.

Because we use re-calculation when necessary, rather than super detailed
extent tracking, we don't grow this reservation as the true number of
extents is revealed. We don't want to be too clever with it, however, as
we don't want the calculation to change for a given inode between
reservation and release, so we only rely on the forcing type flags.

With this change, we no longer under-reserve delayed refs reservations
for delalloc writes, even with compress-force.

Because this would turn count_max_extents() into a named shim for
div_u64(size + max_extent_size - 1, max_extent_size);
we can just get rid of it.

Signed-off-by: Boris Burkov <boris@bur.io>
---
 fs/btrfs/btrfs_inode.h    |  2 +
 fs/btrfs/delalloc-space.c | 13 ++++---
 fs/btrfs/fs.h             | 13 -------
 fs/btrfs/inode.c          | 78 ++++++++++++++++++++++++++++++++-------
 4 files changed, 73 insertions(+), 33 deletions(-)

diff --git a/fs/btrfs/btrfs_inode.h b/fs/btrfs/btrfs_inode.h
index 55c272fe5d92..5368ef87b41a 100644
--- a/fs/btrfs/btrfs_inode.h
+++ b/fs/btrfs/btrfs_inode.h
@@ -510,6 +510,8 @@ static inline bool btrfs_inode_can_compress(const struct btrfs_inode *inode)
 	return true;
 }
 
+u64 btrfs_inode_max_extents(const struct btrfs_inode *inode, u64 size);
+
 static inline void btrfs_assert_inode_locked(struct btrfs_inode *inode)
 {
 	/* Immediately trigger a crash if the inode is not locked. */
diff --git a/fs/btrfs/delalloc-space.c b/fs/btrfs/delalloc-space.c
index c99f516ce8af..9ff02a7da05b 100644
--- a/fs/btrfs/delalloc-space.c
+++ b/fs/btrfs/delalloc-space.c
@@ -7,6 +7,7 @@
 #include "delayed-ref.h"
 #include "block-rsv.h"
 #include "btrfs_inode.h"
+#include "compression.h"
 #include "space-info.h"
 #include "qgroup.h"
 #include "fs.h"
@@ -66,7 +67,7 @@
  *     This is the number of file extent items we'll need to handle all of the
  *     outstanding DELALLOC space we have in this inode.  We limit the maximum
  *     size of an extent, so a large contiguous dirty area may require more than
- *     one outstanding_extent, which is why count_max_extents() is used to
+ *     one outstanding_extent, which is why we use the max extent size to
  *     determine how many outstanding_extents get added.
  *
  *   ->csum_bytes
@@ -328,7 +329,7 @@ static void calc_inode_reservations(struct btrfs_inode *inode,
 				    u64 *meta_reserve, u64 *qgroup_reserve)
 {
 	struct btrfs_fs_info *fs_info = inode->root->fs_info;
-	u64 nr_extents = count_max_extents(fs_info, num_bytes);
+	u64 nr_extents = btrfs_inode_max_extents(inode, num_bytes);
 	u64 csum_leaves;
 	u64 inode_update = btrfs_calc_metadata_size(fs_info, 1);
 
@@ -427,7 +428,7 @@ int btrfs_delalloc_reserve_metadata(struct btrfs_inode *inode, u64 num_bytes,
 	 * racing with an ordered completion or some such that would think it
 	 * needs to free the reservation we just made.
 	 */
-	nr_extents = count_max_extents(fs_info, num_bytes);
+	nr_extents = btrfs_inode_max_extents(inode, num_bytes);
 	spin_lock(&inode->lock);
 	btrfs_mod_outstanding_extents(inode, nr_extents);
 	if (!(inode->flags & BTRFS_INODE_NODATASUM))
@@ -494,7 +495,7 @@ void btrfs_delalloc_release_extents(struct btrfs_inode *inode, u64 num_bytes)
 	unsigned num_extents;
 
 	spin_lock(&inode->lock);
-	num_extents = count_max_extents(fs_info, num_bytes);
+	num_extents = btrfs_inode_max_extents(inode, num_bytes);
 	btrfs_mod_outstanding_extents(inode, -num_extents);
 	btrfs_calculate_inode_block_rsv_size(fs_info, inode);
 	spin_unlock(&inode->lock);
@@ -509,8 +510,8 @@ void btrfs_delalloc_release_extents(struct btrfs_inode *inode, u64 num_bytes)
 void btrfs_delalloc_shrink_extents(struct btrfs_inode *inode, u64 reserved_len, u64 new_len)
 {
 	struct btrfs_fs_info *fs_info = inode->root->fs_info;
-	const u32 reserved_num_extents = count_max_extents(fs_info, reserved_len);
-	const u32 new_num_extents = count_max_extents(fs_info, new_len);
+	const u32 reserved_num_extents = btrfs_inode_max_extents(inode, reserved_len);
+	const u32 new_num_extents = btrfs_inode_max_extents(inode, new_len);
 	const int diff_num_extents = new_num_extents - reserved_num_extents;
 
 	ASSERT(new_len <= reserved_len);
diff --git a/fs/btrfs/fs.h b/fs/btrfs/fs.h
index a4758d94b32e..2c1626155645 100644
--- a/fs/btrfs/fs.h
+++ b/fs/btrfs/fs.h
@@ -1051,19 +1051,6 @@ static inline bool btrfs_is_zoned(const struct btrfs_fs_info *fs_info)
 	return IS_ENABLED(CONFIG_BLK_DEV_ZONED) && fs_info->zone_size > 0;
 }
 
-/*
- * Count how many fs_info->max_extent_size cover the @size
- */
-static inline u32 count_max_extents(const struct btrfs_fs_info *fs_info, u64 size)
-{
-#ifdef CONFIG_BTRFS_FS_RUN_SANITY_TESTS
-	if (!fs_info)
-		return div_u64(size + BTRFS_MAX_EXTENT_SIZE - 1, BTRFS_MAX_EXTENT_SIZE);
-#endif
-
-	return div_u64(size + fs_info->max_extent_size - 1, fs_info->max_extent_size);
-}
-
 static inline unsigned int btrfs_blocks_per_folio(const struct btrfs_fs_info *fs_info,
 						  const struct folio *folio)
 {
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 15945744a304..55255f3794c6 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -747,6 +747,56 @@ static int add_async_extent(struct async_chunk *cow, u64 start, u64 ram_size,
 	return 0;
 }
 
+/*
+ * Check if compression will definitely be attempted for this inode based on
+ * mount options and inode properties.  Unlike inode_need_compress(), this does
+ * NOT run the compression heuristic or check range-specific conditions, so it
+ * is safe to call under locks (e.g. io_tree lock) and for reservation sizing.
+ *
+ * Only returns true for cases where BTRFS_INODE_NOCOMPRESS cannot be set at
+ * runtime (FORCE_COMPRESS and prop_compress), ensuring that the effective max
+ * extent size is stable across paired set/clear delalloc operations.
+ */
+static inline bool inode_may_compress(const struct btrfs_inode *inode)
+{
+	if (!btrfs_inode_can_compress(inode))
+		return false;
+
+	/* Force compress always attempts compression. */
+	if (btrfs_test_opt(inode->root->fs_info, FORCE_COMPRESS))
+		return true;
+
+	/* Per-inode property: NOCOMPRESS cannot override this. */
+	if (inode->prop_compress)
+		return true;
+
+	return false;
+}
+
+/*
+ * Return the effective maximum extent size for reservation accounting.
+ *
+ * When compression is guaranteed to be attempted (FORCE_COMPRESS or
+ * prop_compress), the compression path splits ranges into
+ * BTRFS_MAX_UNCOMPRESSED chunks, each producing an independent ordered
+ * extent.  Use that as the divisor instead of fs_info->max_extent_size
+ * to avoid severely undercounting outstanding extents.
+ */
+static u64 btrfs_inode_max_extent_size(const struct btrfs_inode *inode)
+{
+	if (inode_may_compress(inode))
+		return BTRFS_MAX_UNCOMPRESSED;
+
+	return inode->root->fs_info->max_extent_size;
+}
+
+u64 btrfs_inode_max_extents(const struct btrfs_inode *inode, u64 size)
+{
+	u64 max_extent_size = btrfs_inode_max_extent_size(inode);
+
+	return div_u64(size + max_extent_size - 1, max_extent_size);
+}
+
 /*
  * Check if the inode needs to be submitted to compression, based on mount
  * options, defragmentation, properties or heuristics.
@@ -2459,8 +2509,8 @@ int btrfs_run_delalloc_range(struct btrfs_inode *inode, struct folio *locked_fol
 void btrfs_split_delalloc_extent(struct btrfs_inode *inode,
 				 struct extent_state *orig, u64 split)
 {
-	struct btrfs_fs_info *fs_info = inode->root->fs_info;
 	u64 size;
+	u64 max_extent_size = btrfs_inode_max_extent_size(inode);
 
 	lockdep_assert_held(&inode->io_tree.lock);
 
@@ -2469,8 +2519,8 @@ void btrfs_split_delalloc_extent(struct btrfs_inode *inode,
 		return;
 
 	size = orig->end - orig->start + 1;
-	if (size > fs_info->max_extent_size) {
-		u32 num_extents;
+	if (size > max_extent_size) {
+		u64 num_extents;
 		u64 new_size;
 
 		/*
@@ -2478,10 +2528,10 @@ void btrfs_split_delalloc_extent(struct btrfs_inode *inode,
 		 * applies here, just in reverse.
 		 */
 		new_size = orig->end - split + 1;
-		num_extents = count_max_extents(fs_info, new_size);
+		num_extents = btrfs_inode_max_extents(inode, new_size);
 		new_size = split - orig->start;
-		num_extents += count_max_extents(fs_info, new_size);
-		if (count_max_extents(fs_info, size) >= num_extents)
+		num_extents += btrfs_inode_max_extents(inode, new_size);
+		if (btrfs_inode_max_extents(inode, size) >= num_extents)
 			return;
 	}
 
@@ -2498,9 +2548,9 @@ void btrfs_split_delalloc_extent(struct btrfs_inode *inode,
 void btrfs_merge_delalloc_extent(struct btrfs_inode *inode, struct extent_state *new,
 				 struct extent_state *other)
 {
-	struct btrfs_fs_info *fs_info = inode->root->fs_info;
 	u64 new_size, old_size;
-	u32 num_extents;
+	u64 max_extent_size = btrfs_inode_max_extent_size(inode);
+	u64 num_extents;
 
 	lockdep_assert_held(&inode->io_tree.lock);
 
@@ -2514,7 +2564,7 @@ void btrfs_merge_delalloc_extent(struct btrfs_inode *inode, struct extent_state
 		new_size = other->end - new->start + 1;
 
 	/* we're not bigger than the max, unreserve the space and go */
-	if (new_size <= fs_info->max_extent_size) {
+	if (new_size <= max_extent_size) {
 		spin_lock(&inode->lock);
 		btrfs_mod_outstanding_extents(inode, -1);
 		spin_unlock(&inode->lock);
@@ -2540,10 +2590,10 @@ void btrfs_merge_delalloc_extent(struct btrfs_inode *inode, struct extent_state
 	 * this case.
 	 */
 	old_size = other->end - other->start + 1;
-	num_extents = count_max_extents(fs_info, old_size);
+	num_extents = btrfs_inode_max_extents(inode, old_size);
 	old_size = new->end - new->start + 1;
-	num_extents += count_max_extents(fs_info, old_size);
-	if (count_max_extents(fs_info, new_size) >= num_extents)
+	num_extents += btrfs_inode_max_extents(inode, old_size);
+	if (btrfs_inode_max_extents(inode, new_size) >= num_extents)
 		return;
 
 	spin_lock(&inode->lock);
@@ -2616,7 +2666,7 @@ void btrfs_set_delalloc_extent(struct btrfs_inode *inode, struct extent_state *s
 	if (!(state->state & EXTENT_DELALLOC) && (bits & EXTENT_DELALLOC)) {
 		u64 len = state->end + 1 - state->start;
 		u64 prev_delalloc_bytes;
-		u32 num_extents = count_max_extents(fs_info, len);
+		u32 num_extents = btrfs_inode_max_extents(inode, len);
 
 		spin_lock(&inode->lock);
 		btrfs_mod_outstanding_extents(inode, num_extents);
@@ -2662,7 +2712,7 @@ void btrfs_clear_delalloc_extent(struct btrfs_inode *inode,
 {
 	struct btrfs_fs_info *fs_info = inode->root->fs_info;
 	u64 len = state->end + 1 - state->start;
-	u32 num_extents = count_max_extents(fs_info, len);
+	u32 num_extents = btrfs_inode_max_extents(inode, len);
 
 	lockdep_assert_held(&inode->io_tree.lock);
 
-- 
2.53.0


^ permalink raw reply related	[flat|nested] 11+ messages in thread

* Re: [PATCH v3 2/4] btrfs: account for compression in delalloc extent reservation
  2026-04-07 19:30 ` [PATCH v3 2/4] btrfs: account for compression in delalloc extent reservation Boris Burkov
@ 2026-04-08 14:58   ` Filipe Manana
  0 siblings, 0 replies; 11+ messages in thread
From: Filipe Manana @ 2026-04-08 14:58 UTC (permalink / raw)
  To: Boris Burkov; +Cc: linux-btrfs, kernel-team

On Tue, Apr 7, 2026 at 8:31 PM Boris Burkov <boris@bur.io> wrote:
>
> The btrfs maximum uncompressed extent size is 128MiB. The maximum
> compressed extent size in file extent space is 128KiB. Therefore, the
> estimate for outstanding_extents is off by 3 orders of magnitude when
> COMPRESS_FORCE is set or the inode is set to always compress.
>
> Because we use re-calculation when necessary, rather than super detailed
> extent tracking, we don't grow this reservation as the true number of
> extents is revealed. We don't want to be too clever with it, however, as
> we don't want the calculation to change for a given inode between
> reservation and release, so we only rely on the forcing type flags.
>
> With this change, we no longer under-reserve delayed refs reservations
> for delalloc writes, even with compress-force.
>
> Because this would turn count_max_extents() into a named shim for
> div_u64(size + max_extent_size - 1, max_extent_size);
> we can just get rid of it.
>
> Signed-off-by: Boris Burkov <boris@bur.io>
> ---
>  fs/btrfs/btrfs_inode.h    |  2 +
>  fs/btrfs/delalloc-space.c | 13 ++++---
>  fs/btrfs/fs.h             | 13 -------
>  fs/btrfs/inode.c          | 78 ++++++++++++++++++++++++++++++++-------
>  4 files changed, 73 insertions(+), 33 deletions(-)
>
> diff --git a/fs/btrfs/btrfs_inode.h b/fs/btrfs/btrfs_inode.h
> index 55c272fe5d92..5368ef87b41a 100644
> --- a/fs/btrfs/btrfs_inode.h
> +++ b/fs/btrfs/btrfs_inode.h
> @@ -510,6 +510,8 @@ static inline bool btrfs_inode_can_compress(const struct btrfs_inode *inode)
>         return true;
>  }
>
> +u64 btrfs_inode_max_extents(const struct btrfs_inode *inode, u64 size);
> +
>  static inline void btrfs_assert_inode_locked(struct btrfs_inode *inode)
>  {
>         /* Immediately trigger a crash if the inode is not locked. */
> diff --git a/fs/btrfs/delalloc-space.c b/fs/btrfs/delalloc-space.c
> index c99f516ce8af..9ff02a7da05b 100644
> --- a/fs/btrfs/delalloc-space.c
> +++ b/fs/btrfs/delalloc-space.c
> @@ -7,6 +7,7 @@
>  #include "delayed-ref.h"
>  #include "block-rsv.h"
>  #include "btrfs_inode.h"
> +#include "compression.h"

No need for this include. We're not using anything from compression.{h,c}.

With that removed:

Reviewed-by: Filipe Manana <fdmanana@suse.com>

>  #include "space-info.h"
>  #include "qgroup.h"
>  #include "fs.h"
> @@ -66,7 +67,7 @@
>   *     This is the number of file extent items we'll need to handle all of the
>   *     outstanding DELALLOC space we have in this inode.  We limit the maximum
>   *     size of an extent, so a large contiguous dirty area may require more than
> - *     one outstanding_extent, which is why count_max_extents() is used to
> + *     one outstanding_extent, which is why we use the max extent size to
>   *     determine how many outstanding_extents get added.
>   *
>   *   ->csum_bytes
> @@ -328,7 +329,7 @@ static void calc_inode_reservations(struct btrfs_inode *inode,
>                                     u64 *meta_reserve, u64 *qgroup_reserve)
>  {
>         struct btrfs_fs_info *fs_info = inode->root->fs_info;
> -       u64 nr_extents = count_max_extents(fs_info, num_bytes);
> +       u64 nr_extents = btrfs_inode_max_extents(inode, num_bytes);
>         u64 csum_leaves;
>         u64 inode_update = btrfs_calc_metadata_size(fs_info, 1);
>
> @@ -427,7 +428,7 @@ int btrfs_delalloc_reserve_metadata(struct btrfs_inode *inode, u64 num_bytes,
>          * racing with an ordered completion or some such that would think it
>          * needs to free the reservation we just made.
>          */
> -       nr_extents = count_max_extents(fs_info, num_bytes);
> +       nr_extents = btrfs_inode_max_extents(inode, num_bytes);
>         spin_lock(&inode->lock);
>         btrfs_mod_outstanding_extents(inode, nr_extents);
>         if (!(inode->flags & BTRFS_INODE_NODATASUM))
> @@ -494,7 +495,7 @@ void btrfs_delalloc_release_extents(struct btrfs_inode *inode, u64 num_bytes)
>         unsigned num_extents;
>
>         spin_lock(&inode->lock);
> -       num_extents = count_max_extents(fs_info, num_bytes);
> +       num_extents = btrfs_inode_max_extents(inode, num_bytes);
>         btrfs_mod_outstanding_extents(inode, -num_extents);
>         btrfs_calculate_inode_block_rsv_size(fs_info, inode);
>         spin_unlock(&inode->lock);
> @@ -509,8 +510,8 @@ void btrfs_delalloc_release_extents(struct btrfs_inode *inode, u64 num_bytes)
>  void btrfs_delalloc_shrink_extents(struct btrfs_inode *inode, u64 reserved_len, u64 new_len)
>  {
>         struct btrfs_fs_info *fs_info = inode->root->fs_info;
> -       const u32 reserved_num_extents = count_max_extents(fs_info, reserved_len);
> -       const u32 new_num_extents = count_max_extents(fs_info, new_len);
> +       const u32 reserved_num_extents = btrfs_inode_max_extents(inode, reserved_len);
> +       const u32 new_num_extents = btrfs_inode_max_extents(inode, new_len);
>         const int diff_num_extents = new_num_extents - reserved_num_extents;
>
>         ASSERT(new_len <= reserved_len);
> diff --git a/fs/btrfs/fs.h b/fs/btrfs/fs.h
> index a4758d94b32e..2c1626155645 100644
> --- a/fs/btrfs/fs.h
> +++ b/fs/btrfs/fs.h
> @@ -1051,19 +1051,6 @@ static inline bool btrfs_is_zoned(const struct btrfs_fs_info *fs_info)
>         return IS_ENABLED(CONFIG_BLK_DEV_ZONED) && fs_info->zone_size > 0;
>  }
>
> -/*
> - * Count how many fs_info->max_extent_size cover the @size
> - */
> -static inline u32 count_max_extents(const struct btrfs_fs_info *fs_info, u64 size)
> -{
> -#ifdef CONFIG_BTRFS_FS_RUN_SANITY_TESTS
> -       if (!fs_info)
> -               return div_u64(size + BTRFS_MAX_EXTENT_SIZE - 1, BTRFS_MAX_EXTENT_SIZE);
> -#endif
> -
> -       return div_u64(size + fs_info->max_extent_size - 1, fs_info->max_extent_size);
> -}
> -
>  static inline unsigned int btrfs_blocks_per_folio(const struct btrfs_fs_info *fs_info,
>                                                   const struct folio *folio)
>  {
> diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
> index 15945744a304..55255f3794c6 100644
> --- a/fs/btrfs/inode.c
> +++ b/fs/btrfs/inode.c
> @@ -747,6 +747,56 @@ static int add_async_extent(struct async_chunk *cow, u64 start, u64 ram_size,
>         return 0;
>  }
>
> +/*
> + * Check if compression will definitely be attempted for this inode based on
> + * mount options and inode properties.  Unlike inode_need_compress(), this does
> + * NOT run the compression heuristic or check range-specific conditions, so it
> + * is safe to call under locks (e.g. io_tree lock) and for reservation sizing.
> + *
> + * Only returns true for cases where BTRFS_INODE_NOCOMPRESS cannot be set at
> + * runtime (FORCE_COMPRESS and prop_compress), ensuring that the effective max
> + * extent size is stable across paired set/clear delalloc operations.
> + */
> +static inline bool inode_may_compress(const struct btrfs_inode *inode)
> +{
> +       if (!btrfs_inode_can_compress(inode))
> +               return false;
> +
> +       /* Force compress always attempts compression. */
> +       if (btrfs_test_opt(inode->root->fs_info, FORCE_COMPRESS))
> +               return true;
> +
> +       /* Per-inode property: NOCOMPRESS cannot override this. */
> +       if (inode->prop_compress)
> +               return true;
> +
> +       return false;
> +}
> +
> +/*
> + * Return the effective maximum extent size for reservation accounting.
> + *
> + * When compression is guaranteed to be attempted (FORCE_COMPRESS or
> + * prop_compress), the compression path splits ranges into
> + * BTRFS_MAX_UNCOMPRESSED chunks, each producing an independent ordered
> + * extent.  Use that as the divisor instead of fs_info->max_extent_size
> + * to avoid severely undercounting outstanding extents.
> + */
> +static u64 btrfs_inode_max_extent_size(const struct btrfs_inode *inode)
> +{
> +       if (inode_may_compress(inode))
> +               return BTRFS_MAX_UNCOMPRESSED;
> +
> +       return inode->root->fs_info->max_extent_size;
> +}
> +
> +u64 btrfs_inode_max_extents(const struct btrfs_inode *inode, u64 size)
> +{
> +       u64 max_extent_size = btrfs_inode_max_extent_size(inode);
> +
> +       return div_u64(size + max_extent_size - 1, max_extent_size);
> +}
> +
>  /*
>   * Check if the inode needs to be submitted to compression, based on mount
>   * options, defragmentation, properties or heuristics.
> @@ -2459,8 +2509,8 @@ int btrfs_run_delalloc_range(struct btrfs_inode *inode, struct folio *locked_fol
>  void btrfs_split_delalloc_extent(struct btrfs_inode *inode,
>                                  struct extent_state *orig, u64 split)
>  {
> -       struct btrfs_fs_info *fs_info = inode->root->fs_info;
>         u64 size;
> +       u64 max_extent_size = btrfs_inode_max_extent_size(inode);
>
>         lockdep_assert_held(&inode->io_tree.lock);
>
> @@ -2469,8 +2519,8 @@ void btrfs_split_delalloc_extent(struct btrfs_inode *inode,
>                 return;
>
>         size = orig->end - orig->start + 1;
> -       if (size > fs_info->max_extent_size) {
> -               u32 num_extents;
> +       if (size > max_extent_size) {
> +               u64 num_extents;
>                 u64 new_size;
>
>                 /*
> @@ -2478,10 +2528,10 @@ void btrfs_split_delalloc_extent(struct btrfs_inode *inode,
>                  * applies here, just in reverse.
>                  */
>                 new_size = orig->end - split + 1;
> -               num_extents = count_max_extents(fs_info, new_size);
> +               num_extents = btrfs_inode_max_extents(inode, new_size);
>                 new_size = split - orig->start;
> -               num_extents += count_max_extents(fs_info, new_size);
> -               if (count_max_extents(fs_info, size) >= num_extents)
> +               num_extents += btrfs_inode_max_extents(inode, new_size);
> +               if (btrfs_inode_max_extents(inode, size) >= num_extents)
>                         return;
>         }
>
> @@ -2498,9 +2548,9 @@ void btrfs_split_delalloc_extent(struct btrfs_inode *inode,
>  void btrfs_merge_delalloc_extent(struct btrfs_inode *inode, struct extent_state *new,
>                                  struct extent_state *other)
>  {
> -       struct btrfs_fs_info *fs_info = inode->root->fs_info;
>         u64 new_size, old_size;
> -       u32 num_extents;
> +       u64 max_extent_size = btrfs_inode_max_extent_size(inode);
> +       u64 num_extents;
>
>         lockdep_assert_held(&inode->io_tree.lock);
>
> @@ -2514,7 +2564,7 @@ void btrfs_merge_delalloc_extent(struct btrfs_inode *inode, struct extent_state
>                 new_size = other->end - new->start + 1;
>
>         /* we're not bigger than the max, unreserve the space and go */
> -       if (new_size <= fs_info->max_extent_size) {
> +       if (new_size <= max_extent_size) {
>                 spin_lock(&inode->lock);
>                 btrfs_mod_outstanding_extents(inode, -1);
>                 spin_unlock(&inode->lock);
> @@ -2540,10 +2590,10 @@ void btrfs_merge_delalloc_extent(struct btrfs_inode *inode, struct extent_state
>          * this case.
>          */
>         old_size = other->end - other->start + 1;
> -       num_extents = count_max_extents(fs_info, old_size);
> +       num_extents = btrfs_inode_max_extents(inode, old_size);
>         old_size = new->end - new->start + 1;
> -       num_extents += count_max_extents(fs_info, old_size);
> -       if (count_max_extents(fs_info, new_size) >= num_extents)
> +       num_extents += btrfs_inode_max_extents(inode, old_size);
> +       if (btrfs_inode_max_extents(inode, new_size) >= num_extents)
>                 return;
>
>         spin_lock(&inode->lock);
> @@ -2616,7 +2666,7 @@ void btrfs_set_delalloc_extent(struct btrfs_inode *inode, struct extent_state *s
>         if (!(state->state & EXTENT_DELALLOC) && (bits & EXTENT_DELALLOC)) {
>                 u64 len = state->end + 1 - state->start;
>                 u64 prev_delalloc_bytes;
> -               u32 num_extents = count_max_extents(fs_info, len);
> +               u32 num_extents = btrfs_inode_max_extents(inode, len);
>
>                 spin_lock(&inode->lock);
>                 btrfs_mod_outstanding_extents(inode, num_extents);
> @@ -2662,7 +2712,7 @@ void btrfs_clear_delalloc_extent(struct btrfs_inode *inode,
>  {
>         struct btrfs_fs_info *fs_info = inode->root->fs_info;
>         u64 len = state->end + 1 - state->start;
> -       u32 num_extents = count_max_extents(fs_info, len);
> +       u32 num_extents = btrfs_inode_max_extents(inode, len);
>
>         lockdep_assert_held(&inode->io_tree.lock);
>
> --
> 2.53.0
>
>

^ permalink raw reply	[flat|nested] 11+ messages in thread

* [PATCH v3 3/4] btrfs: make inode->outstanding_extents a u64
  2026-04-07 19:30 [PATCH v3 0/4] btrfs: improve stalls under sudden writeback Boris Burkov
  2026-04-07 19:30 ` [PATCH v3 1/4] btrfs: reserve space for delayed_refs in delalloc Boris Burkov
  2026-04-07 19:30 ` [PATCH v3 2/4] btrfs: account for compression in delalloc extent reservation Boris Burkov
@ 2026-04-07 19:30 ` Boris Burkov
  2026-04-08 15:04   ` Filipe Manana
  2026-04-07 19:30 ` [PATCH v3 4/4] btrfs: cap shrink_delalloc iterations to 128M Boris Burkov
  3 siblings, 1 reply; 11+ messages in thread
From: Boris Burkov @ 2026-04-07 19:30 UTC (permalink / raw)
  To: linux-btrfs, kernel-team

The maximum file size is MAX_LFS_FILESIZE = (loff_t)LLONG_MAX

As a result, the max extent size computation in btrfs has always been
bounded above by LLONG_MAX / 128MiB, which is ~ 2^63 / 2^27. This has
never fit in a u32. With the recent changes to also divide by 128KiB in
compressed cases, that bound is even higher. Whether or not it is likely
to happen, I think it is nice to try to capture the intent in the types,
so change outstanding_extents to u64, and make mod_outstanding_extents
try to capture some expectations around the size of its inputs.

Signed-off-by: Boris Burkov <boris@bur.io>
---
 fs/btrfs/btrfs_inode.h       | 18 ++++++++++++++----
 fs/btrfs/delalloc-space.c    | 19 +++++++++----------
 fs/btrfs/inode.c             | 14 +++++++-------
 fs/btrfs/ordered-data.c      |  4 ++--
 fs/btrfs/tests/inode-tests.c | 18 +++++++++---------
 include/trace/events/btrfs.h |  8 ++++----
 6 files changed, 45 insertions(+), 36 deletions(-)

diff --git a/fs/btrfs/btrfs_inode.h b/fs/btrfs/btrfs_inode.h
index 5368ef87b41a..04825b28f384 100644
--- a/fs/btrfs/btrfs_inode.h
+++ b/fs/btrfs/btrfs_inode.h
@@ -180,7 +180,7 @@ struct btrfs_inode {
 	 * items we think we'll end up using, and reserved_extents is the number
 	 * of extent items we've reserved metadata for. Protected by 'lock'.
 	 */
-	unsigned outstanding_extents;
+	u64 outstanding_extents;
 
 	/* used to order data wrt metadata */
 	spinlock_t ordered_tree_lock;
@@ -429,14 +429,24 @@ static inline bool is_data_inode(const struct btrfs_inode *inode)
 }
 
 static inline void btrfs_mod_outstanding_extents(struct btrfs_inode *inode,
-						 int mod)
+						 int mod, u64 nr_extents)
 {
+	s64 delta = mod * (s64)nr_extents;
+
 	lockdep_assert_held(&inode->lock);
-	inode->outstanding_extents += mod;
+	ASSERT(mod == 1 || mod == -1, "mod=%d", mod);
+	ASSERT(nr_extents <= S64_MAX, "nr_extents=%llu", nr_extents);
+	ASSERT(mod == -1 || inode->outstanding_extents <= U64_MAX - nr_extents,
+	       "nr_extents=%llu, inode->outstanding_extents=%llu",
+	       nr_extents, inode->outstanding_extents);
+	ASSERT(mod == 1 || inode->outstanding_extents >= nr_extents,
+	       "mod=%d, nr_extents=%llu, inode->outstanding_extents=%llu",
+	       mod, nr_extents, inode->outstanding_extents);
+	inode->outstanding_extents += delta;
 	if (btrfs_is_free_space_inode(inode))
 		return;
 	trace_btrfs_inode_mod_outstanding_extents(inode->root, btrfs_ino(inode),
-						  mod, inode->outstanding_extents);
+						  delta, inode->outstanding_extents);
 }
 
 /*
diff --git a/fs/btrfs/delalloc-space.c b/fs/btrfs/delalloc-space.c
index 9ff02a7da05b..d50081c8089a 100644
--- a/fs/btrfs/delalloc-space.c
+++ b/fs/btrfs/delalloc-space.c
@@ -289,7 +289,7 @@ static void btrfs_calculate_inode_block_rsv_size(struct btrfs_fs_info *fs_info,
 	struct btrfs_block_rsv *block_rsv = &inode->block_rsv;
 	u64 reserve_size = 0;
 	u64 qgroup_rsv_size = 0;
-	unsigned outstanding_extents;
+	u64 outstanding_extents;
 
 	lockdep_assert_held(&inode->lock);
 	outstanding_extents = inode->outstanding_extents;
@@ -316,7 +316,7 @@ static void btrfs_calculate_inode_block_rsv_size(struct btrfs_fs_info *fs_info,
 	 *
 	 * This is overestimating in most cases.
 	 */
-	qgroup_rsv_size = (u64)outstanding_extents * fs_info->nodesize;
+	qgroup_rsv_size = outstanding_extents * fs_info->nodesize;
 
 	spin_lock(&block_rsv->lock);
 	block_rsv->size = reserve_size;
@@ -376,7 +376,7 @@ int btrfs_delalloc_reserve_metadata(struct btrfs_inode *inode, u64 num_bytes,
 	struct btrfs_fs_info *fs_info = root->fs_info;
 	struct btrfs_block_rsv *block_rsv = &inode->block_rsv;
 	u64 meta_reserve, qgroup_reserve;
-	unsigned nr_extents;
+	u64 nr_extents;
 	enum btrfs_reserve_flush_enum flush = BTRFS_RESERVE_FLUSH_ALL;
 	int ret = 0;
 
@@ -430,7 +430,7 @@ int btrfs_delalloc_reserve_metadata(struct btrfs_inode *inode, u64 num_bytes,
 	 */
 	nr_extents = btrfs_inode_max_extents(inode, num_bytes);
 	spin_lock(&inode->lock);
-	btrfs_mod_outstanding_extents(inode, nr_extents);
+	btrfs_mod_outstanding_extents(inode, 1, nr_extents);
 	if (!(inode->flags & BTRFS_INODE_NODATASUM))
 		inode->csum_bytes += disk_num_bytes;
 	btrfs_calculate_inode_block_rsv_size(fs_info, inode);
@@ -492,11 +492,11 @@ void btrfs_delalloc_release_metadata(struct btrfs_inode *inode, u64 num_bytes,
 void btrfs_delalloc_release_extents(struct btrfs_inode *inode, u64 num_bytes)
 {
 	struct btrfs_fs_info *fs_info = inode->root->fs_info;
-	unsigned num_extents;
+	u64 num_extents;
 
 	spin_lock(&inode->lock);
 	num_extents = btrfs_inode_max_extents(inode, num_bytes);
-	btrfs_mod_outstanding_extents(inode, -num_extents);
+	btrfs_mod_outstanding_extents(inode, -1, num_extents);
 	btrfs_calculate_inode_block_rsv_size(fs_info, inode);
 	spin_unlock(&inode->lock);
 
@@ -510,16 +510,15 @@ void btrfs_delalloc_release_extents(struct btrfs_inode *inode, u64 num_bytes)
 void btrfs_delalloc_shrink_extents(struct btrfs_inode *inode, u64 reserved_len, u64 new_len)
 {
 	struct btrfs_fs_info *fs_info = inode->root->fs_info;
-	const u32 reserved_num_extents = btrfs_inode_max_extents(inode, reserved_len);
-	const u32 new_num_extents = btrfs_inode_max_extents(inode, new_len);
-	const int diff_num_extents = new_num_extents - reserved_num_extents;
+	const u64 reserved_num_extents = btrfs_inode_max_extents(inode, reserved_len);
+	const u64 new_num_extents = btrfs_inode_max_extents(inode, new_len);
 
 	ASSERT(new_len <= reserved_len);
 	if (new_num_extents == reserved_num_extents)
 		return;
 
 	spin_lock(&inode->lock);
-	btrfs_mod_outstanding_extents(inode, diff_num_extents);
+	btrfs_mod_outstanding_extents(inode, -1, reserved_num_extents - new_num_extents);
 	btrfs_calculate_inode_block_rsv_size(fs_info, inode);
 	spin_unlock(&inode->lock);
 
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 55255f3794c6..b45a92cfe94e 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -2536,7 +2536,7 @@ void btrfs_split_delalloc_extent(struct btrfs_inode *inode,
 	}
 
 	spin_lock(&inode->lock);
-	btrfs_mod_outstanding_extents(inode, 1);
+	btrfs_mod_outstanding_extents(inode, 1, 1);
 	spin_unlock(&inode->lock);
 }
 
@@ -2566,7 +2566,7 @@ void btrfs_merge_delalloc_extent(struct btrfs_inode *inode, struct extent_state
 	/* we're not bigger than the max, unreserve the space and go */
 	if (new_size <= max_extent_size) {
 		spin_lock(&inode->lock);
-		btrfs_mod_outstanding_extents(inode, -1);
+		btrfs_mod_outstanding_extents(inode, -1, 1);
 		spin_unlock(&inode->lock);
 		return;
 	}
@@ -2597,7 +2597,7 @@ void btrfs_merge_delalloc_extent(struct btrfs_inode *inode, struct extent_state
 		return;
 
 	spin_lock(&inode->lock);
-	btrfs_mod_outstanding_extents(inode, -1);
+	btrfs_mod_outstanding_extents(inode, -1, 1);
 	spin_unlock(&inode->lock);
 }
 
@@ -2666,10 +2666,10 @@ void btrfs_set_delalloc_extent(struct btrfs_inode *inode, struct extent_state *s
 	if (!(state->state & EXTENT_DELALLOC) && (bits & EXTENT_DELALLOC)) {
 		u64 len = state->end + 1 - state->start;
 		u64 prev_delalloc_bytes;
-		u32 num_extents = btrfs_inode_max_extents(inode, len);
+		u64 num_extents = btrfs_inode_max_extents(inode, len);
 
 		spin_lock(&inode->lock);
-		btrfs_mod_outstanding_extents(inode, num_extents);
+		btrfs_mod_outstanding_extents(inode, 1, num_extents);
 		spin_unlock(&inode->lock);
 
 		/* For sanity tests */
@@ -2712,7 +2712,7 @@ void btrfs_clear_delalloc_extent(struct btrfs_inode *inode,
 {
 	struct btrfs_fs_info *fs_info = inode->root->fs_info;
 	u64 len = state->end + 1 - state->start;
-	u32 num_extents = btrfs_inode_max_extents(inode, len);
+	u64 num_extents = btrfs_inode_max_extents(inode, len);
 
 	lockdep_assert_held(&inode->io_tree.lock);
 
@@ -2732,7 +2732,7 @@ void btrfs_clear_delalloc_extent(struct btrfs_inode *inode,
 		u64 new_delalloc_bytes;
 
 		spin_lock(&inode->lock);
-		btrfs_mod_outstanding_extents(inode, -num_extents);
+		btrfs_mod_outstanding_extents(inode, -1, num_extents);
 		spin_unlock(&inode->lock);
 
 		/*
diff --git a/fs/btrfs/ordered-data.c b/fs/btrfs/ordered-data.c
index e5a24b3ff95e..96ee8ebfdb92 100644
--- a/fs/btrfs/ordered-data.c
+++ b/fs/btrfs/ordered-data.c
@@ -223,7 +223,7 @@ static struct btrfs_ordered_extent *alloc_ordered_extent(
 	 * smallest the extent is going to get.
 	 */
 	spin_lock(&inode->lock);
-	btrfs_mod_outstanding_extents(inode, 1);
+	btrfs_mod_outstanding_extents(inode, 1, 1);
 	spin_unlock(&inode->lock);
 
 out:
@@ -655,7 +655,7 @@ void btrfs_remove_ordered_extent(struct btrfs_ordered_extent *entry)
 	btrfs_lockdep_acquire(fs_info, btrfs_trans_pending_ordered);
 	/* This is paired with alloc_ordered_extent(). */
 	spin_lock(&btrfs_inode->lock);
-	btrfs_mod_outstanding_extents(btrfs_inode, -1);
+	btrfs_mod_outstanding_extents(btrfs_inode, -1, 1);
 	spin_unlock(&btrfs_inode->lock);
 	if (root != fs_info->tree_root) {
 		u64 release;
diff --git a/fs/btrfs/tests/inode-tests.c b/fs/btrfs/tests/inode-tests.c
index b04fbcaf0a1d..e63afbb9be2b 100644
--- a/fs/btrfs/tests/inode-tests.c
+++ b/fs/btrfs/tests/inode-tests.c
@@ -931,7 +931,7 @@ static int test_extent_accounting(u32 sectorsize, u32 nodesize)
 	}
 	if (BTRFS_I(inode)->outstanding_extents != 1) {
 		ret = -EINVAL;
-		test_err("miscount, wanted 1, got %u",
+		test_err("miscount, wanted 1, got %llu",
 			 BTRFS_I(inode)->outstanding_extents);
 		goto out;
 	}
@@ -946,7 +946,7 @@ static int test_extent_accounting(u32 sectorsize, u32 nodesize)
 	}
 	if (BTRFS_I(inode)->outstanding_extents != 2) {
 		ret = -EINVAL;
-		test_err("miscount, wanted 2, got %u",
+		test_err("miscount, wanted 2, got %llu",
 			 BTRFS_I(inode)->outstanding_extents);
 		goto out;
 	}
@@ -962,7 +962,7 @@ static int test_extent_accounting(u32 sectorsize, u32 nodesize)
 	}
 	if (BTRFS_I(inode)->outstanding_extents != 2) {
 		ret = -EINVAL;
-		test_err("miscount, wanted 2, got %u",
+		test_err("miscount, wanted 2, got %llu",
 			 BTRFS_I(inode)->outstanding_extents);
 		goto out;
 	}
@@ -978,7 +978,7 @@ static int test_extent_accounting(u32 sectorsize, u32 nodesize)
 	}
 	if (BTRFS_I(inode)->outstanding_extents != 2) {
 		ret = -EINVAL;
-		test_err("miscount, wanted 2, got %u",
+		test_err("miscount, wanted 2, got %llu",
 			 BTRFS_I(inode)->outstanding_extents);
 		goto out;
 	}
@@ -996,7 +996,7 @@ static int test_extent_accounting(u32 sectorsize, u32 nodesize)
 	}
 	if (BTRFS_I(inode)->outstanding_extents != 4) {
 		ret = -EINVAL;
-		test_err("miscount, wanted 4, got %u",
+		test_err("miscount, wanted 4, got %llu",
 			 BTRFS_I(inode)->outstanding_extents);
 		goto out;
 	}
@@ -1013,7 +1013,7 @@ static int test_extent_accounting(u32 sectorsize, u32 nodesize)
 	}
 	if (BTRFS_I(inode)->outstanding_extents != 3) {
 		ret = -EINVAL;
-		test_err("miscount, wanted 3, got %u",
+		test_err("miscount, wanted 3, got %llu",
 			 BTRFS_I(inode)->outstanding_extents);
 		goto out;
 	}
@@ -1029,7 +1029,7 @@ static int test_extent_accounting(u32 sectorsize, u32 nodesize)
 	}
 	if (BTRFS_I(inode)->outstanding_extents != 4) {
 		ret = -EINVAL;
-		test_err("miscount, wanted 4, got %u",
+		test_err("miscount, wanted 4, got %llu",
 			 BTRFS_I(inode)->outstanding_extents);
 		goto out;
 	}
@@ -1047,7 +1047,7 @@ static int test_extent_accounting(u32 sectorsize, u32 nodesize)
 	}
 	if (BTRFS_I(inode)->outstanding_extents != 3) {
 		ret = -EINVAL;
-		test_err("miscount, wanted 3, got %u",
+		test_err("miscount, wanted 3, got %llu",
 			 BTRFS_I(inode)->outstanding_extents);
 		goto out;
 	}
@@ -1061,7 +1061,7 @@ static int test_extent_accounting(u32 sectorsize, u32 nodesize)
 	}
 	if (BTRFS_I(inode)->outstanding_extents) {
 		ret = -EINVAL;
-		test_err("miscount, wanted 0, got %u",
+		test_err("miscount, wanted 0, got %llu",
 			 BTRFS_I(inode)->outstanding_extents);
 		goto out;
 	}
diff --git a/include/trace/events/btrfs.h b/include/trace/events/btrfs.h
index 8ad7a2d76c1d..caabdc8d9eed 100644
--- a/include/trace/events/btrfs.h
+++ b/include/trace/events/btrfs.h
@@ -2003,15 +2003,15 @@ DEFINE_EVENT(btrfs__prelim_ref, btrfs_prelim_ref_insert,
 );
 
 TRACE_EVENT(btrfs_inode_mod_outstanding_extents,
-	TP_PROTO(const struct btrfs_root *root, u64 ino, int mod, unsigned outstanding),
+	TP_PROTO(const struct btrfs_root *root, u64 ino, s64 mod, u64 outstanding),
 
 	TP_ARGS(root, ino, mod, outstanding),
 
 	TP_STRUCT__entry_btrfs(
 		__field(	u64, root_objectid	)
 		__field(	u64, ino		)
-		__field(	int, mod		)
-		__field(	unsigned, outstanding	)
+		__field(	s64, mod		)
+		__field(	u64, outstanding	)
 	),
 
 	TP_fast_assign_btrfs(root->fs_info,
@@ -2021,7 +2021,7 @@ TRACE_EVENT(btrfs_inode_mod_outstanding_extents,
 		__entry->outstanding    = outstanding;
 	),
 
-	TP_printk_btrfs("root=%llu(%s) ino=%llu mod=%d outstanding=%u",
+	TP_printk_btrfs("root=%llu(%s) ino=%llu mod=%lld outstanding=%llu",
 			show_root_type(__entry->root_objectid),
 			__entry->ino, __entry->mod, __entry->outstanding)
 );
-- 
2.53.0


^ permalink raw reply related	[flat|nested] 11+ messages in thread

* Re: [PATCH v3 3/4] btrfs: make inode->outstanding_extents a u64
  2026-04-07 19:30 ` [PATCH v3 3/4] btrfs: make inode->outstanding_extents a u64 Boris Burkov
@ 2026-04-08 15:04   ` Filipe Manana
  0 siblings, 0 replies; 11+ messages in thread
From: Filipe Manana @ 2026-04-08 15:04 UTC (permalink / raw)
  To: Boris Burkov; +Cc: linux-btrfs, kernel-team

On Tue, Apr 7, 2026 at 8:31 PM Boris Burkov <boris@bur.io> wrote:
>
> The maximum file size is MAX_LFS_FILESIZE = (loff_t)LLONG_MAX
>
> As a result, the max extent size computation in btrfs has always been
> bounded above by LLONG_MAX / 128MiB, which is ~ 2^63 / 2^27. This has
> never fit in a u32. With the recent changes to also divide by 128KiB in
> compressed cases, that bound is even higher. Whether or not it is likely
> to happen, I think it is nice to try to capture the intent in the types,
> so change outstanding_extents to u64, and make mod_outstanding_extents
> try to capture some expectations around the size of its inputs.
>
> Signed-off-by: Boris Burkov <boris@bur.io>
> ---
>  fs/btrfs/btrfs_inode.h       | 18 ++++++++++++++----
>  fs/btrfs/delalloc-space.c    | 19 +++++++++----------
>  fs/btrfs/inode.c             | 14 +++++++-------
>  fs/btrfs/ordered-data.c      |  4 ++--
>  fs/btrfs/tests/inode-tests.c | 18 +++++++++---------
>  include/trace/events/btrfs.h |  8 ++++----
>  6 files changed, 45 insertions(+), 36 deletions(-)
>
> diff --git a/fs/btrfs/btrfs_inode.h b/fs/btrfs/btrfs_inode.h
> index 5368ef87b41a..04825b28f384 100644
> --- a/fs/btrfs/btrfs_inode.h
> +++ b/fs/btrfs/btrfs_inode.h
> @@ -180,7 +180,7 @@ struct btrfs_inode {
>          * items we think we'll end up using, and reserved_extents is the number
>          * of extent items we've reserved metadata for. Protected by 'lock'.
>          */
> -       unsigned outstanding_extents;
> +       u64 outstanding_extents;
>
>         /* used to order data wrt metadata */
>         spinlock_t ordered_tree_lock;
> @@ -429,14 +429,24 @@ static inline bool is_data_inode(const struct btrfs_inode *inode)
>  }
>
>  static inline void btrfs_mod_outstanding_extents(struct btrfs_inode *inode,
> -                                                int mod)
> +                                                int mod, u64 nr_extents)
>  {
> +       s64 delta = mod * (s64)nr_extents;
> +
>         lockdep_assert_held(&inode->lock);
> -       inode->outstanding_extents += mod;
> +       ASSERT(mod == 1 || mod == -1, "mod=%d", mod);
> +       ASSERT(nr_extents <= S64_MAX, "nr_extents=%llu", nr_extents);
> +       ASSERT(mod == -1 || inode->outstanding_extents <= U64_MAX - nr_extents,
> +              "nr_extents=%llu, inode->outstanding_extents=%llu",
> +              nr_extents, inode->outstanding_extents);
> +       ASSERT(mod == 1 || inode->outstanding_extents >= nr_extents,
> +              "mod=%d, nr_extents=%llu, inode->outstanding_extents=%llu",

I don't think there is a need to print mod here, if the assertion
failed, we know it must be -1.
In the previous assertion, we don't print mod, which makes sense since
it must be 1 if it fails.

With that change:

Reviewed-by: Filipe Manana <fdmanana@suse.com>

Thanks.

> +              mod, nr_extents, inode->outstanding_extents);
> +       inode->outstanding_extents += delta;
>         if (btrfs_is_free_space_inode(inode))
>                 return;
>         trace_btrfs_inode_mod_outstanding_extents(inode->root, btrfs_ino(inode),
> -                                                 mod, inode->outstanding_extents);
> +                                                 delta, inode->outstanding_extents);
>  }
>
>  /*
> diff --git a/fs/btrfs/delalloc-space.c b/fs/btrfs/delalloc-space.c
> index 9ff02a7da05b..d50081c8089a 100644
> --- a/fs/btrfs/delalloc-space.c
> +++ b/fs/btrfs/delalloc-space.c
> @@ -289,7 +289,7 @@ static void btrfs_calculate_inode_block_rsv_size(struct btrfs_fs_info *fs_info,
>         struct btrfs_block_rsv *block_rsv = &inode->block_rsv;
>         u64 reserve_size = 0;
>         u64 qgroup_rsv_size = 0;
> -       unsigned outstanding_extents;
> +       u64 outstanding_extents;
>
>         lockdep_assert_held(&inode->lock);
>         outstanding_extents = inode->outstanding_extents;
> @@ -316,7 +316,7 @@ static void btrfs_calculate_inode_block_rsv_size(struct btrfs_fs_info *fs_info,
>          *
>          * This is overestimating in most cases.
>          */
> -       qgroup_rsv_size = (u64)outstanding_extents * fs_info->nodesize;
> +       qgroup_rsv_size = outstanding_extents * fs_info->nodesize;
>
>         spin_lock(&block_rsv->lock);
>         block_rsv->size = reserve_size;
> @@ -376,7 +376,7 @@ int btrfs_delalloc_reserve_metadata(struct btrfs_inode *inode, u64 num_bytes,
>         struct btrfs_fs_info *fs_info = root->fs_info;
>         struct btrfs_block_rsv *block_rsv = &inode->block_rsv;
>         u64 meta_reserve, qgroup_reserve;
> -       unsigned nr_extents;
> +       u64 nr_extents;
>         enum btrfs_reserve_flush_enum flush = BTRFS_RESERVE_FLUSH_ALL;
>         int ret = 0;
>
> @@ -430,7 +430,7 @@ int btrfs_delalloc_reserve_metadata(struct btrfs_inode *inode, u64 num_bytes,
>          */
>         nr_extents = btrfs_inode_max_extents(inode, num_bytes);
>         spin_lock(&inode->lock);
> -       btrfs_mod_outstanding_extents(inode, nr_extents);
> +       btrfs_mod_outstanding_extents(inode, 1, nr_extents);
>         if (!(inode->flags & BTRFS_INODE_NODATASUM))
>                 inode->csum_bytes += disk_num_bytes;
>         btrfs_calculate_inode_block_rsv_size(fs_info, inode);
> @@ -492,11 +492,11 @@ void btrfs_delalloc_release_metadata(struct btrfs_inode *inode, u64 num_bytes,
>  void btrfs_delalloc_release_extents(struct btrfs_inode *inode, u64 num_bytes)
>  {
>         struct btrfs_fs_info *fs_info = inode->root->fs_info;
> -       unsigned num_extents;
> +       u64 num_extents;
>
>         spin_lock(&inode->lock);
>         num_extents = btrfs_inode_max_extents(inode, num_bytes);
> -       btrfs_mod_outstanding_extents(inode, -num_extents);
> +       btrfs_mod_outstanding_extents(inode, -1, num_extents);
>         btrfs_calculate_inode_block_rsv_size(fs_info, inode);
>         spin_unlock(&inode->lock);
>
> @@ -510,16 +510,15 @@ void btrfs_delalloc_release_extents(struct btrfs_inode *inode, u64 num_bytes)
>  void btrfs_delalloc_shrink_extents(struct btrfs_inode *inode, u64 reserved_len, u64 new_len)
>  {
>         struct btrfs_fs_info *fs_info = inode->root->fs_info;
> -       const u32 reserved_num_extents = btrfs_inode_max_extents(inode, reserved_len);
> -       const u32 new_num_extents = btrfs_inode_max_extents(inode, new_len);
> -       const int diff_num_extents = new_num_extents - reserved_num_extents;
> +       const u64 reserved_num_extents = btrfs_inode_max_extents(inode, reserved_len);
> +       const u64 new_num_extents = btrfs_inode_max_extents(inode, new_len);
>
>         ASSERT(new_len <= reserved_len);
>         if (new_num_extents == reserved_num_extents)
>                 return;
>
>         spin_lock(&inode->lock);
> -       btrfs_mod_outstanding_extents(inode, diff_num_extents);
> +       btrfs_mod_outstanding_extents(inode, -1, reserved_num_extents - new_num_extents);
>         btrfs_calculate_inode_block_rsv_size(fs_info, inode);
>         spin_unlock(&inode->lock);
>
> diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
> index 55255f3794c6..b45a92cfe94e 100644
> --- a/fs/btrfs/inode.c
> +++ b/fs/btrfs/inode.c
> @@ -2536,7 +2536,7 @@ void btrfs_split_delalloc_extent(struct btrfs_inode *inode,
>         }
>
>         spin_lock(&inode->lock);
> -       btrfs_mod_outstanding_extents(inode, 1);
> +       btrfs_mod_outstanding_extents(inode, 1, 1);
>         spin_unlock(&inode->lock);
>  }
>
> @@ -2566,7 +2566,7 @@ void btrfs_merge_delalloc_extent(struct btrfs_inode *inode, struct extent_state
>         /* we're not bigger than the max, unreserve the space and go */
>         if (new_size <= max_extent_size) {
>                 spin_lock(&inode->lock);
> -               btrfs_mod_outstanding_extents(inode, -1);
> +               btrfs_mod_outstanding_extents(inode, -1, 1);
>                 spin_unlock(&inode->lock);
>                 return;
>         }
> @@ -2597,7 +2597,7 @@ void btrfs_merge_delalloc_extent(struct btrfs_inode *inode, struct extent_state
>                 return;
>
>         spin_lock(&inode->lock);
> -       btrfs_mod_outstanding_extents(inode, -1);
> +       btrfs_mod_outstanding_extents(inode, -1, 1);
>         spin_unlock(&inode->lock);
>  }
>
> @@ -2666,10 +2666,10 @@ void btrfs_set_delalloc_extent(struct btrfs_inode *inode, struct extent_state *s
>         if (!(state->state & EXTENT_DELALLOC) && (bits & EXTENT_DELALLOC)) {
>                 u64 len = state->end + 1 - state->start;
>                 u64 prev_delalloc_bytes;
> -               u32 num_extents = btrfs_inode_max_extents(inode, len);
> +               u64 num_extents = btrfs_inode_max_extents(inode, len);
>
>                 spin_lock(&inode->lock);
> -               btrfs_mod_outstanding_extents(inode, num_extents);
> +               btrfs_mod_outstanding_extents(inode, 1, num_extents);
>                 spin_unlock(&inode->lock);
>
>                 /* For sanity tests */
> @@ -2712,7 +2712,7 @@ void btrfs_clear_delalloc_extent(struct btrfs_inode *inode,
>  {
>         struct btrfs_fs_info *fs_info = inode->root->fs_info;
>         u64 len = state->end + 1 - state->start;
> -       u32 num_extents = btrfs_inode_max_extents(inode, len);
> +       u64 num_extents = btrfs_inode_max_extents(inode, len);
>
>         lockdep_assert_held(&inode->io_tree.lock);
>
> @@ -2732,7 +2732,7 @@ void btrfs_clear_delalloc_extent(struct btrfs_inode *inode,
>                 u64 new_delalloc_bytes;
>
>                 spin_lock(&inode->lock);
> -               btrfs_mod_outstanding_extents(inode, -num_extents);
> +               btrfs_mod_outstanding_extents(inode, -1, num_extents);
>                 spin_unlock(&inode->lock);
>
>                 /*
> diff --git a/fs/btrfs/ordered-data.c b/fs/btrfs/ordered-data.c
> index e5a24b3ff95e..96ee8ebfdb92 100644
> --- a/fs/btrfs/ordered-data.c
> +++ b/fs/btrfs/ordered-data.c
> @@ -223,7 +223,7 @@ static struct btrfs_ordered_extent *alloc_ordered_extent(
>          * smallest the extent is going to get.
>          */
>         spin_lock(&inode->lock);
> -       btrfs_mod_outstanding_extents(inode, 1);
> +       btrfs_mod_outstanding_extents(inode, 1, 1);
>         spin_unlock(&inode->lock);
>
>  out:
> @@ -655,7 +655,7 @@ void btrfs_remove_ordered_extent(struct btrfs_ordered_extent *entry)
>         btrfs_lockdep_acquire(fs_info, btrfs_trans_pending_ordered);
>         /* This is paired with alloc_ordered_extent(). */
>         spin_lock(&btrfs_inode->lock);
> -       btrfs_mod_outstanding_extents(btrfs_inode, -1);
> +       btrfs_mod_outstanding_extents(btrfs_inode, -1, 1);
>         spin_unlock(&btrfs_inode->lock);
>         if (root != fs_info->tree_root) {
>                 u64 release;
> diff --git a/fs/btrfs/tests/inode-tests.c b/fs/btrfs/tests/inode-tests.c
> index b04fbcaf0a1d..e63afbb9be2b 100644
> --- a/fs/btrfs/tests/inode-tests.c
> +++ b/fs/btrfs/tests/inode-tests.c
> @@ -931,7 +931,7 @@ static int test_extent_accounting(u32 sectorsize, u32 nodesize)
>         }
>         if (BTRFS_I(inode)->outstanding_extents != 1) {
>                 ret = -EINVAL;
> -               test_err("miscount, wanted 1, got %u",
> +               test_err("miscount, wanted 1, got %llu",
>                          BTRFS_I(inode)->outstanding_extents);
>                 goto out;
>         }
> @@ -946,7 +946,7 @@ static int test_extent_accounting(u32 sectorsize, u32 nodesize)
>         }
>         if (BTRFS_I(inode)->outstanding_extents != 2) {
>                 ret = -EINVAL;
> -               test_err("miscount, wanted 2, got %u",
> +               test_err("miscount, wanted 2, got %llu",
>                          BTRFS_I(inode)->outstanding_extents);
>                 goto out;
>         }
> @@ -962,7 +962,7 @@ static int test_extent_accounting(u32 sectorsize, u32 nodesize)
>         }
>         if (BTRFS_I(inode)->outstanding_extents != 2) {
>                 ret = -EINVAL;
> -               test_err("miscount, wanted 2, got %u",
> +               test_err("miscount, wanted 2, got %llu",
>                          BTRFS_I(inode)->outstanding_extents);
>                 goto out;
>         }
> @@ -978,7 +978,7 @@ static int test_extent_accounting(u32 sectorsize, u32 nodesize)
>         }
>         if (BTRFS_I(inode)->outstanding_extents != 2) {
>                 ret = -EINVAL;
> -               test_err("miscount, wanted 2, got %u",
> +               test_err("miscount, wanted 2, got %llu",
>                          BTRFS_I(inode)->outstanding_extents);
>                 goto out;
>         }
> @@ -996,7 +996,7 @@ static int test_extent_accounting(u32 sectorsize, u32 nodesize)
>         }
>         if (BTRFS_I(inode)->outstanding_extents != 4) {
>                 ret = -EINVAL;
> -               test_err("miscount, wanted 4, got %u",
> +               test_err("miscount, wanted 4, got %llu",
>                          BTRFS_I(inode)->outstanding_extents);
>                 goto out;
>         }
> @@ -1013,7 +1013,7 @@ static int test_extent_accounting(u32 sectorsize, u32 nodesize)
>         }
>         if (BTRFS_I(inode)->outstanding_extents != 3) {
>                 ret = -EINVAL;
> -               test_err("miscount, wanted 3, got %u",
> +               test_err("miscount, wanted 3, got %llu",
>                          BTRFS_I(inode)->outstanding_extents);
>                 goto out;
>         }
> @@ -1029,7 +1029,7 @@ static int test_extent_accounting(u32 sectorsize, u32 nodesize)
>         }
>         if (BTRFS_I(inode)->outstanding_extents != 4) {
>                 ret = -EINVAL;
> -               test_err("miscount, wanted 4, got %u",
> +               test_err("miscount, wanted 4, got %llu",
>                          BTRFS_I(inode)->outstanding_extents);
>                 goto out;
>         }
> @@ -1047,7 +1047,7 @@ static int test_extent_accounting(u32 sectorsize, u32 nodesize)
>         }
>         if (BTRFS_I(inode)->outstanding_extents != 3) {
>                 ret = -EINVAL;
> -               test_err("miscount, wanted 3, got %u",
> +               test_err("miscount, wanted 3, got %llu",
>                          BTRFS_I(inode)->outstanding_extents);
>                 goto out;
>         }
> @@ -1061,7 +1061,7 @@ static int test_extent_accounting(u32 sectorsize, u32 nodesize)
>         }
>         if (BTRFS_I(inode)->outstanding_extents) {
>                 ret = -EINVAL;
> -               test_err("miscount, wanted 0, got %u",
> +               test_err("miscount, wanted 0, got %llu",
>                          BTRFS_I(inode)->outstanding_extents);
>                 goto out;
>         }
> diff --git a/include/trace/events/btrfs.h b/include/trace/events/btrfs.h
> index 8ad7a2d76c1d..caabdc8d9eed 100644
> --- a/include/trace/events/btrfs.h
> +++ b/include/trace/events/btrfs.h
> @@ -2003,15 +2003,15 @@ DEFINE_EVENT(btrfs__prelim_ref, btrfs_prelim_ref_insert,
>  );
>
>  TRACE_EVENT(btrfs_inode_mod_outstanding_extents,
> -       TP_PROTO(const struct btrfs_root *root, u64 ino, int mod, unsigned outstanding),
> +       TP_PROTO(const struct btrfs_root *root, u64 ino, s64 mod, u64 outstanding),
>
>         TP_ARGS(root, ino, mod, outstanding),
>
>         TP_STRUCT__entry_btrfs(
>                 __field(        u64, root_objectid      )
>                 __field(        u64, ino                )
> -               __field(        int, mod                )
> -               __field(        unsigned, outstanding   )
> +               __field(        s64, mod                )
> +               __field(        u64, outstanding        )
>         ),
>
>         TP_fast_assign_btrfs(root->fs_info,
> @@ -2021,7 +2021,7 @@ TRACE_EVENT(btrfs_inode_mod_outstanding_extents,
>                 __entry->outstanding    = outstanding;
>         ),
>
> -       TP_printk_btrfs("root=%llu(%s) ino=%llu mod=%d outstanding=%u",
> +       TP_printk_btrfs("root=%llu(%s) ino=%llu mod=%lld outstanding=%llu",
>                         show_root_type(__entry->root_objectid),
>                         __entry->ino, __entry->mod, __entry->outstanding)
>  );
> --
> 2.53.0
>
>

^ permalink raw reply	[flat|nested] 11+ messages in thread

* [PATCH v3 4/4] btrfs: cap shrink_delalloc iterations to 128M
  2026-04-07 19:30 [PATCH v3 0/4] btrfs: improve stalls under sudden writeback Boris Burkov
                   ` (2 preceding siblings ...)
  2026-04-07 19:30 ` [PATCH v3 3/4] btrfs: make inode->outstanding_extents a u64 Boris Burkov
@ 2026-04-07 19:30 ` Boris Burkov
  3 siblings, 0 replies; 11+ messages in thread
From: Boris Burkov @ 2026-04-07 19:30 UTC (permalink / raw)
  To: linux-btrfs, kernel-team

Even with more accurate delayed_refs reservations, preemptive reclaim is
not perfect and we might generate tickets, especially in cases with a
very large flood of writeback outstanding.

Ultimately, if we do get into a situation with tickets pending and async
reclaim blocking the system, we want to try to make as much progress as
quickly as possible to unblock tasks. We want space reclaim to be
effective, and to have a good chance at making progress, but not to
block arbitrarily as this leads to untenable syscall latencies, long
commits, and even hung task warnings.

I traced such cases of heavy writeback async reclaim hung tasks and
observed that we were blocking for long periods of time in
shrink_delalloc(). This was particularly bad when doing writeback of
incompressible data with the compress-force mount option.

e.g.
dd if=/dev/urandom of=urandom.seed bs=1G count=1
dd if=urandom.seed of=urandom.big bs=1G count=300

shrink_delalloc() computes to_reclaim as delalloc_bytes >> 3. With
hundreds of gigs of delalloc (again imagine a large dirty_ratio and lots
of ram), this is still 10-20+ GiB. Particularly in the wait phases, this
can be quite slow, and generates even more delayed-refs as mentioned in
the previous patch, so it doesn't even help that much with the immediate
space shortfall.

We do satisfy some tickets, but we are ultimately keep the system in
essentially the same state, and with long stalling reclaim calls into
shrink_delalloc().

It would be much better to start some good chunk of I/O and also to work
through the new delayed_refs and keep things moving through the system
while releasing the conservative over-estimated metadata reservations.

To acheive this, tighten up the delalloc work to be in units of the
maximum extent size. If we issue 128MiB of delalloc, we don't leave too
much (any?) extent merging on the table, but don't ever block on
pathological 10GiB+ chunks of delalloc. If we do detect that we
satisfied a ticket, break out of shrink_delalloc() and run some of the
new delayed_refs as well before going again. This way we strike a nice
balance of making delalloc progress, but not at the cost of every other
sort of reservation, as they all feed into each other.

This means iterating over to_reclaim by 128MiB at a time until it is
drained or we satisfy a ticket, rather than trying 3 times to do the
whole thing.

Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Boris Burkov <boris@bur.io>
---
 fs/btrfs/space-info.c | 31 ++++++++++++++++++++-----------
 1 file changed, 20 insertions(+), 11 deletions(-)

diff --git a/fs/btrfs/space-info.c b/fs/btrfs/space-info.c
index f0436eea1544..e931deb3d013 100644
--- a/fs/btrfs/space-info.c
+++ b/fs/btrfs/space-info.c
@@ -725,9 +725,8 @@ static void shrink_delalloc(struct btrfs_space_info *space_info,
 	struct btrfs_trans_handle *trans;
 	u64 delalloc_bytes;
 	u64 ordered_bytes;
-	u64 items;
 	long time_left;
-	int loops;
+	u64 orig_tickets_id;

 	delalloc_bytes = percpu_counter_sum_positive(&fs_info->delalloc_bytes);
 	ordered_bytes = percpu_counter_sum_positive(&fs_info->ordered_bytes);
@@ -735,9 +734,7 @@ static void shrink_delalloc(struct btrfs_space_info *space_info,
 		return;

 	/* Calc the number of the pages we need flush for space reservation */
-	if (to_reclaim == U64_MAX) {
-		items = U64_MAX;
-	} else {
+	if (to_reclaim != U64_MAX) {
 		/*
 		 * to_reclaim is set to however much metadata we need to
 		 * reclaim, but reclaiming that much data doesn't really track
@@ -751,7 +748,6 @@ static void shrink_delalloc(struct btrfs_space_info *space_info,
 		 * aggressive.
 		 */
 		to_reclaim = max(to_reclaim, delalloc_bytes >> 3);
-		items = calc_reclaim_items_nr(fs_info, to_reclaim) * 2;
 	}

 	trans = current->journal_info;
@@ -764,10 +760,14 @@ static void shrink_delalloc(struct btrfs_space_info *space_info,
 	if (ordered_bytes > delalloc_bytes && !for_preempt)
 		wait_ordered = true;

-	loops = 0;
-	while ((delalloc_bytes || ordered_bytes) && loops < 3) {
-		u64 temp = min(delalloc_bytes, to_reclaim) >> PAGE_SHIFT;
-		long nr_pages = min_t(u64, temp, LONG_MAX);
+	spin_lock(&space_info->lock);
+	orig_tickets_id = space_info->tickets_id;
+	spin_unlock(&space_info->lock);
+
+	while ((delalloc_bytes || ordered_bytes) && to_reclaim) {
+		u64 iter_reclaim = min_t(u64, to_reclaim, SZ_128M);
+		long nr_pages = min_t(u64, delalloc_bytes, iter_reclaim) >> PAGE_SHIFT;
+		u64 items = calc_reclaim_items_nr(fs_info, iter_reclaim) * 2;
 		int async_pages;

 		btrfs_start_delalloc_roots(fs_info, nr_pages, true);
@@ -811,7 +811,7 @@ static void shrink_delalloc(struct btrfs_space_info *space_info,
 			   atomic_read(&fs_info->async_delalloc_pages) <=
 			   async_pages);
 skip_async:
-		loops++;
+		to_reclaim -= iter_reclaim;
 		if (wait_ordered && !trans) {
 			btrfs_wait_ordered_roots(fs_info, items, NULL);
 		} else {
@@ -834,6 +834,15 @@ static void shrink_delalloc(struct btrfs_space_info *space_info,
 			spin_unlock(&space_info->lock);
 			break;
 		}
+		/*
+		 * If a ticket was satisfied since we started, break out
+		 * so the async reclaim state machine can process delayed
+		 * refs before we flush more delalloc.
+		 */
+		if (space_info->tickets_id != orig_tickets_id) {
+			spin_unlock(&space_info->lock);
+			break;
+		}
 		spin_unlock(&space_info->lock);

 		delalloc_bytes = percpu_counter_sum_positive(
-- 
2.53.0

^ permalink raw reply related	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2026-04-10 23:06 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-04-07 19:30 [PATCH v3 0/4] btrfs: improve stalls under sudden writeback Boris Burkov
2026-04-07 19:30 ` [PATCH v3 1/4] btrfs: reserve space for delayed_refs in delalloc Boris Burkov
2026-04-08 14:56   ` Filipe Manana
2026-04-08 17:34     ` Boris Burkov
2026-04-10 15:27       ` Filipe Manana
2026-04-10 23:06         ` Boris Burkov
2026-04-07 19:30 ` [PATCH v3 2/4] btrfs: account for compression in delalloc extent reservation Boris Burkov
2026-04-08 14:58   ` Filipe Manana
2026-04-07 19:30 ` [PATCH v3 3/4] btrfs: make inode->outstanding_extents a u64 Boris Burkov
2026-04-08 15:04   ` Filipe Manana
2026-04-07 19:30 ` [PATCH v3 4/4] btrfs: cap shrink_delalloc iterations to 128M Boris Burkov

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox