public inbox for linux-btrfs@vger.kernel.org
 help / color / mirror / Atom feed
* [PATCH 0/5] btrfs: improve stalls under sudden writeback
@ 2026-03-25  0:41 Boris Burkov
  2026-03-25  0:41 ` [PATCH 1/5] btrfs: reserve space for delayed_refs in delalloc Boris Burkov
                   ` (4 more replies)
  0 siblings, 5 replies; 10+ messages in thread
From: Boris Burkov @ 2026-03-25  0:41 UTC (permalink / raw)
  To: linux-btrfs, kernel-team

If you have a system with very large memory (TiBs) and a normal
percentage based dirty_ratio/dirty_background_ratio like the defaults of
20%/10%, then we can theoretically rack up 100s of GiB of dirty pages
before doing any writeback. This is further exacerbated if we also see a
sudden drop in the free memory due to a large allocation. If we
(relatively likely for a large ram system) also have a large disk, we are
unlikely to do trigger much preemptive metadata reclaim either.

Once we do start doing writeback with such a large supply, the results
are somewhat ugly. The delalloc work generates a huge amount of delayed
refs without proper reservations which sends the metadata space system
into a tailspin trying to run yet more delalloc to free space.
Ultimately, the system stalls waiting for huge amounts of ordered
extents and delayed refs blocking all users in start_transaction() on
tickets in reserve_space().

This patch series aims to address these issues in a relatively targeted
way by improving our reservations for delalloc delayed refs and by doing
some very basic smoothing of the work in flush_space(). Further work
could be done to improve flush_space() heuristics and latency but this
is already a big help on my observed workloads.

I was able to reproduce stalls on a more "modest" system with 264GiB of
ram by using a somewhat silly 80% dirty_ratio.

I was unfortunately unable to reproduce any stalls on a yet smaller
system with only 32GiB of ram.

The first 3 patches do the delayed_ref rsv accounting on btrfs_inode,
mirroring inode->block_rsv.
The 4th patch is a cleanup to the types counting max extents
The 5th patch reduces the size of the unit of work in shrink_delalloc()
to further reduce stalls.

Boris Burkov (5):
  btrfs: reserve space for delayed_refs in delalloc
  btrfs: account for csum delayed_refs in delalloc
  btrfs: account for compression in delalloc extent reservation
  btrfs: make inode->outstanding_extents a u64
  btrfs: cap shrink_delalloc iterations to 128M

 fs/btrfs/btrfs_inode.h       | 20 ++++++--
 fs/btrfs/delalloc-space.c    | 75 +++++++++++++++++++++-------
 fs/btrfs/delayed-ref.c       |  2 +-
 fs/btrfs/fs.h                | 13 -----
 fs/btrfs/inode.c             | 97 ++++++++++++++++++++++++++++--------
 fs/btrfs/ordered-data.c      |  4 +-
 fs/btrfs/space-info.c        | 31 ++++++++----
 fs/btrfs/tests/inode-tests.c | 18 +++----
 fs/btrfs/transaction.c       |  7 +--
 fs/btrfs/transaction.h       |  3 +-
 include/trace/events/btrfs.h |  8 +--
 11 files changed, 193 insertions(+), 85 deletions(-)

-- 
2.53.0


^ permalink raw reply	[flat|nested] 10+ messages in thread

* [PATCH 1/5] btrfs: reserve space for delayed_refs in delalloc
  2026-03-25  0:41 [PATCH 0/5] btrfs: improve stalls under sudden writeback Boris Burkov
@ 2026-03-25  0:41 ` Boris Burkov
  2026-03-25 15:36   ` Filipe Manana
  2026-03-25  0:41 ` [PATCH 2/5] btrfs: account for csum " Boris Burkov
                   ` (3 subsequent siblings)
  4 siblings, 1 reply; 10+ messages in thread
From: Boris Burkov @ 2026-03-25  0:41 UTC (permalink / raw)
  To: linux-btrfs, kernel-team

delalloc uses a per-inode block_rsv to perform metadata reservations for
the cow operations it anticipates based on the number of outstanding
extents. This calculation is done based on inode->outstanding_extents in
btrfs_calculate_inode_block_rsv_size(). The reservation is *not*
meticulously tracked as each ordered_extent is actually created in
writeback, but rather delalloc attempts to over-estimate and the
writeback and ordered_extent finish portions are responsible to release
all the reservation.

However, there is a notable gap in this reservation, it reserves no
space for the resulting delayed_refs. If you compare to how
btrfs_start_transaction() reservations work, this is a noteable
difference.

As writeback actually occurs, and we trigger btrfs_finish_one_ordered(),
that function will start generating delayed refs, which will draw from
the trans_handle's delayed_refs_rsv via btrfs_update_delayed_refs_rsv():

btrfs_finish_one_ordered()
  insert_ordered_extent_file_extent()
    insert_reserved_file_extent()
      btrfs_alloc_reserved_file_extent()
        btrfs_add_delayed_data_ref()
          add_delayed_ref()
            btrfs_update_delayed_refs_rsv();

This trans_handle was created in finish_one_ordered() with
btrfs_join_transaction() which calls start_transaction with
num_items=0 and BTRFS_RESERVE_NO_FLUSH. As a result, this trans_handle
has no reserved in h->delayed_rsv, as neither the num_items reservation
nor the btrfs_delayed_refs_rsv_refill() reservation is run.

Thus, when btrfs_update_delayed_refs_rsv() runs, reserved_bytes is 0 and
fs_info->delayed_rsv->size grows but not fs_info->delayed_rsv->reserved.

If a large amount of writeback happens all at once (perhaps due to
dirty_ratio being tuned too high), this results in, among other things,
erroneous assessments of the amount of delayed_refs reserved in the
metadata space reclaim logic, like need_preemptive_reclaim() which
relies on fs_info->delayed_rsv->reserved and even worse, poor decision
making in btrfs_preempt_reclaim_metadata_space() which counts
delalloc_bytes like so:

  block_rsv_size = global_rsv_size +
          btrfs_block_rsv_reserved(delayed_block_rsv) +
          btrfs_block_rsv_reserved(delayed_refs_rsv) +
          btrfs_block_rsv_reserved(trans_rsv);
  delalloc_size = bytes_may_use - block_rsv_size;

So all that lost delayed refs usage gets accounted as delalloc_size and
leads to preemptive reclaim continuously choosing FLUSH_DELALLOC, which
further exacerbates the problem.

With enough writeback around, we can run enough delalloc that we get
into async reclaim which starts blocking start_transaction() and
eventually hits FLUSH_DELALLOC_WAIT/FLUSH_DELALLOC_FULL at which point
the filesystem gets heavily blocked on metadata space in reserve_space(),
blocking all new transaction work until all the ordered_extents finish.

If we had an accurate view of the reservation for delayed refs, then we
could mostly break this feedback loop in preemptive reclaim, and
generally would be able to make more accurate decisions with regards to
metadata space reclamation.

This patch introduces the mechanism of a per-inode delayed_refs rsv
which is modeled closely after the same in trans_handle. The delalloc
reservation also reserves delayed refs and then finish_one_ordered
transfers the inode delayed_refs rsv into the trans_handle one, just
like inode->block_rsv.

This is not a perfect fix for the most pathological cases, but is the
infrastructure needed to keep working on the problem.

Signed-off-by: Boris Burkov <boris@bur.io>
---
 fs/btrfs/btrfs_inode.h    |  3 +++
 fs/btrfs/delalloc-space.c | 34 ++++++++++++++++++++++++++++++----
 fs/btrfs/delayed-ref.c    |  2 +-
 fs/btrfs/inode.c          |  9 ++++++++-
 fs/btrfs/transaction.c    |  7 ++++---
 fs/btrfs/transaction.h    |  3 ++-
 6 files changed, 48 insertions(+), 10 deletions(-)

diff --git a/fs/btrfs/btrfs_inode.h b/fs/btrfs/btrfs_inode.h
index 55c272fe5d92..dca4f6df7e95 100644
--- a/fs/btrfs/btrfs_inode.h
+++ b/fs/btrfs/btrfs_inode.h
@@ -328,6 +328,9 @@ struct btrfs_inode {
 
 	struct btrfs_block_rsv block_rsv;
 
+	/* Reserve for delayed refs generated by ordered extent completion. */
+	struct btrfs_block_rsv delayed_rsv;
+
 	struct btrfs_delayed_node *delayed_node;
 
 	/* File creation time. */
diff --git a/fs/btrfs/delalloc-space.c b/fs/btrfs/delalloc-space.c
index 0970799d0aa4..e2944ff4fe47 100644
--- a/fs/btrfs/delalloc-space.c
+++ b/fs/btrfs/delalloc-space.c
@@ -3,6 +3,7 @@
 #include "messages.h"
 #include "ctree.h"
 #include "delalloc-space.h"
+#include "delayed-ref.h"
 #include "block-rsv.h"
 #include "btrfs_inode.h"
 #include "space-info.h"
@@ -240,6 +241,13 @@ static void btrfs_inode_rsv_release(struct btrfs_inode *inode, bool qgroup_free)
 	if (released > 0)
 		trace_btrfs_space_reservation(fs_info, "delalloc",
 					      btrfs_ino(inode), released, 0);
+
+	released = btrfs_block_rsv_release(fs_info, &inode->delayed_rsv,
+					   0, NULL);
+	if (released > 0)
+		trace_btrfs_space_reservation(fs_info, "delalloc_delayed_refs",
+					      btrfs_ino(inode), released, 0);
+
 	if (qgroup_free)
 		btrfs_qgroup_free_meta_prealloc(inode->root, qgroup_to_release);
 	else
@@ -251,7 +259,9 @@ static void btrfs_calculate_inode_block_rsv_size(struct btrfs_fs_info *fs_info,
 						 struct btrfs_inode *inode)
 {
 	struct btrfs_block_rsv *block_rsv = &inode->block_rsv;
+	struct btrfs_block_rsv *delayed_rsv = &inode->delayed_rsv;
 	u64 reserve_size = 0;
+	u64 delayed_refs_size = 0;
 	u64 qgroup_rsv_size = 0;
 	unsigned outstanding_extents;
 
@@ -266,6 +276,8 @@ static void btrfs_calculate_inode_block_rsv_size(struct btrfs_fs_info *fs_info,
 		reserve_size = btrfs_calc_insert_metadata_size(fs_info,
 						outstanding_extents);
 		reserve_size += btrfs_calc_metadata_size(fs_info, 1);
+		delayed_refs_size += btrfs_calc_delayed_ref_bytes(fs_info,
+						outstanding_extents);
 	}
 	if (!(inode->flags & BTRFS_INODE_NODATASUM)) {
 		u64 csum_leaves;
@@ -285,11 +297,17 @@ static void btrfs_calculate_inode_block_rsv_size(struct btrfs_fs_info *fs_info,
 	block_rsv->size = reserve_size;
 	block_rsv->qgroup_rsv_size = qgroup_rsv_size;
 	spin_unlock(&block_rsv->lock);
+
+	spin_lock(&delayed_rsv->lock);
+	delayed_rsv->size = delayed_refs_size;
+	spin_unlock(&delayed_rsv->lock);
 }
 
 static void calc_inode_reservations(struct btrfs_inode *inode,
 				    u64 num_bytes, u64 disk_num_bytes,
-				    u64 *meta_reserve, u64 *qgroup_reserve)
+				    u64 *meta_reserve,
+				    u64 *delayed_refs_reserve,
+				    u64 *qgroup_reserve)
 {
 	struct btrfs_fs_info *fs_info = inode->root->fs_info;
 	u64 nr_extents = count_max_extents(fs_info, num_bytes);
@@ -309,6 +327,10 @@ static void calc_inode_reservations(struct btrfs_inode *inode,
 	 * for an inode update.
 	 */
 	*meta_reserve += inode_update;
+
+	*delayed_refs_reserve = btrfs_calc_delayed_ref_bytes(fs_info,
+							     nr_extents);
+
 	*qgroup_reserve = nr_extents * fs_info->nodesize;
 }
 
@@ -318,7 +340,7 @@ int btrfs_delalloc_reserve_metadata(struct btrfs_inode *inode, u64 num_bytes,
 	struct btrfs_root *root = inode->root;
 	struct btrfs_fs_info *fs_info = root->fs_info;
 	struct btrfs_block_rsv *block_rsv = &inode->block_rsv;
-	u64 meta_reserve, qgroup_reserve;
+	u64 meta_reserve, delayed_refs_reserve, qgroup_reserve;
 	unsigned nr_extents;
 	enum btrfs_reserve_flush_enum flush = BTRFS_RESERVE_FLUSH_ALL;
 	int ret = 0;
@@ -353,12 +375,14 @@ int btrfs_delalloc_reserve_metadata(struct btrfs_inode *inode, u64 num_bytes,
 	 * over-reserve slightly, and clean up the mess when we are done.
 	 */
 	calc_inode_reservations(inode, num_bytes, disk_num_bytes,
-				&meta_reserve, &qgroup_reserve);
+				&meta_reserve, &delayed_refs_reserve,
+				&qgroup_reserve);
 	ret = btrfs_qgroup_reserve_meta_prealloc(root, qgroup_reserve, true,
 						 noflush);
 	if (ret)
 		return ret;
-	ret = btrfs_reserve_metadata_bytes(block_rsv->space_info, meta_reserve,
+	ret = btrfs_reserve_metadata_bytes(block_rsv->space_info,
+					   meta_reserve + delayed_refs_reserve,
 					   flush);
 	if (ret) {
 		btrfs_qgroup_free_meta_prealloc(root, qgroup_reserve);
@@ -383,6 +407,8 @@ int btrfs_delalloc_reserve_metadata(struct btrfs_inode *inode, u64 num_bytes,
 	btrfs_block_rsv_add_bytes(block_rsv, meta_reserve, false);
 	trace_btrfs_space_reservation(root->fs_info, "delalloc",
 				      btrfs_ino(inode), meta_reserve, 1);
+	btrfs_block_rsv_add_bytes(&inode->delayed_rsv, delayed_refs_reserve,
+				  false);
 
 	spin_lock(&block_rsv->lock);
 	block_rsv->qgroup_rsv_reserved += qgroup_reserve;
diff --git a/fs/btrfs/delayed-ref.c b/fs/btrfs/delayed-ref.c
index 605858c2d9a9..9fe9cec1bef3 100644
--- a/fs/btrfs/delayed-ref.c
+++ b/fs/btrfs/delayed-ref.c
@@ -89,7 +89,7 @@ void btrfs_update_delayed_refs_rsv(struct btrfs_trans_handle *trans)
 {
 	struct btrfs_fs_info *fs_info = trans->fs_info;
 	struct btrfs_block_rsv *delayed_rsv = &fs_info->delayed_refs_rsv;
-	struct btrfs_block_rsv *local_rsv = &trans->delayed_rsv;
+	struct btrfs_block_rsv *local_rsv = trans->delayed_rsv;
 	u64 num_bytes;
 	u64 reserved_bytes;
 
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 1a4e6a9239ae..1f0f3282e4b8 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -653,6 +653,7 @@ static noinline int __cow_file_range_inline(struct btrfs_inode *inode,
 		goto out;
 	}
 	trans->block_rsv = &inode->block_rsv;
+	trans->delayed_rsv = &inode->delayed_rsv;
 
 	drop_args.path = path;
 	drop_args.start = 0;
@@ -3256,6 +3257,7 @@ int btrfs_finish_one_ordered(struct btrfs_ordered_extent *ordered_extent)
 	}
 
 	trans->block_rsv = &inode->block_rsv;
+	trans->delayed_rsv = &inode->delayed_rsv;
 
 	ret = btrfs_insert_raid_extent(trans, ordered_extent);
 	if (unlikely(ret)) {
@@ -8074,9 +8076,12 @@ struct inode *btrfs_alloc_inode(struct super_block *sb)
 
 	spin_lock_init(&ei->lock);
 	ei->outstanding_extents = 0;
-	if (sb->s_magic != BTRFS_TEST_MAGIC)
+	if (sb->s_magic != BTRFS_TEST_MAGIC) {
 		btrfs_init_metadata_block_rsv(fs_info, &ei->block_rsv,
 					      BTRFS_BLOCK_RSV_DELALLOC);
+		btrfs_init_metadata_block_rsv(fs_info, &ei->delayed_rsv,
+					      BTRFS_BLOCK_RSV_DELREFS);
+	}
 	ei->runtime_flags = 0;
 	ei->prop_compress = BTRFS_COMPRESS_NONE;
 	ei->defrag_compress = BTRFS_COMPRESS_NONE;
@@ -8132,6 +8137,8 @@ void btrfs_destroy_inode(struct inode *vfs_inode)
 	WARN_ON(vfs_inode->i_data.nrpages);
 	WARN_ON(inode->block_rsv.reserved);
 	WARN_ON(inode->block_rsv.size);
+	WARN_ON(inode->delayed_rsv.reserved);
+	WARN_ON(inode->delayed_rsv.size);
 	WARN_ON(inode->outstanding_extents);
 	if (!S_ISDIR(vfs_inode->i_mode)) {
 		WARN_ON(inode->delalloc_bytes);
diff --git a/fs/btrfs/transaction.c b/fs/btrfs/transaction.c
index 4358f4b63057..a55f8996cd59 100644
--- a/fs/btrfs/transaction.c
+++ b/fs/btrfs/transaction.c
@@ -737,7 +737,8 @@ start_transaction(struct btrfs_root *root, unsigned int num_items,
 
 	h->type = type;
 	INIT_LIST_HEAD(&h->new_bgs);
-	btrfs_init_metadata_block_rsv(fs_info, &h->delayed_rsv, BTRFS_BLOCK_RSV_DELREFS);
+	h->delayed_rsv = &h->_local_delayed_rsv;
+	btrfs_init_metadata_block_rsv(fs_info, h->delayed_rsv, BTRFS_BLOCK_RSV_DELREFS);
 
 	smp_mb();
 	if (cur_trans->state >= TRANS_STATE_COMMIT_START &&
@@ -758,7 +759,7 @@ start_transaction(struct btrfs_root *root, unsigned int num_items,
 						      h->transid,
 						      delayed_refs_bytes, 1);
 			h->delayed_refs_bytes_reserved = delayed_refs_bytes;
-			btrfs_block_rsv_add_bytes(&h->delayed_rsv, delayed_refs_bytes, true);
+			btrfs_block_rsv_add_bytes(h->delayed_rsv, delayed_refs_bytes, true);
 			delayed_refs_bytes = 0;
 		}
 		h->reloc_reserved = reloc_reserved;
@@ -1067,7 +1068,7 @@ static void btrfs_trans_release_metadata(struct btrfs_trans_handle *trans)
 	trace_btrfs_space_reservation(fs_info, "local_delayed_refs_rsv",
 				      trans->transid,
 				      trans->delayed_refs_bytes_reserved, 0);
-	btrfs_block_rsv_release(fs_info, &trans->delayed_rsv,
+	btrfs_block_rsv_release(fs_info, trans->delayed_rsv,
 				trans->delayed_refs_bytes_reserved, NULL);
 	trans->delayed_refs_bytes_reserved = 0;
 }
diff --git a/fs/btrfs/transaction.h b/fs/btrfs/transaction.h
index 7d70fe486758..268a415c4f32 100644
--- a/fs/btrfs/transaction.h
+++ b/fs/btrfs/transaction.h
@@ -162,7 +162,8 @@ struct btrfs_trans_handle {
 	bool in_fsync;
 	struct btrfs_fs_info *fs_info;
 	struct list_head new_bgs;
-	struct btrfs_block_rsv delayed_rsv;
+	struct btrfs_block_rsv *delayed_rsv;
+	struct btrfs_block_rsv _local_delayed_rsv;
 	/* Extent buffers with writeback inhibited by this handle. */
 	struct xarray writeback_inhibited_ebs;
 };
-- 
2.53.0


^ permalink raw reply related	[flat|nested] 10+ messages in thread

* [PATCH 2/5] btrfs: account for csum delayed_refs in delalloc
  2026-03-25  0:41 [PATCH 0/5] btrfs: improve stalls under sudden writeback Boris Burkov
  2026-03-25  0:41 ` [PATCH 1/5] btrfs: reserve space for delayed_refs in delalloc Boris Burkov
@ 2026-03-25  0:41 ` Boris Burkov
  2026-03-25  0:41 ` [PATCH 3/5] btrfs: account for compression in delalloc extent reservation Boris Burkov
                   ` (2 subsequent siblings)
  4 siblings, 0 replies; 10+ messages in thread
From: Boris Burkov @ 2026-03-25  0:41 UTC (permalink / raw)
  To: linux-btrfs, kernel-team

As ordered_extents complete, they not only produce direct delayed refs,
they also add csums via add_pending_csums. This produces some number of
metadata delayed_refs as the csum tree is cow-ed. These refs are counted
against the trans_handle delayed_rsv, thanks to trans->adding_csums.

As a result, just like we account for the extent tree and free space
tree when reserving delayed_refs for delalloc, we must also reserve for
the csum tree. This is mirrored by the non-delayed-ref metadata
reservation already accounting for csums.

This serves to ensure we have a proper worst case estimate for
delayed_rsv from delalloc.

Signed-off-by: Boris Burkov <boris@bur.io>
---
 fs/btrfs/delalloc-space.c | 21 +++++++++++++++++----
 1 file changed, 17 insertions(+), 4 deletions(-)

diff --git a/fs/btrfs/delalloc-space.c b/fs/btrfs/delalloc-space.c
index e2944ff4fe47..2eeafada96ec 100644
--- a/fs/btrfs/delalloc-space.c
+++ b/fs/btrfs/delalloc-space.c
@@ -255,6 +255,20 @@ static void btrfs_inode_rsv_release(struct btrfs_inode *inode, bool qgroup_free)
 						   qgroup_to_release);
 }
 
+/*
+ * ordered_extent completion will generate metadata delayed refs for
+ * the extent, free_space, and csum trees. btrfs_calc_delayed_ref_bytes()
+ * accounts for the former two, and we explicitly reserve for the latter.
+ * This ensures that we reserve enough delayed_ref space for each
+ * ordered_extent.
+ */
+static u64 delalloc_calc_delayed_refs_rsv(const struct btrfs_fs_info *fs_info,
+					  unsigned int nr_extents)
+{
+	return btrfs_calc_delayed_ref_bytes(fs_info, nr_extents) +
+		btrfs_calc_insert_metadata_size(fs_info, nr_extents);
+}
+
 static void btrfs_calculate_inode_block_rsv_size(struct btrfs_fs_info *fs_info,
 						 struct btrfs_inode *inode)
 {
@@ -276,8 +290,8 @@ static void btrfs_calculate_inode_block_rsv_size(struct btrfs_fs_info *fs_info,
 		reserve_size = btrfs_calc_insert_metadata_size(fs_info,
 						outstanding_extents);
 		reserve_size += btrfs_calc_metadata_size(fs_info, 1);
-		delayed_refs_size += btrfs_calc_delayed_ref_bytes(fs_info,
-						outstanding_extents);
+		delayed_refs_size +=
+			delalloc_calc_delayed_refs_rsv(fs_info, outstanding_extents);
 	}
 	if (!(inode->flags & BTRFS_INODE_NODATASUM)) {
 		u64 csum_leaves;
@@ -328,8 +342,7 @@ static void calc_inode_reservations(struct btrfs_inode *inode,
 	 */
 	*meta_reserve += inode_update;
 
-	*delayed_refs_reserve = btrfs_calc_delayed_ref_bytes(fs_info,
-							     nr_extents);
+	*delayed_refs_reserve = delalloc_calc_delayed_refs_rsv(fs_info, nr_extents);
 
 	*qgroup_reserve = nr_extents * fs_info->nodesize;
 }
-- 
2.53.0


^ permalink raw reply related	[flat|nested] 10+ messages in thread

* [PATCH 3/5] btrfs: account for compression in delalloc extent reservation
  2026-03-25  0:41 [PATCH 0/5] btrfs: improve stalls under sudden writeback Boris Burkov
  2026-03-25  0:41 ` [PATCH 1/5] btrfs: reserve space for delayed_refs in delalloc Boris Burkov
  2026-03-25  0:41 ` [PATCH 2/5] btrfs: account for csum " Boris Burkov
@ 2026-03-25  0:41 ` Boris Burkov
  2026-03-25  0:41 ` [PATCH 4/5] btrfs: make inode->outstanding_extents a u64 Boris Burkov
  2026-03-25  0:41 ` [PATCH 5/5] btrfs: cap shrink_delalloc iterations to 128M Boris Burkov
  4 siblings, 0 replies; 10+ messages in thread
From: Boris Burkov @ 2026-03-25  0:41 UTC (permalink / raw)
  To: linux-btrfs, kernel-team

The btrfs maximum uncompressed extent size is 128MiB. The maximum
compressed extent size in file extent space is 128KiB. Therefore, the
estimate for outstanding_extents is off by 3 orders of magnitude when
COMPRESS_FORCE is set or the inode is set to always compress.

Because we use re-calculation when necessary, rather than super detailed
extent tracking, we don't grow this reservation as the true number of
extents is revealed. We don't want to be too clever with it, however, as
we don't want the calculation to change for a given inode between
reservation and release, so we only rely on the forcing type flags.

With this change, we no longer under-reserve delayed refs reservations
for delalloc writes, even with compress-force.

Because this would turn count_max_extents() into a named shim for
div_u64(size + max_extent_size - 1, max_extent_size);
we can just get rid of it.

Signed-off-by: Boris Burkov <boris@bur.io>
---
 fs/btrfs/btrfs_inode.h    |  3 ++
 fs/btrfs/delalloc-space.c | 13 ++++---
 fs/btrfs/fs.h             | 13 -------
 fs/btrfs/inode.c          | 78 ++++++++++++++++++++++++++++++++-------
 4 files changed, 74 insertions(+), 33 deletions(-)

diff --git a/fs/btrfs/btrfs_inode.h b/fs/btrfs/btrfs_inode.h
index dca4f6df7e95..cfeda43b01d7 100644
--- a/fs/btrfs/btrfs_inode.h
+++ b/fs/btrfs/btrfs_inode.h
@@ -513,6 +513,9 @@ static inline bool btrfs_inode_can_compress(const struct btrfs_inode *inode)
 	return true;
 }
 
+u64 btrfs_inode_max_extent_size(const struct btrfs_inode *inode);
+u64 btrfs_inode_max_extents(const struct btrfs_inode *inode, u64 size);
+
 static inline void btrfs_assert_inode_locked(struct btrfs_inode *inode)
 {
 	/* Immediately trigger a crash if the inode is not locked. */
diff --git a/fs/btrfs/delalloc-space.c b/fs/btrfs/delalloc-space.c
index 2eeafada96ec..2ceae1065f2c 100644
--- a/fs/btrfs/delalloc-space.c
+++ b/fs/btrfs/delalloc-space.c
@@ -6,6 +6,7 @@
 #include "delayed-ref.h"
 #include "block-rsv.h"
 #include "btrfs_inode.h"
+#include "compression.h"
 #include "space-info.h"
 #include "qgroup.h"
 #include "fs.h"
@@ -64,7 +65,7 @@
  *     This is the number of file extent items we'll need to handle all of the
  *     outstanding DELALLOC space we have in this inode.  We limit the maximum
  *     size of an extent, so a large contiguous dirty area may require more than
- *     one outstanding_extent, which is why count_max_extents() is used to
+ *     one outstanding_extent, which is why we use the max extent size to
  *     determine how many outstanding_extents get added.
  *
  *   ->csum_bytes
@@ -324,7 +325,7 @@ static void calc_inode_reservations(struct btrfs_inode *inode,
 				    u64 *qgroup_reserve)
 {
 	struct btrfs_fs_info *fs_info = inode->root->fs_info;
-	u64 nr_extents = count_max_extents(fs_info, num_bytes);
+	u64 nr_extents = btrfs_inode_max_extents(inode, num_bytes);
 	u64 csum_leaves;
 	u64 inode_update = btrfs_calc_metadata_size(fs_info, 1);
 
@@ -408,7 +409,7 @@ int btrfs_delalloc_reserve_metadata(struct btrfs_inode *inode, u64 num_bytes,
 	 * racing with an ordered completion or some such that would think it
 	 * needs to free the reservation we just made.
 	 */
-	nr_extents = count_max_extents(fs_info, num_bytes);
+	nr_extents = btrfs_inode_max_extents(inode, num_bytes);
 	spin_lock(&inode->lock);
 	btrfs_mod_outstanding_extents(inode, nr_extents);
 	if (!(inode->flags & BTRFS_INODE_NODATASUM))
@@ -477,7 +478,7 @@ void btrfs_delalloc_release_extents(struct btrfs_inode *inode, u64 num_bytes)
 	unsigned num_extents;
 
 	spin_lock(&inode->lock);
-	num_extents = count_max_extents(fs_info, num_bytes);
+	num_extents = btrfs_inode_max_extents(inode, num_bytes);
 	btrfs_mod_outstanding_extents(inode, -num_extents);
 	btrfs_calculate_inode_block_rsv_size(fs_info, inode);
 	spin_unlock(&inode->lock);
@@ -492,8 +493,8 @@ void btrfs_delalloc_release_extents(struct btrfs_inode *inode, u64 num_bytes)
 void btrfs_delalloc_shrink_extents(struct btrfs_inode *inode, u64 reserved_len, u64 new_len)
 {
 	struct btrfs_fs_info *fs_info = inode->root->fs_info;
-	const u32 reserved_num_extents = count_max_extents(fs_info, reserved_len);
-	const u32 new_num_extents = count_max_extents(fs_info, new_len);
+	const u32 reserved_num_extents = btrfs_inode_max_extents(inode, reserved_len);
+	const u32 new_num_extents = btrfs_inode_max_extents(inode, new_len);
 	const int diff_num_extents = new_num_extents - reserved_num_extents;
 
 	ASSERT(new_len <= reserved_len);
diff --git a/fs/btrfs/fs.h b/fs/btrfs/fs.h
index a4758d94b32e..2c1626155645 100644
--- a/fs/btrfs/fs.h
+++ b/fs/btrfs/fs.h
@@ -1051,19 +1051,6 @@ static inline bool btrfs_is_zoned(const struct btrfs_fs_info *fs_info)
 	return IS_ENABLED(CONFIG_BLK_DEV_ZONED) && fs_info->zone_size > 0;
 }
 
-/*
- * Count how many fs_info->max_extent_size cover the @size
- */
-static inline u32 count_max_extents(const struct btrfs_fs_info *fs_info, u64 size)
-{
-#ifdef CONFIG_BTRFS_FS_RUN_SANITY_TESTS
-	if (!fs_info)
-		return div_u64(size + BTRFS_MAX_EXTENT_SIZE - 1, BTRFS_MAX_EXTENT_SIZE);
-#endif
-
-	return div_u64(size + fs_info->max_extent_size - 1, fs_info->max_extent_size);
-}
-
 static inline unsigned int btrfs_blocks_per_folio(const struct btrfs_fs_info *fs_info,
 						  const struct folio *folio)
 {
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 1f0f3282e4b8..e567b23efe39 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -747,6 +747,56 @@ static int add_async_extent(struct async_chunk *cow, u64 start, u64 ram_size,
 	return 0;
 }
 
+/*
+ * Check if compression will definitely be attempted for this inode based on
+ * mount options and inode properties.  Unlike inode_need_compress(), this does
+ * NOT run the compression heuristic or check range-specific conditions, so it
+ * is safe to call under locks (e.g. io_tree lock) and for reservation sizing.
+ *
+ * Only returns true for cases where BTRFS_INODE_NOCOMPRESS cannot be set at
+ * runtime (FORCE_COMPRESS and prop_compress), ensuring that the effective max
+ * extent size is stable across paired set/clear delalloc operations.
+ */
+static inline bool inode_may_compress(const struct btrfs_inode *inode)
+{
+	if (!btrfs_inode_can_compress(inode))
+		return false;
+
+	/* force compress always attempts compression */
+	if (btrfs_test_opt(inode->root->fs_info, FORCE_COMPRESS))
+		return true;
+
+	/* per-inode property: NOCOMPRESS cannot override this */
+	if (inode->prop_compress)
+		return true;
+
+	return false;
+}
+
+/*
+ * Return the effective maximum extent size for reservation accounting.
+ *
+ * When compression is guaranteed to be attempted (FORCE_COMPRESS or
+ * prop_compress), the compression path splits ranges into
+ * BTRFS_MAX_UNCOMPRESSED chunks, each producing an independent ordered
+ * extent.  Use that as the divisor instead of fs_info->max_extent_size
+ * to avoid severely undercounting outstanding extents.
+ */
+u64 btrfs_inode_max_extent_size(const struct btrfs_inode *inode)
+{
+	if (inode_may_compress(inode))
+		return BTRFS_MAX_UNCOMPRESSED;
+
+	return inode->root->fs_info->max_extent_size;
+}
+
+u64 btrfs_inode_max_extents(const struct btrfs_inode *inode, u64 size)
+{
+	u64 max_extent_size = btrfs_inode_max_extent_size(inode);
+
+	return div_u64(size + max_extent_size - 1, max_extent_size);
+}
+
 /*
  * Check if the inode needs to be submitted to compression, based on mount
  * options, defragmentation, properties or heuristics.
@@ -2459,8 +2509,8 @@ int btrfs_run_delalloc_range(struct btrfs_inode *inode, struct folio *locked_fol
 void btrfs_split_delalloc_extent(struct btrfs_inode *inode,
 				 struct extent_state *orig, u64 split)
 {
-	struct btrfs_fs_info *fs_info = inode->root->fs_info;
 	u64 size;
+	u64 max_extent_size = btrfs_inode_max_extent_size(inode);
 
 	lockdep_assert_held(&inode->io_tree.lock);
 
@@ -2469,8 +2519,8 @@ void btrfs_split_delalloc_extent(struct btrfs_inode *inode,
 		return;
 
 	size = orig->end - orig->start + 1;
-	if (size > fs_info->max_extent_size) {
-		u32 num_extents;
+	if (size > max_extent_size) {
+		u64 num_extents;
 		u64 new_size;
 
 		/*
@@ -2478,10 +2528,10 @@ void btrfs_split_delalloc_extent(struct btrfs_inode *inode,
 		 * applies here, just in reverse.
 		 */
 		new_size = orig->end - split + 1;
-		num_extents = count_max_extents(fs_info, new_size);
+		num_extents = btrfs_inode_max_extents(inode, new_size);
 		new_size = split - orig->start;
-		num_extents += count_max_extents(fs_info, new_size);
-		if (count_max_extents(fs_info, size) >= num_extents)
+		num_extents += btrfs_inode_max_extents(inode, new_size);
+		if (btrfs_inode_max_extents(inode, size) >= num_extents)
 			return;
 	}
 
@@ -2498,9 +2548,9 @@ void btrfs_split_delalloc_extent(struct btrfs_inode *inode,
 void btrfs_merge_delalloc_extent(struct btrfs_inode *inode, struct extent_state *new,
 				 struct extent_state *other)
 {
-	struct btrfs_fs_info *fs_info = inode->root->fs_info;
 	u64 new_size, old_size;
-	u32 num_extents;
+	u64 max_extent_size = btrfs_inode_max_extent_size(inode);
+	u64 num_extents;
 
 	lockdep_assert_held(&inode->io_tree.lock);
 
@@ -2514,7 +2564,7 @@ void btrfs_merge_delalloc_extent(struct btrfs_inode *inode, struct extent_state
 		new_size = other->end - new->start + 1;
 
 	/* we're not bigger than the max, unreserve the space and go */
-	if (new_size <= fs_info->max_extent_size) {
+	if (new_size <= max_extent_size) {
 		spin_lock(&inode->lock);
 		btrfs_mod_outstanding_extents(inode, -1);
 		spin_unlock(&inode->lock);
@@ -2540,10 +2590,10 @@ void btrfs_merge_delalloc_extent(struct btrfs_inode *inode, struct extent_state
 	 * this case.
 	 */
 	old_size = other->end - other->start + 1;
-	num_extents = count_max_extents(fs_info, old_size);
+	num_extents = btrfs_inode_max_extents(inode, old_size);
 	old_size = new->end - new->start + 1;
-	num_extents += count_max_extents(fs_info, old_size);
-	if (count_max_extents(fs_info, new_size) >= num_extents)
+	num_extents += btrfs_inode_max_extents(inode, old_size);
+	if (btrfs_inode_max_extents(inode, new_size) >= num_extents)
 		return;
 
 	spin_lock(&inode->lock);
@@ -2616,7 +2666,7 @@ void btrfs_set_delalloc_extent(struct btrfs_inode *inode, struct extent_state *s
 	if (!(state->state & EXTENT_DELALLOC) && (bits & EXTENT_DELALLOC)) {
 		u64 len = state->end + 1 - state->start;
 		u64 prev_delalloc_bytes;
-		u32 num_extents = count_max_extents(fs_info, len);
+		u32 num_extents = btrfs_inode_max_extents(inode, len);
 
 		spin_lock(&inode->lock);
 		btrfs_mod_outstanding_extents(inode, num_extents);
@@ -2662,7 +2712,7 @@ void btrfs_clear_delalloc_extent(struct btrfs_inode *inode,
 {
 	struct btrfs_fs_info *fs_info = inode->root->fs_info;
 	u64 len = state->end + 1 - state->start;
-	u32 num_extents = count_max_extents(fs_info, len);
+	u32 num_extents = btrfs_inode_max_extents(inode, len);
 
 	lockdep_assert_held(&inode->io_tree.lock);
 
-- 
2.53.0


^ permalink raw reply related	[flat|nested] 10+ messages in thread

* [PATCH 4/5] btrfs: make inode->outstanding_extents a u64
  2026-03-25  0:41 [PATCH 0/5] btrfs: improve stalls under sudden writeback Boris Burkov
                   ` (2 preceding siblings ...)
  2026-03-25  0:41 ` [PATCH 3/5] btrfs: account for compression in delalloc extent reservation Boris Burkov
@ 2026-03-25  0:41 ` Boris Burkov
  2026-03-25  0:41 ` [PATCH 5/5] btrfs: cap shrink_delalloc iterations to 128M Boris Burkov
  4 siblings, 0 replies; 10+ messages in thread
From: Boris Burkov @ 2026-03-25  0:41 UTC (permalink / raw)
  To: linux-btrfs, kernel-team

The maximum file size is MAX_LFS_FILESIZE = (loff_t)LLONG_MAX

As a result, the max extent size computation in btrfs has always been
bounded above by LLONG_MAX / 128MiB, which is ~ 2^63 / 2^27. This has
never fit in a u32. With the recent changes to also divide by 128KiB in
compressed cases, that bound is even higher. Whether or not it is likely
to happen, I think it is nice to try to capture the intent in the types,
so change outstanding_extents to u64, and make mod_outstanding_extents
try to capture some expectations around the size of its inputs.

Signed-off-by: Boris Burkov <boris@bur.io>
---
 fs/btrfs/btrfs_inode.h       | 14 ++++++++++----
 fs/btrfs/delalloc-space.c    | 21 ++++++++++-----------
 fs/btrfs/inode.c             | 14 +++++++-------
 fs/btrfs/ordered-data.c      |  4 ++--
 fs/btrfs/tests/inode-tests.c | 18 +++++++++---------
 include/trace/events/btrfs.h |  8 ++++----
 6 files changed, 42 insertions(+), 37 deletions(-)

diff --git a/fs/btrfs/btrfs_inode.h b/fs/btrfs/btrfs_inode.h
index cfeda43b01d7..af7d7244a94b 100644
--- a/fs/btrfs/btrfs_inode.h
+++ b/fs/btrfs/btrfs_inode.h
@@ -180,7 +180,7 @@ struct btrfs_inode {
 	 * items we think we'll end up using, and reserved_extents is the number
 	 * of extent items we've reserved metadata for. Protected by 'lock'.
 	 */
-	unsigned outstanding_extents;
+	u64 outstanding_extents;
 
 	/* used to order data wrt metadata */
 	spinlock_t ordered_tree_lock;
@@ -432,14 +432,20 @@ static inline bool is_data_inode(const struct btrfs_inode *inode)
 }
 
 static inline void btrfs_mod_outstanding_extents(struct btrfs_inode *inode,
-						 int mod)
+						 int mod, u64 nr_extents)
 {
+	s64 delta = mod * (s64)nr_extents;
+
 	lockdep_assert_held(&inode->lock);
-	inode->outstanding_extents += mod;
+	ASSERT(mod == 1 || mod == -1);
+	ASSERT(nr_extents <= S64_MAX);
+	ASSERT(mod == -1 || inode->outstanding_extents <= U64_MAX - nr_extents);
+	ASSERT(mod == 1 || inode->outstanding_extents >= nr_extents);
+	inode->outstanding_extents += delta;
 	if (btrfs_is_free_space_inode(inode))
 		return;
 	trace_btrfs_inode_mod_outstanding_extents(inode->root, btrfs_ino(inode),
-						  mod, inode->outstanding_extents);
+						  delta, inode->outstanding_extents);
 }
 
 /*
diff --git a/fs/btrfs/delalloc-space.c b/fs/btrfs/delalloc-space.c
index 2ceae1065f2c..55d0d18b5117 100644
--- a/fs/btrfs/delalloc-space.c
+++ b/fs/btrfs/delalloc-space.c
@@ -264,7 +264,7 @@ static void btrfs_inode_rsv_release(struct btrfs_inode *inode, bool qgroup_free)
  * ordered_extent.
  */
 static u64 delalloc_calc_delayed_refs_rsv(const struct btrfs_fs_info *fs_info,
-					  unsigned int nr_extents)
+					  u64 nr_extents)
 {
 	return btrfs_calc_delayed_ref_bytes(fs_info, nr_extents) +
 		btrfs_calc_insert_metadata_size(fs_info, nr_extents);
@@ -278,7 +278,7 @@ static void btrfs_calculate_inode_block_rsv_size(struct btrfs_fs_info *fs_info,
 	u64 reserve_size = 0;
 	u64 delayed_refs_size = 0;
 	u64 qgroup_rsv_size = 0;
-	unsigned outstanding_extents;
+	u64 outstanding_extents;
 
 	lockdep_assert_held(&inode->lock);
 	outstanding_extents = inode->outstanding_extents;
@@ -306,7 +306,7 @@ static void btrfs_calculate_inode_block_rsv_size(struct btrfs_fs_info *fs_info,
 	 *
 	 * This is overestimating in most cases.
 	 */
-	qgroup_rsv_size = (u64)outstanding_extents * fs_info->nodesize;
+	qgroup_rsv_size = outstanding_extents * fs_info->nodesize;
 
 	spin_lock(&block_rsv->lock);
 	block_rsv->size = reserve_size;
@@ -355,7 +355,7 @@ int btrfs_delalloc_reserve_metadata(struct btrfs_inode *inode, u64 num_bytes,
 	struct btrfs_fs_info *fs_info = root->fs_info;
 	struct btrfs_block_rsv *block_rsv = &inode->block_rsv;
 	u64 meta_reserve, delayed_refs_reserve, qgroup_reserve;
-	unsigned nr_extents;
+	u64 nr_extents;
 	enum btrfs_reserve_flush_enum flush = BTRFS_RESERVE_FLUSH_ALL;
 	int ret = 0;
 
@@ -411,7 +411,7 @@ int btrfs_delalloc_reserve_metadata(struct btrfs_inode *inode, u64 num_bytes,
 	 */
 	nr_extents = btrfs_inode_max_extents(inode, num_bytes);
 	spin_lock(&inode->lock);
-	btrfs_mod_outstanding_extents(inode, nr_extents);
+	btrfs_mod_outstanding_extents(inode, 1, nr_extents);
 	if (!(inode->flags & BTRFS_INODE_NODATASUM))
 		inode->csum_bytes += disk_num_bytes;
 	btrfs_calculate_inode_block_rsv_size(fs_info, inode);
@@ -475,11 +475,11 @@ void btrfs_delalloc_release_metadata(struct btrfs_inode *inode, u64 num_bytes,
 void btrfs_delalloc_release_extents(struct btrfs_inode *inode, u64 num_bytes)
 {
 	struct btrfs_fs_info *fs_info = inode->root->fs_info;
-	unsigned num_extents;
+	u64 num_extents;
 
 	spin_lock(&inode->lock);
 	num_extents = btrfs_inode_max_extents(inode, num_bytes);
-	btrfs_mod_outstanding_extents(inode, -num_extents);
+	btrfs_mod_outstanding_extents(inode, -1, num_extents);
 	btrfs_calculate_inode_block_rsv_size(fs_info, inode);
 	spin_unlock(&inode->lock);
 
@@ -493,16 +493,15 @@ void btrfs_delalloc_release_extents(struct btrfs_inode *inode, u64 num_bytes)
 void btrfs_delalloc_shrink_extents(struct btrfs_inode *inode, u64 reserved_len, u64 new_len)
 {
 	struct btrfs_fs_info *fs_info = inode->root->fs_info;
-	const u32 reserved_num_extents = btrfs_inode_max_extents(inode, reserved_len);
-	const u32 new_num_extents = btrfs_inode_max_extents(inode, new_len);
-	const int diff_num_extents = new_num_extents - reserved_num_extents;
+	const u64 reserved_num_extents = btrfs_inode_max_extents(inode, reserved_len);
+	const u64 new_num_extents = btrfs_inode_max_extents(inode, new_len);
 
 	ASSERT(new_len <= reserved_len);
 	if (new_num_extents == reserved_num_extents)
 		return;
 
 	spin_lock(&inode->lock);
-	btrfs_mod_outstanding_extents(inode, diff_num_extents);
+	btrfs_mod_outstanding_extents(inode, -1, reserved_num_extents - new_num_extents);
 	btrfs_calculate_inode_block_rsv_size(fs_info, inode);
 	spin_unlock(&inode->lock);
 
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index e567b23efe39..887f1a5dba9f 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -2536,7 +2536,7 @@ void btrfs_split_delalloc_extent(struct btrfs_inode *inode,
 	}
 
 	spin_lock(&inode->lock);
-	btrfs_mod_outstanding_extents(inode, 1);
+	btrfs_mod_outstanding_extents(inode, 1, 1);
 	spin_unlock(&inode->lock);
 }
 
@@ -2566,7 +2566,7 @@ void btrfs_merge_delalloc_extent(struct btrfs_inode *inode, struct extent_state
 	/* we're not bigger than the max, unreserve the space and go */
 	if (new_size <= max_extent_size) {
 		spin_lock(&inode->lock);
-		btrfs_mod_outstanding_extents(inode, -1);
+		btrfs_mod_outstanding_extents(inode, -1, 1);
 		spin_unlock(&inode->lock);
 		return;
 	}
@@ -2597,7 +2597,7 @@ void btrfs_merge_delalloc_extent(struct btrfs_inode *inode, struct extent_state
 		return;
 
 	spin_lock(&inode->lock);
-	btrfs_mod_outstanding_extents(inode, -1);
+	btrfs_mod_outstanding_extents(inode, -1, 1);
 	spin_unlock(&inode->lock);
 }
 
@@ -2666,10 +2666,10 @@ void btrfs_set_delalloc_extent(struct btrfs_inode *inode, struct extent_state *s
 	if (!(state->state & EXTENT_DELALLOC) && (bits & EXTENT_DELALLOC)) {
 		u64 len = state->end + 1 - state->start;
 		u64 prev_delalloc_bytes;
-		u32 num_extents = btrfs_inode_max_extents(inode, len);
+		u64 num_extents = btrfs_inode_max_extents(inode, len);
 
 		spin_lock(&inode->lock);
-		btrfs_mod_outstanding_extents(inode, num_extents);
+		btrfs_mod_outstanding_extents(inode, 1, num_extents);
 		spin_unlock(&inode->lock);
 
 		/* For sanity tests */
@@ -2712,7 +2712,7 @@ void btrfs_clear_delalloc_extent(struct btrfs_inode *inode,
 {
 	struct btrfs_fs_info *fs_info = inode->root->fs_info;
 	u64 len = state->end + 1 - state->start;
-	u32 num_extents = btrfs_inode_max_extents(inode, len);
+	u64 num_extents = btrfs_inode_max_extents(inode, len);
 
 	lockdep_assert_held(&inode->io_tree.lock);
 
@@ -2732,7 +2732,7 @@ void btrfs_clear_delalloc_extent(struct btrfs_inode *inode,
 		u64 new_delalloc_bytes;
 
 		spin_lock(&inode->lock);
-		btrfs_mod_outstanding_extents(inode, -num_extents);
+		btrfs_mod_outstanding_extents(inode, -1, num_extents);
 		spin_unlock(&inode->lock);
 
 		/*
diff --git a/fs/btrfs/ordered-data.c b/fs/btrfs/ordered-data.c
index d39f1c49d1cf..14b49cb33bb0 100644
--- a/fs/btrfs/ordered-data.c
+++ b/fs/btrfs/ordered-data.c
@@ -223,7 +223,7 @@ static struct btrfs_ordered_extent *alloc_ordered_extent(
 	 * smallest the extent is going to get.
 	 */
 	spin_lock(&inode->lock);
-	btrfs_mod_outstanding_extents(inode, 1);
+	btrfs_mod_outstanding_extents(inode, 1, 1);
 	spin_unlock(&inode->lock);
 
 out:
@@ -655,7 +655,7 @@ void btrfs_remove_ordered_extent(struct btrfs_ordered_extent *entry)
 	btrfs_lockdep_acquire(fs_info, btrfs_trans_pending_ordered);
 	/* This is paired with alloc_ordered_extent(). */
 	spin_lock(&btrfs_inode->lock);
-	btrfs_mod_outstanding_extents(btrfs_inode, -1);
+	btrfs_mod_outstanding_extents(btrfs_inode, -1, 1);
 	spin_unlock(&btrfs_inode->lock);
 	if (root != fs_info->tree_root) {
 		u64 release;
diff --git a/fs/btrfs/tests/inode-tests.c b/fs/btrfs/tests/inode-tests.c
index b04fbcaf0a1d..e63afbb9be2b 100644
--- a/fs/btrfs/tests/inode-tests.c
+++ b/fs/btrfs/tests/inode-tests.c
@@ -931,7 +931,7 @@ static int test_extent_accounting(u32 sectorsize, u32 nodesize)
 	}
 	if (BTRFS_I(inode)->outstanding_extents != 1) {
 		ret = -EINVAL;
-		test_err("miscount, wanted 1, got %u",
+		test_err("miscount, wanted 1, got %llu",
 			 BTRFS_I(inode)->outstanding_extents);
 		goto out;
 	}
@@ -946,7 +946,7 @@ static int test_extent_accounting(u32 sectorsize, u32 nodesize)
 	}
 	if (BTRFS_I(inode)->outstanding_extents != 2) {
 		ret = -EINVAL;
-		test_err("miscount, wanted 2, got %u",
+		test_err("miscount, wanted 2, got %llu",
 			 BTRFS_I(inode)->outstanding_extents);
 		goto out;
 	}
@@ -962,7 +962,7 @@ static int test_extent_accounting(u32 sectorsize, u32 nodesize)
 	}
 	if (BTRFS_I(inode)->outstanding_extents != 2) {
 		ret = -EINVAL;
-		test_err("miscount, wanted 2, got %u",
+		test_err("miscount, wanted 2, got %llu",
 			 BTRFS_I(inode)->outstanding_extents);
 		goto out;
 	}
@@ -978,7 +978,7 @@ static int test_extent_accounting(u32 sectorsize, u32 nodesize)
 	}
 	if (BTRFS_I(inode)->outstanding_extents != 2) {
 		ret = -EINVAL;
-		test_err("miscount, wanted 2, got %u",
+		test_err("miscount, wanted 2, got %llu",
 			 BTRFS_I(inode)->outstanding_extents);
 		goto out;
 	}
@@ -996,7 +996,7 @@ static int test_extent_accounting(u32 sectorsize, u32 nodesize)
 	}
 	if (BTRFS_I(inode)->outstanding_extents != 4) {
 		ret = -EINVAL;
-		test_err("miscount, wanted 4, got %u",
+		test_err("miscount, wanted 4, got %llu",
 			 BTRFS_I(inode)->outstanding_extents);
 		goto out;
 	}
@@ -1013,7 +1013,7 @@ static int test_extent_accounting(u32 sectorsize, u32 nodesize)
 	}
 	if (BTRFS_I(inode)->outstanding_extents != 3) {
 		ret = -EINVAL;
-		test_err("miscount, wanted 3, got %u",
+		test_err("miscount, wanted 3, got %llu",
 			 BTRFS_I(inode)->outstanding_extents);
 		goto out;
 	}
@@ -1029,7 +1029,7 @@ static int test_extent_accounting(u32 sectorsize, u32 nodesize)
 	}
 	if (BTRFS_I(inode)->outstanding_extents != 4) {
 		ret = -EINVAL;
-		test_err("miscount, wanted 4, got %u",
+		test_err("miscount, wanted 4, got %llu",
 			 BTRFS_I(inode)->outstanding_extents);
 		goto out;
 	}
@@ -1047,7 +1047,7 @@ static int test_extent_accounting(u32 sectorsize, u32 nodesize)
 	}
 	if (BTRFS_I(inode)->outstanding_extents != 3) {
 		ret = -EINVAL;
-		test_err("miscount, wanted 3, got %u",
+		test_err("miscount, wanted 3, got %llu",
 			 BTRFS_I(inode)->outstanding_extents);
 		goto out;
 	}
@@ -1061,7 +1061,7 @@ static int test_extent_accounting(u32 sectorsize, u32 nodesize)
 	}
 	if (BTRFS_I(inode)->outstanding_extents) {
 		ret = -EINVAL;
-		test_err("miscount, wanted 0, got %u",
+		test_err("miscount, wanted 0, got %llu",
 			 BTRFS_I(inode)->outstanding_extents);
 		goto out;
 	}
diff --git a/include/trace/events/btrfs.h b/include/trace/events/btrfs.h
index 8ad7a2d76c1d..caabdc8d9eed 100644
--- a/include/trace/events/btrfs.h
+++ b/include/trace/events/btrfs.h
@@ -2003,15 +2003,15 @@ DEFINE_EVENT(btrfs__prelim_ref, btrfs_prelim_ref_insert,
 );
 
 TRACE_EVENT(btrfs_inode_mod_outstanding_extents,
-	TP_PROTO(const struct btrfs_root *root, u64 ino, int mod, unsigned outstanding),
+	TP_PROTO(const struct btrfs_root *root, u64 ino, s64 mod, u64 outstanding),
 
 	TP_ARGS(root, ino, mod, outstanding),
 
 	TP_STRUCT__entry_btrfs(
 		__field(	u64, root_objectid	)
 		__field(	u64, ino		)
-		__field(	int, mod		)
-		__field(	unsigned, outstanding	)
+		__field(	s64, mod		)
+		__field(	u64, outstanding	)
 	),
 
 	TP_fast_assign_btrfs(root->fs_info,
@@ -2021,7 +2021,7 @@ TRACE_EVENT(btrfs_inode_mod_outstanding_extents,
 		__entry->outstanding    = outstanding;
 	),
 
-	TP_printk_btrfs("root=%llu(%s) ino=%llu mod=%d outstanding=%u",
+	TP_printk_btrfs("root=%llu(%s) ino=%llu mod=%lld outstanding=%llu",
 			show_root_type(__entry->root_objectid),
 			__entry->ino, __entry->mod, __entry->outstanding)
 );
-- 
2.53.0


^ permalink raw reply related	[flat|nested] 10+ messages in thread

* [PATCH 5/5] btrfs: cap shrink_delalloc iterations to 128M
  2026-03-25  0:41 [PATCH 0/5] btrfs: improve stalls under sudden writeback Boris Burkov
                   ` (3 preceding siblings ...)
  2026-03-25  0:41 ` [PATCH 4/5] btrfs: make inode->outstanding_extents a u64 Boris Burkov
@ 2026-03-25  0:41 ` Boris Burkov
  4 siblings, 0 replies; 10+ messages in thread
From: Boris Burkov @ 2026-03-25  0:41 UTC (permalink / raw)
  To: linux-btrfs, kernel-team

Even with more accurate delayed_refs reservations, preemptive reclaim is
not perfect and we might generate tickets, especially in cases with a
very large flood of writeback outstanding.

Ultimately, if we do get into a situation with tickets pending and async
reclaim blocking the system, we want to try to make as much progress as
quickly as possible to unblock tasks. We want space reclaim to be
effective, and to have a good chance at making progress, but not to
block arbitrarily as this leads to untenable syscall latencies, long
commits, and even hung task warnings.

I traced such cases of heavy writeback async reclaim hung tasks and
observed that we were blocking for long periods of time in
shrink_delalloc(). This was particularly bad when doing writeback of
incompressible data with the compress-force mount option.

e.g.
dd if=/dev/urandom of=urandom.seed bs=1G count=1
dd if=urandom.seed of=urandom.big bs=1G count=300

shrink_delalloc() computes to_reclaim as delalloc_bytes >> 3. With
hundreds of gigs of delalloc (again imagine a large dirty_ratio and lots
of ram), this is still 10-20+ GiB. Particularly in the wait phases, this
can be quite slow, and generates even more delayed-refs as mentioned in
the previous patch, so it doesn't even help that much with the immediate
space shortfall.

We do satisfy some tickets, but we are ultimately keep the system in
essentially the same state, and with long stalling reclaim calls into
shrink_delalloc().

It would be much better to start some good chunk of I/O and also to work
through the new delayed_refs and keep things moving through the system
while releasing the conservative over-estimated metadata reservations.

To acheive this, tighten up the delalloc work to be in units of the
maximum extent size. If we issue 128MiB of delalloc, we don't leave too
much (any?) extent merging on the table, but don't ever block on
pathological 10GiB+ chunks of delalloc. If we do detect that we
satisfied a ticket, break out of shrink_delalloc() and run some of the
new delayed_refs as well before going again. This way we strike a nice
balance of making delalloc progress, but not at the cost of every other
sort of reservation, as they all feed into each other.

This means iterating over to_reclaim by 128MiB at a time until it is
drained or we satisfy a ticket, rather than trying 3 times to do the
whole thing.

Signed-off-by: Boris Burkov <boris@bur.io>
---
 fs/btrfs/space-info.c | 31 +++++++++++++++++++++----------
 1 file changed, 21 insertions(+), 10 deletions(-)

diff --git a/fs/btrfs/space-info.c b/fs/btrfs/space-info.c
index e017bb182c8c..42f7d63e2464 100644
--- a/fs/btrfs/space-info.c
+++ b/fs/btrfs/space-info.c
@@ -729,7 +729,7 @@ static void shrink_delalloc(struct btrfs_space_info *space_info,
 	u64 ordered_bytes;
 	u64 items;
 	long time_left;
-	int loops;
+	u64 orig_tickets_id;
 
 	delalloc_bytes = percpu_counter_sum_positive(&fs_info->delalloc_bytes);
 	ordered_bytes = percpu_counter_sum_positive(&fs_info->ordered_bytes);
@@ -737,9 +737,7 @@ static void shrink_delalloc(struct btrfs_space_info *space_info,
 		return;
 
 	/* Calc the number of the pages we need flush for space reservation */
-	if (to_reclaim == U64_MAX) {
-		items = U64_MAX;
-	} else {
+	if (to_reclaim != U64_MAX) {
 		/*
 		 * to_reclaim is set to however much metadata we need to
 		 * reclaim, but reclaiming that much data doesn't really track
@@ -753,7 +751,6 @@ static void shrink_delalloc(struct btrfs_space_info *space_info,
 		 * aggressive.
 		 */
 		to_reclaim = max(to_reclaim, delalloc_bytes >> 3);
-		items = calc_reclaim_items_nr(fs_info, to_reclaim) * 2;
 	}
 
 	trans = current->journal_info;
@@ -766,12 +763,17 @@ static void shrink_delalloc(struct btrfs_space_info *space_info,
 	if (ordered_bytes > delalloc_bytes && !for_preempt)
 		wait_ordered = true;
 
-	loops = 0;
-	while ((delalloc_bytes || ordered_bytes) && loops < 3) {
-		u64 temp = min(delalloc_bytes, to_reclaim) >> PAGE_SHIFT;
-		long nr_pages = min_t(u64, temp, LONG_MAX);
+	spin_lock(&space_info->lock);
+	orig_tickets_id = space_info->tickets_id;
+	spin_unlock(&space_info->lock);
+
+	while ((delalloc_bytes || ordered_bytes) && to_reclaim) {
+		u64 iter_reclaim = min_t(u64, to_reclaim, SZ_128M);
+		long nr_pages = min_t(u64, delalloc_bytes, iter_reclaim) >> PAGE_SHIFT;
 		int async_pages;
 
+		items = calc_reclaim_items_nr(fs_info, iter_reclaim) * 2;
+
 		btrfs_start_delalloc_roots(fs_info, nr_pages, true);
 
 		/*
@@ -813,7 +815,7 @@ static void shrink_delalloc(struct btrfs_space_info *space_info,
 			   atomic_read(&fs_info->async_delalloc_pages) <=
 			   async_pages);
 skip_async:
-		loops++;
+		to_reclaim -= iter_reclaim;
 		if (wait_ordered && !trans) {
 			btrfs_wait_ordered_roots(fs_info, items, NULL);
 		} else {
@@ -836,6 +838,15 @@ static void shrink_delalloc(struct btrfs_space_info *space_info,
 			spin_unlock(&space_info->lock);
 			break;
 		}
+		/*
+		 * If a ticket was satisfied since we started, break out
+		 * so the async reclaim state machine can process delayed
+		 * refs before we flush more delalloc.
+		 */
+		if (space_info->tickets_id != orig_tickets_id) {
+			spin_unlock(&space_info->lock);
+			break;
+		}
 		spin_unlock(&space_info->lock);
 
 		delalloc_bytes = percpu_counter_sum_positive(
-- 
2.53.0


^ permalink raw reply related	[flat|nested] 10+ messages in thread

* Re: [PATCH 1/5] btrfs: reserve space for delayed_refs in delalloc
  2026-03-25  0:41 ` [PATCH 1/5] btrfs: reserve space for delayed_refs in delalloc Boris Burkov
@ 2026-03-25 15:36   ` Filipe Manana
  2026-03-25 18:39     ` Boris Burkov
  0 siblings, 1 reply; 10+ messages in thread
From: Filipe Manana @ 2026-03-25 15:36 UTC (permalink / raw)
  To: Boris Burkov; +Cc: linux-btrfs, kernel-team

On Wed, Mar 25, 2026 at 12:45 AM Boris Burkov <boris@bur.io> wrote:
>
> delalloc uses a per-inode block_rsv to perform metadata reservations for
> the cow operations it anticipates based on the number of outstanding
> extents. This calculation is done based on inode->outstanding_extents in
> btrfs_calculate_inode_block_rsv_size(). The reservation is *not*
> meticulously tracked as each ordered_extent is actually created in
> writeback, but rather delalloc attempts to over-estimate and the
> writeback and ordered_extent finish portions are responsible to release
> all the reservation.
>
> However, there is a notable gap in this reservation, it reserves no
> space for the resulting delayed_refs. If you compare to how
> btrfs_start_transaction() reservations work, this is a noteable
> difference.
>
> As writeback actually occurs, and we trigger btrfs_finish_one_ordered(),
> that function will start generating delayed refs, which will draw from
> the trans_handle's delayed_refs_rsv via btrfs_update_delayed_refs_rsv():
>
> btrfs_finish_one_ordered()
>   insert_ordered_extent_file_extent()
>     insert_reserved_file_extent()
>       btrfs_alloc_reserved_file_extent()
>         btrfs_add_delayed_data_ref()
>           add_delayed_ref()
>             btrfs_update_delayed_refs_rsv();
>
> This trans_handle was created in finish_one_ordered() with
> btrfs_join_transaction() which calls start_transaction with
> num_items=0 and BTRFS_RESERVE_NO_FLUSH. As a result, this trans_handle
> has no reserved in h->delayed_rsv, as neither the num_items reservation
> nor the btrfs_delayed_refs_rsv_refill() reservation is run.
>
> Thus, when btrfs_update_delayed_refs_rsv() runs, reserved_bytes is 0 and
> fs_info->delayed_rsv->size grows but not fs_info->delayed_rsv->reserved.
>
> If a large amount of writeback happens all at once (perhaps due to
> dirty_ratio being tuned too high), this results in, among other things,
> erroneous assessments of the amount of delayed_refs reserved in the
> metadata space reclaim logic, like need_preemptive_reclaim() which
> relies on fs_info->delayed_rsv->reserved and even worse, poor decision
> making in btrfs_preempt_reclaim_metadata_space() which counts
> delalloc_bytes like so:
>
>   block_rsv_size = global_rsv_size +
>           btrfs_block_rsv_reserved(delayed_block_rsv) +
>           btrfs_block_rsv_reserved(delayed_refs_rsv) +
>           btrfs_block_rsv_reserved(trans_rsv);
>   delalloc_size = bytes_may_use - block_rsv_size;
>
> So all that lost delayed refs usage gets accounted as delalloc_size and
> leads to preemptive reclaim continuously choosing FLUSH_DELALLOC, which
> further exacerbates the problem.
>
> With enough writeback around, we can run enough delalloc that we get
> into async reclaim which starts blocking start_transaction() and
> eventually hits FLUSH_DELALLOC_WAIT/FLUSH_DELALLOC_FULL at which point
> the filesystem gets heavily blocked on metadata space in reserve_space(),
> blocking all new transaction work until all the ordered_extents finish.
>
> If we had an accurate view of the reservation for delayed refs, then we
> could mostly break this feedback loop in preemptive reclaim, and
> generally would be able to make more accurate decisions with regards to
> metadata space reclamation.
>
> This patch introduces the mechanism of a per-inode delayed_refs rsv
> which is modeled closely after the same in trans_handle. The delalloc
> reservation also reserves delayed refs and then finish_one_ordered
> transfers the inode delayed_refs rsv into the trans_handle one, just
> like inode->block_rsv.
>
> This is not a perfect fix for the most pathological cases, but is the
> infrastructure needed to keep working on the problem.
>
> Signed-off-by: Boris Burkov <boris@bur.io>
> ---
>  fs/btrfs/btrfs_inode.h    |  3 +++
>  fs/btrfs/delalloc-space.c | 34 ++++++++++++++++++++++++++++++----
>  fs/btrfs/delayed-ref.c    |  2 +-
>  fs/btrfs/inode.c          |  9 ++++++++-
>  fs/btrfs/transaction.c    |  7 ++++---
>  fs/btrfs/transaction.h    |  3 ++-
>  6 files changed, 48 insertions(+), 10 deletions(-)
>
> diff --git a/fs/btrfs/btrfs_inode.h b/fs/btrfs/btrfs_inode.h
> index 55c272fe5d92..dca4f6df7e95 100644
> --- a/fs/btrfs/btrfs_inode.h
> +++ b/fs/btrfs/btrfs_inode.h
> @@ -328,6 +328,9 @@ struct btrfs_inode {
>
>         struct btrfs_block_rsv block_rsv;
>
> +       /* Reserve for delayed refs generated by ordered extent completion. */
> +       struct btrfs_block_rsv delayed_rsv;

Not that long ago we had an effort to decrease the btrfs_inode
structure size down to less than 1024 bytes, so that we could have 4
inodes per 4K page instead of 3, and this change now makes the
structure larger than 1024 bytes again.

Instead of adding another block reserve to the inode we could:

1) Add the reservations for delayed refs in the existing block reserve
(inode->block_rsv);

2) When finishing the ordered extent, after joining the transaction
and setting trans->block to inode->block_rsv, we could migrate the
space reserved for delayed refs from the inode->block_rsv into
trans->delayed_rsv.

This would not require increasing the btrfs_inode structure and
neither add a _local_delayed_rsv field to the transaction handle (and
btw, we don't use the _ prefix for any structure fields anywhere in
btrfs).

Thanks.


> +
>         struct btrfs_delayed_node *delayed_node;
>
>         /* File creation time. */
> diff --git a/fs/btrfs/delalloc-space.c b/fs/btrfs/delalloc-space.c
> index 0970799d0aa4..e2944ff4fe47 100644
> --- a/fs/btrfs/delalloc-space.c
> +++ b/fs/btrfs/delalloc-space.c
> @@ -3,6 +3,7 @@
>  #include "messages.h"
>  #include "ctree.h"
>  #include "delalloc-space.h"
> +#include "delayed-ref.h"
>  #include "block-rsv.h"
>  #include "btrfs_inode.h"
>  #include "space-info.h"
> @@ -240,6 +241,13 @@ static void btrfs_inode_rsv_release(struct btrfs_inode *inode, bool qgroup_free)
>         if (released > 0)
>                 trace_btrfs_space_reservation(fs_info, "delalloc",
>                                               btrfs_ino(inode), released, 0);
> +
> +       released = btrfs_block_rsv_release(fs_info, &inode->delayed_rsv,
> +                                          0, NULL);
> +       if (released > 0)
> +               trace_btrfs_space_reservation(fs_info, "delalloc_delayed_refs",
> +                                             btrfs_ino(inode), released, 0);
> +
>         if (qgroup_free)
>                 btrfs_qgroup_free_meta_prealloc(inode->root, qgroup_to_release);
>         else
> @@ -251,7 +259,9 @@ static void btrfs_calculate_inode_block_rsv_size(struct btrfs_fs_info *fs_info,
>                                                  struct btrfs_inode *inode)
>  {
>         struct btrfs_block_rsv *block_rsv = &inode->block_rsv;
> +       struct btrfs_block_rsv *delayed_rsv = &inode->delayed_rsv;
>         u64 reserve_size = 0;
> +       u64 delayed_refs_size = 0;
>         u64 qgroup_rsv_size = 0;
>         unsigned outstanding_extents;
>
> @@ -266,6 +276,8 @@ static void btrfs_calculate_inode_block_rsv_size(struct btrfs_fs_info *fs_info,
>                 reserve_size = btrfs_calc_insert_metadata_size(fs_info,
>                                                 outstanding_extents);
>                 reserve_size += btrfs_calc_metadata_size(fs_info, 1);
> +               delayed_refs_size += btrfs_calc_delayed_ref_bytes(fs_info,
> +                                               outstanding_extents);
>         }
>         if (!(inode->flags & BTRFS_INODE_NODATASUM)) {
>                 u64 csum_leaves;
> @@ -285,11 +297,17 @@ static void btrfs_calculate_inode_block_rsv_size(struct btrfs_fs_info *fs_info,
>         block_rsv->size = reserve_size;
>         block_rsv->qgroup_rsv_size = qgroup_rsv_size;
>         spin_unlock(&block_rsv->lock);
> +
> +       spin_lock(&delayed_rsv->lock);
> +       delayed_rsv->size = delayed_refs_size;
> +       spin_unlock(&delayed_rsv->lock);
>  }
>
>  static void calc_inode_reservations(struct btrfs_inode *inode,
>                                     u64 num_bytes, u64 disk_num_bytes,
> -                                   u64 *meta_reserve, u64 *qgroup_reserve)
> +                                   u64 *meta_reserve,
> +                                   u64 *delayed_refs_reserve,
> +                                   u64 *qgroup_reserve)
>  {
>         struct btrfs_fs_info *fs_info = inode->root->fs_info;
>         u64 nr_extents = count_max_extents(fs_info, num_bytes);
> @@ -309,6 +327,10 @@ static void calc_inode_reservations(struct btrfs_inode *inode,
>          * for an inode update.
>          */
>         *meta_reserve += inode_update;
> +
> +       *delayed_refs_reserve = btrfs_calc_delayed_ref_bytes(fs_info,
> +                                                            nr_extents);
> +
>         *qgroup_reserve = nr_extents * fs_info->nodesize;
>  }
>
> @@ -318,7 +340,7 @@ int btrfs_delalloc_reserve_metadata(struct btrfs_inode *inode, u64 num_bytes,
>         struct btrfs_root *root = inode->root;
>         struct btrfs_fs_info *fs_info = root->fs_info;
>         struct btrfs_block_rsv *block_rsv = &inode->block_rsv;
> -       u64 meta_reserve, qgroup_reserve;
> +       u64 meta_reserve, delayed_refs_reserve, qgroup_reserve;
>         unsigned nr_extents;
>         enum btrfs_reserve_flush_enum flush = BTRFS_RESERVE_FLUSH_ALL;
>         int ret = 0;
> @@ -353,12 +375,14 @@ int btrfs_delalloc_reserve_metadata(struct btrfs_inode *inode, u64 num_bytes,
>          * over-reserve slightly, and clean up the mess when we are done.
>          */
>         calc_inode_reservations(inode, num_bytes, disk_num_bytes,
> -                               &meta_reserve, &qgroup_reserve);
> +                               &meta_reserve, &delayed_refs_reserve,
> +                               &qgroup_reserve);
>         ret = btrfs_qgroup_reserve_meta_prealloc(root, qgroup_reserve, true,
>                                                  noflush);
>         if (ret)
>                 return ret;
> -       ret = btrfs_reserve_metadata_bytes(block_rsv->space_info, meta_reserve,
> +       ret = btrfs_reserve_metadata_bytes(block_rsv->space_info,
> +                                          meta_reserve + delayed_refs_reserve,
>                                            flush);
>         if (ret) {
>                 btrfs_qgroup_free_meta_prealloc(root, qgroup_reserve);
> @@ -383,6 +407,8 @@ int btrfs_delalloc_reserve_metadata(struct btrfs_inode *inode, u64 num_bytes,
>         btrfs_block_rsv_add_bytes(block_rsv, meta_reserve, false);
>         trace_btrfs_space_reservation(root->fs_info, "delalloc",
>                                       btrfs_ino(inode), meta_reserve, 1);
> +       btrfs_block_rsv_add_bytes(&inode->delayed_rsv, delayed_refs_reserve,
> +                                 false);
>
>         spin_lock(&block_rsv->lock);
>         block_rsv->qgroup_rsv_reserved += qgroup_reserve;
> diff --git a/fs/btrfs/delayed-ref.c b/fs/btrfs/delayed-ref.c
> index 605858c2d9a9..9fe9cec1bef3 100644
> --- a/fs/btrfs/delayed-ref.c
> +++ b/fs/btrfs/delayed-ref.c
> @@ -89,7 +89,7 @@ void btrfs_update_delayed_refs_rsv(struct btrfs_trans_handle *trans)
>  {
>         struct btrfs_fs_info *fs_info = trans->fs_info;
>         struct btrfs_block_rsv *delayed_rsv = &fs_info->delayed_refs_rsv;
> -       struct btrfs_block_rsv *local_rsv = &trans->delayed_rsv;
> +       struct btrfs_block_rsv *local_rsv = trans->delayed_rsv;
>         u64 num_bytes;
>         u64 reserved_bytes;
>
> diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
> index 1a4e6a9239ae..1f0f3282e4b8 100644
> --- a/fs/btrfs/inode.c
> +++ b/fs/btrfs/inode.c
> @@ -653,6 +653,7 @@ static noinline int __cow_file_range_inline(struct btrfs_inode *inode,
>                 goto out;
>         }
>         trans->block_rsv = &inode->block_rsv;
> +       trans->delayed_rsv = &inode->delayed_rsv;
>
>         drop_args.path = path;
>         drop_args.start = 0;
> @@ -3256,6 +3257,7 @@ int btrfs_finish_one_ordered(struct btrfs_ordered_extent *ordered_extent)
>         }
>
>         trans->block_rsv = &inode->block_rsv;
> +       trans->delayed_rsv = &inode->delayed_rsv;
>
>         ret = btrfs_insert_raid_extent(trans, ordered_extent);
>         if (unlikely(ret)) {
> @@ -8074,9 +8076,12 @@ struct inode *btrfs_alloc_inode(struct super_block *sb)
>
>         spin_lock_init(&ei->lock);
>         ei->outstanding_extents = 0;
> -       if (sb->s_magic != BTRFS_TEST_MAGIC)
> +       if (sb->s_magic != BTRFS_TEST_MAGIC) {
>                 btrfs_init_metadata_block_rsv(fs_info, &ei->block_rsv,
>                                               BTRFS_BLOCK_RSV_DELALLOC);
> +               btrfs_init_metadata_block_rsv(fs_info, &ei->delayed_rsv,
> +                                             BTRFS_BLOCK_RSV_DELREFS);
> +       }
>         ei->runtime_flags = 0;
>         ei->prop_compress = BTRFS_COMPRESS_NONE;
>         ei->defrag_compress = BTRFS_COMPRESS_NONE;
> @@ -8132,6 +8137,8 @@ void btrfs_destroy_inode(struct inode *vfs_inode)
>         WARN_ON(vfs_inode->i_data.nrpages);
>         WARN_ON(inode->block_rsv.reserved);
>         WARN_ON(inode->block_rsv.size);
> +       WARN_ON(inode->delayed_rsv.reserved);
> +       WARN_ON(inode->delayed_rsv.size);
>         WARN_ON(inode->outstanding_extents);
>         if (!S_ISDIR(vfs_inode->i_mode)) {
>                 WARN_ON(inode->delalloc_bytes);
> diff --git a/fs/btrfs/transaction.c b/fs/btrfs/transaction.c
> index 4358f4b63057..a55f8996cd59 100644
> --- a/fs/btrfs/transaction.c
> +++ b/fs/btrfs/transaction.c
> @@ -737,7 +737,8 @@ start_transaction(struct btrfs_root *root, unsigned int num_items,
>
>         h->type = type;
>         INIT_LIST_HEAD(&h->new_bgs);
> -       btrfs_init_metadata_block_rsv(fs_info, &h->delayed_rsv, BTRFS_BLOCK_RSV_DELREFS);
> +       h->delayed_rsv = &h->_local_delayed_rsv;
> +       btrfs_init_metadata_block_rsv(fs_info, h->delayed_rsv, BTRFS_BLOCK_RSV_DELREFS);
>
>         smp_mb();
>         if (cur_trans->state >= TRANS_STATE_COMMIT_START &&
> @@ -758,7 +759,7 @@ start_transaction(struct btrfs_root *root, unsigned int num_items,
>                                                       h->transid,
>                                                       delayed_refs_bytes, 1);
>                         h->delayed_refs_bytes_reserved = delayed_refs_bytes;
> -                       btrfs_block_rsv_add_bytes(&h->delayed_rsv, delayed_refs_bytes, true);
> +                       btrfs_block_rsv_add_bytes(h->delayed_rsv, delayed_refs_bytes, true);
>                         delayed_refs_bytes = 0;
>                 }
>                 h->reloc_reserved = reloc_reserved;
> @@ -1067,7 +1068,7 @@ static void btrfs_trans_release_metadata(struct btrfs_trans_handle *trans)
>         trace_btrfs_space_reservation(fs_info, "local_delayed_refs_rsv",
>                                       trans->transid,
>                                       trans->delayed_refs_bytes_reserved, 0);
> -       btrfs_block_rsv_release(fs_info, &trans->delayed_rsv,
> +       btrfs_block_rsv_release(fs_info, trans->delayed_rsv,
>                                 trans->delayed_refs_bytes_reserved, NULL);
>         trans->delayed_refs_bytes_reserved = 0;
>  }
> diff --git a/fs/btrfs/transaction.h b/fs/btrfs/transaction.h
> index 7d70fe486758..268a415c4f32 100644
> --- a/fs/btrfs/transaction.h
> +++ b/fs/btrfs/transaction.h
> @@ -162,7 +162,8 @@ struct btrfs_trans_handle {
>         bool in_fsync;
>         struct btrfs_fs_info *fs_info;
>         struct list_head new_bgs;
> -       struct btrfs_block_rsv delayed_rsv;
> +       struct btrfs_block_rsv *delayed_rsv;
> +       struct btrfs_block_rsv _local_delayed_rsv;
>         /* Extent buffers with writeback inhibited by this handle. */
>         struct xarray writeback_inhibited_ebs;
>  };
> --
> 2.53.0
>
>

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH 1/5] btrfs: reserve space for delayed_refs in delalloc
  2026-03-25 15:36   ` Filipe Manana
@ 2026-03-25 18:39     ` Boris Burkov
  2026-03-25 18:55       ` Filipe Manana
  0 siblings, 1 reply; 10+ messages in thread
From: Boris Burkov @ 2026-03-25 18:39 UTC (permalink / raw)
  To: Filipe Manana; +Cc: linux-btrfs, kernel-team

On Wed, Mar 25, 2026 at 03:36:21PM +0000, Filipe Manana wrote:
> On Wed, Mar 25, 2026 at 12:45 AM Boris Burkov <boris@bur.io> wrote:
> >
> > delalloc uses a per-inode block_rsv to perform metadata reservations for
> > the cow operations it anticipates based on the number of outstanding
> > extents. This calculation is done based on inode->outstanding_extents in
> > btrfs_calculate_inode_block_rsv_size(). The reservation is *not*
> > meticulously tracked as each ordered_extent is actually created in
> > writeback, but rather delalloc attempts to over-estimate and the
> > writeback and ordered_extent finish portions are responsible to release
> > all the reservation.
> >
> > However, there is a notable gap in this reservation, it reserves no
> > space for the resulting delayed_refs. If you compare to how
> > btrfs_start_transaction() reservations work, this is a noteable
> > difference.
> >
> > As writeback actually occurs, and we trigger btrfs_finish_one_ordered(),
> > that function will start generating delayed refs, which will draw from
> > the trans_handle's delayed_refs_rsv via btrfs_update_delayed_refs_rsv():
> >
> > btrfs_finish_one_ordered()
> >   insert_ordered_extent_file_extent()
> >     insert_reserved_file_extent()
> >       btrfs_alloc_reserved_file_extent()
> >         btrfs_add_delayed_data_ref()
> >           add_delayed_ref()
> >             btrfs_update_delayed_refs_rsv();
> >
> > This trans_handle was created in finish_one_ordered() with
> > btrfs_join_transaction() which calls start_transaction with
> > num_items=0 and BTRFS_RESERVE_NO_FLUSH. As a result, this trans_handle
> > has no reserved in h->delayed_rsv, as neither the num_items reservation
> > nor the btrfs_delayed_refs_rsv_refill() reservation is run.
> >
> > Thus, when btrfs_update_delayed_refs_rsv() runs, reserved_bytes is 0 and
> > fs_info->delayed_rsv->size grows but not fs_info->delayed_rsv->reserved.
> >
> > If a large amount of writeback happens all at once (perhaps due to
> > dirty_ratio being tuned too high), this results in, among other things,
> > erroneous assessments of the amount of delayed_refs reserved in the
> > metadata space reclaim logic, like need_preemptive_reclaim() which
> > relies on fs_info->delayed_rsv->reserved and even worse, poor decision
> > making in btrfs_preempt_reclaim_metadata_space() which counts
> > delalloc_bytes like so:
> >
> >   block_rsv_size = global_rsv_size +
> >           btrfs_block_rsv_reserved(delayed_block_rsv) +
> >           btrfs_block_rsv_reserved(delayed_refs_rsv) +
> >           btrfs_block_rsv_reserved(trans_rsv);
> >   delalloc_size = bytes_may_use - block_rsv_size;
> >
> > So all that lost delayed refs usage gets accounted as delalloc_size and
> > leads to preemptive reclaim continuously choosing FLUSH_DELALLOC, which
> > further exacerbates the problem.
> >
> > With enough writeback around, we can run enough delalloc that we get
> > into async reclaim which starts blocking start_transaction() and
> > eventually hits FLUSH_DELALLOC_WAIT/FLUSH_DELALLOC_FULL at which point
> > the filesystem gets heavily blocked on metadata space in reserve_space(),
> > blocking all new transaction work until all the ordered_extents finish.
> >
> > If we had an accurate view of the reservation for delayed refs, then we
> > could mostly break this feedback loop in preemptive reclaim, and
> > generally would be able to make more accurate decisions with regards to
> > metadata space reclamation.
> >
> > This patch introduces the mechanism of a per-inode delayed_refs rsv
> > which is modeled closely after the same in trans_handle. The delalloc
> > reservation also reserves delayed refs and then finish_one_ordered
> > transfers the inode delayed_refs rsv into the trans_handle one, just
> > like inode->block_rsv.
> >
> > This is not a perfect fix for the most pathological cases, but is the
> > infrastructure needed to keep working on the problem.
> >
> > Signed-off-by: Boris Burkov <boris@bur.io>
> > ---
> >  fs/btrfs/btrfs_inode.h    |  3 +++
> >  fs/btrfs/delalloc-space.c | 34 ++++++++++++++++++++++++++++++----
> >  fs/btrfs/delayed-ref.c    |  2 +-
> >  fs/btrfs/inode.c          |  9 ++++++++-
> >  fs/btrfs/transaction.c    |  7 ++++---
> >  fs/btrfs/transaction.h    |  3 ++-
> >  6 files changed, 48 insertions(+), 10 deletions(-)
> >
> > diff --git a/fs/btrfs/btrfs_inode.h b/fs/btrfs/btrfs_inode.h
> > index 55c272fe5d92..dca4f6df7e95 100644
> > --- a/fs/btrfs/btrfs_inode.h
> > +++ b/fs/btrfs/btrfs_inode.h
> > @@ -328,6 +328,9 @@ struct btrfs_inode {
> >
> >         struct btrfs_block_rsv block_rsv;
> >
> > +       /* Reserve for delayed refs generated by ordered extent completion. */
> > +       struct btrfs_block_rsv delayed_rsv;
> 
> Not that long ago we had an effort to decrease the btrfs_inode
> structure size down to less than 1024 bytes, so that we could have 4
> inodes per 4K page instead of 3, and this change now makes the
> structure larger than 1024 bytes again.

Good catch, thanks. And sorry for missing it.

> 
> Instead of adding another block reserve to the inode we could:
> 
> 1) Add the reservations for delayed refs in the existing block reserve
> (inode->block_rsv);
> 
> 2) When finishing the ordered extent, after joining the transaction
> and setting trans->block to inode->block_rsv, we could migrate the
> space reserved for delayed refs from the inode->block_rsv into
> trans->delayed_rsv.

At first blush, without trying very hard yet, I don't love this because I
think it means tracking or re-computing the delayed_refs portion of
inode->block_rsv for the migration. However, I'm sure I can make it work
and it's certainly worth doing to not regress the size of struct
btrfs_inode.

What do you think of just allocating inode->delayed_rsv indirectly with
btrfs_alloc_block_rsv? We can save another ~100 bytes on struct
btrfs_inode. And we could do it for inode->block_rsv to save yet more
space. Do you know if we have experience that it's really important for
inode->block_rsv to be embedded in struct btrfs_inode?

If that's a no-go, I will work on using just inode->block_rsv and doing
the migration.

> 
> This would not require increasing the btrfs_inode structure and
> neither add a _local_delayed_rsv field to the transaction handle (and
> btw, we don't use the _ prefix for any structure fields anywhere in
> btrfs).

Noted, thanks.

> 
> Thanks.
> 
> 
> > +
> >         struct btrfs_delayed_node *delayed_node;
> >
> >         /* File creation time. */
> > diff --git a/fs/btrfs/delalloc-space.c b/fs/btrfs/delalloc-space.c
> > index 0970799d0aa4..e2944ff4fe47 100644
> > --- a/fs/btrfs/delalloc-space.c
> > +++ b/fs/btrfs/delalloc-space.c
> > @@ -3,6 +3,7 @@
> >  #include "messages.h"
> >  #include "ctree.h"
> >  #include "delalloc-space.h"
> > +#include "delayed-ref.h"
> >  #include "block-rsv.h"
> >  #include "btrfs_inode.h"
> >  #include "space-info.h"
> > @@ -240,6 +241,13 @@ static void btrfs_inode_rsv_release(struct btrfs_inode *inode, bool qgroup_free)
> >         if (released > 0)
> >                 trace_btrfs_space_reservation(fs_info, "delalloc",
> >                                               btrfs_ino(inode), released, 0);
> > +
> > +       released = btrfs_block_rsv_release(fs_info, &inode->delayed_rsv,
> > +                                          0, NULL);
> > +       if (released > 0)
> > +               trace_btrfs_space_reservation(fs_info, "delalloc_delayed_refs",
> > +                                             btrfs_ino(inode), released, 0);
> > +
> >         if (qgroup_free)
> >                 btrfs_qgroup_free_meta_prealloc(inode->root, qgroup_to_release);
> >         else
> > @@ -251,7 +259,9 @@ static void btrfs_calculate_inode_block_rsv_size(struct btrfs_fs_info *fs_info,
> >                                                  struct btrfs_inode *inode)
> >  {
> >         struct btrfs_block_rsv *block_rsv = &inode->block_rsv;
> > +       struct btrfs_block_rsv *delayed_rsv = &inode->delayed_rsv;
> >         u64 reserve_size = 0;
> > +       u64 delayed_refs_size = 0;
> >         u64 qgroup_rsv_size = 0;
> >         unsigned outstanding_extents;
> >
> > @@ -266,6 +276,8 @@ static void btrfs_calculate_inode_block_rsv_size(struct btrfs_fs_info *fs_info,
> >                 reserve_size = btrfs_calc_insert_metadata_size(fs_info,
> >                                                 outstanding_extents);
> >                 reserve_size += btrfs_calc_metadata_size(fs_info, 1);
> > +               delayed_refs_size += btrfs_calc_delayed_ref_bytes(fs_info,
> > +                                               outstanding_extents);
> >         }
> >         if (!(inode->flags & BTRFS_INODE_NODATASUM)) {
> >                 u64 csum_leaves;
> > @@ -285,11 +297,17 @@ static void btrfs_calculate_inode_block_rsv_size(struct btrfs_fs_info *fs_info,
> >         block_rsv->size = reserve_size;
> >         block_rsv->qgroup_rsv_size = qgroup_rsv_size;
> >         spin_unlock(&block_rsv->lock);
> > +
> > +       spin_lock(&delayed_rsv->lock);
> > +       delayed_rsv->size = delayed_refs_size;
> > +       spin_unlock(&delayed_rsv->lock);
> >  }
> >
> >  static void calc_inode_reservations(struct btrfs_inode *inode,
> >                                     u64 num_bytes, u64 disk_num_bytes,
> > -                                   u64 *meta_reserve, u64 *qgroup_reserve)
> > +                                   u64 *meta_reserve,
> > +                                   u64 *delayed_refs_reserve,
> > +                                   u64 *qgroup_reserve)
> >  {
> >         struct btrfs_fs_info *fs_info = inode->root->fs_info;
> >         u64 nr_extents = count_max_extents(fs_info, num_bytes);
> > @@ -309,6 +327,10 @@ static void calc_inode_reservations(struct btrfs_inode *inode,
> >          * for an inode update.
> >          */
> >         *meta_reserve += inode_update;
> > +
> > +       *delayed_refs_reserve = btrfs_calc_delayed_ref_bytes(fs_info,
> > +                                                            nr_extents);
> > +
> >         *qgroup_reserve = nr_extents * fs_info->nodesize;
> >  }
> >
> > @@ -318,7 +340,7 @@ int btrfs_delalloc_reserve_metadata(struct btrfs_inode *inode, u64 num_bytes,
> >         struct btrfs_root *root = inode->root;
> >         struct btrfs_fs_info *fs_info = root->fs_info;
> >         struct btrfs_block_rsv *block_rsv = &inode->block_rsv;
> > -       u64 meta_reserve, qgroup_reserve;
> > +       u64 meta_reserve, delayed_refs_reserve, qgroup_reserve;
> >         unsigned nr_extents;
> >         enum btrfs_reserve_flush_enum flush = BTRFS_RESERVE_FLUSH_ALL;
> >         int ret = 0;
> > @@ -353,12 +375,14 @@ int btrfs_delalloc_reserve_metadata(struct btrfs_inode *inode, u64 num_bytes,
> >          * over-reserve slightly, and clean up the mess when we are done.
> >          */
> >         calc_inode_reservations(inode, num_bytes, disk_num_bytes,
> > -                               &meta_reserve, &qgroup_reserve);
> > +                               &meta_reserve, &delayed_refs_reserve,
> > +                               &qgroup_reserve);
> >         ret = btrfs_qgroup_reserve_meta_prealloc(root, qgroup_reserve, true,
> >                                                  noflush);
> >         if (ret)
> >                 return ret;
> > -       ret = btrfs_reserve_metadata_bytes(block_rsv->space_info, meta_reserve,
> > +       ret = btrfs_reserve_metadata_bytes(block_rsv->space_info,
> > +                                          meta_reserve + delayed_refs_reserve,
> >                                            flush);
> >         if (ret) {
> >                 btrfs_qgroup_free_meta_prealloc(root, qgroup_reserve);
> > @@ -383,6 +407,8 @@ int btrfs_delalloc_reserve_metadata(struct btrfs_inode *inode, u64 num_bytes,
> >         btrfs_block_rsv_add_bytes(block_rsv, meta_reserve, false);
> >         trace_btrfs_space_reservation(root->fs_info, "delalloc",
> >                                       btrfs_ino(inode), meta_reserve, 1);
> > +       btrfs_block_rsv_add_bytes(&inode->delayed_rsv, delayed_refs_reserve,
> > +                                 false);
> >
> >         spin_lock(&block_rsv->lock);
> >         block_rsv->qgroup_rsv_reserved += qgroup_reserve;
> > diff --git a/fs/btrfs/delayed-ref.c b/fs/btrfs/delayed-ref.c
> > index 605858c2d9a9..9fe9cec1bef3 100644
> > --- a/fs/btrfs/delayed-ref.c
> > +++ b/fs/btrfs/delayed-ref.c
> > @@ -89,7 +89,7 @@ void btrfs_update_delayed_refs_rsv(struct btrfs_trans_handle *trans)
> >  {
> >         struct btrfs_fs_info *fs_info = trans->fs_info;
> >         struct btrfs_block_rsv *delayed_rsv = &fs_info->delayed_refs_rsv;
> > -       struct btrfs_block_rsv *local_rsv = &trans->delayed_rsv;
> > +       struct btrfs_block_rsv *local_rsv = trans->delayed_rsv;
> >         u64 num_bytes;
> >         u64 reserved_bytes;
> >
> > diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
> > index 1a4e6a9239ae..1f0f3282e4b8 100644
> > --- a/fs/btrfs/inode.c
> > +++ b/fs/btrfs/inode.c
> > @@ -653,6 +653,7 @@ static noinline int __cow_file_range_inline(struct btrfs_inode *inode,
> >                 goto out;
> >         }
> >         trans->block_rsv = &inode->block_rsv;
> > +       trans->delayed_rsv = &inode->delayed_rsv;
> >
> >         drop_args.path = path;
> >         drop_args.start = 0;
> > @@ -3256,6 +3257,7 @@ int btrfs_finish_one_ordered(struct btrfs_ordered_extent *ordered_extent)
> >         }
> >
> >         trans->block_rsv = &inode->block_rsv;
> > +       trans->delayed_rsv = &inode->delayed_rsv;
> >
> >         ret = btrfs_insert_raid_extent(trans, ordered_extent);
> >         if (unlikely(ret)) {
> > @@ -8074,9 +8076,12 @@ struct inode *btrfs_alloc_inode(struct super_block *sb)
> >
> >         spin_lock_init(&ei->lock);
> >         ei->outstanding_extents = 0;
> > -       if (sb->s_magic != BTRFS_TEST_MAGIC)
> > +       if (sb->s_magic != BTRFS_TEST_MAGIC) {
> >                 btrfs_init_metadata_block_rsv(fs_info, &ei->block_rsv,
> >                                               BTRFS_BLOCK_RSV_DELALLOC);
> > +               btrfs_init_metadata_block_rsv(fs_info, &ei->delayed_rsv,
> > +                                             BTRFS_BLOCK_RSV_DELREFS);
> > +       }
> >         ei->runtime_flags = 0;
> >         ei->prop_compress = BTRFS_COMPRESS_NONE;
> >         ei->defrag_compress = BTRFS_COMPRESS_NONE;
> > @@ -8132,6 +8137,8 @@ void btrfs_destroy_inode(struct inode *vfs_inode)
> >         WARN_ON(vfs_inode->i_data.nrpages);
> >         WARN_ON(inode->block_rsv.reserved);
> >         WARN_ON(inode->block_rsv.size);
> > +       WARN_ON(inode->delayed_rsv.reserved);
> > +       WARN_ON(inode->delayed_rsv.size);
> >         WARN_ON(inode->outstanding_extents);
> >         if (!S_ISDIR(vfs_inode->i_mode)) {
> >                 WARN_ON(inode->delalloc_bytes);
> > diff --git a/fs/btrfs/transaction.c b/fs/btrfs/transaction.c
> > index 4358f4b63057..a55f8996cd59 100644
> > --- a/fs/btrfs/transaction.c
> > +++ b/fs/btrfs/transaction.c
> > @@ -737,7 +737,8 @@ start_transaction(struct btrfs_root *root, unsigned int num_items,
> >
> >         h->type = type;
> >         INIT_LIST_HEAD(&h->new_bgs);
> > -       btrfs_init_metadata_block_rsv(fs_info, &h->delayed_rsv, BTRFS_BLOCK_RSV_DELREFS);
> > +       h->delayed_rsv = &h->_local_delayed_rsv;
> > +       btrfs_init_metadata_block_rsv(fs_info, h->delayed_rsv, BTRFS_BLOCK_RSV_DELREFS);
> >
> >         smp_mb();
> >         if (cur_trans->state >= TRANS_STATE_COMMIT_START &&
> > @@ -758,7 +759,7 @@ start_transaction(struct btrfs_root *root, unsigned int num_items,
> >                                                       h->transid,
> >                                                       delayed_refs_bytes, 1);
> >                         h->delayed_refs_bytes_reserved = delayed_refs_bytes;
> > -                       btrfs_block_rsv_add_bytes(&h->delayed_rsv, delayed_refs_bytes, true);
> > +                       btrfs_block_rsv_add_bytes(h->delayed_rsv, delayed_refs_bytes, true);
> >                         delayed_refs_bytes = 0;
> >                 }
> >                 h->reloc_reserved = reloc_reserved;
> > @@ -1067,7 +1068,7 @@ static void btrfs_trans_release_metadata(struct btrfs_trans_handle *trans)
> >         trace_btrfs_space_reservation(fs_info, "local_delayed_refs_rsv",
> >                                       trans->transid,
> >                                       trans->delayed_refs_bytes_reserved, 0);
> > -       btrfs_block_rsv_release(fs_info, &trans->delayed_rsv,
> > +       btrfs_block_rsv_release(fs_info, trans->delayed_rsv,
> >                                 trans->delayed_refs_bytes_reserved, NULL);
> >         trans->delayed_refs_bytes_reserved = 0;
> >  }
> > diff --git a/fs/btrfs/transaction.h b/fs/btrfs/transaction.h
> > index 7d70fe486758..268a415c4f32 100644
> > --- a/fs/btrfs/transaction.h
> > +++ b/fs/btrfs/transaction.h
> > @@ -162,7 +162,8 @@ struct btrfs_trans_handle {
> >         bool in_fsync;
> >         struct btrfs_fs_info *fs_info;
> >         struct list_head new_bgs;
> > -       struct btrfs_block_rsv delayed_rsv;
> > +       struct btrfs_block_rsv *delayed_rsv;
> > +       struct btrfs_block_rsv _local_delayed_rsv;
> >         /* Extent buffers with writeback inhibited by this handle. */
> >         struct xarray writeback_inhibited_ebs;
> >  };
> > --
> > 2.53.0
> >
> >

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH 1/5] btrfs: reserve space for delayed_refs in delalloc
  2026-03-25 18:39     ` Boris Burkov
@ 2026-03-25 18:55       ` Filipe Manana
  2026-03-25 22:24         ` Boris Burkov
  0 siblings, 1 reply; 10+ messages in thread
From: Filipe Manana @ 2026-03-25 18:55 UTC (permalink / raw)
  To: Boris Burkov; +Cc: linux-btrfs, kernel-team

On Wed, Mar 25, 2026 at 6:39 PM Boris Burkov <boris@bur.io> wrote:
>
> On Wed, Mar 25, 2026 at 03:36:21PM +0000, Filipe Manana wrote:
> > On Wed, Mar 25, 2026 at 12:45 AM Boris Burkov <boris@bur.io> wrote:
> > >
> > > delalloc uses a per-inode block_rsv to perform metadata reservations for
> > > the cow operations it anticipates based on the number of outstanding
> > > extents. This calculation is done based on inode->outstanding_extents in
> > > btrfs_calculate_inode_block_rsv_size(). The reservation is *not*
> > > meticulously tracked as each ordered_extent is actually created in
> > > writeback, but rather delalloc attempts to over-estimate and the
> > > writeback and ordered_extent finish portions are responsible to release
> > > all the reservation.
> > >
> > > However, there is a notable gap in this reservation, it reserves no
> > > space for the resulting delayed_refs. If you compare to how
> > > btrfs_start_transaction() reservations work, this is a noteable
> > > difference.
> > >
> > > As writeback actually occurs, and we trigger btrfs_finish_one_ordered(),
> > > that function will start generating delayed refs, which will draw from
> > > the trans_handle's delayed_refs_rsv via btrfs_update_delayed_refs_rsv():
> > >
> > > btrfs_finish_one_ordered()
> > >   insert_ordered_extent_file_extent()
> > >     insert_reserved_file_extent()
> > >       btrfs_alloc_reserved_file_extent()
> > >         btrfs_add_delayed_data_ref()
> > >           add_delayed_ref()
> > >             btrfs_update_delayed_refs_rsv();
> > >
> > > This trans_handle was created in finish_one_ordered() with
> > > btrfs_join_transaction() which calls start_transaction with
> > > num_items=0 and BTRFS_RESERVE_NO_FLUSH. As a result, this trans_handle
> > > has no reserved in h->delayed_rsv, as neither the num_items reservation
> > > nor the btrfs_delayed_refs_rsv_refill() reservation is run.
> > >
> > > Thus, when btrfs_update_delayed_refs_rsv() runs, reserved_bytes is 0 and
> > > fs_info->delayed_rsv->size grows but not fs_info->delayed_rsv->reserved.
> > >
> > > If a large amount of writeback happens all at once (perhaps due to
> > > dirty_ratio being tuned too high), this results in, among other things,
> > > erroneous assessments of the amount of delayed_refs reserved in the
> > > metadata space reclaim logic, like need_preemptive_reclaim() which
> > > relies on fs_info->delayed_rsv->reserved and even worse, poor decision
> > > making in btrfs_preempt_reclaim_metadata_space() which counts
> > > delalloc_bytes like so:
> > >
> > >   block_rsv_size = global_rsv_size +
> > >           btrfs_block_rsv_reserved(delayed_block_rsv) +
> > >           btrfs_block_rsv_reserved(delayed_refs_rsv) +
> > >           btrfs_block_rsv_reserved(trans_rsv);
> > >   delalloc_size = bytes_may_use - block_rsv_size;
> > >
> > > So all that lost delayed refs usage gets accounted as delalloc_size and
> > > leads to preemptive reclaim continuously choosing FLUSH_DELALLOC, which
> > > further exacerbates the problem.
> > >
> > > With enough writeback around, we can run enough delalloc that we get
> > > into async reclaim which starts blocking start_transaction() and
> > > eventually hits FLUSH_DELALLOC_WAIT/FLUSH_DELALLOC_FULL at which point
> > > the filesystem gets heavily blocked on metadata space in reserve_space(),
> > > blocking all new transaction work until all the ordered_extents finish.
> > >
> > > If we had an accurate view of the reservation for delayed refs, then we
> > > could mostly break this feedback loop in preemptive reclaim, and
> > > generally would be able to make more accurate decisions with regards to
> > > metadata space reclamation.
> > >
> > > This patch introduces the mechanism of a per-inode delayed_refs rsv
> > > which is modeled closely after the same in trans_handle. The delalloc
> > > reservation also reserves delayed refs and then finish_one_ordered
> > > transfers the inode delayed_refs rsv into the trans_handle one, just
> > > like inode->block_rsv.
> > >
> > > This is not a perfect fix for the most pathological cases, but is the
> > > infrastructure needed to keep working on the problem.
> > >
> > > Signed-off-by: Boris Burkov <boris@bur.io>
> > > ---
> > >  fs/btrfs/btrfs_inode.h    |  3 +++
> > >  fs/btrfs/delalloc-space.c | 34 ++++++++++++++++++++++++++++++----
> > >  fs/btrfs/delayed-ref.c    |  2 +-
> > >  fs/btrfs/inode.c          |  9 ++++++++-
> > >  fs/btrfs/transaction.c    |  7 ++++---
> > >  fs/btrfs/transaction.h    |  3 ++-
> > >  6 files changed, 48 insertions(+), 10 deletions(-)
> > >
> > > diff --git a/fs/btrfs/btrfs_inode.h b/fs/btrfs/btrfs_inode.h
> > > index 55c272fe5d92..dca4f6df7e95 100644
> > > --- a/fs/btrfs/btrfs_inode.h
> > > +++ b/fs/btrfs/btrfs_inode.h
> > > @@ -328,6 +328,9 @@ struct btrfs_inode {
> > >
> > >         struct btrfs_block_rsv block_rsv;
> > >
> > > +       /* Reserve for delayed refs generated by ordered extent completion. */
> > > +       struct btrfs_block_rsv delayed_rsv;
> >
> > Not that long ago we had an effort to decrease the btrfs_inode
> > structure size down to less than 1024 bytes, so that we could have 4
> > inodes per 4K page instead of 3, and this change now makes the
> > structure larger than 1024 bytes again.
>
> Good catch, thanks. And sorry for missing it.
>
> >
> > Instead of adding another block reserve to the inode we could:
> >
> > 1) Add the reservations for delayed refs in the existing block reserve
> > (inode->block_rsv);
> >
> > 2) When finishing the ordered extent, after joining the transaction
> > and setting trans->block to inode->block_rsv, we could migrate the
> > space reserved for delayed refs from the inode->block_rsv into
> > trans->delayed_rsv.
>
> At first blush, without trying very hard yet, I don't love this because I
> think it means tracking or re-computing the delayed_refs portion of
> inode->block_rsv for the migration.

Re-computing (calling btrfs_calc_delayed_ref_bytes()) is not
expensive, and it's done in workqueue for finishing ordered extents.
I wouldn't worry about that. But tracking it somewhere, implies
increasing the structure size - if it stays under or at 1024 bytes,
it's fine.

> However, I'm sure I can make it work
> and it's certainly worth doing to not regress the size of struct
> btrfs_inode.
>
> What do you think of just allocating inode->delayed_rsv indirectly with
> btrfs_alloc_block_rsv? We can save another ~100 bytes on struct
> btrfs_inode.

Adding an allocation adds potential for -ENOMEM failures.
And for 4K pages we will still be able to allocate 4 inodes per page,
so no gains here.
For a 64K page we can get a few more inodes (7 or 8).

>  And we could do it for inode->block_rsv to save yet more
> space. Do you know if we have experience that it's really important for
> inode->block_rsv to be embedded in struct btrfs_inode?

The reason we have it embedded it's because it saves a memory
allocation, and it may also be faster and maybe better cache line
spatial locality.
In the end we won't get any benefit unless the page size is > 4K so
that we can get more inodes per page.

>
> If that's a no-go, I will work on using just inode->block_rsv and doing
> the migration.
>
> >
> > This would not require increasing the btrfs_inode structure and
> > neither add a _local_delayed_rsv field to the transaction handle (and
> > btw, we don't use the _ prefix for any structure fields anywhere in
> > btrfs).
>
> Noted, thanks.
>
> >
> > Thanks.
> >
> >
> > > +
> > >         struct btrfs_delayed_node *delayed_node;
> > >
> > >         /* File creation time. */
> > > diff --git a/fs/btrfs/delalloc-space.c b/fs/btrfs/delalloc-space.c
> > > index 0970799d0aa4..e2944ff4fe47 100644
> > > --- a/fs/btrfs/delalloc-space.c
> > > +++ b/fs/btrfs/delalloc-space.c
> > > @@ -3,6 +3,7 @@
> > >  #include "messages.h"
> > >  #include "ctree.h"
> > >  #include "delalloc-space.h"
> > > +#include "delayed-ref.h"
> > >  #include "block-rsv.h"
> > >  #include "btrfs_inode.h"
> > >  #include "space-info.h"
> > > @@ -240,6 +241,13 @@ static void btrfs_inode_rsv_release(struct btrfs_inode *inode, bool qgroup_free)
> > >         if (released > 0)
> > >                 trace_btrfs_space_reservation(fs_info, "delalloc",
> > >                                               btrfs_ino(inode), released, 0);
> > > +
> > > +       released = btrfs_block_rsv_release(fs_info, &inode->delayed_rsv,
> > > +                                          0, NULL);
> > > +       if (released > 0)
> > > +               trace_btrfs_space_reservation(fs_info, "delalloc_delayed_refs",
> > > +                                             btrfs_ino(inode), released, 0);
> > > +
> > >         if (qgroup_free)
> > >                 btrfs_qgroup_free_meta_prealloc(inode->root, qgroup_to_release);
> > >         else
> > > @@ -251,7 +259,9 @@ static void btrfs_calculate_inode_block_rsv_size(struct btrfs_fs_info *fs_info,
> > >                                                  struct btrfs_inode *inode)
> > >  {
> > >         struct btrfs_block_rsv *block_rsv = &inode->block_rsv;
> > > +       struct btrfs_block_rsv *delayed_rsv = &inode->delayed_rsv;
> > >         u64 reserve_size = 0;
> > > +       u64 delayed_refs_size = 0;
> > >         u64 qgroup_rsv_size = 0;
> > >         unsigned outstanding_extents;
> > >
> > > @@ -266,6 +276,8 @@ static void btrfs_calculate_inode_block_rsv_size(struct btrfs_fs_info *fs_info,
> > >                 reserve_size = btrfs_calc_insert_metadata_size(fs_info,
> > >                                                 outstanding_extents);
> > >                 reserve_size += btrfs_calc_metadata_size(fs_info, 1);
> > > +               delayed_refs_size += btrfs_calc_delayed_ref_bytes(fs_info,
> > > +                                               outstanding_extents);
> > >         }
> > >         if (!(inode->flags & BTRFS_INODE_NODATASUM)) {
> > >                 u64 csum_leaves;
> > > @@ -285,11 +297,17 @@ static void btrfs_calculate_inode_block_rsv_size(struct btrfs_fs_info *fs_info,
> > >         block_rsv->size = reserve_size;
> > >         block_rsv->qgroup_rsv_size = qgroup_rsv_size;
> > >         spin_unlock(&block_rsv->lock);
> > > +
> > > +       spin_lock(&delayed_rsv->lock);
> > > +       delayed_rsv->size = delayed_refs_size;
> > > +       spin_unlock(&delayed_rsv->lock);
> > >  }
> > >
> > >  static void calc_inode_reservations(struct btrfs_inode *inode,
> > >                                     u64 num_bytes, u64 disk_num_bytes,
> > > -                                   u64 *meta_reserve, u64 *qgroup_reserve)
> > > +                                   u64 *meta_reserve,
> > > +                                   u64 *delayed_refs_reserve,
> > > +                                   u64 *qgroup_reserve)
> > >  {
> > >         struct btrfs_fs_info *fs_info = inode->root->fs_info;
> > >         u64 nr_extents = count_max_extents(fs_info, num_bytes);
> > > @@ -309,6 +327,10 @@ static void calc_inode_reservations(struct btrfs_inode *inode,
> > >          * for an inode update.
> > >          */
> > >         *meta_reserve += inode_update;
> > > +
> > > +       *delayed_refs_reserve = btrfs_calc_delayed_ref_bytes(fs_info,
> > > +                                                            nr_extents);
> > > +
> > >         *qgroup_reserve = nr_extents * fs_info->nodesize;
> > >  }
> > >
> > > @@ -318,7 +340,7 @@ int btrfs_delalloc_reserve_metadata(struct btrfs_inode *inode, u64 num_bytes,
> > >         struct btrfs_root *root = inode->root;
> > >         struct btrfs_fs_info *fs_info = root->fs_info;
> > >         struct btrfs_block_rsv *block_rsv = &inode->block_rsv;
> > > -       u64 meta_reserve, qgroup_reserve;
> > > +       u64 meta_reserve, delayed_refs_reserve, qgroup_reserve;
> > >         unsigned nr_extents;
> > >         enum btrfs_reserve_flush_enum flush = BTRFS_RESERVE_FLUSH_ALL;
> > >         int ret = 0;
> > > @@ -353,12 +375,14 @@ int btrfs_delalloc_reserve_metadata(struct btrfs_inode *inode, u64 num_bytes,
> > >          * over-reserve slightly, and clean up the mess when we are done.
> > >          */
> > >         calc_inode_reservations(inode, num_bytes, disk_num_bytes,
> > > -                               &meta_reserve, &qgroup_reserve);
> > > +                               &meta_reserve, &delayed_refs_reserve,
> > > +                               &qgroup_reserve);
> > >         ret = btrfs_qgroup_reserve_meta_prealloc(root, qgroup_reserve, true,
> > >                                                  noflush);
> > >         if (ret)
> > >                 return ret;
> > > -       ret = btrfs_reserve_metadata_bytes(block_rsv->space_info, meta_reserve,
> > > +       ret = btrfs_reserve_metadata_bytes(block_rsv->space_info,
> > > +                                          meta_reserve + delayed_refs_reserve,
> > >                                            flush);
> > >         if (ret) {
> > >                 btrfs_qgroup_free_meta_prealloc(root, qgroup_reserve);
> > > @@ -383,6 +407,8 @@ int btrfs_delalloc_reserve_metadata(struct btrfs_inode *inode, u64 num_bytes,
> > >         btrfs_block_rsv_add_bytes(block_rsv, meta_reserve, false);
> > >         trace_btrfs_space_reservation(root->fs_info, "delalloc",
> > >                                       btrfs_ino(inode), meta_reserve, 1);
> > > +       btrfs_block_rsv_add_bytes(&inode->delayed_rsv, delayed_refs_reserve,
> > > +                                 false);
> > >
> > >         spin_lock(&block_rsv->lock);
> > >         block_rsv->qgroup_rsv_reserved += qgroup_reserve;
> > > diff --git a/fs/btrfs/delayed-ref.c b/fs/btrfs/delayed-ref.c
> > > index 605858c2d9a9..9fe9cec1bef3 100644
> > > --- a/fs/btrfs/delayed-ref.c
> > > +++ b/fs/btrfs/delayed-ref.c
> > > @@ -89,7 +89,7 @@ void btrfs_update_delayed_refs_rsv(struct btrfs_trans_handle *trans)
> > >  {
> > >         struct btrfs_fs_info *fs_info = trans->fs_info;
> > >         struct btrfs_block_rsv *delayed_rsv = &fs_info->delayed_refs_rsv;
> > > -       struct btrfs_block_rsv *local_rsv = &trans->delayed_rsv;
> > > +       struct btrfs_block_rsv *local_rsv = trans->delayed_rsv;
> > >         u64 num_bytes;
> > >         u64 reserved_bytes;
> > >
> > > diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
> > > index 1a4e6a9239ae..1f0f3282e4b8 100644
> > > --- a/fs/btrfs/inode.c
> > > +++ b/fs/btrfs/inode.c
> > > @@ -653,6 +653,7 @@ static noinline int __cow_file_range_inline(struct btrfs_inode *inode,
> > >                 goto out;
> > >         }
> > >         trans->block_rsv = &inode->block_rsv;
> > > +       trans->delayed_rsv = &inode->delayed_rsv;
> > >
> > >         drop_args.path = path;
> > >         drop_args.start = 0;
> > > @@ -3256,6 +3257,7 @@ int btrfs_finish_one_ordered(struct btrfs_ordered_extent *ordered_extent)
> > >         }
> > >
> > >         trans->block_rsv = &inode->block_rsv;
> > > +       trans->delayed_rsv = &inode->delayed_rsv;
> > >
> > >         ret = btrfs_insert_raid_extent(trans, ordered_extent);
> > >         if (unlikely(ret)) {
> > > @@ -8074,9 +8076,12 @@ struct inode *btrfs_alloc_inode(struct super_block *sb)
> > >
> > >         spin_lock_init(&ei->lock);
> > >         ei->outstanding_extents = 0;
> > > -       if (sb->s_magic != BTRFS_TEST_MAGIC)
> > > +       if (sb->s_magic != BTRFS_TEST_MAGIC) {
> > >                 btrfs_init_metadata_block_rsv(fs_info, &ei->block_rsv,
> > >                                               BTRFS_BLOCK_RSV_DELALLOC);
> > > +               btrfs_init_metadata_block_rsv(fs_info, &ei->delayed_rsv,
> > > +                                             BTRFS_BLOCK_RSV_DELREFS);
> > > +       }
> > >         ei->runtime_flags = 0;
> > >         ei->prop_compress = BTRFS_COMPRESS_NONE;
> > >         ei->defrag_compress = BTRFS_COMPRESS_NONE;
> > > @@ -8132,6 +8137,8 @@ void btrfs_destroy_inode(struct inode *vfs_inode)
> > >         WARN_ON(vfs_inode->i_data.nrpages);
> > >         WARN_ON(inode->block_rsv.reserved);
> > >         WARN_ON(inode->block_rsv.size);
> > > +       WARN_ON(inode->delayed_rsv.reserved);
> > > +       WARN_ON(inode->delayed_rsv.size);
> > >         WARN_ON(inode->outstanding_extents);
> > >         if (!S_ISDIR(vfs_inode->i_mode)) {
> > >                 WARN_ON(inode->delalloc_bytes);
> > > diff --git a/fs/btrfs/transaction.c b/fs/btrfs/transaction.c
> > > index 4358f4b63057..a55f8996cd59 100644
> > > --- a/fs/btrfs/transaction.c
> > > +++ b/fs/btrfs/transaction.c
> > > @@ -737,7 +737,8 @@ start_transaction(struct btrfs_root *root, unsigned int num_items,
> > >
> > >         h->type = type;
> > >         INIT_LIST_HEAD(&h->new_bgs);
> > > -       btrfs_init_metadata_block_rsv(fs_info, &h->delayed_rsv, BTRFS_BLOCK_RSV_DELREFS);
> > > +       h->delayed_rsv = &h->_local_delayed_rsv;
> > > +       btrfs_init_metadata_block_rsv(fs_info, h->delayed_rsv, BTRFS_BLOCK_RSV_DELREFS);
> > >
> > >         smp_mb();
> > >         if (cur_trans->state >= TRANS_STATE_COMMIT_START &&
> > > @@ -758,7 +759,7 @@ start_transaction(struct btrfs_root *root, unsigned int num_items,
> > >                                                       h->transid,
> > >                                                       delayed_refs_bytes, 1);
> > >                         h->delayed_refs_bytes_reserved = delayed_refs_bytes;
> > > -                       btrfs_block_rsv_add_bytes(&h->delayed_rsv, delayed_refs_bytes, true);
> > > +                       btrfs_block_rsv_add_bytes(h->delayed_rsv, delayed_refs_bytes, true);
> > >                         delayed_refs_bytes = 0;
> > >                 }
> > >                 h->reloc_reserved = reloc_reserved;
> > > @@ -1067,7 +1068,7 @@ static void btrfs_trans_release_metadata(struct btrfs_trans_handle *trans)
> > >         trace_btrfs_space_reservation(fs_info, "local_delayed_refs_rsv",
> > >                                       trans->transid,
> > >                                       trans->delayed_refs_bytes_reserved, 0);
> > > -       btrfs_block_rsv_release(fs_info, &trans->delayed_rsv,
> > > +       btrfs_block_rsv_release(fs_info, trans->delayed_rsv,
> > >                                 trans->delayed_refs_bytes_reserved, NULL);
> > >         trans->delayed_refs_bytes_reserved = 0;
> > >  }
> > > diff --git a/fs/btrfs/transaction.h b/fs/btrfs/transaction.h
> > > index 7d70fe486758..268a415c4f32 100644
> > > --- a/fs/btrfs/transaction.h
> > > +++ b/fs/btrfs/transaction.h
> > > @@ -162,7 +162,8 @@ struct btrfs_trans_handle {
> > >         bool in_fsync;
> > >         struct btrfs_fs_info *fs_info;
> > >         struct list_head new_bgs;
> > > -       struct btrfs_block_rsv delayed_rsv;
> > > +       struct btrfs_block_rsv *delayed_rsv;
> > > +       struct btrfs_block_rsv _local_delayed_rsv;
> > >         /* Extent buffers with writeback inhibited by this handle. */
> > >         struct xarray writeback_inhibited_ebs;
> > >  };
> > > --
> > > 2.53.0
> > >
> > >

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH 1/5] btrfs: reserve space for delayed_refs in delalloc
  2026-03-25 18:55       ` Filipe Manana
@ 2026-03-25 22:24         ` Boris Burkov
  0 siblings, 0 replies; 10+ messages in thread
From: Boris Burkov @ 2026-03-25 22:24 UTC (permalink / raw)
  To: Filipe Manana; +Cc: linux-btrfs, kernel-team

On Wed, Mar 25, 2026 at 06:55:49PM +0000, Filipe Manana wrote:
> On Wed, Mar 25, 2026 at 6:39 PM Boris Burkov <boris@bur.io> wrote:
> >
> > On Wed, Mar 25, 2026 at 03:36:21PM +0000, Filipe Manana wrote:
> > > On Wed, Mar 25, 2026 at 12:45 AM Boris Burkov <boris@bur.io> wrote:
> > > >
> > > > delalloc uses a per-inode block_rsv to perform metadata reservations for
> > > > the cow operations it anticipates based on the number of outstanding
> > > > extents. This calculation is done based on inode->outstanding_extents in
> > > > btrfs_calculate_inode_block_rsv_size(). The reservation is *not*
> > > > meticulously tracked as each ordered_extent is actually created in
> > > > writeback, but rather delalloc attempts to over-estimate and the
> > > > writeback and ordered_extent finish portions are responsible to release
> > > > all the reservation.
> > > >
> > > > However, there is a notable gap in this reservation, it reserves no
> > > > space for the resulting delayed_refs. If you compare to how
> > > > btrfs_start_transaction() reservations work, this is a noteable
> > > > difference.
> > > >
> > > > As writeback actually occurs, and we trigger btrfs_finish_one_ordered(),
> > > > that function will start generating delayed refs, which will draw from
> > > > the trans_handle's delayed_refs_rsv via btrfs_update_delayed_refs_rsv():
> > > >
> > > > btrfs_finish_one_ordered()
> > > >   insert_ordered_extent_file_extent()
> > > >     insert_reserved_file_extent()
> > > >       btrfs_alloc_reserved_file_extent()
> > > >         btrfs_add_delayed_data_ref()
> > > >           add_delayed_ref()
> > > >             btrfs_update_delayed_refs_rsv();
> > > >
> > > > This trans_handle was created in finish_one_ordered() with
> > > > btrfs_join_transaction() which calls start_transaction with
> > > > num_items=0 and BTRFS_RESERVE_NO_FLUSH. As a result, this trans_handle
> > > > has no reserved in h->delayed_rsv, as neither the num_items reservation
> > > > nor the btrfs_delayed_refs_rsv_refill() reservation is run.
> > > >
> > > > Thus, when btrfs_update_delayed_refs_rsv() runs, reserved_bytes is 0 and
> > > > fs_info->delayed_rsv->size grows but not fs_info->delayed_rsv->reserved.
> > > >
> > > > If a large amount of writeback happens all at once (perhaps due to
> > > > dirty_ratio being tuned too high), this results in, among other things,
> > > > erroneous assessments of the amount of delayed_refs reserved in the
> > > > metadata space reclaim logic, like need_preemptive_reclaim() which
> > > > relies on fs_info->delayed_rsv->reserved and even worse, poor decision
> > > > making in btrfs_preempt_reclaim_metadata_space() which counts
> > > > delalloc_bytes like so:
> > > >
> > > >   block_rsv_size = global_rsv_size +
> > > >           btrfs_block_rsv_reserved(delayed_block_rsv) +
> > > >           btrfs_block_rsv_reserved(delayed_refs_rsv) +
> > > >           btrfs_block_rsv_reserved(trans_rsv);
> > > >   delalloc_size = bytes_may_use - block_rsv_size;
> > > >
> > > > So all that lost delayed refs usage gets accounted as delalloc_size and
> > > > leads to preemptive reclaim continuously choosing FLUSH_DELALLOC, which
> > > > further exacerbates the problem.
> > > >
> > > > With enough writeback around, we can run enough delalloc that we get
> > > > into async reclaim which starts blocking start_transaction() and
> > > > eventually hits FLUSH_DELALLOC_WAIT/FLUSH_DELALLOC_FULL at which point
> > > > the filesystem gets heavily blocked on metadata space in reserve_space(),
> > > > blocking all new transaction work until all the ordered_extents finish.
> > > >
> > > > If we had an accurate view of the reservation for delayed refs, then we
> > > > could mostly break this feedback loop in preemptive reclaim, and
> > > > generally would be able to make more accurate decisions with regards to
> > > > metadata space reclamation.
> > > >
> > > > This patch introduces the mechanism of a per-inode delayed_refs rsv
> > > > which is modeled closely after the same in trans_handle. The delalloc
> > > > reservation also reserves delayed refs and then finish_one_ordered
> > > > transfers the inode delayed_refs rsv into the trans_handle one, just
> > > > like inode->block_rsv.
> > > >
> > > > This is not a perfect fix for the most pathological cases, but is the
> > > > infrastructure needed to keep working on the problem.
> > > >
> > > > Signed-off-by: Boris Burkov <boris@bur.io>
> > > > ---
> > > >  fs/btrfs/btrfs_inode.h    |  3 +++
> > > >  fs/btrfs/delalloc-space.c | 34 ++++++++++++++++++++++++++++++----
> > > >  fs/btrfs/delayed-ref.c    |  2 +-
> > > >  fs/btrfs/inode.c          |  9 ++++++++-
> > > >  fs/btrfs/transaction.c    |  7 ++++---
> > > >  fs/btrfs/transaction.h    |  3 ++-
> > > >  6 files changed, 48 insertions(+), 10 deletions(-)
> > > >
> > > > diff --git a/fs/btrfs/btrfs_inode.h b/fs/btrfs/btrfs_inode.h
> > > > index 55c272fe5d92..dca4f6df7e95 100644
> > > > --- a/fs/btrfs/btrfs_inode.h
> > > > +++ b/fs/btrfs/btrfs_inode.h
> > > > @@ -328,6 +328,9 @@ struct btrfs_inode {
> > > >
> > > >         struct btrfs_block_rsv block_rsv;
> > > >
> > > > +       /* Reserve for delayed refs generated by ordered extent completion. */
> > > > +       struct btrfs_block_rsv delayed_rsv;
> > >
> > > Not that long ago we had an effort to decrease the btrfs_inode
> > > structure size down to less than 1024 bytes, so that we could have 4
> > > inodes per 4K page instead of 3, and this change now makes the
> > > structure larger than 1024 bytes again.
> >
> > Good catch, thanks. And sorry for missing it.
> >
> > >
> > > Instead of adding another block reserve to the inode we could:
> > >
> > > 1) Add the reservations for delayed refs in the existing block reserve
> > > (inode->block_rsv);
> > >
> > > 2) When finishing the ordered extent, after joining the transaction
> > > and setting trans->block to inode->block_rsv, we could migrate the
> > > space reserved for delayed refs from the inode->block_rsv into
> > > trans->delayed_rsv.
> >
> > At first blush, without trying very hard yet, I don't love this because I
> > think it means tracking or re-computing the delayed_refs portion of
> > inode->block_rsv for the migration.
> 
> Re-computing (calling btrfs_calc_delayed_ref_bytes()) is not
> expensive, and it's done in workqueue for finishing ordered extents.
> I wouldn't worry about that. But tracking it somewhere, implies
> increasing the structure size - if it stays under or at 1024 bytes,
> it's fine.
> 

I'm more concerned about the code being unclear and carrying latent
state in an overloaded block_rsv. Metadata reservations are complicated
enough as it is so I was really hoping to increase clarity with the
explicit block_rsv mirroring the design of trans_handle.

> > However, I'm sure I can make it work
> > and it's certainly worth doing to not regress the size of struct
> > btrfs_inode.
> >
> > What do you think of just allocating inode->delayed_rsv indirectly with
> > btrfs_alloc_block_rsv? We can save another ~100 bytes on struct
> > btrfs_inode.
> 
> Adding an allocation adds potential for -ENOMEM failures.
> And for 4K pages we will still be able to allocate 4 inodes per page,
> so no gains here.
> For a 64K page we can get a few more inodes (7 or 8).
> 
> >  And we could do it for inode->block_rsv to save yet more
> > space. Do you know if we have experience that it's really important for
> > inode->block_rsv to be embedded in struct btrfs_inode?
> 
> The reason we have it embedded it's because it saves a memory
> allocation, and it may also be faster and maybe better cache line
> spatial locality.

I understand the general tradeoffs.

I was asking if we know that it's specifically important or if it's a
general rule. There are other structs in btrfs_inode that are allocated
and other allocations that occur while populating and using btrfs_inode.
(delayed_node, extent_io_tree, path/search_slot, extent_changeset in
delalloc, etc.) So we could allocate the block_rsvs in a similar way
(when loading the inode, when doing delalloc for the first time, etc.)

But I understand that you don't want to risk regressions, so I can go
ahead and do it your way if you feel it's the best way to go.

> In the end we won't get any benefit unless the page size is > 4K so
> that we can get more inodes per page.
> 

Well there is a benefit of more of a buffer for future additions to the
struct.

Thanks,
Boris

> >
> > If that's a no-go, I will work on using just inode->block_rsv and doing
> > the migration.
> >
> > >
> > > This would not require increasing the btrfs_inode structure and
> > > neither add a _local_delayed_rsv field to the transaction handle (and
> > > btw, we don't use the _ prefix for any structure fields anywhere in
> > > btrfs).
> >
> > Noted, thanks.
> >
> > >
> > > Thanks.
> > >
> > >
> > > > +
> > > >         struct btrfs_delayed_node *delayed_node;
> > > >
> > > >         /* File creation time. */
> > > > diff --git a/fs/btrfs/delalloc-space.c b/fs/btrfs/delalloc-space.c
> > > > index 0970799d0aa4..e2944ff4fe47 100644
> > > > --- a/fs/btrfs/delalloc-space.c
> > > > +++ b/fs/btrfs/delalloc-space.c
> > > > @@ -3,6 +3,7 @@
> > > >  #include "messages.h"
> > > >  #include "ctree.h"
> > > >  #include "delalloc-space.h"
> > > > +#include "delayed-ref.h"
> > > >  #include "block-rsv.h"
> > > >  #include "btrfs_inode.h"
> > > >  #include "space-info.h"
> > > > @@ -240,6 +241,13 @@ static void btrfs_inode_rsv_release(struct btrfs_inode *inode, bool qgroup_free)
> > > >         if (released > 0)
> > > >                 trace_btrfs_space_reservation(fs_info, "delalloc",
> > > >                                               btrfs_ino(inode), released, 0);
> > > > +
> > > > +       released = btrfs_block_rsv_release(fs_info, &inode->delayed_rsv,
> > > > +                                          0, NULL);
> > > > +       if (released > 0)
> > > > +               trace_btrfs_space_reservation(fs_info, "delalloc_delayed_refs",
> > > > +                                             btrfs_ino(inode), released, 0);
> > > > +
> > > >         if (qgroup_free)
> > > >                 btrfs_qgroup_free_meta_prealloc(inode->root, qgroup_to_release);
> > > >         else
> > > > @@ -251,7 +259,9 @@ static void btrfs_calculate_inode_block_rsv_size(struct btrfs_fs_info *fs_info,
> > > >                                                  struct btrfs_inode *inode)
> > > >  {
> > > >         struct btrfs_block_rsv *block_rsv = &inode->block_rsv;
> > > > +       struct btrfs_block_rsv *delayed_rsv = &inode->delayed_rsv;
> > > >         u64 reserve_size = 0;
> > > > +       u64 delayed_refs_size = 0;
> > > >         u64 qgroup_rsv_size = 0;
> > > >         unsigned outstanding_extents;
> > > >
> > > > @@ -266,6 +276,8 @@ static void btrfs_calculate_inode_block_rsv_size(struct btrfs_fs_info *fs_info,
> > > >                 reserve_size = btrfs_calc_insert_metadata_size(fs_info,
> > > >                                                 outstanding_extents);
> > > >                 reserve_size += btrfs_calc_metadata_size(fs_info, 1);
> > > > +               delayed_refs_size += btrfs_calc_delayed_ref_bytes(fs_info,
> > > > +                                               outstanding_extents);
> > > >         }
> > > >         if (!(inode->flags & BTRFS_INODE_NODATASUM)) {
> > > >                 u64 csum_leaves;
> > > > @@ -285,11 +297,17 @@ static void btrfs_calculate_inode_block_rsv_size(struct btrfs_fs_info *fs_info,
> > > >         block_rsv->size = reserve_size;
> > > >         block_rsv->qgroup_rsv_size = qgroup_rsv_size;
> > > >         spin_unlock(&block_rsv->lock);
> > > > +
> > > > +       spin_lock(&delayed_rsv->lock);
> > > > +       delayed_rsv->size = delayed_refs_size;
> > > > +       spin_unlock(&delayed_rsv->lock);
> > > >  }
> > > >
> > > >  static void calc_inode_reservations(struct btrfs_inode *inode,
> > > >                                     u64 num_bytes, u64 disk_num_bytes,
> > > > -                                   u64 *meta_reserve, u64 *qgroup_reserve)
> > > > +                                   u64 *meta_reserve,
> > > > +                                   u64 *delayed_refs_reserve,
> > > > +                                   u64 *qgroup_reserve)
> > > >  {
> > > >         struct btrfs_fs_info *fs_info = inode->root->fs_info;
> > > >         u64 nr_extents = count_max_extents(fs_info, num_bytes);
> > > > @@ -309,6 +327,10 @@ static void calc_inode_reservations(struct btrfs_inode *inode,
> > > >          * for an inode update.
> > > >          */
> > > >         *meta_reserve += inode_update;
> > > > +
> > > > +       *delayed_refs_reserve = btrfs_calc_delayed_ref_bytes(fs_info,
> > > > +                                                            nr_extents);
> > > > +
> > > >         *qgroup_reserve = nr_extents * fs_info->nodesize;
> > > >  }
> > > >
> > > > @@ -318,7 +340,7 @@ int btrfs_delalloc_reserve_metadata(struct btrfs_inode *inode, u64 num_bytes,
> > > >         struct btrfs_root *root = inode->root;
> > > >         struct btrfs_fs_info *fs_info = root->fs_info;
> > > >         struct btrfs_block_rsv *block_rsv = &inode->block_rsv;
> > > > -       u64 meta_reserve, qgroup_reserve;
> > > > +       u64 meta_reserve, delayed_refs_reserve, qgroup_reserve;
> > > >         unsigned nr_extents;
> > > >         enum btrfs_reserve_flush_enum flush = BTRFS_RESERVE_FLUSH_ALL;
> > > >         int ret = 0;
> > > > @@ -353,12 +375,14 @@ int btrfs_delalloc_reserve_metadata(struct btrfs_inode *inode, u64 num_bytes,
> > > >          * over-reserve slightly, and clean up the mess when we are done.
> > > >          */
> > > >         calc_inode_reservations(inode, num_bytes, disk_num_bytes,
> > > > -                               &meta_reserve, &qgroup_reserve);
> > > > +                               &meta_reserve, &delayed_refs_reserve,
> > > > +                               &qgroup_reserve);
> > > >         ret = btrfs_qgroup_reserve_meta_prealloc(root, qgroup_reserve, true,
> > > >                                                  noflush);
> > > >         if (ret)
> > > >                 return ret;
> > > > -       ret = btrfs_reserve_metadata_bytes(block_rsv->space_info, meta_reserve,
> > > > +       ret = btrfs_reserve_metadata_bytes(block_rsv->space_info,
> > > > +                                          meta_reserve + delayed_refs_reserve,
> > > >                                            flush);
> > > >         if (ret) {
> > > >                 btrfs_qgroup_free_meta_prealloc(root, qgroup_reserve);
> > > > @@ -383,6 +407,8 @@ int btrfs_delalloc_reserve_metadata(struct btrfs_inode *inode, u64 num_bytes,
> > > >         btrfs_block_rsv_add_bytes(block_rsv, meta_reserve, false);
> > > >         trace_btrfs_space_reservation(root->fs_info, "delalloc",
> > > >                                       btrfs_ino(inode), meta_reserve, 1);
> > > > +       btrfs_block_rsv_add_bytes(&inode->delayed_rsv, delayed_refs_reserve,
> > > > +                                 false);
> > > >
> > > >         spin_lock(&block_rsv->lock);
> > > >         block_rsv->qgroup_rsv_reserved += qgroup_reserve;
> > > > diff --git a/fs/btrfs/delayed-ref.c b/fs/btrfs/delayed-ref.c
> > > > index 605858c2d9a9..9fe9cec1bef3 100644
> > > > --- a/fs/btrfs/delayed-ref.c
> > > > +++ b/fs/btrfs/delayed-ref.c
> > > > @@ -89,7 +89,7 @@ void btrfs_update_delayed_refs_rsv(struct btrfs_trans_handle *trans)
> > > >  {
> > > >         struct btrfs_fs_info *fs_info = trans->fs_info;
> > > >         struct btrfs_block_rsv *delayed_rsv = &fs_info->delayed_refs_rsv;
> > > > -       struct btrfs_block_rsv *local_rsv = &trans->delayed_rsv;
> > > > +       struct btrfs_block_rsv *local_rsv = trans->delayed_rsv;
> > > >         u64 num_bytes;
> > > >         u64 reserved_bytes;
> > > >
> > > > diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
> > > > index 1a4e6a9239ae..1f0f3282e4b8 100644
> > > > --- a/fs/btrfs/inode.c
> > > > +++ b/fs/btrfs/inode.c
> > > > @@ -653,6 +653,7 @@ static noinline int __cow_file_range_inline(struct btrfs_inode *inode,
> > > >                 goto out;
> > > >         }
> > > >         trans->block_rsv = &inode->block_rsv;
> > > > +       trans->delayed_rsv = &inode->delayed_rsv;
> > > >
> > > >         drop_args.path = path;
> > > >         drop_args.start = 0;
> > > > @@ -3256,6 +3257,7 @@ int btrfs_finish_one_ordered(struct btrfs_ordered_extent *ordered_extent)
> > > >         }
> > > >
> > > >         trans->block_rsv = &inode->block_rsv;
> > > > +       trans->delayed_rsv = &inode->delayed_rsv;
> > > >
> > > >         ret = btrfs_insert_raid_extent(trans, ordered_extent);
> > > >         if (unlikely(ret)) {
> > > > @@ -8074,9 +8076,12 @@ struct inode *btrfs_alloc_inode(struct super_block *sb)
> > > >
> > > >         spin_lock_init(&ei->lock);
> > > >         ei->outstanding_extents = 0;
> > > > -       if (sb->s_magic != BTRFS_TEST_MAGIC)
> > > > +       if (sb->s_magic != BTRFS_TEST_MAGIC) {
> > > >                 btrfs_init_metadata_block_rsv(fs_info, &ei->block_rsv,
> > > >                                               BTRFS_BLOCK_RSV_DELALLOC);
> > > > +               btrfs_init_metadata_block_rsv(fs_info, &ei->delayed_rsv,
> > > > +                                             BTRFS_BLOCK_RSV_DELREFS);
> > > > +       }
> > > >         ei->runtime_flags = 0;
> > > >         ei->prop_compress = BTRFS_COMPRESS_NONE;
> > > >         ei->defrag_compress = BTRFS_COMPRESS_NONE;
> > > > @@ -8132,6 +8137,8 @@ void btrfs_destroy_inode(struct inode *vfs_inode)
> > > >         WARN_ON(vfs_inode->i_data.nrpages);
> > > >         WARN_ON(inode->block_rsv.reserved);
> > > >         WARN_ON(inode->block_rsv.size);
> > > > +       WARN_ON(inode->delayed_rsv.reserved);
> > > > +       WARN_ON(inode->delayed_rsv.size);
> > > >         WARN_ON(inode->outstanding_extents);
> > > >         if (!S_ISDIR(vfs_inode->i_mode)) {
> > > >                 WARN_ON(inode->delalloc_bytes);
> > > > diff --git a/fs/btrfs/transaction.c b/fs/btrfs/transaction.c
> > > > index 4358f4b63057..a55f8996cd59 100644
> > > > --- a/fs/btrfs/transaction.c
> > > > +++ b/fs/btrfs/transaction.c
> > > > @@ -737,7 +737,8 @@ start_transaction(struct btrfs_root *root, unsigned int num_items,
> > > >
> > > >         h->type = type;
> > > >         INIT_LIST_HEAD(&h->new_bgs);
> > > > -       btrfs_init_metadata_block_rsv(fs_info, &h->delayed_rsv, BTRFS_BLOCK_RSV_DELREFS);
> > > > +       h->delayed_rsv = &h->_local_delayed_rsv;
> > > > +       btrfs_init_metadata_block_rsv(fs_info, h->delayed_rsv, BTRFS_BLOCK_RSV_DELREFS);
> > > >
> > > >         smp_mb();
> > > >         if (cur_trans->state >= TRANS_STATE_COMMIT_START &&
> > > > @@ -758,7 +759,7 @@ start_transaction(struct btrfs_root *root, unsigned int num_items,
> > > >                                                       h->transid,
> > > >                                                       delayed_refs_bytes, 1);
> > > >                         h->delayed_refs_bytes_reserved = delayed_refs_bytes;
> > > > -                       btrfs_block_rsv_add_bytes(&h->delayed_rsv, delayed_refs_bytes, true);
> > > > +                       btrfs_block_rsv_add_bytes(h->delayed_rsv, delayed_refs_bytes, true);
> > > >                         delayed_refs_bytes = 0;
> > > >                 }
> > > >                 h->reloc_reserved = reloc_reserved;
> > > > @@ -1067,7 +1068,7 @@ static void btrfs_trans_release_metadata(struct btrfs_trans_handle *trans)
> > > >         trace_btrfs_space_reservation(fs_info, "local_delayed_refs_rsv",
> > > >                                       trans->transid,
> > > >                                       trans->delayed_refs_bytes_reserved, 0);
> > > > -       btrfs_block_rsv_release(fs_info, &trans->delayed_rsv,
> > > > +       btrfs_block_rsv_release(fs_info, trans->delayed_rsv,
> > > >                                 trans->delayed_refs_bytes_reserved, NULL);
> > > >         trans->delayed_refs_bytes_reserved = 0;
> > > >  }
> > > > diff --git a/fs/btrfs/transaction.h b/fs/btrfs/transaction.h
> > > > index 7d70fe486758..268a415c4f32 100644
> > > > --- a/fs/btrfs/transaction.h
> > > > +++ b/fs/btrfs/transaction.h
> > > > @@ -162,7 +162,8 @@ struct btrfs_trans_handle {
> > > >         bool in_fsync;
> > > >         struct btrfs_fs_info *fs_info;
> > > >         struct list_head new_bgs;
> > > > -       struct btrfs_block_rsv delayed_rsv;
> > > > +       struct btrfs_block_rsv *delayed_rsv;
> > > > +       struct btrfs_block_rsv _local_delayed_rsv;
> > > >         /* Extent buffers with writeback inhibited by this handle. */
> > > >         struct xarray writeback_inhibited_ebs;
> > > >  };
> > > > --
> > > > 2.53.0
> > > >
> > > >

^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2026-03-25 22:24 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-03-25  0:41 [PATCH 0/5] btrfs: improve stalls under sudden writeback Boris Burkov
2026-03-25  0:41 ` [PATCH 1/5] btrfs: reserve space for delayed_refs in delalloc Boris Burkov
2026-03-25 15:36   ` Filipe Manana
2026-03-25 18:39     ` Boris Burkov
2026-03-25 18:55       ` Filipe Manana
2026-03-25 22:24         ` Boris Burkov
2026-03-25  0:41 ` [PATCH 2/5] btrfs: account for csum " Boris Burkov
2026-03-25  0:41 ` [PATCH 3/5] btrfs: account for compression in delalloc extent reservation Boris Burkov
2026-03-25  0:41 ` [PATCH 4/5] btrfs: make inode->outstanding_extents a u64 Boris Burkov
2026-03-25  0:41 ` [PATCH 5/5] btrfs: cap shrink_delalloc iterations to 128M Boris Burkov

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox