public inbox for linux-btrfs@vger.kernel.org
 help / color / mirror / Atom feed
From: Boris Burkov <boris@bur.io>
To: linux-btrfs@vger.kernel.org, kernel-team@fb.com
Subject: [PATCH 1/5] btrfs: reserve space for delayed_refs in delalloc
Date: Tue, 24 Mar 2026 17:41:49 -0700	[thread overview]
Message-ID: <4828515e28d350985e7b7b9d3a58a5990b74362d.1774398665.git.boris@bur.io> (raw)
In-Reply-To: <cover.1774398665.git.boris@bur.io>

delalloc uses a per-inode block_rsv to perform metadata reservations for
the cow operations it anticipates based on the number of outstanding
extents. This calculation is done based on inode->outstanding_extents in
btrfs_calculate_inode_block_rsv_size(). The reservation is *not*
meticulously tracked as each ordered_extent is actually created in
writeback, but rather delalloc attempts to over-estimate and the
writeback and ordered_extent finish portions are responsible to release
all the reservation.

However, there is a notable gap in this reservation, it reserves no
space for the resulting delayed_refs. If you compare to how
btrfs_start_transaction() reservations work, this is a noteable
difference.

As writeback actually occurs, and we trigger btrfs_finish_one_ordered(),
that function will start generating delayed refs, which will draw from
the trans_handle's delayed_refs_rsv via btrfs_update_delayed_refs_rsv():

btrfs_finish_one_ordered()
  insert_ordered_extent_file_extent()
    insert_reserved_file_extent()
      btrfs_alloc_reserved_file_extent()
        btrfs_add_delayed_data_ref()
          add_delayed_ref()
            btrfs_update_delayed_refs_rsv();

This trans_handle was created in finish_one_ordered() with
btrfs_join_transaction() which calls start_transaction with
num_items=0 and BTRFS_RESERVE_NO_FLUSH. As a result, this trans_handle
has no reserved in h->delayed_rsv, as neither the num_items reservation
nor the btrfs_delayed_refs_rsv_refill() reservation is run.

Thus, when btrfs_update_delayed_refs_rsv() runs, reserved_bytes is 0 and
fs_info->delayed_rsv->size grows but not fs_info->delayed_rsv->reserved.

If a large amount of writeback happens all at once (perhaps due to
dirty_ratio being tuned too high), this results in, among other things,
erroneous assessments of the amount of delayed_refs reserved in the
metadata space reclaim logic, like need_preemptive_reclaim() which
relies on fs_info->delayed_rsv->reserved and even worse, poor decision
making in btrfs_preempt_reclaim_metadata_space() which counts
delalloc_bytes like so:

  block_rsv_size = global_rsv_size +
          btrfs_block_rsv_reserved(delayed_block_rsv) +
          btrfs_block_rsv_reserved(delayed_refs_rsv) +
          btrfs_block_rsv_reserved(trans_rsv);
  delalloc_size = bytes_may_use - block_rsv_size;

So all that lost delayed refs usage gets accounted as delalloc_size and
leads to preemptive reclaim continuously choosing FLUSH_DELALLOC, which
further exacerbates the problem.

With enough writeback around, we can run enough delalloc that we get
into async reclaim which starts blocking start_transaction() and
eventually hits FLUSH_DELALLOC_WAIT/FLUSH_DELALLOC_FULL at which point
the filesystem gets heavily blocked on metadata space in reserve_space(),
blocking all new transaction work until all the ordered_extents finish.

If we had an accurate view of the reservation for delayed refs, then we
could mostly break this feedback loop in preemptive reclaim, and
generally would be able to make more accurate decisions with regards to
metadata space reclamation.

This patch introduces the mechanism of a per-inode delayed_refs rsv
which is modeled closely after the same in trans_handle. The delalloc
reservation also reserves delayed refs and then finish_one_ordered
transfers the inode delayed_refs rsv into the trans_handle one, just
like inode->block_rsv.

This is not a perfect fix for the most pathological cases, but is the
infrastructure needed to keep working on the problem.

Signed-off-by: Boris Burkov <boris@bur.io>
---
 fs/btrfs/btrfs_inode.h    |  3 +++
 fs/btrfs/delalloc-space.c | 34 ++++++++++++++++++++++++++++++----
 fs/btrfs/delayed-ref.c    |  2 +-
 fs/btrfs/inode.c          |  9 ++++++++-
 fs/btrfs/transaction.c    |  7 ++++---
 fs/btrfs/transaction.h    |  3 ++-
 6 files changed, 48 insertions(+), 10 deletions(-)

diff --git a/fs/btrfs/btrfs_inode.h b/fs/btrfs/btrfs_inode.h
index 55c272fe5d92..dca4f6df7e95 100644
--- a/fs/btrfs/btrfs_inode.h
+++ b/fs/btrfs/btrfs_inode.h
@@ -328,6 +328,9 @@ struct btrfs_inode {
 
 	struct btrfs_block_rsv block_rsv;
 
+	/* Reserve for delayed refs generated by ordered extent completion. */
+	struct btrfs_block_rsv delayed_rsv;
+
 	struct btrfs_delayed_node *delayed_node;
 
 	/* File creation time. */
diff --git a/fs/btrfs/delalloc-space.c b/fs/btrfs/delalloc-space.c
index 0970799d0aa4..e2944ff4fe47 100644
--- a/fs/btrfs/delalloc-space.c
+++ b/fs/btrfs/delalloc-space.c
@@ -3,6 +3,7 @@
 #include "messages.h"
 #include "ctree.h"
 #include "delalloc-space.h"
+#include "delayed-ref.h"
 #include "block-rsv.h"
 #include "btrfs_inode.h"
 #include "space-info.h"
@@ -240,6 +241,13 @@ static void btrfs_inode_rsv_release(struct btrfs_inode *inode, bool qgroup_free)
 	if (released > 0)
 		trace_btrfs_space_reservation(fs_info, "delalloc",
 					      btrfs_ino(inode), released, 0);
+
+	released = btrfs_block_rsv_release(fs_info, &inode->delayed_rsv,
+					   0, NULL);
+	if (released > 0)
+		trace_btrfs_space_reservation(fs_info, "delalloc_delayed_refs",
+					      btrfs_ino(inode), released, 0);
+
 	if (qgroup_free)
 		btrfs_qgroup_free_meta_prealloc(inode->root, qgroup_to_release);
 	else
@@ -251,7 +259,9 @@ static void btrfs_calculate_inode_block_rsv_size(struct btrfs_fs_info *fs_info,
 						 struct btrfs_inode *inode)
 {
 	struct btrfs_block_rsv *block_rsv = &inode->block_rsv;
+	struct btrfs_block_rsv *delayed_rsv = &inode->delayed_rsv;
 	u64 reserve_size = 0;
+	u64 delayed_refs_size = 0;
 	u64 qgroup_rsv_size = 0;
 	unsigned outstanding_extents;
 
@@ -266,6 +276,8 @@ static void btrfs_calculate_inode_block_rsv_size(struct btrfs_fs_info *fs_info,
 		reserve_size = btrfs_calc_insert_metadata_size(fs_info,
 						outstanding_extents);
 		reserve_size += btrfs_calc_metadata_size(fs_info, 1);
+		delayed_refs_size += btrfs_calc_delayed_ref_bytes(fs_info,
+						outstanding_extents);
 	}
 	if (!(inode->flags & BTRFS_INODE_NODATASUM)) {
 		u64 csum_leaves;
@@ -285,11 +297,17 @@ static void btrfs_calculate_inode_block_rsv_size(struct btrfs_fs_info *fs_info,
 	block_rsv->size = reserve_size;
 	block_rsv->qgroup_rsv_size = qgroup_rsv_size;
 	spin_unlock(&block_rsv->lock);
+
+	spin_lock(&delayed_rsv->lock);
+	delayed_rsv->size = delayed_refs_size;
+	spin_unlock(&delayed_rsv->lock);
 }
 
 static void calc_inode_reservations(struct btrfs_inode *inode,
 				    u64 num_bytes, u64 disk_num_bytes,
-				    u64 *meta_reserve, u64 *qgroup_reserve)
+				    u64 *meta_reserve,
+				    u64 *delayed_refs_reserve,
+				    u64 *qgroup_reserve)
 {
 	struct btrfs_fs_info *fs_info = inode->root->fs_info;
 	u64 nr_extents = count_max_extents(fs_info, num_bytes);
@@ -309,6 +327,10 @@ static void calc_inode_reservations(struct btrfs_inode *inode,
 	 * for an inode update.
 	 */
 	*meta_reserve += inode_update;
+
+	*delayed_refs_reserve = btrfs_calc_delayed_ref_bytes(fs_info,
+							     nr_extents);
+
 	*qgroup_reserve = nr_extents * fs_info->nodesize;
 }
 
@@ -318,7 +340,7 @@ int btrfs_delalloc_reserve_metadata(struct btrfs_inode *inode, u64 num_bytes,
 	struct btrfs_root *root = inode->root;
 	struct btrfs_fs_info *fs_info = root->fs_info;
 	struct btrfs_block_rsv *block_rsv = &inode->block_rsv;
-	u64 meta_reserve, qgroup_reserve;
+	u64 meta_reserve, delayed_refs_reserve, qgroup_reserve;
 	unsigned nr_extents;
 	enum btrfs_reserve_flush_enum flush = BTRFS_RESERVE_FLUSH_ALL;
 	int ret = 0;
@@ -353,12 +375,14 @@ int btrfs_delalloc_reserve_metadata(struct btrfs_inode *inode, u64 num_bytes,
 	 * over-reserve slightly, and clean up the mess when we are done.
 	 */
 	calc_inode_reservations(inode, num_bytes, disk_num_bytes,
-				&meta_reserve, &qgroup_reserve);
+				&meta_reserve, &delayed_refs_reserve,
+				&qgroup_reserve);
 	ret = btrfs_qgroup_reserve_meta_prealloc(root, qgroup_reserve, true,
 						 noflush);
 	if (ret)
 		return ret;
-	ret = btrfs_reserve_metadata_bytes(block_rsv->space_info, meta_reserve,
+	ret = btrfs_reserve_metadata_bytes(block_rsv->space_info,
+					   meta_reserve + delayed_refs_reserve,
 					   flush);
 	if (ret) {
 		btrfs_qgroup_free_meta_prealloc(root, qgroup_reserve);
@@ -383,6 +407,8 @@ int btrfs_delalloc_reserve_metadata(struct btrfs_inode *inode, u64 num_bytes,
 	btrfs_block_rsv_add_bytes(block_rsv, meta_reserve, false);
 	trace_btrfs_space_reservation(root->fs_info, "delalloc",
 				      btrfs_ino(inode), meta_reserve, 1);
+	btrfs_block_rsv_add_bytes(&inode->delayed_rsv, delayed_refs_reserve,
+				  false);
 
 	spin_lock(&block_rsv->lock);
 	block_rsv->qgroup_rsv_reserved += qgroup_reserve;
diff --git a/fs/btrfs/delayed-ref.c b/fs/btrfs/delayed-ref.c
index 605858c2d9a9..9fe9cec1bef3 100644
--- a/fs/btrfs/delayed-ref.c
+++ b/fs/btrfs/delayed-ref.c
@@ -89,7 +89,7 @@ void btrfs_update_delayed_refs_rsv(struct btrfs_trans_handle *trans)
 {
 	struct btrfs_fs_info *fs_info = trans->fs_info;
 	struct btrfs_block_rsv *delayed_rsv = &fs_info->delayed_refs_rsv;
-	struct btrfs_block_rsv *local_rsv = &trans->delayed_rsv;
+	struct btrfs_block_rsv *local_rsv = trans->delayed_rsv;
 	u64 num_bytes;
 	u64 reserved_bytes;
 
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 1a4e6a9239ae..1f0f3282e4b8 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -653,6 +653,7 @@ static noinline int __cow_file_range_inline(struct btrfs_inode *inode,
 		goto out;
 	}
 	trans->block_rsv = &inode->block_rsv;
+	trans->delayed_rsv = &inode->delayed_rsv;
 
 	drop_args.path = path;
 	drop_args.start = 0;
@@ -3256,6 +3257,7 @@ int btrfs_finish_one_ordered(struct btrfs_ordered_extent *ordered_extent)
 	}
 
 	trans->block_rsv = &inode->block_rsv;
+	trans->delayed_rsv = &inode->delayed_rsv;
 
 	ret = btrfs_insert_raid_extent(trans, ordered_extent);
 	if (unlikely(ret)) {
@@ -8074,9 +8076,12 @@ struct inode *btrfs_alloc_inode(struct super_block *sb)
 
 	spin_lock_init(&ei->lock);
 	ei->outstanding_extents = 0;
-	if (sb->s_magic != BTRFS_TEST_MAGIC)
+	if (sb->s_magic != BTRFS_TEST_MAGIC) {
 		btrfs_init_metadata_block_rsv(fs_info, &ei->block_rsv,
 					      BTRFS_BLOCK_RSV_DELALLOC);
+		btrfs_init_metadata_block_rsv(fs_info, &ei->delayed_rsv,
+					      BTRFS_BLOCK_RSV_DELREFS);
+	}
 	ei->runtime_flags = 0;
 	ei->prop_compress = BTRFS_COMPRESS_NONE;
 	ei->defrag_compress = BTRFS_COMPRESS_NONE;
@@ -8132,6 +8137,8 @@ void btrfs_destroy_inode(struct inode *vfs_inode)
 	WARN_ON(vfs_inode->i_data.nrpages);
 	WARN_ON(inode->block_rsv.reserved);
 	WARN_ON(inode->block_rsv.size);
+	WARN_ON(inode->delayed_rsv.reserved);
+	WARN_ON(inode->delayed_rsv.size);
 	WARN_ON(inode->outstanding_extents);
 	if (!S_ISDIR(vfs_inode->i_mode)) {
 		WARN_ON(inode->delalloc_bytes);
diff --git a/fs/btrfs/transaction.c b/fs/btrfs/transaction.c
index 4358f4b63057..a55f8996cd59 100644
--- a/fs/btrfs/transaction.c
+++ b/fs/btrfs/transaction.c
@@ -737,7 +737,8 @@ start_transaction(struct btrfs_root *root, unsigned int num_items,
 
 	h->type = type;
 	INIT_LIST_HEAD(&h->new_bgs);
-	btrfs_init_metadata_block_rsv(fs_info, &h->delayed_rsv, BTRFS_BLOCK_RSV_DELREFS);
+	h->delayed_rsv = &h->_local_delayed_rsv;
+	btrfs_init_metadata_block_rsv(fs_info, h->delayed_rsv, BTRFS_BLOCK_RSV_DELREFS);
 
 	smp_mb();
 	if (cur_trans->state >= TRANS_STATE_COMMIT_START &&
@@ -758,7 +759,7 @@ start_transaction(struct btrfs_root *root, unsigned int num_items,
 						      h->transid,
 						      delayed_refs_bytes, 1);
 			h->delayed_refs_bytes_reserved = delayed_refs_bytes;
-			btrfs_block_rsv_add_bytes(&h->delayed_rsv, delayed_refs_bytes, true);
+			btrfs_block_rsv_add_bytes(h->delayed_rsv, delayed_refs_bytes, true);
 			delayed_refs_bytes = 0;
 		}
 		h->reloc_reserved = reloc_reserved;
@@ -1067,7 +1068,7 @@ static void btrfs_trans_release_metadata(struct btrfs_trans_handle *trans)
 	trace_btrfs_space_reservation(fs_info, "local_delayed_refs_rsv",
 				      trans->transid,
 				      trans->delayed_refs_bytes_reserved, 0);
-	btrfs_block_rsv_release(fs_info, &trans->delayed_rsv,
+	btrfs_block_rsv_release(fs_info, trans->delayed_rsv,
 				trans->delayed_refs_bytes_reserved, NULL);
 	trans->delayed_refs_bytes_reserved = 0;
 }
diff --git a/fs/btrfs/transaction.h b/fs/btrfs/transaction.h
index 7d70fe486758..268a415c4f32 100644
--- a/fs/btrfs/transaction.h
+++ b/fs/btrfs/transaction.h
@@ -162,7 +162,8 @@ struct btrfs_trans_handle {
 	bool in_fsync;
 	struct btrfs_fs_info *fs_info;
 	struct list_head new_bgs;
-	struct btrfs_block_rsv delayed_rsv;
+	struct btrfs_block_rsv *delayed_rsv;
+	struct btrfs_block_rsv _local_delayed_rsv;
 	/* Extent buffers with writeback inhibited by this handle. */
 	struct xarray writeback_inhibited_ebs;
 };
-- 
2.53.0


  reply	other threads:[~2026-03-25  0:42 UTC|newest]

Thread overview: 10+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-03-25  0:41 [PATCH 0/5] btrfs: improve stalls under sudden writeback Boris Burkov
2026-03-25  0:41 ` Boris Burkov [this message]
2026-03-25 15:36   ` [PATCH 1/5] btrfs: reserve space for delayed_refs in delalloc Filipe Manana
2026-03-25 18:39     ` Boris Burkov
2026-03-25 18:55       ` Filipe Manana
2026-03-25 22:24         ` Boris Burkov
2026-03-25  0:41 ` [PATCH 2/5] btrfs: account for csum " Boris Burkov
2026-03-25  0:41 ` [PATCH 3/5] btrfs: account for compression in delalloc extent reservation Boris Burkov
2026-03-25  0:41 ` [PATCH 4/5] btrfs: make inode->outstanding_extents a u64 Boris Burkov
2026-03-25  0:41 ` [PATCH 5/5] btrfs: cap shrink_delalloc iterations to 128M Boris Burkov

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=4828515e28d350985e7b7b9d3a58a5990b74362d.1774398665.git.boris@bur.io \
    --to=boris@bur.io \
    --cc=kernel-team@fb.com \
    --cc=linux-btrfs@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox