public inbox for linux-btrfs@vger.kernel.org
 help / color / mirror / Atom feed
* [PATCH v2 0/5] btrfs: improve stalls under sudden writeback
@ 2026-04-03 22:27 Boris Burkov
  2026-04-03 22:27 ` [PATCH v2 1/5] btrfs: reserve space for delayed_refs in delalloc Boris Burkov
                   ` (4 more replies)
  0 siblings, 5 replies; 12+ messages in thread
From: Boris Burkov @ 2026-04-03 22:27 UTC (permalink / raw)
  To: linux-btrfs, kernel-team

If you have a system with very large memory (TiBs) and a normal
percentage based dirty_ratio/dirty_background_ratio like the defaults of
20%/10%, then we can theoretically rack up 100s of GiB of dirty pages
before doing any writeback. This is further exacerbated if we also see a
sudden drop in the free memory due to a large allocation. If we
(relatively likely for a large ram system) also have a large disk, we are
unlikely to do trigger much preemptive metadata reclaim either.

Once we do start doing writeback with such a large supply, the results
are somewhat ugly. The delalloc work generates a huge amount of delayed
refs without proper reservations which sends the metadata space system
into a tailspin trying to run yet more delalloc to free space.
Ultimately, the system stalls waiting for huge amounts of ordered
extents and delayed refs blocking all users in start_transaction() on
tickets in reserve_space().

This patch series aims to address these issues in a relatively targeted
way by improving our reservations for delalloc delayed refs and by doing
some very basic smoothing of the work in flush_space(). Further work
could be done to improve flush_space() heuristics and latency but this
is already a big help on my observed workloads.

I was able to reproduce stalls on a more "modest" system with 264GiB of
ram by using a somewhat silly 80% dirty_ratio.

I was unfortunately unable to reproduce any stalls on a yet smaller
system with only 32GiB of ram.

The first 3 patches do the delayed_ref rsv accounting on btrfs_inode,
mirroring inode->block_rsv.
The 4th patch is a cleanup to the types counting max extents
The 5th patch reduces the size of the unit of work in shrink_delalloc()
to further reduce stalls.
---
Changelog:
v2:
- patch 1 no longer embeds a new block_rsv on btrfs_inode for the
  delayed reservation. Instead it does the reservation on
  inode->block_rsv and migrates it to trans->delayed_rsv at the moment
  of truth.

Boris Burkov (5):
  btrfs: reserve space for delayed_refs in delalloc
  btrfs: account for csum delayed_refs in delalloc
  btrfs: account for compression in delalloc extent reservation
  btrfs: make inode->outstanding_extents a u64
  btrfs: cap shrink_delalloc iterations to 128M

 fs/btrfs/btrfs_inode.h       | 17 +++++--
 fs/btrfs/delalloc-space.c    | 78 +++++++++++++++++++++++-------
 fs/btrfs/delalloc-space.h    |  3 ++
 fs/btrfs/fs.h                | 13 -----
 fs/btrfs/inode.c             | 93 ++++++++++++++++++++++++++++--------
 fs/btrfs/ordered-data.c      |  4 +-
 fs/btrfs/space-info.c        | 31 ++++++++----
 fs/btrfs/tests/inode-tests.c | 18 +++----
 fs/btrfs/transaction.c       | 36 ++++++--------
 include/trace/events/btrfs.h |  8 ++--
 10 files changed, 201 insertions(+), 100 deletions(-)

-- 
2.53.0


^ permalink raw reply	[flat|nested] 12+ messages in thread

* [PATCH v2 1/5] btrfs: reserve space for delayed_refs in delalloc
  2026-04-03 22:27 [PATCH v2 0/5] btrfs: improve stalls under sudden writeback Boris Burkov
@ 2026-04-03 22:27 ` Boris Burkov
  2026-04-06 17:20   ` Filipe Manana
  2026-04-03 22:27 ` [PATCH v2 2/5] btrfs: account for csum " Boris Burkov
                   ` (3 subsequent siblings)
  4 siblings, 1 reply; 12+ messages in thread
From: Boris Burkov @ 2026-04-03 22:27 UTC (permalink / raw)
  To: linux-btrfs, kernel-team

delalloc uses a per-inode block_rsv to perform metadata reservations for
the cow operations it anticipates based on the number of outstanding
extents. This calculation is done based on inode->outstanding_extents in
btrfs_calculate_inode_block_rsv_size(). The reservation is *not*
meticulously tracked as each ordered_extent is actually created in
writeback, but rather delalloc attempts to over-estimate and the
writeback and ordered_extent finish portions are responsible to release
all the reservation.

However, there is a notable gap in this reservation, it reserves no
space for the resulting delayed_refs. If you compare to how
btrfs_start_transaction() reservations work, this is a noteable
difference.

As writeback actually occurs, and we trigger btrfs_finish_one_ordered(),
that function will start generating delayed refs, which will draw from
the trans_handle's delayed_refs_rsv via btrfs_update_delayed_refs_rsv():

btrfs_finish_one_ordered()
  insert_ordered_extent_file_extent()
    insert_reserved_file_extent()
      btrfs_alloc_reserved_file_extent()
        btrfs_add_delayed_data_ref()
          add_delayed_ref()
            btrfs_update_delayed_refs_rsv();

This trans_handle was created in finish_one_ordered() with
btrfs_join_transaction() which calls start_transaction with
num_items=0 and BTRFS_RESERVE_NO_FLUSH. As a result, this trans_handle
has no reserved in h->delayed_rsv, as neither the num_items reservation
nor the btrfs_delayed_refs_rsv_refill() reservation is run.

Thus, when btrfs_update_delayed_refs_rsv() runs, reserved_bytes is 0 and
fs_info->delayed_rsv->size grows but not fs_info->delayed_rsv->reserved.

If a large amount of writeback happens all at once (perhaps due to
dirty_ratio being tuned too high), this results in, among other things,
erroneous assessments of the amount of delayed_refs reserved in the
metadata space reclaim logic, like need_preemptive_reclaim() which
relies on fs_info->delayed_rsv->reserved and even worse, poor decision
making in btrfs_preempt_reclaim_metadata_space() which counts
delalloc_bytes like so:

  block_rsv_size = global_rsv_size +
          btrfs_block_rsv_reserved(delayed_block_rsv) +
          btrfs_block_rsv_reserved(delayed_refs_rsv) +
          btrfs_block_rsv_reserved(trans_rsv);
  delalloc_size = bytes_may_use - block_rsv_size;

So all that lost delayed refs usage gets accounted as delalloc_size and
leads to preemptive reclaim continuously choosing FLUSH_DELALLOC, which
further exacerbates the problem.

With enough writeback around, we can run enough delalloc that we get
into async reclaim which starts blocking start_transaction() and
eventually hits FLUSH_DELALLOC_WAIT/FLUSH_DELALLOC_FULL at which point
the filesystem gets heavily blocked on metadata space in reserve_space(),
blocking all new transaction work until all the ordered_extents finish.

If we had an accurate view of the reservation for delayed refs, then we
could mostly break this feedback loop in preemptive reclaim, and
generally would be able to make more accurate decisions with regards to
metadata space reclamation.

This patch adds extra metadata reservation to the inode's block_rsv to
account for the delayed refs. When the ordered_extent finishes and we
are about to do work in the transaction that uses delayed refs, we
migrate enough for 1 extent. Since this is not necessarily perfect, we
have to be careful and do a "soft" migrate which succeeds even if there
is not enough reservation. This is strictly better than what we have and
also matches how the delayed ref rsv gets used in the transaction at
btrfs_update_delayed_refs_rsv()

This is not a perfect fix for the most pathological cases, but is the
infrastructure needed to keep working on the problem.

Signed-off-by: Boris Burkov <boris@bur.io>
---
 fs/btrfs/delalloc-space.c | 43 ++++++++++++++++++++++++++++++++++++---
 fs/btrfs/delalloc-space.h |  3 +++
 fs/btrfs/inode.c          |  5 ++++-
 fs/btrfs/transaction.c    | 36 ++++++++++++++------------------
 4 files changed, 62 insertions(+), 25 deletions(-)

diff --git a/fs/btrfs/delalloc-space.c b/fs/btrfs/delalloc-space.c
index 0970799d0aa4..a087be55409f 100644
--- a/fs/btrfs/delalloc-space.c
+++ b/fs/btrfs/delalloc-space.c
@@ -3,11 +3,13 @@
 #include "messages.h"
 #include "ctree.h"
 #include "delalloc-space.h"
+#include "delayed-ref.h"
 #include "block-rsv.h"
 #include "btrfs_inode.h"
 #include "space-info.h"
 #include "qgroup.h"
 #include "fs.h"
+#include "transaction.h"
 
 /*
  * HOW DOES THIS WORK
@@ -240,6 +242,7 @@ static void btrfs_inode_rsv_release(struct btrfs_inode *inode, bool qgroup_free)
 	if (released > 0)
 		trace_btrfs_space_reservation(fs_info, "delalloc",
 					      btrfs_ino(inode), released, 0);
+
 	if (qgroup_free)
 		btrfs_qgroup_free_meta_prealloc(inode->root, qgroup_to_release);
 	else
@@ -247,6 +250,16 @@ static void btrfs_inode_rsv_release(struct btrfs_inode *inode, bool qgroup_free)
 						   qgroup_to_release);
 }
 
+/*
+ * Each delalloc extent could become an ordered_extent and end up inserting a
+ * new extent into the extent and free space trees. So we must reserve
+ * delayed ref space up front for that.
+ */
+static u64 delalloc_calc_delayed_refs_rsv(const struct btrfs_fs_info *fs_info, u64 nr_extents)
+{
+	return btrfs_calc_delayed_ref_bytes(fs_info, nr_extents);
+}
+
 static void btrfs_calculate_inode_block_rsv_size(struct btrfs_fs_info *fs_info,
 						 struct btrfs_inode *inode)
 {
@@ -266,6 +279,7 @@ static void btrfs_calculate_inode_block_rsv_size(struct btrfs_fs_info *fs_info,
 		reserve_size = btrfs_calc_insert_metadata_size(fs_info,
 						outstanding_extents);
 		reserve_size += btrfs_calc_metadata_size(fs_info, 1);
+		reserve_size += delalloc_calc_delayed_refs_rsv(fs_info, outstanding_extents);
 	}
 	if (!(inode->flags & BTRFS_INODE_NODATASUM)) {
 		u64 csum_leaves;
@@ -289,7 +303,8 @@ static void btrfs_calculate_inode_block_rsv_size(struct btrfs_fs_info *fs_info,
 
 static void calc_inode_reservations(struct btrfs_inode *inode,
 				    u64 num_bytes, u64 disk_num_bytes,
-				    u64 *meta_reserve, u64 *qgroup_reserve)
+				    u64 *meta_reserve,
+				    u64 *qgroup_reserve)
 {
 	struct btrfs_fs_info *fs_info = inode->root->fs_info;
 	u64 nr_extents = count_max_extents(fs_info, num_bytes);
@@ -309,9 +324,31 @@ static void calc_inode_reservations(struct btrfs_inode *inode,
 	 * for an inode update.
 	 */
 	*meta_reserve += inode_update;
+
+	*meta_reserve += delalloc_calc_delayed_refs_rsv(fs_info, nr_extents);
+
 	*qgroup_reserve = nr_extents * fs_info->nodesize;
 }
 
+u64 btrfs_delalloc_migrate_delayed_refs_rsv(struct btrfs_trans_handle *trans,
+                                           struct btrfs_inode *inode)
+{
+       struct btrfs_block_rsv *inode_rsv = &inode->block_rsv;
+       struct btrfs_block_rsv *trans_rsv = &trans->delayed_rsv;
+       u64 num_bytes = delalloc_calc_delayed_refs_rsv(trans->fs_info, 1);
+
+       spin_lock(&inode_rsv->lock);
+       num_bytes = min(num_bytes, inode_rsv->reserved);
+       inode_rsv->reserved -= num_bytes;
+       inode_rsv->full = (inode_rsv->reserved >= inode_rsv->size);
+       spin_unlock(&inode_rsv->lock);
+
+       btrfs_block_rsv_add_bytes(trans_rsv, num_bytes, true);
+       trans->delayed_refs_bytes_reserved += num_bytes;
+
+       return num_bytes;
+}
+
 int btrfs_delalloc_reserve_metadata(struct btrfs_inode *inode, u64 num_bytes,
 				    u64 disk_num_bytes, bool noflush)
 {
@@ -358,8 +395,8 @@ int btrfs_delalloc_reserve_metadata(struct btrfs_inode *inode, u64 num_bytes,
 						 noflush);
 	if (ret)
 		return ret;
-	ret = btrfs_reserve_metadata_bytes(block_rsv->space_info, meta_reserve,
-					   flush);
+	ret = btrfs_reserve_metadata_bytes(block_rsv->space_info,
+					   meta_reserve, flush);
 	if (ret) {
 		btrfs_qgroup_free_meta_prealloc(root, qgroup_reserve);
 		return ret;
diff --git a/fs/btrfs/delalloc-space.h b/fs/btrfs/delalloc-space.h
index 6119c0d3f883..b058e884879b 100644
--- a/fs/btrfs/delalloc-space.h
+++ b/fs/btrfs/delalloc-space.h
@@ -8,6 +8,7 @@
 struct extent_changeset;
 struct btrfs_inode;
 struct btrfs_fs_info;
+struct btrfs_trans_handle;
 
 int btrfs_alloc_data_chunk_ondemand(const struct btrfs_inode *inode, u64 bytes);
 int btrfs_check_data_free_space(struct btrfs_inode *inode,
@@ -27,5 +28,7 @@ int btrfs_delalloc_reserve_metadata(struct btrfs_inode *inode, u64 num_bytes,
 				    u64 disk_num_bytes, bool noflush);
 void btrfs_delalloc_release_extents(struct btrfs_inode *inode, u64 num_bytes);
 void btrfs_delalloc_shrink_extents(struct btrfs_inode *inode, u64 reserved_len, u64 new_len);
+u64 btrfs_delalloc_migrate_delayed_refs_rsv(struct btrfs_trans_handle *trans,
+                                           struct btrfs_inode *inode);
 
 #endif /* BTRFS_DELALLOC_SPACE_H */
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 808e52aa6ef2..c57473030814 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -653,6 +653,7 @@ static noinline int __cow_file_range_inline(struct btrfs_inode *inode,
 		goto out;
 	}
 	trans->block_rsv = &inode->block_rsv;
+	btrfs_delalloc_migrate_delayed_refs_rsv(trans, inode);
 
 	drop_args.path = path;
 	drop_args.start = 0;
@@ -3297,6 +3298,7 @@ int btrfs_finish_one_ordered(struct btrfs_ordered_extent *ordered_extent)
 		}
 	} else {
 		BUG_ON(root == fs_info->tree_root);
+		btrfs_delalloc_migrate_delayed_refs_rsv(trans, inode);
 		ret = insert_ordered_extent_file_extent(trans, ordered_extent);
 		if (unlikely(ret < 0)) {
 			btrfs_abort_transaction(trans, ret);
@@ -8074,9 +8076,10 @@ struct inode *btrfs_alloc_inode(struct super_block *sb)
 
 	spin_lock_init(&ei->lock);
 	ei->outstanding_extents = 0;
-	if (sb->s_magic != BTRFS_TEST_MAGIC)
+	if (sb->s_magic != BTRFS_TEST_MAGIC) {
 		btrfs_init_metadata_block_rsv(fs_info, &ei->block_rsv,
 					      BTRFS_BLOCK_RSV_DELALLOC);
+	}
 	ei->runtime_flags = 0;
 	ei->prop_compress = BTRFS_COMPRESS_NONE;
 	ei->defrag_compress = BTRFS_COMPRESS_NONE;
diff --git a/fs/btrfs/transaction.c b/fs/btrfs/transaction.c
index 248adb785051..55791bb100a2 100644
--- a/fs/btrfs/transaction.c
+++ b/fs/btrfs/transaction.c
@@ -1047,29 +1047,23 @@ static void btrfs_trans_release_metadata(struct btrfs_trans_handle *trans)
 		return;
 	}
 
-	if (!trans->bytes_reserved) {
-		ASSERT(trans->delayed_refs_bytes_reserved == 0,
-		       "trans->delayed_refs_bytes_reserved=%llu",
-		       trans->delayed_refs_bytes_reserved);
-		return;
+	if (trans->bytes_reserved) {
+		ASSERT(trans->block_rsv == &fs_info->trans_block_rsv);
+		trace_btrfs_space_reservation(fs_info, "transaction",
+					trans->transid, trans->bytes_reserved, 0);
+		btrfs_block_rsv_release(fs_info, trans->block_rsv,
+					trans->bytes_reserved, NULL);
+		trans->bytes_reserved = 0;
 	}
 
-	ASSERT(trans->block_rsv == &fs_info->trans_block_rsv);
-	trace_btrfs_space_reservation(fs_info, "transaction",
-				      trans->transid, trans->bytes_reserved, 0);
-	btrfs_block_rsv_release(fs_info, trans->block_rsv,
-				trans->bytes_reserved, NULL);
-	trans->bytes_reserved = 0;
-
-	if (!trans->delayed_refs_bytes_reserved)
-		return;
-
-	trace_btrfs_space_reservation(fs_info, "local_delayed_refs_rsv",
-				      trans->transid,
-				      trans->delayed_refs_bytes_reserved, 0);
-	btrfs_block_rsv_release(fs_info, &trans->delayed_rsv,
-				trans->delayed_refs_bytes_reserved, NULL);
-	trans->delayed_refs_bytes_reserved = 0;
+	if (trans->delayed_refs_bytes_reserved) {
+		trace_btrfs_space_reservation(fs_info, "local_delayed_refs_rsv",
+					trans->transid,
+					trans->delayed_refs_bytes_reserved, 0);
+		btrfs_block_rsv_release(fs_info, &trans->delayed_rsv,
+					trans->delayed_refs_bytes_reserved, NULL);
+		trans->delayed_refs_bytes_reserved = 0;
+	}
 }
 
 static int __btrfs_end_transaction(struct btrfs_trans_handle *trans,
-- 
2.53.0


^ permalink raw reply related	[flat|nested] 12+ messages in thread

* [PATCH v2 2/5] btrfs: account for csum delayed_refs in delalloc
  2026-04-03 22:27 [PATCH v2 0/5] btrfs: improve stalls under sudden writeback Boris Burkov
  2026-04-03 22:27 ` [PATCH v2 1/5] btrfs: reserve space for delayed_refs in delalloc Boris Burkov
@ 2026-04-03 22:27 ` Boris Burkov
  2026-04-06 17:33   ` Filipe Manana
  2026-04-03 22:27 ` [PATCH v2 3/5] btrfs: account for compression in delalloc extent reservation Boris Burkov
                   ` (2 subsequent siblings)
  4 siblings, 1 reply; 12+ messages in thread
From: Boris Burkov @ 2026-04-03 22:27 UTC (permalink / raw)
  To: linux-btrfs, kernel-team

As ordered_extents complete, they not only produce direct delayed refs,
they also add csums via add_pending_csums. This produces some number of
metadata delayed_refs as the csum tree is cow-ed. These refs are counted
against the trans_handle delayed_rsv, thanks to trans->adding_csums.

As a result, just like we account for the extent tree and free space
tree when reserving delayed_refs for delalloc, we must also reserve for
the csum tree. This is mirrored by the non-delayed-ref metadata
reservation already accounting for csums.

This serves to ensure we have a proper worst case estimate for
delayed_rsv from delalloc.

Signed-off-by: Boris Burkov <boris@bur.io>
---
 fs/btrfs/delalloc-space.c | 13 ++++++++++---
 1 file changed, 10 insertions(+), 3 deletions(-)

diff --git a/fs/btrfs/delalloc-space.c b/fs/btrfs/delalloc-space.c
index a087be55409f..c2e5c7758ed5 100644
--- a/fs/btrfs/delalloc-space.c
+++ b/fs/btrfs/delalloc-space.c
@@ -252,12 +252,19 @@ static void btrfs_inode_rsv_release(struct btrfs_inode *inode, bool qgroup_free)
 
 /*
  * Each delalloc extent could become an ordered_extent and end up inserting a
- * new extent into the extent and free space trees. So we must reserve
- * delayed ref space up front for that.
+ * new extent into the extent and free space trees. It also may end up writing
+ * csums, so we must reserve delayed ref space up front for all of that.
  */
 static u64 delalloc_calc_delayed_refs_rsv(const struct btrfs_fs_info *fs_info, u64 nr_extents)
 {
-	return btrfs_calc_delayed_ref_bytes(fs_info, nr_extents);
+	/*
+	 * btrfs_calc_delayed_ref_bytes() does not assume csums as it is
+	 * also called from metadata writing contexts. So the extra
+	 * metadata insertion reservation represents the possibility of
+	 * delayed refs during csum tree writes.
+	 */
+	return btrfs_calc_delayed_ref_bytes(fs_info, nr_extents) +
+		btrfs_calc_insert_metadata_size(fs_info, nr_extents);
 }
 
 static void btrfs_calculate_inode_block_rsv_size(struct btrfs_fs_info *fs_info,
-- 
2.53.0


^ permalink raw reply related	[flat|nested] 12+ messages in thread

* [PATCH v2 3/5] btrfs: account for compression in delalloc extent reservation
  2026-04-03 22:27 [PATCH v2 0/5] btrfs: improve stalls under sudden writeback Boris Burkov
  2026-04-03 22:27 ` [PATCH v2 1/5] btrfs: reserve space for delayed_refs in delalloc Boris Burkov
  2026-04-03 22:27 ` [PATCH v2 2/5] btrfs: account for csum " Boris Burkov
@ 2026-04-03 22:27 ` Boris Burkov
  2026-04-06 17:44   ` Filipe Manana
  2026-04-03 22:27 ` [PATCH v2 4/5] btrfs: make inode->outstanding_extents a u64 Boris Burkov
  2026-04-03 22:27 ` [PATCH v2 5/5] btrfs: cap shrink_delalloc iterations to 128M Boris Burkov
  4 siblings, 1 reply; 12+ messages in thread
From: Boris Burkov @ 2026-04-03 22:27 UTC (permalink / raw)
  To: linux-btrfs, kernel-team

The btrfs maximum uncompressed extent size is 128MiB. The maximum
compressed extent size in file extent space is 128KiB. Therefore, the
estimate for outstanding_extents is off by 3 orders of magnitude when
COMPRESS_FORCE is set or the inode is set to always compress.

Because we use re-calculation when necessary, rather than super detailed
extent tracking, we don't grow this reservation as the true number of
extents is revealed. We don't want to be too clever with it, however, as
we don't want the calculation to change for a given inode between
reservation and release, so we only rely on the forcing type flags.

With this change, we no longer under-reserve delayed refs reservations
for delalloc writes, even with compress-force.

Because this would turn count_max_extents() into a named shim for
div_u64(size + max_extent_size - 1, max_extent_size);
we can just get rid of it.

Signed-off-by: Boris Burkov <boris@bur.io>
---
 fs/btrfs/btrfs_inode.h    |  3 ++
 fs/btrfs/delalloc-space.c | 13 ++++---
 fs/btrfs/fs.h             | 13 -------
 fs/btrfs/inode.c          | 78 ++++++++++++++++++++++++++++++++-------
 4 files changed, 74 insertions(+), 33 deletions(-)

diff --git a/fs/btrfs/btrfs_inode.h b/fs/btrfs/btrfs_inode.h
index 55c272fe5d92..fad61b4d94a7 100644
--- a/fs/btrfs/btrfs_inode.h
+++ b/fs/btrfs/btrfs_inode.h
@@ -510,6 +510,9 @@ static inline bool btrfs_inode_can_compress(const struct btrfs_inode *inode)
 	return true;
 }
 
+u64 btrfs_inode_max_extent_size(const struct btrfs_inode *inode);
+u64 btrfs_inode_max_extents(const struct btrfs_inode *inode, u64 size);
+
 static inline void btrfs_assert_inode_locked(struct btrfs_inode *inode)
 {
 	/* Immediately trigger a crash if the inode is not locked. */
diff --git a/fs/btrfs/delalloc-space.c b/fs/btrfs/delalloc-space.c
index c2e5c7758ed5..2c99e332e25b 100644
--- a/fs/btrfs/delalloc-space.c
+++ b/fs/btrfs/delalloc-space.c
@@ -6,6 +6,7 @@
 #include "delayed-ref.h"
 #include "block-rsv.h"
 #include "btrfs_inode.h"
+#include "compression.h"
 #include "space-info.h"
 #include "qgroup.h"
 #include "fs.h"
@@ -65,7 +66,7 @@
  *     This is the number of file extent items we'll need to handle all of the
  *     outstanding DELALLOC space we have in this inode.  We limit the maximum
  *     size of an extent, so a large contiguous dirty area may require more than
- *     one outstanding_extent, which is why count_max_extents() is used to
+ *     one outstanding_extent, which is why we use the max extent size to
  *     determine how many outstanding_extents get added.
  *
  *   ->csum_bytes
@@ -314,7 +315,7 @@ static void calc_inode_reservations(struct btrfs_inode *inode,
 				    u64 *qgroup_reserve)
 {
 	struct btrfs_fs_info *fs_info = inode->root->fs_info;
-	u64 nr_extents = count_max_extents(fs_info, num_bytes);
+	u64 nr_extents = btrfs_inode_max_extents(inode, num_bytes);
 	u64 csum_leaves;
 	u64 inode_update = btrfs_calc_metadata_size(fs_info, 1);
 
@@ -415,7 +416,7 @@ int btrfs_delalloc_reserve_metadata(struct btrfs_inode *inode, u64 num_bytes,
 	 * racing with an ordered completion or some such that would think it
 	 * needs to free the reservation we just made.
 	 */
-	nr_extents = count_max_extents(fs_info, num_bytes);
+	nr_extents = btrfs_inode_max_extents(inode, num_bytes);
 	spin_lock(&inode->lock);
 	btrfs_mod_outstanding_extents(inode, nr_extents);
 	if (!(inode->flags & BTRFS_INODE_NODATASUM))
@@ -482,7 +483,7 @@ void btrfs_delalloc_release_extents(struct btrfs_inode *inode, u64 num_bytes)
 	unsigned num_extents;
 
 	spin_lock(&inode->lock);
-	num_extents = count_max_extents(fs_info, num_bytes);
+	num_extents = btrfs_inode_max_extents(inode, num_bytes);
 	btrfs_mod_outstanding_extents(inode, -num_extents);
 	btrfs_calculate_inode_block_rsv_size(fs_info, inode);
 	spin_unlock(&inode->lock);
@@ -497,8 +498,8 @@ void btrfs_delalloc_release_extents(struct btrfs_inode *inode, u64 num_bytes)
 void btrfs_delalloc_shrink_extents(struct btrfs_inode *inode, u64 reserved_len, u64 new_len)
 {
 	struct btrfs_fs_info *fs_info = inode->root->fs_info;
-	const u32 reserved_num_extents = count_max_extents(fs_info, reserved_len);
-	const u32 new_num_extents = count_max_extents(fs_info, new_len);
+	const u32 reserved_num_extents = btrfs_inode_max_extents(inode, reserved_len);
+	const u32 new_num_extents = btrfs_inode_max_extents(inode, new_len);
 	const int diff_num_extents = new_num_extents - reserved_num_extents;
 
 	ASSERT(new_len <= reserved_len);
diff --git a/fs/btrfs/fs.h b/fs/btrfs/fs.h
index a4758d94b32e..2c1626155645 100644
--- a/fs/btrfs/fs.h
+++ b/fs/btrfs/fs.h
@@ -1051,19 +1051,6 @@ static inline bool btrfs_is_zoned(const struct btrfs_fs_info *fs_info)
 	return IS_ENABLED(CONFIG_BLK_DEV_ZONED) && fs_info->zone_size > 0;
 }
 
-/*
- * Count how many fs_info->max_extent_size cover the @size
- */
-static inline u32 count_max_extents(const struct btrfs_fs_info *fs_info, u64 size)
-{
-#ifdef CONFIG_BTRFS_FS_RUN_SANITY_TESTS
-	if (!fs_info)
-		return div_u64(size + BTRFS_MAX_EXTENT_SIZE - 1, BTRFS_MAX_EXTENT_SIZE);
-#endif
-
-	return div_u64(size + fs_info->max_extent_size - 1, fs_info->max_extent_size);
-}
-
 static inline unsigned int btrfs_blocks_per_folio(const struct btrfs_fs_info *fs_info,
 						  const struct folio *folio)
 {
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index c57473030814..1e53ff65d1a7 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -747,6 +747,56 @@ static int add_async_extent(struct async_chunk *cow, u64 start, u64 ram_size,
 	return 0;
 }
 
+/*
+ * Check if compression will definitely be attempted for this inode based on
+ * mount options and inode properties.  Unlike inode_need_compress(), this does
+ * NOT run the compression heuristic or check range-specific conditions, so it
+ * is safe to call under locks (e.g. io_tree lock) and for reservation sizing.
+ *
+ * Only returns true for cases where BTRFS_INODE_NOCOMPRESS cannot be set at
+ * runtime (FORCE_COMPRESS and prop_compress), ensuring that the effective max
+ * extent size is stable across paired set/clear delalloc operations.
+ */
+static inline bool inode_may_compress(const struct btrfs_inode *inode)
+{
+	if (!btrfs_inode_can_compress(inode))
+		return false;
+
+	/* force compress always attempts compression */
+	if (btrfs_test_opt(inode->root->fs_info, FORCE_COMPRESS))
+		return true;
+
+	/* per-inode property: NOCOMPRESS cannot override this */
+	if (inode->prop_compress)
+		return true;
+
+	return false;
+}
+
+/*
+ * Return the effective maximum extent size for reservation accounting.
+ *
+ * When compression is guaranteed to be attempted (FORCE_COMPRESS or
+ * prop_compress), the compression path splits ranges into
+ * BTRFS_MAX_UNCOMPRESSED chunks, each producing an independent ordered
+ * extent.  Use that as the divisor instead of fs_info->max_extent_size
+ * to avoid severely undercounting outstanding extents.
+ */
+u64 btrfs_inode_max_extent_size(const struct btrfs_inode *inode)
+{
+	if (inode_may_compress(inode))
+		return BTRFS_MAX_UNCOMPRESSED;
+
+	return inode->root->fs_info->max_extent_size;
+}
+
+u64 btrfs_inode_max_extents(const struct btrfs_inode *inode, u64 size)
+{
+	u64 max_extent_size = btrfs_inode_max_extent_size(inode);
+
+	return div_u64(size + max_extent_size - 1, max_extent_size);
+}
+
 /*
  * Check if the inode needs to be submitted to compression, based on mount
  * options, defragmentation, properties or heuristics.
@@ -2459,8 +2509,8 @@ int btrfs_run_delalloc_range(struct btrfs_inode *inode, struct folio *locked_fol
 void btrfs_split_delalloc_extent(struct btrfs_inode *inode,
 				 struct extent_state *orig, u64 split)
 {
-	struct btrfs_fs_info *fs_info = inode->root->fs_info;
 	u64 size;
+	u64 max_extent_size = btrfs_inode_max_extent_size(inode);
 
 	lockdep_assert_held(&inode->io_tree.lock);
 
@@ -2469,8 +2519,8 @@ void btrfs_split_delalloc_extent(struct btrfs_inode *inode,
 		return;
 
 	size = orig->end - orig->start + 1;
-	if (size > fs_info->max_extent_size) {
-		u32 num_extents;
+	if (size > max_extent_size) {
+		u64 num_extents;
 		u64 new_size;
 
 		/*
@@ -2478,10 +2528,10 @@ void btrfs_split_delalloc_extent(struct btrfs_inode *inode,
 		 * applies here, just in reverse.
 		 */
 		new_size = orig->end - split + 1;
-		num_extents = count_max_extents(fs_info, new_size);
+		num_extents = btrfs_inode_max_extents(inode, new_size);
 		new_size = split - orig->start;
-		num_extents += count_max_extents(fs_info, new_size);
-		if (count_max_extents(fs_info, size) >= num_extents)
+		num_extents += btrfs_inode_max_extents(inode, new_size);
+		if (btrfs_inode_max_extents(inode, size) >= num_extents)
 			return;
 	}
 
@@ -2498,9 +2548,9 @@ void btrfs_split_delalloc_extent(struct btrfs_inode *inode,
 void btrfs_merge_delalloc_extent(struct btrfs_inode *inode, struct extent_state *new,
 				 struct extent_state *other)
 {
-	struct btrfs_fs_info *fs_info = inode->root->fs_info;
 	u64 new_size, old_size;
-	u32 num_extents;
+	u64 max_extent_size = btrfs_inode_max_extent_size(inode);
+	u64 num_extents;
 
 	lockdep_assert_held(&inode->io_tree.lock);
 
@@ -2514,7 +2564,7 @@ void btrfs_merge_delalloc_extent(struct btrfs_inode *inode, struct extent_state
 		new_size = other->end - new->start + 1;
 
 	/* we're not bigger than the max, unreserve the space and go */
-	if (new_size <= fs_info->max_extent_size) {
+	if (new_size <= max_extent_size) {
 		spin_lock(&inode->lock);
 		btrfs_mod_outstanding_extents(inode, -1);
 		spin_unlock(&inode->lock);
@@ -2540,10 +2590,10 @@ void btrfs_merge_delalloc_extent(struct btrfs_inode *inode, struct extent_state
 	 * this case.
 	 */
 	old_size = other->end - other->start + 1;
-	num_extents = count_max_extents(fs_info, old_size);
+	num_extents = btrfs_inode_max_extents(inode, old_size);
 	old_size = new->end - new->start + 1;
-	num_extents += count_max_extents(fs_info, old_size);
-	if (count_max_extents(fs_info, new_size) >= num_extents)
+	num_extents += btrfs_inode_max_extents(inode, old_size);
+	if (btrfs_inode_max_extents(inode, new_size) >= num_extents)
 		return;
 
 	spin_lock(&inode->lock);
@@ -2616,7 +2666,7 @@ void btrfs_set_delalloc_extent(struct btrfs_inode *inode, struct extent_state *s
 	if (!(state->state & EXTENT_DELALLOC) && (bits & EXTENT_DELALLOC)) {
 		u64 len = state->end + 1 - state->start;
 		u64 prev_delalloc_bytes;
-		u32 num_extents = count_max_extents(fs_info, len);
+		u32 num_extents = btrfs_inode_max_extents(inode, len);
 
 		spin_lock(&inode->lock);
 		btrfs_mod_outstanding_extents(inode, num_extents);
@@ -2662,7 +2712,7 @@ void btrfs_clear_delalloc_extent(struct btrfs_inode *inode,
 {
 	struct btrfs_fs_info *fs_info = inode->root->fs_info;
 	u64 len = state->end + 1 - state->start;
-	u32 num_extents = count_max_extents(fs_info, len);
+	u32 num_extents = btrfs_inode_max_extents(inode, len);
 
 	lockdep_assert_held(&inode->io_tree.lock);
 
-- 
2.53.0


^ permalink raw reply related	[flat|nested] 12+ messages in thread

* [PATCH v2 4/5] btrfs: make inode->outstanding_extents a u64
  2026-04-03 22:27 [PATCH v2 0/5] btrfs: improve stalls under sudden writeback Boris Burkov
                   ` (2 preceding siblings ...)
  2026-04-03 22:27 ` [PATCH v2 3/5] btrfs: account for compression in delalloc extent reservation Boris Burkov
@ 2026-04-03 22:27 ` Boris Burkov
  2026-04-06 17:55   ` Filipe Manana
  2026-04-03 22:27 ` [PATCH v2 5/5] btrfs: cap shrink_delalloc iterations to 128M Boris Burkov
  4 siblings, 1 reply; 12+ messages in thread
From: Boris Burkov @ 2026-04-03 22:27 UTC (permalink / raw)
  To: linux-btrfs, kernel-team

The maximum file size is MAX_LFS_FILESIZE = (loff_t)LLONG_MAX

As a result, the max extent size computation in btrfs has always been
bounded above by LLONG_MAX / 128MiB, which is ~ 2^63 / 2^27. This has
never fit in a u32. With the recent changes to also divide by 128KiB in
compressed cases, that bound is even higher. Whether or not it is likely
to happen, I think it is nice to try to capture the intent in the types,
so change outstanding_extents to u64, and make mod_outstanding_extents
try to capture some expectations around the size of its inputs.

Signed-off-by: Boris Burkov <boris@bur.io>
---
 fs/btrfs/btrfs_inode.h       | 14 ++++++++++----
 fs/btrfs/delalloc-space.c    | 19 +++++++++----------
 fs/btrfs/inode.c             | 14 +++++++-------
 fs/btrfs/ordered-data.c      |  4 ++--
 fs/btrfs/tests/inode-tests.c | 18 +++++++++---------
 include/trace/events/btrfs.h |  8 ++++----
 6 files changed, 41 insertions(+), 36 deletions(-)

diff --git a/fs/btrfs/btrfs_inode.h b/fs/btrfs/btrfs_inode.h
index fad61b4d94a7..259af3f6ebb5 100644
--- a/fs/btrfs/btrfs_inode.h
+++ b/fs/btrfs/btrfs_inode.h
@@ -180,7 +180,7 @@ struct btrfs_inode {
 	 * items we think we'll end up using, and reserved_extents is the number
 	 * of extent items we've reserved metadata for. Protected by 'lock'.
 	 */
-	unsigned outstanding_extents;
+	u64 outstanding_extents;
 
 	/* used to order data wrt metadata */
 	spinlock_t ordered_tree_lock;
@@ -429,14 +429,20 @@ static inline bool is_data_inode(const struct btrfs_inode *inode)
 }
 
 static inline void btrfs_mod_outstanding_extents(struct btrfs_inode *inode,
-						 int mod)
+						 int mod, u64 nr_extents)
 {
+	s64 delta = mod * (s64)nr_extents;
+
 	lockdep_assert_held(&inode->lock);
-	inode->outstanding_extents += mod;
+	ASSERT(mod == 1 || mod == -1);
+	ASSERT(nr_extents <= S64_MAX);
+	ASSERT(mod == -1 || inode->outstanding_extents <= U64_MAX - nr_extents);
+	ASSERT(mod == 1 || inode->outstanding_extents >= nr_extents);
+	inode->outstanding_extents += delta;
 	if (btrfs_is_free_space_inode(inode))
 		return;
 	trace_btrfs_inode_mod_outstanding_extents(inode->root, btrfs_ino(inode),
-						  mod, inode->outstanding_extents);
+						  delta, inode->outstanding_extents);
 }
 
 /*
diff --git a/fs/btrfs/delalloc-space.c b/fs/btrfs/delalloc-space.c
index 2c99e332e25b..f4851cc7e2a8 100644
--- a/fs/btrfs/delalloc-space.c
+++ b/fs/btrfs/delalloc-space.c
@@ -274,7 +274,7 @@ static void btrfs_calculate_inode_block_rsv_size(struct btrfs_fs_info *fs_info,
 	struct btrfs_block_rsv *block_rsv = &inode->block_rsv;
 	u64 reserve_size = 0;
 	u64 qgroup_rsv_size = 0;
-	unsigned outstanding_extents;
+	u64 outstanding_extents;
 
 	lockdep_assert_held(&inode->lock);
 	outstanding_extents = inode->outstanding_extents;
@@ -301,7 +301,7 @@ static void btrfs_calculate_inode_block_rsv_size(struct btrfs_fs_info *fs_info,
 	 *
 	 * This is overestimating in most cases.
 	 */
-	qgroup_rsv_size = (u64)outstanding_extents * fs_info->nodesize;
+	qgroup_rsv_size = outstanding_extents * fs_info->nodesize;
 
 	spin_lock(&block_rsv->lock);
 	block_rsv->size = reserve_size;
@@ -364,7 +364,7 @@ int btrfs_delalloc_reserve_metadata(struct btrfs_inode *inode, u64 num_bytes,
 	struct btrfs_fs_info *fs_info = root->fs_info;
 	struct btrfs_block_rsv *block_rsv = &inode->block_rsv;
 	u64 meta_reserve, qgroup_reserve;
-	unsigned nr_extents;
+	u64 nr_extents;
 	enum btrfs_reserve_flush_enum flush = BTRFS_RESERVE_FLUSH_ALL;
 	int ret = 0;
 
@@ -418,7 +418,7 @@ int btrfs_delalloc_reserve_metadata(struct btrfs_inode *inode, u64 num_bytes,
 	 */
 	nr_extents = btrfs_inode_max_extents(inode, num_bytes);
 	spin_lock(&inode->lock);
-	btrfs_mod_outstanding_extents(inode, nr_extents);
+	btrfs_mod_outstanding_extents(inode, 1, nr_extents);
 	if (!(inode->flags & BTRFS_INODE_NODATASUM))
 		inode->csum_bytes += disk_num_bytes;
 	btrfs_calculate_inode_block_rsv_size(fs_info, inode);
@@ -480,11 +480,11 @@ void btrfs_delalloc_release_metadata(struct btrfs_inode *inode, u64 num_bytes,
 void btrfs_delalloc_release_extents(struct btrfs_inode *inode, u64 num_bytes)
 {
 	struct btrfs_fs_info *fs_info = inode->root->fs_info;
-	unsigned num_extents;
+	u64 num_extents;
 
 	spin_lock(&inode->lock);
 	num_extents = btrfs_inode_max_extents(inode, num_bytes);
-	btrfs_mod_outstanding_extents(inode, -num_extents);
+	btrfs_mod_outstanding_extents(inode, -1, num_extents);
 	btrfs_calculate_inode_block_rsv_size(fs_info, inode);
 	spin_unlock(&inode->lock);
 
@@ -498,16 +498,15 @@ void btrfs_delalloc_release_extents(struct btrfs_inode *inode, u64 num_bytes)
 void btrfs_delalloc_shrink_extents(struct btrfs_inode *inode, u64 reserved_len, u64 new_len)
 {
 	struct btrfs_fs_info *fs_info = inode->root->fs_info;
-	const u32 reserved_num_extents = btrfs_inode_max_extents(inode, reserved_len);
-	const u32 new_num_extents = btrfs_inode_max_extents(inode, new_len);
-	const int diff_num_extents = new_num_extents - reserved_num_extents;
+	const u64 reserved_num_extents = btrfs_inode_max_extents(inode, reserved_len);
+	const u64 new_num_extents = btrfs_inode_max_extents(inode, new_len);
 
 	ASSERT(new_len <= reserved_len);
 	if (new_num_extents == reserved_num_extents)
 		return;
 
 	spin_lock(&inode->lock);
-	btrfs_mod_outstanding_extents(inode, diff_num_extents);
+	btrfs_mod_outstanding_extents(inode, -1, reserved_num_extents - new_num_extents);
 	btrfs_calculate_inode_block_rsv_size(fs_info, inode);
 	spin_unlock(&inode->lock);
 
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 1e53ff65d1a7..965c2739d33b 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -2536,7 +2536,7 @@ void btrfs_split_delalloc_extent(struct btrfs_inode *inode,
 	}
 
 	spin_lock(&inode->lock);
-	btrfs_mod_outstanding_extents(inode, 1);
+	btrfs_mod_outstanding_extents(inode, 1, 1);
 	spin_unlock(&inode->lock);
 }
 
@@ -2566,7 +2566,7 @@ void btrfs_merge_delalloc_extent(struct btrfs_inode *inode, struct extent_state
 	/* we're not bigger than the max, unreserve the space and go */
 	if (new_size <= max_extent_size) {
 		spin_lock(&inode->lock);
-		btrfs_mod_outstanding_extents(inode, -1);
+		btrfs_mod_outstanding_extents(inode, -1, 1);
 		spin_unlock(&inode->lock);
 		return;
 	}
@@ -2597,7 +2597,7 @@ void btrfs_merge_delalloc_extent(struct btrfs_inode *inode, struct extent_state
 		return;
 
 	spin_lock(&inode->lock);
-	btrfs_mod_outstanding_extents(inode, -1);
+	btrfs_mod_outstanding_extents(inode, -1, 1);
 	spin_unlock(&inode->lock);
 }
 
@@ -2666,10 +2666,10 @@ void btrfs_set_delalloc_extent(struct btrfs_inode *inode, struct extent_state *s
 	if (!(state->state & EXTENT_DELALLOC) && (bits & EXTENT_DELALLOC)) {
 		u64 len = state->end + 1 - state->start;
 		u64 prev_delalloc_bytes;
-		u32 num_extents = btrfs_inode_max_extents(inode, len);
+		u64 num_extents = btrfs_inode_max_extents(inode, len);
 
 		spin_lock(&inode->lock);
-		btrfs_mod_outstanding_extents(inode, num_extents);
+		btrfs_mod_outstanding_extents(inode, 1, num_extents);
 		spin_unlock(&inode->lock);
 
 		/* For sanity tests */
@@ -2712,7 +2712,7 @@ void btrfs_clear_delalloc_extent(struct btrfs_inode *inode,
 {
 	struct btrfs_fs_info *fs_info = inode->root->fs_info;
 	u64 len = state->end + 1 - state->start;
-	u32 num_extents = btrfs_inode_max_extents(inode, len);
+	u64 num_extents = btrfs_inode_max_extents(inode, len);
 
 	lockdep_assert_held(&inode->io_tree.lock);
 
@@ -2732,7 +2732,7 @@ void btrfs_clear_delalloc_extent(struct btrfs_inode *inode,
 		u64 new_delalloc_bytes;
 
 		spin_lock(&inode->lock);
-		btrfs_mod_outstanding_extents(inode, -num_extents);
+		btrfs_mod_outstanding_extents(inode, -1, num_extents);
 		spin_unlock(&inode->lock);
 
 		/*
diff --git a/fs/btrfs/ordered-data.c b/fs/btrfs/ordered-data.c
index d39f1c49d1cf..14b49cb33bb0 100644
--- a/fs/btrfs/ordered-data.c
+++ b/fs/btrfs/ordered-data.c
@@ -223,7 +223,7 @@ static struct btrfs_ordered_extent *alloc_ordered_extent(
 	 * smallest the extent is going to get.
 	 */
 	spin_lock(&inode->lock);
-	btrfs_mod_outstanding_extents(inode, 1);
+	btrfs_mod_outstanding_extents(inode, 1, 1);
 	spin_unlock(&inode->lock);
 
 out:
@@ -655,7 +655,7 @@ void btrfs_remove_ordered_extent(struct btrfs_ordered_extent *entry)
 	btrfs_lockdep_acquire(fs_info, btrfs_trans_pending_ordered);
 	/* This is paired with alloc_ordered_extent(). */
 	spin_lock(&btrfs_inode->lock);
-	btrfs_mod_outstanding_extents(btrfs_inode, -1);
+	btrfs_mod_outstanding_extents(btrfs_inode, -1, 1);
 	spin_unlock(&btrfs_inode->lock);
 	if (root != fs_info->tree_root) {
 		u64 release;
diff --git a/fs/btrfs/tests/inode-tests.c b/fs/btrfs/tests/inode-tests.c
index b04fbcaf0a1d..e63afbb9be2b 100644
--- a/fs/btrfs/tests/inode-tests.c
+++ b/fs/btrfs/tests/inode-tests.c
@@ -931,7 +931,7 @@ static int test_extent_accounting(u32 sectorsize, u32 nodesize)
 	}
 	if (BTRFS_I(inode)->outstanding_extents != 1) {
 		ret = -EINVAL;
-		test_err("miscount, wanted 1, got %u",
+		test_err("miscount, wanted 1, got %llu",
 			 BTRFS_I(inode)->outstanding_extents);
 		goto out;
 	}
@@ -946,7 +946,7 @@ static int test_extent_accounting(u32 sectorsize, u32 nodesize)
 	}
 	if (BTRFS_I(inode)->outstanding_extents != 2) {
 		ret = -EINVAL;
-		test_err("miscount, wanted 2, got %u",
+		test_err("miscount, wanted 2, got %llu",
 			 BTRFS_I(inode)->outstanding_extents);
 		goto out;
 	}
@@ -962,7 +962,7 @@ static int test_extent_accounting(u32 sectorsize, u32 nodesize)
 	}
 	if (BTRFS_I(inode)->outstanding_extents != 2) {
 		ret = -EINVAL;
-		test_err("miscount, wanted 2, got %u",
+		test_err("miscount, wanted 2, got %llu",
 			 BTRFS_I(inode)->outstanding_extents);
 		goto out;
 	}
@@ -978,7 +978,7 @@ static int test_extent_accounting(u32 sectorsize, u32 nodesize)
 	}
 	if (BTRFS_I(inode)->outstanding_extents != 2) {
 		ret = -EINVAL;
-		test_err("miscount, wanted 2, got %u",
+		test_err("miscount, wanted 2, got %llu",
 			 BTRFS_I(inode)->outstanding_extents);
 		goto out;
 	}
@@ -996,7 +996,7 @@ static int test_extent_accounting(u32 sectorsize, u32 nodesize)
 	}
 	if (BTRFS_I(inode)->outstanding_extents != 4) {
 		ret = -EINVAL;
-		test_err("miscount, wanted 4, got %u",
+		test_err("miscount, wanted 4, got %llu",
 			 BTRFS_I(inode)->outstanding_extents);
 		goto out;
 	}
@@ -1013,7 +1013,7 @@ static int test_extent_accounting(u32 sectorsize, u32 nodesize)
 	}
 	if (BTRFS_I(inode)->outstanding_extents != 3) {
 		ret = -EINVAL;
-		test_err("miscount, wanted 3, got %u",
+		test_err("miscount, wanted 3, got %llu",
 			 BTRFS_I(inode)->outstanding_extents);
 		goto out;
 	}
@@ -1029,7 +1029,7 @@ static int test_extent_accounting(u32 sectorsize, u32 nodesize)
 	}
 	if (BTRFS_I(inode)->outstanding_extents != 4) {
 		ret = -EINVAL;
-		test_err("miscount, wanted 4, got %u",
+		test_err("miscount, wanted 4, got %llu",
 			 BTRFS_I(inode)->outstanding_extents);
 		goto out;
 	}
@@ -1047,7 +1047,7 @@ static int test_extent_accounting(u32 sectorsize, u32 nodesize)
 	}
 	if (BTRFS_I(inode)->outstanding_extents != 3) {
 		ret = -EINVAL;
-		test_err("miscount, wanted 3, got %u",
+		test_err("miscount, wanted 3, got %llu",
 			 BTRFS_I(inode)->outstanding_extents);
 		goto out;
 	}
@@ -1061,7 +1061,7 @@ static int test_extent_accounting(u32 sectorsize, u32 nodesize)
 	}
 	if (BTRFS_I(inode)->outstanding_extents) {
 		ret = -EINVAL;
-		test_err("miscount, wanted 0, got %u",
+		test_err("miscount, wanted 0, got %llu",
 			 BTRFS_I(inode)->outstanding_extents);
 		goto out;
 	}
diff --git a/include/trace/events/btrfs.h b/include/trace/events/btrfs.h
index 8ad7a2d76c1d..caabdc8d9eed 100644
--- a/include/trace/events/btrfs.h
+++ b/include/trace/events/btrfs.h
@@ -2003,15 +2003,15 @@ DEFINE_EVENT(btrfs__prelim_ref, btrfs_prelim_ref_insert,
 );
 
 TRACE_EVENT(btrfs_inode_mod_outstanding_extents,
-	TP_PROTO(const struct btrfs_root *root, u64 ino, int mod, unsigned outstanding),
+	TP_PROTO(const struct btrfs_root *root, u64 ino, s64 mod, u64 outstanding),
 
 	TP_ARGS(root, ino, mod, outstanding),
 
 	TP_STRUCT__entry_btrfs(
 		__field(	u64, root_objectid	)
 		__field(	u64, ino		)
-		__field(	int, mod		)
-		__field(	unsigned, outstanding	)
+		__field(	s64, mod		)
+		__field(	u64, outstanding	)
 	),
 
 	TP_fast_assign_btrfs(root->fs_info,
@@ -2021,7 +2021,7 @@ TRACE_EVENT(btrfs_inode_mod_outstanding_extents,
 		__entry->outstanding    = outstanding;
 	),
 
-	TP_printk_btrfs("root=%llu(%s) ino=%llu mod=%d outstanding=%u",
+	TP_printk_btrfs("root=%llu(%s) ino=%llu mod=%lld outstanding=%llu",
 			show_root_type(__entry->root_objectid),
 			__entry->ino, __entry->mod, __entry->outstanding)
 );
-- 
2.53.0


^ permalink raw reply related	[flat|nested] 12+ messages in thread

* [PATCH v2 5/5] btrfs: cap shrink_delalloc iterations to 128M
  2026-04-03 22:27 [PATCH v2 0/5] btrfs: improve stalls under sudden writeback Boris Burkov
                   ` (3 preceding siblings ...)
  2026-04-03 22:27 ` [PATCH v2 4/5] btrfs: make inode->outstanding_extents a u64 Boris Burkov
@ 2026-04-03 22:27 ` Boris Burkov
  2026-04-06 18:08   ` Filipe Manana
  4 siblings, 1 reply; 12+ messages in thread
From: Boris Burkov @ 2026-04-03 22:27 UTC (permalink / raw)
  To: linux-btrfs, kernel-team

Even with more accurate delayed_refs reservations, preemptive reclaim is
not perfect and we might generate tickets, especially in cases with a
very large flood of writeback outstanding.

Ultimately, if we do get into a situation with tickets pending and async
reclaim blocking the system, we want to try to make as much progress as
quickly as possible to unblock tasks. We want space reclaim to be
effective, and to have a good chance at making progress, but not to
block arbitrarily as this leads to untenable syscall latencies, long
commits, and even hung task warnings.

I traced such cases of heavy writeback async reclaim hung tasks and
observed that we were blocking for long periods of time in
shrink_delalloc(). This was particularly bad when doing writeback of
incompressible data with the compress-force mount option.

e.g.
dd if=/dev/urandom of=urandom.seed bs=1G count=1
dd if=urandom.seed of=urandom.big bs=1G count=300

shrink_delalloc() computes to_reclaim as delalloc_bytes >> 3. With
hundreds of gigs of delalloc (again imagine a large dirty_ratio and lots
of ram), this is still 10-20+ GiB. Particularly in the wait phases, this
can be quite slow, and generates even more delayed-refs as mentioned in
the previous patch, so it doesn't even help that much with the immediate
space shortfall.

We do satisfy some tickets, but we are ultimately keep the system in
essentially the same state, and with long stalling reclaim calls into
shrink_delalloc().

It would be much better to start some good chunk of I/O and also to work
through the new delayed_refs and keep things moving through the system
while releasing the conservative over-estimated metadata reservations.

To acheive this, tighten up the delalloc work to be in units of the
maximum extent size. If we issue 128MiB of delalloc, we don't leave too
much (any?) extent merging on the table, but don't ever block on
pathological 10GiB+ chunks of delalloc. If we do detect that we
satisfied a ticket, break out of shrink_delalloc() and run some of the
new delayed_refs as well before going again. This way we strike a nice
balance of making delalloc progress, but not at the cost of every other
sort of reservation, as they all feed into each other.

This means iterating over to_reclaim by 128MiB at a time until it is
drained or we satisfy a ticket, rather than trying 3 times to do the
whole thing.

Signed-off-by: Boris Burkov <boris@bur.io>
---
 fs/btrfs/space-info.c | 31 +++++++++++++++++++++----------
 1 file changed, 21 insertions(+), 10 deletions(-)

diff --git a/fs/btrfs/space-info.c b/fs/btrfs/space-info.c
index f0436eea1544..15c5be97a123 100644
--- a/fs/btrfs/space-info.c
+++ b/fs/btrfs/space-info.c
@@ -727,7 +727,7 @@ static void shrink_delalloc(struct btrfs_space_info *space_info,
 	u64 ordered_bytes;
 	u64 items;
 	long time_left;
-	int loops;
+	u64 orig_tickets_id;
 
 	delalloc_bytes = percpu_counter_sum_positive(&fs_info->delalloc_bytes);
 	ordered_bytes = percpu_counter_sum_positive(&fs_info->ordered_bytes);
@@ -735,9 +735,7 @@ static void shrink_delalloc(struct btrfs_space_info *space_info,
 		return;
 
 	/* Calc the number of the pages we need flush for space reservation */
-	if (to_reclaim == U64_MAX) {
-		items = U64_MAX;
-	} else {
+	if (to_reclaim != U64_MAX) {
 		/*
 		 * to_reclaim is set to however much metadata we need to
 		 * reclaim, but reclaiming that much data doesn't really track
@@ -751,7 +749,6 @@ static void shrink_delalloc(struct btrfs_space_info *space_info,
 		 * aggressive.
 		 */
 		to_reclaim = max(to_reclaim, delalloc_bytes >> 3);
-		items = calc_reclaim_items_nr(fs_info, to_reclaim) * 2;
 	}
 
 	trans = current->journal_info;
@@ -764,12 +761,17 @@ static void shrink_delalloc(struct btrfs_space_info *space_info,
 	if (ordered_bytes > delalloc_bytes && !for_preempt)
 		wait_ordered = true;
 
-	loops = 0;
-	while ((delalloc_bytes || ordered_bytes) && loops < 3) {
-		u64 temp = min(delalloc_bytes, to_reclaim) >> PAGE_SHIFT;
-		long nr_pages = min_t(u64, temp, LONG_MAX);
+	spin_lock(&space_info->lock);
+	orig_tickets_id = space_info->tickets_id;
+	spin_unlock(&space_info->lock);
+
+	while ((delalloc_bytes || ordered_bytes) && to_reclaim) {
+		u64 iter_reclaim = min_t(u64, to_reclaim, SZ_128M);
+		long nr_pages = min_t(u64, delalloc_bytes, iter_reclaim) >> PAGE_SHIFT;
 		int async_pages;
 
+		items = calc_reclaim_items_nr(fs_info, iter_reclaim) * 2;
+
 		btrfs_start_delalloc_roots(fs_info, nr_pages, true);
 
 		/*
@@ -811,7 +813,7 @@ static void shrink_delalloc(struct btrfs_space_info *space_info,
 			   atomic_read(&fs_info->async_delalloc_pages) <=
 			   async_pages);
 skip_async:
-		loops++;
+		to_reclaim -= iter_reclaim;
 		if (wait_ordered && !trans) {
 			btrfs_wait_ordered_roots(fs_info, items, NULL);
 		} else {
@@ -834,6 +836,15 @@ static void shrink_delalloc(struct btrfs_space_info *space_info,
 			spin_unlock(&space_info->lock);
 			break;
 		}
+		/*
+		 * If a ticket was satisfied since we started, break out
+		 * so the async reclaim state machine can process delayed
+		 * refs before we flush more delalloc.
+		 */
+		if (space_info->tickets_id != orig_tickets_id) {
+			spin_unlock(&space_info->lock);
+			break;
+		}
 		spin_unlock(&space_info->lock);
 
 		delalloc_bytes = percpu_counter_sum_positive(
-- 
2.53.0


^ permalink raw reply related	[flat|nested] 12+ messages in thread

* Re: [PATCH v2 1/5] btrfs: reserve space for delayed_refs in delalloc
  2026-04-03 22:27 ` [PATCH v2 1/5] btrfs: reserve space for delayed_refs in delalloc Boris Burkov
@ 2026-04-06 17:20   ` Filipe Manana
  2026-04-06 17:40     ` Boris Burkov
  0 siblings, 1 reply; 12+ messages in thread
From: Filipe Manana @ 2026-04-06 17:20 UTC (permalink / raw)
  To: Boris Burkov; +Cc: linux-btrfs, kernel-team

On Fri, Apr 3, 2026 at 11:29 PM Boris Burkov <boris@bur.io> wrote:
>
> delalloc uses a per-inode block_rsv to perform metadata reservations for
> the cow operations it anticipates based on the number of outstanding
> extents. This calculation is done based on inode->outstanding_extents in
> btrfs_calculate_inode_block_rsv_size(). The reservation is *not*
> meticulously tracked as each ordered_extent is actually created in
> writeback, but rather delalloc attempts to over-estimate and the
> writeback and ordered_extent finish portions are responsible to release
> all the reservation.
>
> However, there is a notable gap in this reservation, it reserves no
> space for the resulting delayed_refs. If you compare to how
> btrfs_start_transaction() reservations work, this is a noteable
> difference.
>
> As writeback actually occurs, and we trigger btrfs_finish_one_ordered(),
> that function will start generating delayed refs, which will draw from
> the trans_handle's delayed_refs_rsv via btrfs_update_delayed_refs_rsv():
>
> btrfs_finish_one_ordered()
>   insert_ordered_extent_file_extent()
>     insert_reserved_file_extent()
>       btrfs_alloc_reserved_file_extent()
>         btrfs_add_delayed_data_ref()
>           add_delayed_ref()
>             btrfs_update_delayed_refs_rsv();
>
> This trans_handle was created in finish_one_ordered() with
> btrfs_join_transaction() which calls start_transaction with
> num_items=0 and BTRFS_RESERVE_NO_FLUSH. As a result, this trans_handle
> has no reserved in h->delayed_rsv, as neither the num_items reservation
> nor the btrfs_delayed_refs_rsv_refill() reservation is run.
>
> Thus, when btrfs_update_delayed_refs_rsv() runs, reserved_bytes is 0 and
> fs_info->delayed_rsv->size grows but not fs_info->delayed_rsv->reserved.
>
> If a large amount of writeback happens all at once (perhaps due to
> dirty_ratio being tuned too high), this results in, among other things,
> erroneous assessments of the amount of delayed_refs reserved in the
> metadata space reclaim logic, like need_preemptive_reclaim() which
> relies on fs_info->delayed_rsv->reserved and even worse, poor decision
> making in btrfs_preempt_reclaim_metadata_space() which counts
> delalloc_bytes like so:
>
>   block_rsv_size = global_rsv_size +
>           btrfs_block_rsv_reserved(delayed_block_rsv) +
>           btrfs_block_rsv_reserved(delayed_refs_rsv) +
>           btrfs_block_rsv_reserved(trans_rsv);
>   delalloc_size = bytes_may_use - block_rsv_size;
>
> So all that lost delayed refs usage gets accounted as delalloc_size and
> leads to preemptive reclaim continuously choosing FLUSH_DELALLOC, which
> further exacerbates the problem.
>
> With enough writeback around, we can run enough delalloc that we get
> into async reclaim which starts blocking start_transaction() and
> eventually hits FLUSH_DELALLOC_WAIT/FLUSH_DELALLOC_FULL at which point
> the filesystem gets heavily blocked on metadata space in reserve_space(),
> blocking all new transaction work until all the ordered_extents finish.
>
> If we had an accurate view of the reservation for delayed refs, then we
> could mostly break this feedback loop in preemptive reclaim, and
> generally would be able to make more accurate decisions with regards to
> metadata space reclamation.
>
> This patch adds extra metadata reservation to the inode's block_rsv to
> account for the delayed refs. When the ordered_extent finishes and we
> are about to do work in the transaction that uses delayed refs, we
> migrate enough for 1 extent. Since this is not necessarily perfect, we
> have to be careful and do a "soft" migrate which succeeds even if there
> is not enough reservation. This is strictly better than what we have and
> also matches how the delayed ref rsv gets used in the transaction at
> btrfs_update_delayed_refs_rsv()
>
> This is not a perfect fix for the most pathological cases, but is the
> infrastructure needed to keep working on the problem.
>
> Signed-off-by: Boris Burkov <boris@bur.io>
> ---
>  fs/btrfs/delalloc-space.c | 43 ++++++++++++++++++++++++++++++++++++---
>  fs/btrfs/delalloc-space.h |  3 +++
>  fs/btrfs/inode.c          |  5 ++++-
>  fs/btrfs/transaction.c    | 36 ++++++++++++++------------------
>  4 files changed, 62 insertions(+), 25 deletions(-)
>
> diff --git a/fs/btrfs/delalloc-space.c b/fs/btrfs/delalloc-space.c
> index 0970799d0aa4..a087be55409f 100644
> --- a/fs/btrfs/delalloc-space.c
> +++ b/fs/btrfs/delalloc-space.c
> @@ -3,11 +3,13 @@
>  #include "messages.h"
>  #include "ctree.h"
>  #include "delalloc-space.h"
> +#include "delayed-ref.h"
>  #include "block-rsv.h"
>  #include "btrfs_inode.h"
>  #include "space-info.h"
>  #include "qgroup.h"
>  #include "fs.h"
> +#include "transaction.h"
>
>  /*
>   * HOW DOES THIS WORK
> @@ -240,6 +242,7 @@ static void btrfs_inode_rsv_release(struct btrfs_inode *inode, bool qgroup_free)
>         if (released > 0)
>                 trace_btrfs_space_reservation(fs_info, "delalloc",
>                                               btrfs_ino(inode), released, 0);
> +

Stray new line change. No other surrounding code changed.

>         if (qgroup_free)
>                 btrfs_qgroup_free_meta_prealloc(inode->root, qgroup_to_release);
>         else
> @@ -247,6 +250,16 @@ static void btrfs_inode_rsv_release(struct btrfs_inode *inode, bool qgroup_free)
>                                                    qgroup_to_release);
>  }
>
> +/*
> + * Each delalloc extent could become an ordered_extent and end up inserting a
> + * new extent into the extent and free space trees. So we must reserve
> + * delayed ref space up front for that.

Ok, so that's 2 delayed refs, one to modify the extent tree and
another to modify the free space tree.
So below, techically we should account for nr_extents * 2.

And we actually need to modify more trees: the csum tree if we have
csums to insert and the new raid stripe tree.

> + */
> +static u64 delalloc_calc_delayed_refs_rsv(const struct btrfs_fs_info *fs_info, u64 nr_extents)
> +{
> +       return btrfs_calc_delayed_ref_bytes(fs_info, nr_extents);
> +}
> +
>  static void btrfs_calculate_inode_block_rsv_size(struct btrfs_fs_info *fs_info,
>                                                  struct btrfs_inode *inode)
>  {
> @@ -266,6 +279,7 @@ static void btrfs_calculate_inode_block_rsv_size(struct btrfs_fs_info *fs_info,
>                 reserve_size = btrfs_calc_insert_metadata_size(fs_info,
>                                                 outstanding_extents);
>                 reserve_size += btrfs_calc_metadata_size(fs_info, 1);
> +               reserve_size += delalloc_calc_delayed_refs_rsv(fs_info, outstanding_extents);
>         }
>         if (!(inode->flags & BTRFS_INODE_NODATASUM)) {
>                 u64 csum_leaves;
> @@ -289,7 +303,8 @@ static void btrfs_calculate_inode_block_rsv_size(struct btrfs_fs_info *fs_info,
>
>  static void calc_inode_reservations(struct btrfs_inode *inode,
>                                     u64 num_bytes, u64 disk_num_bytes,
> -                                   u64 *meta_reserve, u64 *qgroup_reserve)
> +                                   u64 *meta_reserve,
> +                                   u64 *qgroup_reserve)

Another style change only, without any functional changes.

>  {
>         struct btrfs_fs_info *fs_info = inode->root->fs_info;
>         u64 nr_extents = count_max_extents(fs_info, num_bytes);
> @@ -309,9 +324,31 @@ static void calc_inode_reservations(struct btrfs_inode *inode,
>          * for an inode update.
>          */
>         *meta_reserve += inode_update;
> +
> +       *meta_reserve += delalloc_calc_delayed_refs_rsv(fs_info, nr_extents);
> +
>         *qgroup_reserve = nr_extents * fs_info->nodesize;
>  }
>
> +u64 btrfs_delalloc_migrate_delayed_refs_rsv(struct btrfs_trans_handle *trans,
> +                                           struct btrfs_inode *inode)
> +{
> +       struct btrfs_block_rsv *inode_rsv = &inode->block_rsv;
> +       struct btrfs_block_rsv *trans_rsv = &trans->delayed_rsv;
> +       u64 num_bytes = delalloc_calc_delayed_refs_rsv(trans->fs_info, 1);
> +
> +       spin_lock(&inode_rsv->lock);
> +       num_bytes = min(num_bytes, inode_rsv->reserved);
> +       inode_rsv->reserved -= num_bytes;
> +       inode_rsv->full = (inode_rsv->reserved >= inode_rsv->size);
> +       spin_unlock(&inode_rsv->lock);
> +
> +       btrfs_block_rsv_add_bytes(trans_rsv, num_bytes, true);
> +       trans->delayed_refs_bytes_reserved += num_bytes;
> +
> +       return num_bytes;

Why return the number of bytes?
None of the callers cares about the return value, it can be just void.

> +}
> +
>  int btrfs_delalloc_reserve_metadata(struct btrfs_inode *inode, u64 num_bytes,
>                                     u64 disk_num_bytes, bool noflush)
>  {
> @@ -358,8 +395,8 @@ int btrfs_delalloc_reserve_metadata(struct btrfs_inode *inode, u64 num_bytes,
>                                                  noflush);
>         if (ret)
>                 return ret;
> -       ret = btrfs_reserve_metadata_bytes(block_rsv->space_info, meta_reserve,
> -                                          flush);
> +       ret = btrfs_reserve_metadata_bytes(block_rsv->space_info,
> +                                          meta_reserve, flush);

Yet another style change, without any surrounding functional changes.

>         if (ret) {
>                 btrfs_qgroup_free_meta_prealloc(root, qgroup_reserve);
>                 return ret;
> diff --git a/fs/btrfs/delalloc-space.h b/fs/btrfs/delalloc-space.h
> index 6119c0d3f883..b058e884879b 100644
> --- a/fs/btrfs/delalloc-space.h
> +++ b/fs/btrfs/delalloc-space.h
> @@ -8,6 +8,7 @@
>  struct extent_changeset;
>  struct btrfs_inode;
>  struct btrfs_fs_info;
> +struct btrfs_trans_handle;
>
>  int btrfs_alloc_data_chunk_ondemand(const struct btrfs_inode *inode, u64 bytes);
>  int btrfs_check_data_free_space(struct btrfs_inode *inode,
> @@ -27,5 +28,7 @@ int btrfs_delalloc_reserve_metadata(struct btrfs_inode *inode, u64 num_bytes,
>                                     u64 disk_num_bytes, bool noflush);
>  void btrfs_delalloc_release_extents(struct btrfs_inode *inode, u64 num_bytes);
>  void btrfs_delalloc_shrink_extents(struct btrfs_inode *inode, u64 reserved_len, u64 new_len);
> +u64 btrfs_delalloc_migrate_delayed_refs_rsv(struct btrfs_trans_handle *trans,
> +                                           struct btrfs_inode *inode);
>
>  #endif /* BTRFS_DELALLOC_SPACE_H */
> diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
> index 808e52aa6ef2..c57473030814 100644
> --- a/fs/btrfs/inode.c
> +++ b/fs/btrfs/inode.c
> @@ -653,6 +653,7 @@ static noinline int __cow_file_range_inline(struct btrfs_inode *inode,
>                 goto out;
>         }
>         trans->block_rsv = &inode->block_rsv;
> +       btrfs_delalloc_migrate_delayed_refs_rsv(trans, inode);
>
>         drop_args.path = path;
>         drop_args.start = 0;
> @@ -3297,6 +3298,7 @@ int btrfs_finish_one_ordered(struct btrfs_ordered_extent *ordered_extent)
>                 }
>         } else {
>                 BUG_ON(root == fs_info->tree_root);
> +               btrfs_delalloc_migrate_delayed_refs_rsv(trans, inode);

So for the other case, writing into a prealloc extent, we should also
migrate space space because we will update the fs tree to modify a
file extent item (the call to btrfs_mark_extent_written()) and that
can generate delayed refs if we need to COW blocks from the fs tree
for that.

>                 ret = insert_ordered_extent_file_extent(trans, ordered_extent);
>                 if (unlikely(ret < 0)) {
>                         btrfs_abort_transaction(trans, ret);
> @@ -8074,9 +8076,10 @@ struct inode *btrfs_alloc_inode(struct super_block *sb)
>
>         spin_lock_init(&ei->lock);
>         ei->outstanding_extents = 0;
> -       if (sb->s_magic != BTRFS_TEST_MAGIC)
> +       if (sb->s_magic != BTRFS_TEST_MAGIC) {
>                 btrfs_init_metadata_block_rsv(fs_info, &ei->block_rsv,
>                                               BTRFS_BLOCK_RSV_DELALLOC);
> +       }

Another style change and this one goes against our style preference -
no { } for single statement if.

This is just a quick pass over the change, I'll look into the
following patches soon.

>         ei->runtime_flags = 0;
>         ei->prop_compress = BTRFS_COMPRESS_NONE;
>         ei->defrag_compress = BTRFS_COMPRESS_NONE;
> diff --git a/fs/btrfs/transaction.c b/fs/btrfs/transaction.c
> index 248adb785051..55791bb100a2 100644
> --- a/fs/btrfs/transaction.c
> +++ b/fs/btrfs/transaction.c
> @@ -1047,29 +1047,23 @@ static void btrfs_trans_release_metadata(struct btrfs_trans_handle *trans)
>                 return;
>         }
>
> -       if (!trans->bytes_reserved) {
> -               ASSERT(trans->delayed_refs_bytes_reserved == 0,
> -                      "trans->delayed_refs_bytes_reserved=%llu",
> -                      trans->delayed_refs_bytes_reserved);
> -               return;
> +       if (trans->bytes_reserved) {
> +               ASSERT(trans->block_rsv == &fs_info->trans_block_rsv);
> +               trace_btrfs_space_reservation(fs_info, "transaction",
> +                                       trans->transid, trans->bytes_reserved, 0);
> +               btrfs_block_rsv_release(fs_info, trans->block_rsv,
> +                                       trans->bytes_reserved, NULL);
> +               trans->bytes_reserved = 0;
>         }
>
> -       ASSERT(trans->block_rsv == &fs_info->trans_block_rsv);
> -       trace_btrfs_space_reservation(fs_info, "transaction",
> -                                     trans->transid, trans->bytes_reserved, 0);
> -       btrfs_block_rsv_release(fs_info, trans->block_rsv,
> -                               trans->bytes_reserved, NULL);
> -       trans->bytes_reserved = 0;
> -
> -       if (!trans->delayed_refs_bytes_reserved)
> -               return;
> -
> -       trace_btrfs_space_reservation(fs_info, "local_delayed_refs_rsv",
> -                                     trans->transid,
> -                                     trans->delayed_refs_bytes_reserved, 0);
> -       btrfs_block_rsv_release(fs_info, &trans->delayed_rsv,
> -                               trans->delayed_refs_bytes_reserved, NULL);
> -       trans->delayed_refs_bytes_reserved = 0;
> +       if (trans->delayed_refs_bytes_reserved) {
> +               trace_btrfs_space_reservation(fs_info, "local_delayed_refs_rsv",
> +                                       trans->transid,
> +                                       trans->delayed_refs_bytes_reserved, 0);
> +               btrfs_block_rsv_release(fs_info, &trans->delayed_rsv,
> +                                       trans->delayed_refs_bytes_reserved, NULL);
> +               trans->delayed_refs_bytes_reserved = 0;
> +       }
>  }
>
>  static int __btrfs_end_transaction(struct btrfs_trans_handle *trans,
> --
> 2.53.0
>
>

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH v2 2/5] btrfs: account for csum delayed_refs in delalloc
  2026-04-03 22:27 ` [PATCH v2 2/5] btrfs: account for csum " Boris Burkov
@ 2026-04-06 17:33   ` Filipe Manana
  0 siblings, 0 replies; 12+ messages in thread
From: Filipe Manana @ 2026-04-06 17:33 UTC (permalink / raw)
  To: Boris Burkov; +Cc: linux-btrfs, kernel-team

On Fri, Apr 3, 2026 at 11:29 PM Boris Burkov <boris@bur.io> wrote:
>
> As ordered_extents complete, they not only produce direct delayed refs,
> they also add csums via add_pending_csums. This produces some number of
> metadata delayed_refs as the csum tree is cow-ed. These refs are counted
> against the trans_handle delayed_rsv, thanks to trans->adding_csums.
>
> As a result, just like we account for the extent tree and free space
> tree when reserving delayed_refs for delalloc, we must also reserve for
> the csum tree. This is mirrored by the non-delayed-ref metadata
> reservation already accounting for csums.
>
> This serves to ensure we have a proper worst case estimate for
> delayed_rsv from delalloc.
>
> Signed-off-by: Boris Burkov <boris@bur.io>
> ---
>  fs/btrfs/delalloc-space.c | 13 ++++++++++---
>  1 file changed, 10 insertions(+), 3 deletions(-)
>
> diff --git a/fs/btrfs/delalloc-space.c b/fs/btrfs/delalloc-space.c
> index a087be55409f..c2e5c7758ed5 100644
> --- a/fs/btrfs/delalloc-space.c
> +++ b/fs/btrfs/delalloc-space.c
> @@ -252,12 +252,19 @@ static void btrfs_inode_rsv_release(struct btrfs_inode *inode, bool qgroup_free)
>
>  /*
>   * Each delalloc extent could become an ordered_extent and end up inserting a
> - * new extent into the extent and free space trees. So we must reserve
> - * delayed ref space up front for that.
> + * new extent into the extent and free space trees. It also may end up writing
> + * csums, so we must reserve delayed ref space up front for all of that.

Ok, this patch addresses my comment in the previous patch regarding
the missing accounting for csum tree updates.

But it lacks accounting for new delayed refs when COWing blocks from:

1) A subvolume tree (insert a new file extent item, or modify an
existing one in case of prealloc writes).

2) Raid stripe tree.


>   */
>  static u64 delalloc_calc_delayed_refs_rsv(const struct btrfs_fs_info *fs_info, u64 nr_extents)
>  {
> -       return btrfs_calc_delayed_ref_bytes(fs_info, nr_extents);
> +       /*
> +        * btrfs_calc_delayed_ref_bytes() does not assume csums as it is
> +        * also called from metadata writing contexts. So the extra
> +        * metadata insertion reservation represents the possibility of
> +        * delayed refs during csum tree writes.
> +        */
> +       return btrfs_calc_delayed_ref_bytes(fs_info, nr_extents) +
> +               btrfs_calc_insert_metadata_size(fs_info, nr_extents);

We can also skip the reservation for the csum tree if we are writing
into a nocow/nosum inode.
Check the rest of delalloc-space.c: there's an optimization I did
years ago to reduce space reservations in case the inode has the
BTRFS_INODE_NODATASUM flag set.

Thanks.

>  }
>
>  static void btrfs_calculate_inode_block_rsv_size(struct btrfs_fs_info *fs_info,
> --
> 2.53.0
>
>

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH v2 1/5] btrfs: reserve space for delayed_refs in delalloc
  2026-04-06 17:20   ` Filipe Manana
@ 2026-04-06 17:40     ` Boris Burkov
  0 siblings, 0 replies; 12+ messages in thread
From: Boris Burkov @ 2026-04-06 17:40 UTC (permalink / raw)
  To: Filipe Manana; +Cc: linux-btrfs, kernel-team

On Mon, Apr 06, 2026 at 06:20:39PM +0100, Filipe Manana wrote:
> On Fri, Apr 3, 2026 at 11:29 PM Boris Burkov <boris@bur.io> wrote:
> >
> > delalloc uses a per-inode block_rsv to perform metadata reservations for
> > the cow operations it anticipates based on the number of outstanding
> > extents. This calculation is done based on inode->outstanding_extents in
> > btrfs_calculate_inode_block_rsv_size(). The reservation is *not*
> > meticulously tracked as each ordered_extent is actually created in
> > writeback, but rather delalloc attempts to over-estimate and the
> > writeback and ordered_extent finish portions are responsible to release
> > all the reservation.
> >
> > However, there is a notable gap in this reservation, it reserves no
> > space for the resulting delayed_refs. If you compare to how
> > btrfs_start_transaction() reservations work, this is a noteable
> > difference.
> >
> > As writeback actually occurs, and we trigger btrfs_finish_one_ordered(),
> > that function will start generating delayed refs, which will draw from
> > the trans_handle's delayed_refs_rsv via btrfs_update_delayed_refs_rsv():
> >
> > btrfs_finish_one_ordered()
> >   insert_ordered_extent_file_extent()
> >     insert_reserved_file_extent()
> >       btrfs_alloc_reserved_file_extent()
> >         btrfs_add_delayed_data_ref()
> >           add_delayed_ref()
> >             btrfs_update_delayed_refs_rsv();
> >
> > This trans_handle was created in finish_one_ordered() with
> > btrfs_join_transaction() which calls start_transaction with
> > num_items=0 and BTRFS_RESERVE_NO_FLUSH. As a result, this trans_handle
> > has no reserved in h->delayed_rsv, as neither the num_items reservation
> > nor the btrfs_delayed_refs_rsv_refill() reservation is run.
> >
> > Thus, when btrfs_update_delayed_refs_rsv() runs, reserved_bytes is 0 and
> > fs_info->delayed_rsv->size grows but not fs_info->delayed_rsv->reserved.
> >
> > If a large amount of writeback happens all at once (perhaps due to
> > dirty_ratio being tuned too high), this results in, among other things,
> > erroneous assessments of the amount of delayed_refs reserved in the
> > metadata space reclaim logic, like need_preemptive_reclaim() which
> > relies on fs_info->delayed_rsv->reserved and even worse, poor decision
> > making in btrfs_preempt_reclaim_metadata_space() which counts
> > delalloc_bytes like so:
> >
> >   block_rsv_size = global_rsv_size +
> >           btrfs_block_rsv_reserved(delayed_block_rsv) +
> >           btrfs_block_rsv_reserved(delayed_refs_rsv) +
> >           btrfs_block_rsv_reserved(trans_rsv);
> >   delalloc_size = bytes_may_use - block_rsv_size;
> >
> > So all that lost delayed refs usage gets accounted as delalloc_size and
> > leads to preemptive reclaim continuously choosing FLUSH_DELALLOC, which
> > further exacerbates the problem.
> >
> > With enough writeback around, we can run enough delalloc that we get
> > into async reclaim which starts blocking start_transaction() and
> > eventually hits FLUSH_DELALLOC_WAIT/FLUSH_DELALLOC_FULL at which point
> > the filesystem gets heavily blocked on metadata space in reserve_space(),
> > blocking all new transaction work until all the ordered_extents finish.
> >
> > If we had an accurate view of the reservation for delayed refs, then we
> > could mostly break this feedback loop in preemptive reclaim, and
> > generally would be able to make more accurate decisions with regards to
> > metadata space reclamation.
> >
> > This patch adds extra metadata reservation to the inode's block_rsv to
> > account for the delayed refs. When the ordered_extent finishes and we
> > are about to do work in the transaction that uses delayed refs, we
> > migrate enough for 1 extent. Since this is not necessarily perfect, we
> > have to be careful and do a "soft" migrate which succeeds even if there
> > is not enough reservation. This is strictly better than what we have and
> > also matches how the delayed ref rsv gets used in the transaction at
> > btrfs_update_delayed_refs_rsv()
> >
> > This is not a perfect fix for the most pathological cases, but is the
> > infrastructure needed to keep working on the problem.
> >
> > Signed-off-by: Boris Burkov <boris@bur.io>
> > ---
> >  fs/btrfs/delalloc-space.c | 43 ++++++++++++++++++++++++++++++++++++---
> >  fs/btrfs/delalloc-space.h |  3 +++
> >  fs/btrfs/inode.c          |  5 ++++-
> >  fs/btrfs/transaction.c    | 36 ++++++++++++++------------------
> >  4 files changed, 62 insertions(+), 25 deletions(-)
> >
> > diff --git a/fs/btrfs/delalloc-space.c b/fs/btrfs/delalloc-space.c
> > index 0970799d0aa4..a087be55409f 100644
> > --- a/fs/btrfs/delalloc-space.c
> > +++ b/fs/btrfs/delalloc-space.c
> > @@ -3,11 +3,13 @@
> >  #include "messages.h"
> >  #include "ctree.h"
> >  #include "delalloc-space.h"
> > +#include "delayed-ref.h"
> >  #include "block-rsv.h"
> >  #include "btrfs_inode.h"
> >  #include "space-info.h"
> >  #include "qgroup.h"
> >  #include "fs.h"
> > +#include "transaction.h"
> >
> >  /*
> >   * HOW DOES THIS WORK
> > @@ -240,6 +242,7 @@ static void btrfs_inode_rsv_release(struct btrfs_inode *inode, bool qgroup_free)
> >         if (released > 0)
> >                 trace_btrfs_space_reservation(fs_info, "delalloc",
> >                                               btrfs_ino(inode), released, 0);
> > +
> 
> Stray new line change. No other surrounding code changed.
> 
> >         if (qgroup_free)
> >                 btrfs_qgroup_free_meta_prealloc(inode->root, qgroup_to_release);
> >         else
> > @@ -247,6 +250,16 @@ static void btrfs_inode_rsv_release(struct btrfs_inode *inode, bool qgroup_free)
> >                                                    qgroup_to_release);
> >  }
> >
> > +/*
> > + * Each delalloc extent could become an ordered_extent and end up inserting a
> > + * new extent into the extent and free space trees. So we must reserve
> > + * delayed ref space up front for that.
> 
> Ok, so that's 2 delayed refs, one to modify the extent tree and
> another to modify the free space tree.
> So below, techically we should account for nr_extents * 2.
> 
> And we actually need to modify more trees: the csum tree if we have
> csums to insert and the new raid stripe tree.
> 
> > + */
> > +static u64 delalloc_calc_delayed_refs_rsv(const struct btrfs_fs_info *fs_info, u64 nr_extents)
> > +{
> > +       return btrfs_calc_delayed_ref_bytes(fs_info, nr_extents);
> > +}
> > +
> >  static void btrfs_calculate_inode_block_rsv_size(struct btrfs_fs_info *fs_info,
> >                                                  struct btrfs_inode *inode)
> >  {
> > @@ -266,6 +279,7 @@ static void btrfs_calculate_inode_block_rsv_size(struct btrfs_fs_info *fs_info,
> >                 reserve_size = btrfs_calc_insert_metadata_size(fs_info,
> >                                                 outstanding_extents);
> >                 reserve_size += btrfs_calc_metadata_size(fs_info, 1);
> > +               reserve_size += delalloc_calc_delayed_refs_rsv(fs_info, outstanding_extents);
> >         }
> >         if (!(inode->flags & BTRFS_INODE_NODATASUM)) {
> >                 u64 csum_leaves;
> > @@ -289,7 +303,8 @@ static void btrfs_calculate_inode_block_rsv_size(struct btrfs_fs_info *fs_info,
> >
> >  static void calc_inode_reservations(struct btrfs_inode *inode,
> >                                     u64 num_bytes, u64 disk_num_bytes,
> > -                                   u64 *meta_reserve, u64 *qgroup_reserve)
> > +                                   u64 *meta_reserve,
> > +                                   u64 *qgroup_reserve)
> 
> Another style change only, without any functional changes.
> 
> >  {
> >         struct btrfs_fs_info *fs_info = inode->root->fs_info;
> >         u64 nr_extents = count_max_extents(fs_info, num_bytes);
> > @@ -309,9 +324,31 @@ static void calc_inode_reservations(struct btrfs_inode *inode,
> >          * for an inode update.
> >          */
> >         *meta_reserve += inode_update;
> > +
> > +       *meta_reserve += delalloc_calc_delayed_refs_rsv(fs_info, nr_extents);
> > +
> >         *qgroup_reserve = nr_extents * fs_info->nodesize;
> >  }
> >
> > +u64 btrfs_delalloc_migrate_delayed_refs_rsv(struct btrfs_trans_handle *trans,
> > +                                           struct btrfs_inode *inode)
> > +{
> > +       struct btrfs_block_rsv *inode_rsv = &inode->block_rsv;
> > +       struct btrfs_block_rsv *trans_rsv = &trans->delayed_rsv;
> > +       u64 num_bytes = delalloc_calc_delayed_refs_rsv(trans->fs_info, 1);
> > +
> > +       spin_lock(&inode_rsv->lock);
> > +       num_bytes = min(num_bytes, inode_rsv->reserved);
> > +       inode_rsv->reserved -= num_bytes;
> > +       inode_rsv->full = (inode_rsv->reserved >= inode_rsv->size);
> > +       spin_unlock(&inode_rsv->lock);
> > +
> > +       btrfs_block_rsv_add_bytes(trans_rsv, num_bytes, true);
> > +       trans->delayed_refs_bytes_reserved += num_bytes;
> > +
> > +       return num_bytes;
> 
> Why return the number of bytes?
> None of the callers cares about the return value, it can be just void.
> 
> > +}
> > +
> >  int btrfs_delalloc_reserve_metadata(struct btrfs_inode *inode, u64 num_bytes,
> >                                     u64 disk_num_bytes, bool noflush)
> >  {
> > @@ -358,8 +395,8 @@ int btrfs_delalloc_reserve_metadata(struct btrfs_inode *inode, u64 num_bytes,
> >                                                  noflush);
> >         if (ret)
> >                 return ret;
> > -       ret = btrfs_reserve_metadata_bytes(block_rsv->space_info, meta_reserve,
> > -                                          flush);
> > +       ret = btrfs_reserve_metadata_bytes(block_rsv->space_info,
> > +                                          meta_reserve, flush);
> 
> Yet another style change, without any surrounding functional changes.
> 
> >         if (ret) {
> >                 btrfs_qgroup_free_meta_prealloc(root, qgroup_reserve);
> >                 return ret;
> > diff --git a/fs/btrfs/delalloc-space.h b/fs/btrfs/delalloc-space.h
> > index 6119c0d3f883..b058e884879b 100644
> > --- a/fs/btrfs/delalloc-space.h
> > +++ b/fs/btrfs/delalloc-space.h
> > @@ -8,6 +8,7 @@
> >  struct extent_changeset;
> >  struct btrfs_inode;
> >  struct btrfs_fs_info;
> > +struct btrfs_trans_handle;
> >
> >  int btrfs_alloc_data_chunk_ondemand(const struct btrfs_inode *inode, u64 bytes);
> >  int btrfs_check_data_free_space(struct btrfs_inode *inode,
> > @@ -27,5 +28,7 @@ int btrfs_delalloc_reserve_metadata(struct btrfs_inode *inode, u64 num_bytes,
> >                                     u64 disk_num_bytes, bool noflush);
> >  void btrfs_delalloc_release_extents(struct btrfs_inode *inode, u64 num_bytes);
> >  void btrfs_delalloc_shrink_extents(struct btrfs_inode *inode, u64 reserved_len, u64 new_len);
> > +u64 btrfs_delalloc_migrate_delayed_refs_rsv(struct btrfs_trans_handle *trans,
> > +                                           struct btrfs_inode *inode);
> >
> >  #endif /* BTRFS_DELALLOC_SPACE_H */
> > diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
> > index 808e52aa6ef2..c57473030814 100644
> > --- a/fs/btrfs/inode.c
> > +++ b/fs/btrfs/inode.c
> > @@ -653,6 +653,7 @@ static noinline int __cow_file_range_inline(struct btrfs_inode *inode,
> >                 goto out;
> >         }
> >         trans->block_rsv = &inode->block_rsv;
> > +       btrfs_delalloc_migrate_delayed_refs_rsv(trans, inode);
> >
> >         drop_args.path = path;
> >         drop_args.start = 0;
> > @@ -3297,6 +3298,7 @@ int btrfs_finish_one_ordered(struct btrfs_ordered_extent *ordered_extent)
> >                 }
> >         } else {
> >                 BUG_ON(root == fs_info->tree_root);
> > +               btrfs_delalloc_migrate_delayed_refs_rsv(trans, inode);
> 
> So for the other case, writing into a prealloc extent, we should also
> migrate space space because we will update the fs tree to modify a
> file extent item (the call to btrfs_mark_extent_written()) and that
> can generate delayed refs if we need to COW blocks from the fs tree
> for that.
> 
> >                 ret = insert_ordered_extent_file_extent(trans, ordered_extent);
> >                 if (unlikely(ret < 0)) {
> >                         btrfs_abort_transaction(trans, ret);
> > @@ -8074,9 +8076,10 @@ struct inode *btrfs_alloc_inode(struct super_block *sb)
> >
> >         spin_lock_init(&ei->lock);
> >         ei->outstanding_extents = 0;
> > -       if (sb->s_magic != BTRFS_TEST_MAGIC)
> > +       if (sb->s_magic != BTRFS_TEST_MAGIC) {
> >                 btrfs_init_metadata_block_rsv(fs_info, &ei->block_rsv,
> >                                               BTRFS_BLOCK_RSV_DELALLOC);
> > +       }
> 
> Another style change and this one goes against our style preference -
> no { } for single statement if.
> 
> This is just a quick pass over the change, I'll look into the
> following patches soon.
> 

Sorry for the leaked style changes. Those were places I modified in the
first try using a new block_rsv and I didn't properly undo them while
changing it to the new migration design. I should have double checked
the full patch output for silly stuff like that.

Thanks for the review,
Boris

> >         ei->runtime_flags = 0;
> >         ei->prop_compress = BTRFS_COMPRESS_NONE;
> >         ei->defrag_compress = BTRFS_COMPRESS_NONE;
> > diff --git a/fs/btrfs/transaction.c b/fs/btrfs/transaction.c
> > index 248adb785051..55791bb100a2 100644
> > --- a/fs/btrfs/transaction.c
> > +++ b/fs/btrfs/transaction.c
> > @@ -1047,29 +1047,23 @@ static void btrfs_trans_release_metadata(struct btrfs_trans_handle *trans)
> >                 return;
> >         }
> >
> > -       if (!trans->bytes_reserved) {
> > -               ASSERT(trans->delayed_refs_bytes_reserved == 0,
> > -                      "trans->delayed_refs_bytes_reserved=%llu",
> > -                      trans->delayed_refs_bytes_reserved);
> > -               return;
> > +       if (trans->bytes_reserved) {
> > +               ASSERT(trans->block_rsv == &fs_info->trans_block_rsv);
> > +               trace_btrfs_space_reservation(fs_info, "transaction",
> > +                                       trans->transid, trans->bytes_reserved, 0);
> > +               btrfs_block_rsv_release(fs_info, trans->block_rsv,
> > +                                       trans->bytes_reserved, NULL);
> > +               trans->bytes_reserved = 0;
> >         }
> >
> > -       ASSERT(trans->block_rsv == &fs_info->trans_block_rsv);
> > -       trace_btrfs_space_reservation(fs_info, "transaction",
> > -                                     trans->transid, trans->bytes_reserved, 0);
> > -       btrfs_block_rsv_release(fs_info, trans->block_rsv,
> > -                               trans->bytes_reserved, NULL);
> > -       trans->bytes_reserved = 0;
> > -
> > -       if (!trans->delayed_refs_bytes_reserved)
> > -               return;
> > -
> > -       trace_btrfs_space_reservation(fs_info, "local_delayed_refs_rsv",
> > -                                     trans->transid,
> > -                                     trans->delayed_refs_bytes_reserved, 0);
> > -       btrfs_block_rsv_release(fs_info, &trans->delayed_rsv,
> > -                               trans->delayed_refs_bytes_reserved, NULL);
> > -       trans->delayed_refs_bytes_reserved = 0;
> > +       if (trans->delayed_refs_bytes_reserved) {
> > +               trace_btrfs_space_reservation(fs_info, "local_delayed_refs_rsv",
> > +                                       trans->transid,
> > +                                       trans->delayed_refs_bytes_reserved, 0);
> > +               btrfs_block_rsv_release(fs_info, &trans->delayed_rsv,
> > +                                       trans->delayed_refs_bytes_reserved, NULL);
> > +               trans->delayed_refs_bytes_reserved = 0;
> > +       }
> >  }
> >
> >  static int __btrfs_end_transaction(struct btrfs_trans_handle *trans,
> > --
> > 2.53.0
> >
> >

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH v2 3/5] btrfs: account for compression in delalloc extent reservation
  2026-04-03 22:27 ` [PATCH v2 3/5] btrfs: account for compression in delalloc extent reservation Boris Burkov
@ 2026-04-06 17:44   ` Filipe Manana
  0 siblings, 0 replies; 12+ messages in thread
From: Filipe Manana @ 2026-04-06 17:44 UTC (permalink / raw)
  To: Boris Burkov; +Cc: linux-btrfs, kernel-team

On Fri, Apr 3, 2026 at 11:30 PM Boris Burkov <boris@bur.io> wrote:
>
> The btrfs maximum uncompressed extent size is 128MiB. The maximum
> compressed extent size in file extent space is 128KiB. Therefore, the
> estimate for outstanding_extents is off by 3 orders of magnitude when
> COMPRESS_FORCE is set or the inode is set to always compress.
>
> Because we use re-calculation when necessary, rather than super detailed
> extent tracking, we don't grow this reservation as the true number of
> extents is revealed. We don't want to be too clever with it, however, as
> we don't want the calculation to change for a given inode between
> reservation and release, so we only rely on the forcing type flags.
>
> With this change, we no longer under-reserve delayed refs reservations
> for delalloc writes, even with compress-force.
>
> Because this would turn count_max_extents() into a named shim for
> div_u64(size + max_extent_size - 1, max_extent_size);
> we can just get rid of it.
>
> Signed-off-by: Boris Burkov <boris@bur.io>
> ---
>  fs/btrfs/btrfs_inode.h    |  3 ++
>  fs/btrfs/delalloc-space.c | 13 ++++---
>  fs/btrfs/fs.h             | 13 -------
>  fs/btrfs/inode.c          | 78 ++++++++++++++++++++++++++++++++-------
>  4 files changed, 74 insertions(+), 33 deletions(-)
>
> diff --git a/fs/btrfs/btrfs_inode.h b/fs/btrfs/btrfs_inode.h
> index 55c272fe5d92..fad61b4d94a7 100644
> --- a/fs/btrfs/btrfs_inode.h
> +++ b/fs/btrfs/btrfs_inode.h
> @@ -510,6 +510,9 @@ static inline bool btrfs_inode_can_compress(const struct btrfs_inode *inode)
>         return true;
>  }
>
> +u64 btrfs_inode_max_extent_size(const struct btrfs_inode *inode);
> +u64 btrfs_inode_max_extents(const struct btrfs_inode *inode, u64 size);
> +
>  static inline void btrfs_assert_inode_locked(struct btrfs_inode *inode)
>  {
>         /* Immediately trigger a crash if the inode is not locked. */
> diff --git a/fs/btrfs/delalloc-space.c b/fs/btrfs/delalloc-space.c
> index c2e5c7758ed5..2c99e332e25b 100644
> --- a/fs/btrfs/delalloc-space.c
> +++ b/fs/btrfs/delalloc-space.c
> @@ -6,6 +6,7 @@
>  #include "delayed-ref.h"
>  #include "block-rsv.h"
>  #include "btrfs_inode.h"
> +#include "compression.h"
>  #include "space-info.h"
>  #include "qgroup.h"
>  #include "fs.h"
> @@ -65,7 +66,7 @@
>   *     This is the number of file extent items we'll need to handle all of the
>   *     outstanding DELALLOC space we have in this inode.  We limit the maximum
>   *     size of an extent, so a large contiguous dirty area may require more than
> - *     one outstanding_extent, which is why count_max_extents() is used to
> + *     one outstanding_extent, which is why we use the max extent size to
>   *     determine how many outstanding_extents get added.
>   *
>   *   ->csum_bytes
> @@ -314,7 +315,7 @@ static void calc_inode_reservations(struct btrfs_inode *inode,
>                                     u64 *qgroup_reserve)
>  {
>         struct btrfs_fs_info *fs_info = inode->root->fs_info;
> -       u64 nr_extents = count_max_extents(fs_info, num_bytes);
> +       u64 nr_extents = btrfs_inode_max_extents(inode, num_bytes);
>         u64 csum_leaves;
>         u64 inode_update = btrfs_calc_metadata_size(fs_info, 1);
>
> @@ -415,7 +416,7 @@ int btrfs_delalloc_reserve_metadata(struct btrfs_inode *inode, u64 num_bytes,
>          * racing with an ordered completion or some such that would think it
>          * needs to free the reservation we just made.
>          */
> -       nr_extents = count_max_extents(fs_info, num_bytes);
> +       nr_extents = btrfs_inode_max_extents(inode, num_bytes);
>         spin_lock(&inode->lock);
>         btrfs_mod_outstanding_extents(inode, nr_extents);
>         if (!(inode->flags & BTRFS_INODE_NODATASUM))
> @@ -482,7 +483,7 @@ void btrfs_delalloc_release_extents(struct btrfs_inode *inode, u64 num_bytes)
>         unsigned num_extents;
>
>         spin_lock(&inode->lock);
> -       num_extents = count_max_extents(fs_info, num_bytes);
> +       num_extents = btrfs_inode_max_extents(inode, num_bytes);
>         btrfs_mod_outstanding_extents(inode, -num_extents);
>         btrfs_calculate_inode_block_rsv_size(fs_info, inode);
>         spin_unlock(&inode->lock);
> @@ -497,8 +498,8 @@ void btrfs_delalloc_release_extents(struct btrfs_inode *inode, u64 num_bytes)
>  void btrfs_delalloc_shrink_extents(struct btrfs_inode *inode, u64 reserved_len, u64 new_len)
>  {
>         struct btrfs_fs_info *fs_info = inode->root->fs_info;
> -       const u32 reserved_num_extents = count_max_extents(fs_info, reserved_len);
> -       const u32 new_num_extents = count_max_extents(fs_info, new_len);
> +       const u32 reserved_num_extents = btrfs_inode_max_extents(inode, reserved_len);
> +       const u32 new_num_extents = btrfs_inode_max_extents(inode, new_len);
>         const int diff_num_extents = new_num_extents - reserved_num_extents;
>
>         ASSERT(new_len <= reserved_len);
> diff --git a/fs/btrfs/fs.h b/fs/btrfs/fs.h
> index a4758d94b32e..2c1626155645 100644
> --- a/fs/btrfs/fs.h
> +++ b/fs/btrfs/fs.h
> @@ -1051,19 +1051,6 @@ static inline bool btrfs_is_zoned(const struct btrfs_fs_info *fs_info)
>         return IS_ENABLED(CONFIG_BLK_DEV_ZONED) && fs_info->zone_size > 0;
>  }
>
> -/*
> - * Count how many fs_info->max_extent_size cover the @size
> - */
> -static inline u32 count_max_extents(const struct btrfs_fs_info *fs_info, u64 size)
> -{
> -#ifdef CONFIG_BTRFS_FS_RUN_SANITY_TESTS
> -       if (!fs_info)
> -               return div_u64(size + BTRFS_MAX_EXTENT_SIZE - 1, BTRFS_MAX_EXTENT_SIZE);
> -#endif
> -
> -       return div_u64(size + fs_info->max_extent_size - 1, fs_info->max_extent_size);
> -}
> -
>  static inline unsigned int btrfs_blocks_per_folio(const struct btrfs_fs_info *fs_info,
>                                                   const struct folio *folio)
>  {
> diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
> index c57473030814..1e53ff65d1a7 100644
> --- a/fs/btrfs/inode.c
> +++ b/fs/btrfs/inode.c
> @@ -747,6 +747,56 @@ static int add_async_extent(struct async_chunk *cow, u64 start, u64 ram_size,
>         return 0;
>  }
>
> +/*
> + * Check if compression will definitely be attempted for this inode based on
> + * mount options and inode properties.  Unlike inode_need_compress(), this does
> + * NOT run the compression heuristic or check range-specific conditions, so it
> + * is safe to call under locks (e.g. io_tree lock) and for reservation sizing.
> + *
> + * Only returns true for cases where BTRFS_INODE_NOCOMPRESS cannot be set at
> + * runtime (FORCE_COMPRESS and prop_compress), ensuring that the effective max
> + * extent size is stable across paired set/clear delalloc operations.
> + */
> +static inline bool inode_may_compress(const struct btrfs_inode *inode)
> +{
> +       if (!btrfs_inode_can_compress(inode))
> +               return false;
> +
> +       /* force compress always attempts compression */

Please follow our coding style: the first word is capitalized and
sentence should finish with punctuation.

> +       if (btrfs_test_opt(inode->root->fs_info, FORCE_COMPRESS))
> +               return true;
> +
> +       /* per-inode property: NOCOMPRESS cannot override this */

Same here.

> +       if (inode->prop_compress)
> +               return true;
> +
> +       return false;
> +}
> +
> +/*
> + * Return the effective maximum extent size for reservation accounting.
> + *
> + * When compression is guaranteed to be attempted (FORCE_COMPRESS or
> + * prop_compress), the compression path splits ranges into
> + * BTRFS_MAX_UNCOMPRESSED chunks, each producing an independent ordered
> + * extent.  Use that as the divisor instead of fs_info->max_extent_size
> + * to avoid severely undercounting outstanding extents.
> + */
> +u64 btrfs_inode_max_extent_size(const struct btrfs_inode *inode)

There's no need to export this, make it static since it's not used
outside inode.c.

Thanks.

> +{
> +       if (inode_may_compress(inode))
> +               return BTRFS_MAX_UNCOMPRESSED;
> +
> +       return inode->root->fs_info->max_extent_size;
> +}
> +
> +u64 btrfs_inode_max_extents(const struct btrfs_inode *inode, u64 size)
> +{
> +       u64 max_extent_size = btrfs_inode_max_extent_size(inode);
> +
> +       return div_u64(size + max_extent_size - 1, max_extent_size);
> +}
> +
>  /*
>   * Check if the inode needs to be submitted to compression, based on mount
>   * options, defragmentation, properties or heuristics.
> @@ -2459,8 +2509,8 @@ int btrfs_run_delalloc_range(struct btrfs_inode *inode, struct folio *locked_fol
>  void btrfs_split_delalloc_extent(struct btrfs_inode *inode,
>                                  struct extent_state *orig, u64 split)
>  {
> -       struct btrfs_fs_info *fs_info = inode->root->fs_info;
>         u64 size;
> +       u64 max_extent_size = btrfs_inode_max_extent_size(inode);
>
>         lockdep_assert_held(&inode->io_tree.lock);
>
> @@ -2469,8 +2519,8 @@ void btrfs_split_delalloc_extent(struct btrfs_inode *inode,
>                 return;
>
>         size = orig->end - orig->start + 1;
> -       if (size > fs_info->max_extent_size) {
> -               u32 num_extents;
> +       if (size > max_extent_size) {
> +               u64 num_extents;
>                 u64 new_size;
>
>                 /*
> @@ -2478,10 +2528,10 @@ void btrfs_split_delalloc_extent(struct btrfs_inode *inode,
>                  * applies here, just in reverse.
>                  */
>                 new_size = orig->end - split + 1;
> -               num_extents = count_max_extents(fs_info, new_size);
> +               num_extents = btrfs_inode_max_extents(inode, new_size);
>                 new_size = split - orig->start;
> -               num_extents += count_max_extents(fs_info, new_size);
> -               if (count_max_extents(fs_info, size) >= num_extents)
> +               num_extents += btrfs_inode_max_extents(inode, new_size);
> +               if (btrfs_inode_max_extents(inode, size) >= num_extents)
>                         return;
>         }
>
> @@ -2498,9 +2548,9 @@ void btrfs_split_delalloc_extent(struct btrfs_inode *inode,
>  void btrfs_merge_delalloc_extent(struct btrfs_inode *inode, struct extent_state *new,
>                                  struct extent_state *other)
>  {
> -       struct btrfs_fs_info *fs_info = inode->root->fs_info;
>         u64 new_size, old_size;
> -       u32 num_extents;
> +       u64 max_extent_size = btrfs_inode_max_extent_size(inode);
> +       u64 num_extents;
>
>         lockdep_assert_held(&inode->io_tree.lock);
>
> @@ -2514,7 +2564,7 @@ void btrfs_merge_delalloc_extent(struct btrfs_inode *inode, struct extent_state
>                 new_size = other->end - new->start + 1;
>
>         /* we're not bigger than the max, unreserve the space and go */
> -       if (new_size <= fs_info->max_extent_size) {
> +       if (new_size <= max_extent_size) {
>                 spin_lock(&inode->lock);
>                 btrfs_mod_outstanding_extents(inode, -1);
>                 spin_unlock(&inode->lock);
> @@ -2540,10 +2590,10 @@ void btrfs_merge_delalloc_extent(struct btrfs_inode *inode, struct extent_state
>          * this case.
>          */
>         old_size = other->end - other->start + 1;
> -       num_extents = count_max_extents(fs_info, old_size);
> +       num_extents = btrfs_inode_max_extents(inode, old_size);
>         old_size = new->end - new->start + 1;
> -       num_extents += count_max_extents(fs_info, old_size);
> -       if (count_max_extents(fs_info, new_size) >= num_extents)
> +       num_extents += btrfs_inode_max_extents(inode, old_size);
> +       if (btrfs_inode_max_extents(inode, new_size) >= num_extents)
>                 return;
>
>         spin_lock(&inode->lock);
> @@ -2616,7 +2666,7 @@ void btrfs_set_delalloc_extent(struct btrfs_inode *inode, struct extent_state *s
>         if (!(state->state & EXTENT_DELALLOC) && (bits & EXTENT_DELALLOC)) {
>                 u64 len = state->end + 1 - state->start;
>                 u64 prev_delalloc_bytes;
> -               u32 num_extents = count_max_extents(fs_info, len);
> +               u32 num_extents = btrfs_inode_max_extents(inode, len);
>
>                 spin_lock(&inode->lock);
>                 btrfs_mod_outstanding_extents(inode, num_extents);
> @@ -2662,7 +2712,7 @@ void btrfs_clear_delalloc_extent(struct btrfs_inode *inode,
>  {
>         struct btrfs_fs_info *fs_info = inode->root->fs_info;
>         u64 len = state->end + 1 - state->start;
> -       u32 num_extents = count_max_extents(fs_info, len);
> +       u32 num_extents = btrfs_inode_max_extents(inode, len);
>
>         lockdep_assert_held(&inode->io_tree.lock);
>
> --
> 2.53.0
>
>

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH v2 4/5] btrfs: make inode->outstanding_extents a u64
  2026-04-03 22:27 ` [PATCH v2 4/5] btrfs: make inode->outstanding_extents a u64 Boris Burkov
@ 2026-04-06 17:55   ` Filipe Manana
  0 siblings, 0 replies; 12+ messages in thread
From: Filipe Manana @ 2026-04-06 17:55 UTC (permalink / raw)
  To: Boris Burkov; +Cc: linux-btrfs, kernel-team

On Fri, Apr 3, 2026 at 11:30 PM Boris Burkov <boris@bur.io> wrote:
>
> The maximum file size is MAX_LFS_FILESIZE = (loff_t)LLONG_MAX
>
> As a result, the max extent size computation in btrfs has always been
> bounded above by LLONG_MAX / 128MiB, which is ~ 2^63 / 2^27. This has
> never fit in a u32. With the recent changes to also divide by 128KiB in
> compressed cases, that bound is even higher. Whether or not it is likely
> to happen, I think it is nice to try to capture the intent in the types,
> so change outstanding_extents to u64, and make mod_outstanding_extents
> try to capture some expectations around the size of its inputs.
>
> Signed-off-by: Boris Burkov <boris@bur.io>
> ---
>  fs/btrfs/btrfs_inode.h       | 14 ++++++++++----
>  fs/btrfs/delalloc-space.c    | 19 +++++++++----------
>  fs/btrfs/inode.c             | 14 +++++++-------
>  fs/btrfs/ordered-data.c      |  4 ++--
>  fs/btrfs/tests/inode-tests.c | 18 +++++++++---------
>  include/trace/events/btrfs.h |  8 ++++----
>  6 files changed, 41 insertions(+), 36 deletions(-)
>
> diff --git a/fs/btrfs/btrfs_inode.h b/fs/btrfs/btrfs_inode.h
> index fad61b4d94a7..259af3f6ebb5 100644
> --- a/fs/btrfs/btrfs_inode.h
> +++ b/fs/btrfs/btrfs_inode.h
> @@ -180,7 +180,7 @@ struct btrfs_inode {
>          * items we think we'll end up using, and reserved_extents is the number
>          * of extent items we've reserved metadata for. Protected by 'lock'.
>          */
> -       unsigned outstanding_extents;
> +       u64 outstanding_extents;
>
>         /* used to order data wrt metadata */
>         spinlock_t ordered_tree_lock;
> @@ -429,14 +429,20 @@ static inline bool is_data_inode(const struct btrfs_inode *inode)
>  }
>
>  static inline void btrfs_mod_outstanding_extents(struct btrfs_inode *inode,
> -                                                int mod)
> +                                                int mod, u64 nr_extents)
>  {
> +       s64 delta = mod * (s64)nr_extents;
> +
>         lockdep_assert_held(&inode->lock);
> -       inode->outstanding_extents += mod;
> +       ASSERT(mod == 1 || mod == -1);
> +       ASSERT(nr_extents <= S64_MAX);
> +       ASSERT(mod == -1 || inode->outstanding_extents <= U64_MAX - nr_extents);
> +       ASSERT(mod == 1 || inode->outstanding_extents >= nr_extents);

Please use extended assertions that print the checked values, so that
we can have useful output to debug if they fail.

Thanks.

> +       inode->outstanding_extents += delta;
>         if (btrfs_is_free_space_inode(inode))
>                 return;
>         trace_btrfs_inode_mod_outstanding_extents(inode->root, btrfs_ino(inode),
> -                                                 mod, inode->outstanding_extents);
> +                                                 delta, inode->outstanding_extents);
>  }
>
>  /*
> diff --git a/fs/btrfs/delalloc-space.c b/fs/btrfs/delalloc-space.c
> index 2c99e332e25b..f4851cc7e2a8 100644
> --- a/fs/btrfs/delalloc-space.c
> +++ b/fs/btrfs/delalloc-space.c
> @@ -274,7 +274,7 @@ static void btrfs_calculate_inode_block_rsv_size(struct btrfs_fs_info *fs_info,
>         struct btrfs_block_rsv *block_rsv = &inode->block_rsv;
>         u64 reserve_size = 0;
>         u64 qgroup_rsv_size = 0;
> -       unsigned outstanding_extents;
> +       u64 outstanding_extents;
>
>         lockdep_assert_held(&inode->lock);
>         outstanding_extents = inode->outstanding_extents;
> @@ -301,7 +301,7 @@ static void btrfs_calculate_inode_block_rsv_size(struct btrfs_fs_info *fs_info,
>          *
>          * This is overestimating in most cases.
>          */
> -       qgroup_rsv_size = (u64)outstanding_extents * fs_info->nodesize;
> +       qgroup_rsv_size = outstanding_extents * fs_info->nodesize;
>
>         spin_lock(&block_rsv->lock);
>         block_rsv->size = reserve_size;
> @@ -364,7 +364,7 @@ int btrfs_delalloc_reserve_metadata(struct btrfs_inode *inode, u64 num_bytes,
>         struct btrfs_fs_info *fs_info = root->fs_info;
>         struct btrfs_block_rsv *block_rsv = &inode->block_rsv;
>         u64 meta_reserve, qgroup_reserve;
> -       unsigned nr_extents;
> +       u64 nr_extents;
>         enum btrfs_reserve_flush_enum flush = BTRFS_RESERVE_FLUSH_ALL;
>         int ret = 0;
>
> @@ -418,7 +418,7 @@ int btrfs_delalloc_reserve_metadata(struct btrfs_inode *inode, u64 num_bytes,
>          */
>         nr_extents = btrfs_inode_max_extents(inode, num_bytes);
>         spin_lock(&inode->lock);
> -       btrfs_mod_outstanding_extents(inode, nr_extents);
> +       btrfs_mod_outstanding_extents(inode, 1, nr_extents);
>         if (!(inode->flags & BTRFS_INODE_NODATASUM))
>                 inode->csum_bytes += disk_num_bytes;
>         btrfs_calculate_inode_block_rsv_size(fs_info, inode);
> @@ -480,11 +480,11 @@ void btrfs_delalloc_release_metadata(struct btrfs_inode *inode, u64 num_bytes,
>  void btrfs_delalloc_release_extents(struct btrfs_inode *inode, u64 num_bytes)
>  {
>         struct btrfs_fs_info *fs_info = inode->root->fs_info;
> -       unsigned num_extents;
> +       u64 num_extents;
>
>         spin_lock(&inode->lock);
>         num_extents = btrfs_inode_max_extents(inode, num_bytes);
> -       btrfs_mod_outstanding_extents(inode, -num_extents);
> +       btrfs_mod_outstanding_extents(inode, -1, num_extents);
>         btrfs_calculate_inode_block_rsv_size(fs_info, inode);
>         spin_unlock(&inode->lock);
>
> @@ -498,16 +498,15 @@ void btrfs_delalloc_release_extents(struct btrfs_inode *inode, u64 num_bytes)
>  void btrfs_delalloc_shrink_extents(struct btrfs_inode *inode, u64 reserved_len, u64 new_len)
>  {
>         struct btrfs_fs_info *fs_info = inode->root->fs_info;
> -       const u32 reserved_num_extents = btrfs_inode_max_extents(inode, reserved_len);
> -       const u32 new_num_extents = btrfs_inode_max_extents(inode, new_len);
> -       const int diff_num_extents = new_num_extents - reserved_num_extents;
> +       const u64 reserved_num_extents = btrfs_inode_max_extents(inode, reserved_len);
> +       const u64 new_num_extents = btrfs_inode_max_extents(inode, new_len);
>
>         ASSERT(new_len <= reserved_len);
>         if (new_num_extents == reserved_num_extents)
>                 return;
>
>         spin_lock(&inode->lock);
> -       btrfs_mod_outstanding_extents(inode, diff_num_extents);
> +       btrfs_mod_outstanding_extents(inode, -1, reserved_num_extents - new_num_extents);
>         btrfs_calculate_inode_block_rsv_size(fs_info, inode);
>         spin_unlock(&inode->lock);
>
> diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
> index 1e53ff65d1a7..965c2739d33b 100644
> --- a/fs/btrfs/inode.c
> +++ b/fs/btrfs/inode.c
> @@ -2536,7 +2536,7 @@ void btrfs_split_delalloc_extent(struct btrfs_inode *inode,
>         }
>
>         spin_lock(&inode->lock);
> -       btrfs_mod_outstanding_extents(inode, 1);
> +       btrfs_mod_outstanding_extents(inode, 1, 1);
>         spin_unlock(&inode->lock);
>  }
>
> @@ -2566,7 +2566,7 @@ void btrfs_merge_delalloc_extent(struct btrfs_inode *inode, struct extent_state
>         /* we're not bigger than the max, unreserve the space and go */
>         if (new_size <= max_extent_size) {
>                 spin_lock(&inode->lock);
> -               btrfs_mod_outstanding_extents(inode, -1);
> +               btrfs_mod_outstanding_extents(inode, -1, 1);
>                 spin_unlock(&inode->lock);
>                 return;
>         }
> @@ -2597,7 +2597,7 @@ void btrfs_merge_delalloc_extent(struct btrfs_inode *inode, struct extent_state
>                 return;
>
>         spin_lock(&inode->lock);
> -       btrfs_mod_outstanding_extents(inode, -1);
> +       btrfs_mod_outstanding_extents(inode, -1, 1);
>         spin_unlock(&inode->lock);
>  }
>
> @@ -2666,10 +2666,10 @@ void btrfs_set_delalloc_extent(struct btrfs_inode *inode, struct extent_state *s
>         if (!(state->state & EXTENT_DELALLOC) && (bits & EXTENT_DELALLOC)) {
>                 u64 len = state->end + 1 - state->start;
>                 u64 prev_delalloc_bytes;
> -               u32 num_extents = btrfs_inode_max_extents(inode, len);
> +               u64 num_extents = btrfs_inode_max_extents(inode, len);
>
>                 spin_lock(&inode->lock);
> -               btrfs_mod_outstanding_extents(inode, num_extents);
> +               btrfs_mod_outstanding_extents(inode, 1, num_extents);
>                 spin_unlock(&inode->lock);
>
>                 /* For sanity tests */
> @@ -2712,7 +2712,7 @@ void btrfs_clear_delalloc_extent(struct btrfs_inode *inode,
>  {
>         struct btrfs_fs_info *fs_info = inode->root->fs_info;
>         u64 len = state->end + 1 - state->start;
> -       u32 num_extents = btrfs_inode_max_extents(inode, len);
> +       u64 num_extents = btrfs_inode_max_extents(inode, len);
>
>         lockdep_assert_held(&inode->io_tree.lock);
>
> @@ -2732,7 +2732,7 @@ void btrfs_clear_delalloc_extent(struct btrfs_inode *inode,
>                 u64 new_delalloc_bytes;
>
>                 spin_lock(&inode->lock);
> -               btrfs_mod_outstanding_extents(inode, -num_extents);
> +               btrfs_mod_outstanding_extents(inode, -1, num_extents);
>                 spin_unlock(&inode->lock);
>
>                 /*
> diff --git a/fs/btrfs/ordered-data.c b/fs/btrfs/ordered-data.c
> index d39f1c49d1cf..14b49cb33bb0 100644
> --- a/fs/btrfs/ordered-data.c
> +++ b/fs/btrfs/ordered-data.c
> @@ -223,7 +223,7 @@ static struct btrfs_ordered_extent *alloc_ordered_extent(
>          * smallest the extent is going to get.
>          */
>         spin_lock(&inode->lock);
> -       btrfs_mod_outstanding_extents(inode, 1);
> +       btrfs_mod_outstanding_extents(inode, 1, 1);
>         spin_unlock(&inode->lock);
>
>  out:
> @@ -655,7 +655,7 @@ void btrfs_remove_ordered_extent(struct btrfs_ordered_extent *entry)
>         btrfs_lockdep_acquire(fs_info, btrfs_trans_pending_ordered);
>         /* This is paired with alloc_ordered_extent(). */
>         spin_lock(&btrfs_inode->lock);
> -       btrfs_mod_outstanding_extents(btrfs_inode, -1);
> +       btrfs_mod_outstanding_extents(btrfs_inode, -1, 1);
>         spin_unlock(&btrfs_inode->lock);
>         if (root != fs_info->tree_root) {
>                 u64 release;
> diff --git a/fs/btrfs/tests/inode-tests.c b/fs/btrfs/tests/inode-tests.c
> index b04fbcaf0a1d..e63afbb9be2b 100644
> --- a/fs/btrfs/tests/inode-tests.c
> +++ b/fs/btrfs/tests/inode-tests.c
> @@ -931,7 +931,7 @@ static int test_extent_accounting(u32 sectorsize, u32 nodesize)
>         }
>         if (BTRFS_I(inode)->outstanding_extents != 1) {
>                 ret = -EINVAL;
> -               test_err("miscount, wanted 1, got %u",
> +               test_err("miscount, wanted 1, got %llu",
>                          BTRFS_I(inode)->outstanding_extents);
>                 goto out;
>         }
> @@ -946,7 +946,7 @@ static int test_extent_accounting(u32 sectorsize, u32 nodesize)
>         }
>         if (BTRFS_I(inode)->outstanding_extents != 2) {
>                 ret = -EINVAL;
> -               test_err("miscount, wanted 2, got %u",
> +               test_err("miscount, wanted 2, got %llu",
>                          BTRFS_I(inode)->outstanding_extents);
>                 goto out;
>         }
> @@ -962,7 +962,7 @@ static int test_extent_accounting(u32 sectorsize, u32 nodesize)
>         }
>         if (BTRFS_I(inode)->outstanding_extents != 2) {
>                 ret = -EINVAL;
> -               test_err("miscount, wanted 2, got %u",
> +               test_err("miscount, wanted 2, got %llu",
>                          BTRFS_I(inode)->outstanding_extents);
>                 goto out;
>         }
> @@ -978,7 +978,7 @@ static int test_extent_accounting(u32 sectorsize, u32 nodesize)
>         }
>         if (BTRFS_I(inode)->outstanding_extents != 2) {
>                 ret = -EINVAL;
> -               test_err("miscount, wanted 2, got %u",
> +               test_err("miscount, wanted 2, got %llu",
>                          BTRFS_I(inode)->outstanding_extents);
>                 goto out;
>         }
> @@ -996,7 +996,7 @@ static int test_extent_accounting(u32 sectorsize, u32 nodesize)
>         }
>         if (BTRFS_I(inode)->outstanding_extents != 4) {
>                 ret = -EINVAL;
> -               test_err("miscount, wanted 4, got %u",
> +               test_err("miscount, wanted 4, got %llu",
>                          BTRFS_I(inode)->outstanding_extents);
>                 goto out;
>         }
> @@ -1013,7 +1013,7 @@ static int test_extent_accounting(u32 sectorsize, u32 nodesize)
>         }
>         if (BTRFS_I(inode)->outstanding_extents != 3) {
>                 ret = -EINVAL;
> -               test_err("miscount, wanted 3, got %u",
> +               test_err("miscount, wanted 3, got %llu",
>                          BTRFS_I(inode)->outstanding_extents);
>                 goto out;
>         }
> @@ -1029,7 +1029,7 @@ static int test_extent_accounting(u32 sectorsize, u32 nodesize)
>         }
>         if (BTRFS_I(inode)->outstanding_extents != 4) {
>                 ret = -EINVAL;
> -               test_err("miscount, wanted 4, got %u",
> +               test_err("miscount, wanted 4, got %llu",
>                          BTRFS_I(inode)->outstanding_extents);
>                 goto out;
>         }
> @@ -1047,7 +1047,7 @@ static int test_extent_accounting(u32 sectorsize, u32 nodesize)
>         }
>         if (BTRFS_I(inode)->outstanding_extents != 3) {
>                 ret = -EINVAL;
> -               test_err("miscount, wanted 3, got %u",
> +               test_err("miscount, wanted 3, got %llu",
>                          BTRFS_I(inode)->outstanding_extents);
>                 goto out;
>         }
> @@ -1061,7 +1061,7 @@ static int test_extent_accounting(u32 sectorsize, u32 nodesize)
>         }
>         if (BTRFS_I(inode)->outstanding_extents) {
>                 ret = -EINVAL;
> -               test_err("miscount, wanted 0, got %u",
> +               test_err("miscount, wanted 0, got %llu",
>                          BTRFS_I(inode)->outstanding_extents);
>                 goto out;
>         }
> diff --git a/include/trace/events/btrfs.h b/include/trace/events/btrfs.h
> index 8ad7a2d76c1d..caabdc8d9eed 100644
> --- a/include/trace/events/btrfs.h
> +++ b/include/trace/events/btrfs.h
> @@ -2003,15 +2003,15 @@ DEFINE_EVENT(btrfs__prelim_ref, btrfs_prelim_ref_insert,
>  );
>
>  TRACE_EVENT(btrfs_inode_mod_outstanding_extents,
> -       TP_PROTO(const struct btrfs_root *root, u64 ino, int mod, unsigned outstanding),
> +       TP_PROTO(const struct btrfs_root *root, u64 ino, s64 mod, u64 outstanding),
>
>         TP_ARGS(root, ino, mod, outstanding),
>
>         TP_STRUCT__entry_btrfs(
>                 __field(        u64, root_objectid      )
>                 __field(        u64, ino                )
> -               __field(        int, mod                )
> -               __field(        unsigned, outstanding   )
> +               __field(        s64, mod                )
> +               __field(        u64, outstanding        )
>         ),
>
>         TP_fast_assign_btrfs(root->fs_info,
> @@ -2021,7 +2021,7 @@ TRACE_EVENT(btrfs_inode_mod_outstanding_extents,
>                 __entry->outstanding    = outstanding;
>         ),
>
> -       TP_printk_btrfs("root=%llu(%s) ino=%llu mod=%d outstanding=%u",
> +       TP_printk_btrfs("root=%llu(%s) ino=%llu mod=%lld outstanding=%llu",
>                         show_root_type(__entry->root_objectid),
>                         __entry->ino, __entry->mod, __entry->outstanding)
>  );
> --
> 2.53.0
>
>

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH v2 5/5] btrfs: cap shrink_delalloc iterations to 128M
  2026-04-03 22:27 ` [PATCH v2 5/5] btrfs: cap shrink_delalloc iterations to 128M Boris Burkov
@ 2026-04-06 18:08   ` Filipe Manana
  0 siblings, 0 replies; 12+ messages in thread
From: Filipe Manana @ 2026-04-06 18:08 UTC (permalink / raw)
  To: Boris Burkov; +Cc: linux-btrfs, kernel-team

On Fri, Apr 3, 2026 at 11:30 PM Boris Burkov <boris@bur.io> wrote:
>
> Even with more accurate delayed_refs reservations, preemptive reclaim is
> not perfect and we might generate tickets, especially in cases with a
> very large flood of writeback outstanding.
>
> Ultimately, if we do get into a situation with tickets pending and async
> reclaim blocking the system, we want to try to make as much progress as
> quickly as possible to unblock tasks. We want space reclaim to be
> effective, and to have a good chance at making progress, but not to
> block arbitrarily as this leads to untenable syscall latencies, long
> commits, and even hung task warnings.
>
> I traced such cases of heavy writeback async reclaim hung tasks and
> observed that we were blocking for long periods of time in
> shrink_delalloc(). This was particularly bad when doing writeback of
> incompressible data with the compress-force mount option.
>
> e.g.
> dd if=/dev/urandom of=urandom.seed bs=1G count=1
> dd if=urandom.seed of=urandom.big bs=1G count=300
>
> shrink_delalloc() computes to_reclaim as delalloc_bytes >> 3. With
> hundreds of gigs of delalloc (again imagine a large dirty_ratio and lots
> of ram), this is still 10-20+ GiB. Particularly in the wait phases, this
> can be quite slow, and generates even more delayed-refs as mentioned in
> the previous patch, so it doesn't even help that much with the immediate
> space shortfall.
>
> We do satisfy some tickets, but we are ultimately keep the system in
> essentially the same state, and with long stalling reclaim calls into
> shrink_delalloc().
>
> It would be much better to start some good chunk of I/O and also to work
> through the new delayed_refs and keep things moving through the system
> while releasing the conservative over-estimated metadata reservations.
>
> To acheive this, tighten up the delalloc work to be in units of the
> maximum extent size. If we issue 128MiB of delalloc, we don't leave too
> much (any?) extent merging on the table, but don't ever block on
> pathological 10GiB+ chunks of delalloc. If we do detect that we
> satisfied a ticket, break out of shrink_delalloc() and run some of the
> new delayed_refs as well before going again. This way we strike a nice
> balance of making delalloc progress, but not at the cost of every other
> sort of reservation, as they all feed into each other.
>
> This means iterating over to_reclaim by 128MiB at a time until it is
> drained or we satisfy a ticket, rather than trying 3 times to do the
> whole thing.
>
> Signed-off-by: Boris Burkov <boris@bur.io>
> ---
>  fs/btrfs/space-info.c | 31 +++++++++++++++++++++----------
>  1 file changed, 21 insertions(+), 10 deletions(-)
>
> diff --git a/fs/btrfs/space-info.c b/fs/btrfs/space-info.c
> index f0436eea1544..15c5be97a123 100644
> --- a/fs/btrfs/space-info.c
> +++ b/fs/btrfs/space-info.c
> @@ -727,7 +727,7 @@ static void shrink_delalloc(struct btrfs_space_info *space_info,
>         u64 ordered_bytes;
>         u64 items;
>         long time_left;
> -       int loops;
> +       u64 orig_tickets_id;
>
>         delalloc_bytes = percpu_counter_sum_positive(&fs_info->delalloc_bytes);
>         ordered_bytes = percpu_counter_sum_positive(&fs_info->ordered_bytes);
> @@ -735,9 +735,7 @@ static void shrink_delalloc(struct btrfs_space_info *space_info,
>                 return;
>
>         /* Calc the number of the pages we need flush for space reservation */
> -       if (to_reclaim == U64_MAX) {
> -               items = U64_MAX;
> -       } else {
> +       if (to_reclaim != U64_MAX) {
>                 /*
>                  * to_reclaim is set to however much metadata we need to
>                  * reclaim, but reclaiming that much data doesn't really track
> @@ -751,7 +749,6 @@ static void shrink_delalloc(struct btrfs_space_info *space_info,
>                  * aggressive.
>                  */
>                 to_reclaim = max(to_reclaim, delalloc_bytes >> 3);
> -               items = calc_reclaim_items_nr(fs_info, to_reclaim) * 2;
>         }
>
>         trans = current->journal_info;
> @@ -764,12 +761,17 @@ static void shrink_delalloc(struct btrfs_space_info *space_info,
>         if (ordered_bytes > delalloc_bytes && !for_preempt)
>                 wait_ordered = true;
>
> -       loops = 0;
> -       while ((delalloc_bytes || ordered_bytes) && loops < 3) {
> -               u64 temp = min(delalloc_bytes, to_reclaim) >> PAGE_SHIFT;
> -               long nr_pages = min_t(u64, temp, LONG_MAX);
> +       spin_lock(&space_info->lock);
> +       orig_tickets_id = space_info->tickets_id;
> +       spin_unlock(&space_info->lock);
> +
> +       while ((delalloc_bytes || ordered_bytes) && to_reclaim) {
> +               u64 iter_reclaim = min_t(u64, to_reclaim, SZ_128M);
> +               long nr_pages = min_t(u64, delalloc_bytes, iter_reclaim) >> PAGE_SHIFT;
>                 int async_pages;
>
> +               items = calc_reclaim_items_nr(fs_info, iter_reclaim) * 2;

We can also move the 'items' declaration inside this loop instead of top level.

Otherwise:

Reviewed-by: Filipe Manana <fdmanana@suse.com>

Thanks.

> +
>                 btrfs_start_delalloc_roots(fs_info, nr_pages, true);
>
>                 /*
> @@ -811,7 +813,7 @@ static void shrink_delalloc(struct btrfs_space_info *space_info,
>                            atomic_read(&fs_info->async_delalloc_pages) <=
>                            async_pages);
>  skip_async:
> -               loops++;
> +               to_reclaim -= iter_reclaim;
>                 if (wait_ordered && !trans) {
>                         btrfs_wait_ordered_roots(fs_info, items, NULL);
>                 } else {
> @@ -834,6 +836,15 @@ static void shrink_delalloc(struct btrfs_space_info *space_info,
>                         spin_unlock(&space_info->lock);
>                         break;
>                 }
> +               /*
> +                * If a ticket was satisfied since we started, break out
> +                * so the async reclaim state machine can process delayed
> +                * refs before we flush more delalloc.
> +                */
> +               if (space_info->tickets_id != orig_tickets_id) {
> +                       spin_unlock(&space_info->lock);
> +                       break;
> +               }
>                 spin_unlock(&space_info->lock);
>
>                 delalloc_bytes = percpu_counter_sum_positive(
> --
> 2.53.0
>
>

^ permalink raw reply	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2026-04-06 18:08 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-04-03 22:27 [PATCH v2 0/5] btrfs: improve stalls under sudden writeback Boris Burkov
2026-04-03 22:27 ` [PATCH v2 1/5] btrfs: reserve space for delayed_refs in delalloc Boris Burkov
2026-04-06 17:20   ` Filipe Manana
2026-04-06 17:40     ` Boris Burkov
2026-04-03 22:27 ` [PATCH v2 2/5] btrfs: account for csum " Boris Burkov
2026-04-06 17:33   ` Filipe Manana
2026-04-03 22:27 ` [PATCH v2 3/5] btrfs: account for compression in delalloc extent reservation Boris Burkov
2026-04-06 17:44   ` Filipe Manana
2026-04-03 22:27 ` [PATCH v2 4/5] btrfs: make inode->outstanding_extents a u64 Boris Burkov
2026-04-06 17:55   ` Filipe Manana
2026-04-03 22:27 ` [PATCH v2 5/5] btrfs: cap shrink_delalloc iterations to 128M Boris Burkov
2026-04-06 18:08   ` Filipe Manana

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox