* [PATCH v4 1/4] btrfs: reserve space for delayed_refs in delalloc
2026-04-09 17:48 [PATCH v4 0/4] btrfs: improve stalls under sudden writeback Boris Burkov
@ 2026-04-09 17:48 ` Boris Burkov
2026-04-10 16:07 ` Filipe Manana
2026-04-09 17:48 ` [PATCH v4 2/4] btrfs: account for compression in delalloc extent reservation Boris Burkov
` (3 subsequent siblings)
4 siblings, 1 reply; 18+ messages in thread
From: Boris Burkov @ 2026-04-09 17:48 UTC (permalink / raw)
To: linux-btrfs, kernel-team
delalloc uses a per-inode block_rsv to perform metadata reservations for
the cow operations it anticipates based on the number of outstanding
extents. This calculation is done based on inode->outstanding_extents in
btrfs_calculate_inode_block_rsv_size(). The reservation is *not*
meticulously tracked as each ordered_extent is actually created in
writeback, but rather delalloc attempts to over-estimate and the
writeback and ordered_extent finish portions are responsible to release
all the reservation.
However, there is a notable gap in this reservation, it reserves no
space for the resulting delayed_refs. If you compare to how
btrfs_start_transaction() reservations work, this is a noteable
difference.
As writeback actually occurs, and we trigger btrfs_finish_one_ordered(),
that function will start generating delayed refs, which will draw from
the trans_handle's delayed_refs_rsv via btrfs_update_delayed_refs_rsv():
For example, we can trace the primary data delayed ref:
btrfs_finish_one_ordered()
insert_ordered_extent_file_extent()
insert_reserved_file_extent()
btrfs_alloc_reserved_file_extent()
btrfs_add_delayed_data_ref()
add_delayed_ref()
btrfs_update_delayed_refs_rsv();
This trans_handle was created in finish_one_ordered() with
btrfs_join_transaction() which calls start_transaction with
num_items=0 and BTRFS_RESERVE_NO_FLUSH. As a result, this trans_handle
has no reserved in h->delayed_rsv, as neither the num_items reservation
nor the btrfs_delayed_refs_rsv_refill() reservation is run.
Thus, when btrfs_update_delayed_refs_rsv() runs, reserved_bytes is 0 and
fs_info->delayed_rsv->size grows but not fs_info->delayed_rsv->reserved.
If a large amount of writeback happens all at once (perhaps due to
dirty_ratio being tuned too high), this results in, among other things,
erroneous assessments of the amount of delayed_refs reserved in the
metadata space reclaim logic, like need_preemptive_reclaim() which
relies on fs_info->delayed_rsv->reserved and even worse, poor decision
making in btrfs_preempt_reclaim_metadata_space() which counts
delalloc_bytes like so:
block_rsv_size = global_rsv_size +
btrfs_block_rsv_reserved(delayed_block_rsv) +
btrfs_block_rsv_reserved(delayed_refs_rsv) +
btrfs_block_rsv_reserved(trans_rsv);
delalloc_size = bytes_may_use - block_rsv_size;
So all that lost delayed refs usage gets accounted as delalloc_size and
leads to preemptive reclaim continuously choosing FLUSH_DELALLOC, which
further exacerbates the problem.
With enough writeback around, we can run enough delalloc that we get
into async reclaim which starts blocking start_transaction() and
eventually hits FLUSH_DELALLOC_WAIT/FLUSH_DELALLOC_FULL at which point
the filesystem gets heavily blocked on metadata space in reserve_space(),
blocking all new transaction work until all the ordered_extents finish.
If we had an accurate view of the reservation for delayed refs, then we
could mostly break this feedback loop in preemptive reclaim, and
generally would be able to make more accurate decisions with regards to
metadata space reclamation.
This patch adds extra metadata reservation to the inode's block_rsv to
account for the delayed refs. When the ordered_extent finishes and we
are about to do work in the transaction that uses delayed refs, we
migrate enough for 1 extent. Since this is not necessarily perfect, we
have to be careful and do a "soft" migrate which succeeds even if there
is not enough reservation. This is strictly better than what we have and
also matches how the delayed ref rsv gets used in the transaction at
btrfs_update_delayed_refs_rsv().
Aside from this data delayed_ref, there are also some metadata
delayed_refs to consider. These are:
- subvolume tree for the file extent item
- csum tree for data csums
- raid stripe tree if enabled
- free space tree if enabled
So account for those delayed_refs in the reservation as well. This
greatly increases the size of the reservation as each metadata cow
results in two delayed refs: one add for the new block in
btrfs_alloc_tree_block() and one drop for the old in
btrfs_free_tree_block(). As a result, to be completely conservative,
we need to reserve 2 delayed refs worth of space for each cow.
Signed-off-by: Boris Burkov <boris@bur.io>
---
fs/btrfs/delalloc-space.c | 52 +++++++++++++++++++++++++++++++++++++++
fs/btrfs/delalloc-space.h | 3 +++
fs/btrfs/inode.c | 2 ++
fs/btrfs/transaction.c | 36 +++++++++++----------------
4 files changed, 72 insertions(+), 21 deletions(-)
diff --git a/fs/btrfs/delalloc-space.c b/fs/btrfs/delalloc-space.c
index 0970799d0aa4..63e174cc9393 100644
--- a/fs/btrfs/delalloc-space.c
+++ b/fs/btrfs/delalloc-space.c
@@ -3,11 +3,13 @@
#include "messages.h"
#include "ctree.h"
#include "delalloc-space.h"
+#include "delayed-ref.h"
#include "block-rsv.h"
#include "btrfs_inode.h"
#include "space-info.h"
#include "qgroup.h"
#include "fs.h"
+#include "transaction.h"
/*
* HOW DOES THIS WORK
@@ -247,6 +249,35 @@ static void btrfs_inode_rsv_release(struct btrfs_inode *inode, bool qgroup_free)
qgroup_to_release);
}
+/*
+ * Each delalloc extent could become an ordered_extent and end up inserting a
+ * new data extent and modify a number of btrees. Each of those is associated with
+ * adding delayed refs which need a corresponding delayed refs reservation.
+ *
+ * Each metadata cow operation results in an add and a drop delayed ref, both of
+ * which call add_delayed_ref() and ultimately btrfs_update_delayed_refs_rsv(),
+ * so each must account for 2 delayed refs.
+ */
+static u64 delalloc_calc_delayed_refs_rsv(const struct btrfs_inode *inode, u64 nr_extents)
+{
+ const struct btrfs_fs_info *fs_info = inode->root->fs_info;
+ /*
+ * Factor for how many delayed refs updates we will generate per extent.
+ * Non-optional: extent tree, subvolume tree
+ */
+ int factor = 4;
+
+ /* The remaining trees are only written to conditionally. */
+ if (!(inode->flags & BTRFS_INODE_NODATASUM))
+ factor += 2;
+ if (btrfs_test_opt(fs_info, FREE_SPACE_TREE))
+ factor += 2;
+ if (btrfs_fs_incompat(fs_info, RAID_STRIPE_TREE))
+ factor += 2;
+
+ return btrfs_calc_insert_metadata_size(fs_info, nr_extents) * factor;
+}
+
static void btrfs_calculate_inode_block_rsv_size(struct btrfs_fs_info *fs_info,
struct btrfs_inode *inode)
{
@@ -266,6 +297,7 @@ static void btrfs_calculate_inode_block_rsv_size(struct btrfs_fs_info *fs_info,
reserve_size = btrfs_calc_insert_metadata_size(fs_info,
outstanding_extents);
reserve_size += btrfs_calc_metadata_size(fs_info, 1);
+ reserve_size += delalloc_calc_delayed_refs_rsv(inode, outstanding_extents);
}
if (!(inode->flags & BTRFS_INODE_NODATASUM)) {
u64 csum_leaves;
@@ -309,9 +341,29 @@ static void calc_inode_reservations(struct btrfs_inode *inode,
* for an inode update.
*/
*meta_reserve += inode_update;
+
+ *meta_reserve += delalloc_calc_delayed_refs_rsv(inode, nr_extents);
+
*qgroup_reserve = nr_extents * fs_info->nodesize;
}
+void btrfs_delalloc_migrate_delayed_refs_rsv(struct btrfs_trans_handle *trans,
+ struct btrfs_inode *inode)
+{
+ struct btrfs_block_rsv *inode_rsv = &inode->block_rsv;
+ struct btrfs_block_rsv *trans_rsv = &trans->delayed_rsv;
+ u64 num_bytes = delalloc_calc_delayed_refs_rsv(inode, 1);
+
+ spin_lock(&inode_rsv->lock);
+ num_bytes = min(num_bytes, inode_rsv->reserved);
+ inode_rsv->reserved -= num_bytes;
+ inode_rsv->full = (inode_rsv->reserved >= inode_rsv->size);
+ spin_unlock(&inode_rsv->lock);
+
+ btrfs_block_rsv_add_bytes(trans_rsv, num_bytes, true);
+ trans->delayed_refs_bytes_reserved += num_bytes;
+}
+
int btrfs_delalloc_reserve_metadata(struct btrfs_inode *inode, u64 num_bytes,
u64 disk_num_bytes, bool noflush)
{
diff --git a/fs/btrfs/delalloc-space.h b/fs/btrfs/delalloc-space.h
index 6119c0d3f883..bd7041166987 100644
--- a/fs/btrfs/delalloc-space.h
+++ b/fs/btrfs/delalloc-space.h
@@ -8,6 +8,7 @@
struct extent_changeset;
struct btrfs_inode;
struct btrfs_fs_info;
+struct btrfs_trans_handle;
int btrfs_alloc_data_chunk_ondemand(const struct btrfs_inode *inode, u64 bytes);
int btrfs_check_data_free_space(struct btrfs_inode *inode,
@@ -27,5 +28,7 @@ int btrfs_delalloc_reserve_metadata(struct btrfs_inode *inode, u64 num_bytes,
u64 disk_num_bytes, bool noflush);
void btrfs_delalloc_release_extents(struct btrfs_inode *inode, u64 num_bytes);
void btrfs_delalloc_shrink_extents(struct btrfs_inode *inode, u64 reserved_len, u64 new_len);
+void btrfs_delalloc_migrate_delayed_refs_rsv(struct btrfs_trans_handle *trans,
+ struct btrfs_inode *inode);
#endif /* BTRFS_DELALLOC_SPACE_H */
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 40474014c03f..15945744a304 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -653,6 +653,7 @@ static noinline int __cow_file_range_inline(struct btrfs_inode *inode,
goto out;
}
trans->block_rsv = &inode->block_rsv;
+ btrfs_delalloc_migrate_delayed_refs_rsv(trans, inode);
drop_args.path = path;
drop_args.start = 0;
@@ -3259,6 +3260,7 @@ int btrfs_finish_one_ordered(struct btrfs_ordered_extent *ordered_extent)
}
trans->block_rsv = &inode->block_rsv;
+ btrfs_delalloc_migrate_delayed_refs_rsv(trans, inode);
ret = btrfs_insert_raid_extent(trans, ordered_extent);
if (unlikely(ret)) {
diff --git a/fs/btrfs/transaction.c b/fs/btrfs/transaction.c
index 248adb785051..55791bb100a2 100644
--- a/fs/btrfs/transaction.c
+++ b/fs/btrfs/transaction.c
@@ -1047,29 +1047,23 @@ static void btrfs_trans_release_metadata(struct btrfs_trans_handle *trans)
return;
}
- if (!trans->bytes_reserved) {
- ASSERT(trans->delayed_refs_bytes_reserved == 0,
- "trans->delayed_refs_bytes_reserved=%llu",
- trans->delayed_refs_bytes_reserved);
- return;
+ if (trans->bytes_reserved) {
+ ASSERT(trans->block_rsv == &fs_info->trans_block_rsv);
+ trace_btrfs_space_reservation(fs_info, "transaction",
+ trans->transid, trans->bytes_reserved, 0);
+ btrfs_block_rsv_release(fs_info, trans->block_rsv,
+ trans->bytes_reserved, NULL);
+ trans->bytes_reserved = 0;
}
- ASSERT(trans->block_rsv == &fs_info->trans_block_rsv);
- trace_btrfs_space_reservation(fs_info, "transaction",
- trans->transid, trans->bytes_reserved, 0);
- btrfs_block_rsv_release(fs_info, trans->block_rsv,
- trans->bytes_reserved, NULL);
- trans->bytes_reserved = 0;
-
- if (!trans->delayed_refs_bytes_reserved)
- return;
-
- trace_btrfs_space_reservation(fs_info, "local_delayed_refs_rsv",
- trans->transid,
- trans->delayed_refs_bytes_reserved, 0);
- btrfs_block_rsv_release(fs_info, &trans->delayed_rsv,
- trans->delayed_refs_bytes_reserved, NULL);
- trans->delayed_refs_bytes_reserved = 0;
+ if (trans->delayed_refs_bytes_reserved) {
+ trace_btrfs_space_reservation(fs_info, "local_delayed_refs_rsv",
+ trans->transid,
+ trans->delayed_refs_bytes_reserved, 0);
+ btrfs_block_rsv_release(fs_info, &trans->delayed_rsv,
+ trans->delayed_refs_bytes_reserved, NULL);
+ trans->delayed_refs_bytes_reserved = 0;
+ }
}
static int __btrfs_end_transaction(struct btrfs_trans_handle *trans,
--
2.53.0
^ permalink raw reply related [flat|nested] 18+ messages in thread* Re: [PATCH v4 1/4] btrfs: reserve space for delayed_refs in delalloc
2026-04-09 17:48 ` [PATCH v4 1/4] btrfs: reserve space for delayed_refs in delalloc Boris Burkov
@ 2026-04-10 16:07 ` Filipe Manana
0 siblings, 0 replies; 18+ messages in thread
From: Filipe Manana @ 2026-04-10 16:07 UTC (permalink / raw)
To: Boris Burkov; +Cc: linux-btrfs, kernel-team
On Thu, Apr 9, 2026 at 6:49 PM Boris Burkov <boris@bur.io> wrote:
>
> delalloc uses a per-inode block_rsv to perform metadata reservations for
> the cow operations it anticipates based on the number of outstanding
> extents. This calculation is done based on inode->outstanding_extents in
> btrfs_calculate_inode_block_rsv_size(). The reservation is *not*
> meticulously tracked as each ordered_extent is actually created in
> writeback, but rather delalloc attempts to over-estimate and the
> writeback and ordered_extent finish portions are responsible to release
> all the reservation.
>
> However, there is a notable gap in this reservation, it reserves no
> space for the resulting delayed_refs. If you compare to how
> btrfs_start_transaction() reservations work, this is a noteable
> difference.
>
> As writeback actually occurs, and we trigger btrfs_finish_one_ordered(),
> that function will start generating delayed refs, which will draw from
> the trans_handle's delayed_refs_rsv via btrfs_update_delayed_refs_rsv():
>
> For example, we can trace the primary data delayed ref:
>
> btrfs_finish_one_ordered()
> insert_ordered_extent_file_extent()
> insert_reserved_file_extent()
> btrfs_alloc_reserved_file_extent()
> btrfs_add_delayed_data_ref()
> add_delayed_ref()
> btrfs_update_delayed_refs_rsv();
>
> This trans_handle was created in finish_one_ordered() with
> btrfs_join_transaction() which calls start_transaction with
> num_items=0 and BTRFS_RESERVE_NO_FLUSH. As a result, this trans_handle
> has no reserved in h->delayed_rsv, as neither the num_items reservation
> nor the btrfs_delayed_refs_rsv_refill() reservation is run.
>
> Thus, when btrfs_update_delayed_refs_rsv() runs, reserved_bytes is 0 and
> fs_info->delayed_rsv->size grows but not fs_info->delayed_rsv->reserved.
>
> If a large amount of writeback happens all at once (perhaps due to
> dirty_ratio being tuned too high), this results in, among other things,
> erroneous assessments of the amount of delayed_refs reserved in the
> metadata space reclaim logic, like need_preemptive_reclaim() which
> relies on fs_info->delayed_rsv->reserved and even worse, poor decision
> making in btrfs_preempt_reclaim_metadata_space() which counts
> delalloc_bytes like so:
>
> block_rsv_size = global_rsv_size +
> btrfs_block_rsv_reserved(delayed_block_rsv) +
> btrfs_block_rsv_reserved(delayed_refs_rsv) +
> btrfs_block_rsv_reserved(trans_rsv);
> delalloc_size = bytes_may_use - block_rsv_size;
>
> So all that lost delayed refs usage gets accounted as delalloc_size and
> leads to preemptive reclaim continuously choosing FLUSH_DELALLOC, which
> further exacerbates the problem.
>
> With enough writeback around, we can run enough delalloc that we get
> into async reclaim which starts blocking start_transaction() and
> eventually hits FLUSH_DELALLOC_WAIT/FLUSH_DELALLOC_FULL at which point
> the filesystem gets heavily blocked on metadata space in reserve_space(),
> blocking all new transaction work until all the ordered_extents finish.
>
> If we had an accurate view of the reservation for delayed refs, then we
> could mostly break this feedback loop in preemptive reclaim, and
> generally would be able to make more accurate decisions with regards to
> metadata space reclamation.
>
> This patch adds extra metadata reservation to the inode's block_rsv to
> account for the delayed refs. When the ordered_extent finishes and we
> are about to do work in the transaction that uses delayed refs, we
> migrate enough for 1 extent. Since this is not necessarily perfect, we
> have to be careful and do a "soft" migrate which succeeds even if there
> is not enough reservation. This is strictly better than what we have and
> also matches how the delayed ref rsv gets used in the transaction at
> btrfs_update_delayed_refs_rsv().
>
> Aside from this data delayed_ref, there are also some metadata
> delayed_refs to consider. These are:
> - subvolume tree for the file extent item
> - csum tree for data csums
> - raid stripe tree if enabled
> - free space tree if enabled
>
> So account for those delayed_refs in the reservation as well. This
> greatly increases the size of the reservation as each metadata cow
> results in two delayed refs: one add for the new block in
> btrfs_alloc_tree_block() and one drop for the old in
> btrfs_free_tree_block(). As a result, to be completely conservative,
> we need to reserve 2 delayed refs worth of space for each cow.
>
> Signed-off-by: Boris Burkov <boris@bur.io>
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Thanks.
> ---
> fs/btrfs/delalloc-space.c | 52 +++++++++++++++++++++++++++++++++++++++
> fs/btrfs/delalloc-space.h | 3 +++
> fs/btrfs/inode.c | 2 ++
> fs/btrfs/transaction.c | 36 +++++++++++----------------
> 4 files changed, 72 insertions(+), 21 deletions(-)
>
> diff --git a/fs/btrfs/delalloc-space.c b/fs/btrfs/delalloc-space.c
> index 0970799d0aa4..63e174cc9393 100644
> --- a/fs/btrfs/delalloc-space.c
> +++ b/fs/btrfs/delalloc-space.c
> @@ -3,11 +3,13 @@
> #include "messages.h"
> #include "ctree.h"
> #include "delalloc-space.h"
> +#include "delayed-ref.h"
> #include "block-rsv.h"
> #include "btrfs_inode.h"
> #include "space-info.h"
> #include "qgroup.h"
> #include "fs.h"
> +#include "transaction.h"
>
> /*
> * HOW DOES THIS WORK
> @@ -247,6 +249,35 @@ static void btrfs_inode_rsv_release(struct btrfs_inode *inode, bool qgroup_free)
> qgroup_to_release);
> }
>
> +/*
> + * Each delalloc extent could become an ordered_extent and end up inserting a
> + * new data extent and modify a number of btrees. Each of those is associated with
> + * adding delayed refs which need a corresponding delayed refs reservation.
> + *
> + * Each metadata cow operation results in an add and a drop delayed ref, both of
> + * which call add_delayed_ref() and ultimately btrfs_update_delayed_refs_rsv(),
> + * so each must account for 2 delayed refs.
> + */
> +static u64 delalloc_calc_delayed_refs_rsv(const struct btrfs_inode *inode, u64 nr_extents)
> +{
> + const struct btrfs_fs_info *fs_info = inode->root->fs_info;
> + /*
> + * Factor for how many delayed refs updates we will generate per extent.
> + * Non-optional: extent tree, subvolume tree
> + */
> + int factor = 4;
> +
> + /* The remaining trees are only written to conditionally. */
> + if (!(inode->flags & BTRFS_INODE_NODATASUM))
> + factor += 2;
> + if (btrfs_test_opt(fs_info, FREE_SPACE_TREE))
> + factor += 2;
> + if (btrfs_fs_incompat(fs_info, RAID_STRIPE_TREE))
> + factor += 2;
> +
> + return btrfs_calc_insert_metadata_size(fs_info, nr_extents) * factor;
> +}
> +
> static void btrfs_calculate_inode_block_rsv_size(struct btrfs_fs_info *fs_info,
> struct btrfs_inode *inode)
> {
> @@ -266,6 +297,7 @@ static void btrfs_calculate_inode_block_rsv_size(struct btrfs_fs_info *fs_info,
> reserve_size = btrfs_calc_insert_metadata_size(fs_info,
> outstanding_extents);
> reserve_size += btrfs_calc_metadata_size(fs_info, 1);
> + reserve_size += delalloc_calc_delayed_refs_rsv(inode, outstanding_extents);
> }
> if (!(inode->flags & BTRFS_INODE_NODATASUM)) {
> u64 csum_leaves;
> @@ -309,9 +341,29 @@ static void calc_inode_reservations(struct btrfs_inode *inode,
> * for an inode update.
> */
> *meta_reserve += inode_update;
> +
> + *meta_reserve += delalloc_calc_delayed_refs_rsv(inode, nr_extents);
> +
> *qgroup_reserve = nr_extents * fs_info->nodesize;
> }
>
> +void btrfs_delalloc_migrate_delayed_refs_rsv(struct btrfs_trans_handle *trans,
> + struct btrfs_inode *inode)
> +{
> + struct btrfs_block_rsv *inode_rsv = &inode->block_rsv;
> + struct btrfs_block_rsv *trans_rsv = &trans->delayed_rsv;
> + u64 num_bytes = delalloc_calc_delayed_refs_rsv(inode, 1);
> +
> + spin_lock(&inode_rsv->lock);
> + num_bytes = min(num_bytes, inode_rsv->reserved);
> + inode_rsv->reserved -= num_bytes;
> + inode_rsv->full = (inode_rsv->reserved >= inode_rsv->size);
> + spin_unlock(&inode_rsv->lock);
> +
> + btrfs_block_rsv_add_bytes(trans_rsv, num_bytes, true);
> + trans->delayed_refs_bytes_reserved += num_bytes;
> +}
> +
> int btrfs_delalloc_reserve_metadata(struct btrfs_inode *inode, u64 num_bytes,
> u64 disk_num_bytes, bool noflush)
> {
> diff --git a/fs/btrfs/delalloc-space.h b/fs/btrfs/delalloc-space.h
> index 6119c0d3f883..bd7041166987 100644
> --- a/fs/btrfs/delalloc-space.h
> +++ b/fs/btrfs/delalloc-space.h
> @@ -8,6 +8,7 @@
> struct extent_changeset;
> struct btrfs_inode;
> struct btrfs_fs_info;
> +struct btrfs_trans_handle;
>
> int btrfs_alloc_data_chunk_ondemand(const struct btrfs_inode *inode, u64 bytes);
> int btrfs_check_data_free_space(struct btrfs_inode *inode,
> @@ -27,5 +28,7 @@ int btrfs_delalloc_reserve_metadata(struct btrfs_inode *inode, u64 num_bytes,
> u64 disk_num_bytes, bool noflush);
> void btrfs_delalloc_release_extents(struct btrfs_inode *inode, u64 num_bytes);
> void btrfs_delalloc_shrink_extents(struct btrfs_inode *inode, u64 reserved_len, u64 new_len);
> +void btrfs_delalloc_migrate_delayed_refs_rsv(struct btrfs_trans_handle *trans,
> + struct btrfs_inode *inode);
>
> #endif /* BTRFS_DELALLOC_SPACE_H */
> diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
> index 40474014c03f..15945744a304 100644
> --- a/fs/btrfs/inode.c
> +++ b/fs/btrfs/inode.c
> @@ -653,6 +653,7 @@ static noinline int __cow_file_range_inline(struct btrfs_inode *inode,
> goto out;
> }
> trans->block_rsv = &inode->block_rsv;
> + btrfs_delalloc_migrate_delayed_refs_rsv(trans, inode);
>
> drop_args.path = path;
> drop_args.start = 0;
> @@ -3259,6 +3260,7 @@ int btrfs_finish_one_ordered(struct btrfs_ordered_extent *ordered_extent)
> }
>
> trans->block_rsv = &inode->block_rsv;
> + btrfs_delalloc_migrate_delayed_refs_rsv(trans, inode);
>
> ret = btrfs_insert_raid_extent(trans, ordered_extent);
> if (unlikely(ret)) {
> diff --git a/fs/btrfs/transaction.c b/fs/btrfs/transaction.c
> index 248adb785051..55791bb100a2 100644
> --- a/fs/btrfs/transaction.c
> +++ b/fs/btrfs/transaction.c
> @@ -1047,29 +1047,23 @@ static void btrfs_trans_release_metadata(struct btrfs_trans_handle *trans)
> return;
> }
>
> - if (!trans->bytes_reserved) {
> - ASSERT(trans->delayed_refs_bytes_reserved == 0,
> - "trans->delayed_refs_bytes_reserved=%llu",
> - trans->delayed_refs_bytes_reserved);
> - return;
> + if (trans->bytes_reserved) {
> + ASSERT(trans->block_rsv == &fs_info->trans_block_rsv);
> + trace_btrfs_space_reservation(fs_info, "transaction",
> + trans->transid, trans->bytes_reserved, 0);
> + btrfs_block_rsv_release(fs_info, trans->block_rsv,
> + trans->bytes_reserved, NULL);
> + trans->bytes_reserved = 0;
> }
>
> - ASSERT(trans->block_rsv == &fs_info->trans_block_rsv);
> - trace_btrfs_space_reservation(fs_info, "transaction",
> - trans->transid, trans->bytes_reserved, 0);
> - btrfs_block_rsv_release(fs_info, trans->block_rsv,
> - trans->bytes_reserved, NULL);
> - trans->bytes_reserved = 0;
> -
> - if (!trans->delayed_refs_bytes_reserved)
> - return;
> -
> - trace_btrfs_space_reservation(fs_info, "local_delayed_refs_rsv",
> - trans->transid,
> - trans->delayed_refs_bytes_reserved, 0);
> - btrfs_block_rsv_release(fs_info, &trans->delayed_rsv,
> - trans->delayed_refs_bytes_reserved, NULL);
> - trans->delayed_refs_bytes_reserved = 0;
> + if (trans->delayed_refs_bytes_reserved) {
> + trace_btrfs_space_reservation(fs_info, "local_delayed_refs_rsv",
> + trans->transid,
> + trans->delayed_refs_bytes_reserved, 0);
> + btrfs_block_rsv_release(fs_info, &trans->delayed_rsv,
> + trans->delayed_refs_bytes_reserved, NULL);
> + trans->delayed_refs_bytes_reserved = 0;
> + }
> }
>
> static int __btrfs_end_transaction(struct btrfs_trans_handle *trans,
> --
> 2.53.0
>
>
^ permalink raw reply [flat|nested] 18+ messages in thread
* [PATCH v4 2/4] btrfs: account for compression in delalloc extent reservation
2026-04-09 17:48 [PATCH v4 0/4] btrfs: improve stalls under sudden writeback Boris Burkov
2026-04-09 17:48 ` [PATCH v4 1/4] btrfs: reserve space for delayed_refs in delalloc Boris Burkov
@ 2026-04-09 17:48 ` Boris Burkov
2026-04-09 17:48 ` [PATCH v4 3/4] btrfs: make inode->outstanding_extents a u64 Boris Burkov
` (2 subsequent siblings)
4 siblings, 0 replies; 18+ messages in thread
From: Boris Burkov @ 2026-04-09 17:48 UTC (permalink / raw)
To: linux-btrfs, kernel-team
The btrfs maximum uncompressed extent size is 128MiB. The maximum
compressed extent size in file extent space is 128KiB. Therefore, the
estimate for outstanding_extents is off by 3 orders of magnitude when
COMPRESS_FORCE is set or the inode is set to always compress.
Because we use re-calculation when necessary, rather than super detailed
extent tracking, we don't grow this reservation as the true number of
extents is revealed. We don't want to be too clever with it, however, as
we don't want the calculation to change for a given inode between
reservation and release, so we only rely on the forcing type flags.
With this change, we no longer under-reserve delayed refs reservations
for delalloc writes, even with compress-force.
Because this would turn count_max_extents() into a named shim for
div_u64(size + max_extent_size - 1, max_extent_size);
we can just get rid of it.
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Boris Burkov <boris@bur.io>
---
fs/btrfs/btrfs_inode.h | 2 +
fs/btrfs/delalloc-space.c | 12 +++---
fs/btrfs/fs.h | 13 -------
fs/btrfs/inode.c | 78 ++++++++++++++++++++++++++++++++-------
4 files changed, 72 insertions(+), 33 deletions(-)
diff --git a/fs/btrfs/btrfs_inode.h b/fs/btrfs/btrfs_inode.h
index 55c272fe5d92..5368ef87b41a 100644
--- a/fs/btrfs/btrfs_inode.h
+++ b/fs/btrfs/btrfs_inode.h
@@ -510,6 +510,8 @@ static inline bool btrfs_inode_can_compress(const struct btrfs_inode *inode)
return true;
}
+u64 btrfs_inode_max_extents(const struct btrfs_inode *inode, u64 size);
+
static inline void btrfs_assert_inode_locked(struct btrfs_inode *inode)
{
/* Immediately trigger a crash if the inode is not locked. */
diff --git a/fs/btrfs/delalloc-space.c b/fs/btrfs/delalloc-space.c
index 63e174cc9393..609ec75884cd 100644
--- a/fs/btrfs/delalloc-space.c
+++ b/fs/btrfs/delalloc-space.c
@@ -65,7 +65,7 @@
* This is the number of file extent items we'll need to handle all of the
* outstanding DELALLOC space we have in this inode. We limit the maximum
* size of an extent, so a large contiguous dirty area may require more than
- * one outstanding_extent, which is why count_max_extents() is used to
+ * one outstanding_extent, which is why we use the max extent size to
* determine how many outstanding_extents get added.
*
* ->csum_bytes
@@ -324,7 +324,7 @@ static void calc_inode_reservations(struct btrfs_inode *inode,
u64 *meta_reserve, u64 *qgroup_reserve)
{
struct btrfs_fs_info *fs_info = inode->root->fs_info;
- u64 nr_extents = count_max_extents(fs_info, num_bytes);
+ u64 nr_extents = btrfs_inode_max_extents(inode, num_bytes);
u64 csum_leaves;
u64 inode_update = btrfs_calc_metadata_size(fs_info, 1);
@@ -423,7 +423,7 @@ int btrfs_delalloc_reserve_metadata(struct btrfs_inode *inode, u64 num_bytes,
* racing with an ordered completion or some such that would think it
* needs to free the reservation we just made.
*/
- nr_extents = count_max_extents(fs_info, num_bytes);
+ nr_extents = btrfs_inode_max_extents(inode, num_bytes);
spin_lock(&inode->lock);
btrfs_mod_outstanding_extents(inode, nr_extents);
if (!(inode->flags & BTRFS_INODE_NODATASUM))
@@ -490,7 +490,7 @@ void btrfs_delalloc_release_extents(struct btrfs_inode *inode, u64 num_bytes)
unsigned num_extents;
spin_lock(&inode->lock);
- num_extents = count_max_extents(fs_info, num_bytes);
+ num_extents = btrfs_inode_max_extents(inode, num_bytes);
btrfs_mod_outstanding_extents(inode, -num_extents);
btrfs_calculate_inode_block_rsv_size(fs_info, inode);
spin_unlock(&inode->lock);
@@ -505,8 +505,8 @@ void btrfs_delalloc_release_extents(struct btrfs_inode *inode, u64 num_bytes)
void btrfs_delalloc_shrink_extents(struct btrfs_inode *inode, u64 reserved_len, u64 new_len)
{
struct btrfs_fs_info *fs_info = inode->root->fs_info;
- const u32 reserved_num_extents = count_max_extents(fs_info, reserved_len);
- const u32 new_num_extents = count_max_extents(fs_info, new_len);
+ const u32 reserved_num_extents = btrfs_inode_max_extents(inode, reserved_len);
+ const u32 new_num_extents = btrfs_inode_max_extents(inode, new_len);
const int diff_num_extents = new_num_extents - reserved_num_extents;
ASSERT(new_len <= reserved_len);
diff --git a/fs/btrfs/fs.h b/fs/btrfs/fs.h
index a4758d94b32e..2c1626155645 100644
--- a/fs/btrfs/fs.h
+++ b/fs/btrfs/fs.h
@@ -1051,19 +1051,6 @@ static inline bool btrfs_is_zoned(const struct btrfs_fs_info *fs_info)
return IS_ENABLED(CONFIG_BLK_DEV_ZONED) && fs_info->zone_size > 0;
}
-/*
- * Count how many fs_info->max_extent_size cover the @size
- */
-static inline u32 count_max_extents(const struct btrfs_fs_info *fs_info, u64 size)
-{
-#ifdef CONFIG_BTRFS_FS_RUN_SANITY_TESTS
- if (!fs_info)
- return div_u64(size + BTRFS_MAX_EXTENT_SIZE - 1, BTRFS_MAX_EXTENT_SIZE);
-#endif
-
- return div_u64(size + fs_info->max_extent_size - 1, fs_info->max_extent_size);
-}
-
static inline unsigned int btrfs_blocks_per_folio(const struct btrfs_fs_info *fs_info,
const struct folio *folio)
{
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 15945744a304..55255f3794c6 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -747,6 +747,56 @@ static int add_async_extent(struct async_chunk *cow, u64 start, u64 ram_size,
return 0;
}
+/*
+ * Check if compression will definitely be attempted for this inode based on
+ * mount options and inode properties. Unlike inode_need_compress(), this does
+ * NOT run the compression heuristic or check range-specific conditions, so it
+ * is safe to call under locks (e.g. io_tree lock) and for reservation sizing.
+ *
+ * Only returns true for cases where BTRFS_INODE_NOCOMPRESS cannot be set at
+ * runtime (FORCE_COMPRESS and prop_compress), ensuring that the effective max
+ * extent size is stable across paired set/clear delalloc operations.
+ */
+static inline bool inode_may_compress(const struct btrfs_inode *inode)
+{
+ if (!btrfs_inode_can_compress(inode))
+ return false;
+
+ /* Force compress always attempts compression. */
+ if (btrfs_test_opt(inode->root->fs_info, FORCE_COMPRESS))
+ return true;
+
+ /* Per-inode property: NOCOMPRESS cannot override this. */
+ if (inode->prop_compress)
+ return true;
+
+ return false;
+}
+
+/*
+ * Return the effective maximum extent size for reservation accounting.
+ *
+ * When compression is guaranteed to be attempted (FORCE_COMPRESS or
+ * prop_compress), the compression path splits ranges into
+ * BTRFS_MAX_UNCOMPRESSED chunks, each producing an independent ordered
+ * extent. Use that as the divisor instead of fs_info->max_extent_size
+ * to avoid severely undercounting outstanding extents.
+ */
+static u64 btrfs_inode_max_extent_size(const struct btrfs_inode *inode)
+{
+ if (inode_may_compress(inode))
+ return BTRFS_MAX_UNCOMPRESSED;
+
+ return inode->root->fs_info->max_extent_size;
+}
+
+u64 btrfs_inode_max_extents(const struct btrfs_inode *inode, u64 size)
+{
+ u64 max_extent_size = btrfs_inode_max_extent_size(inode);
+
+ return div_u64(size + max_extent_size - 1, max_extent_size);
+}
+
/*
* Check if the inode needs to be submitted to compression, based on mount
* options, defragmentation, properties or heuristics.
@@ -2459,8 +2509,8 @@ int btrfs_run_delalloc_range(struct btrfs_inode *inode, struct folio *locked_fol
void btrfs_split_delalloc_extent(struct btrfs_inode *inode,
struct extent_state *orig, u64 split)
{
- struct btrfs_fs_info *fs_info = inode->root->fs_info;
u64 size;
+ u64 max_extent_size = btrfs_inode_max_extent_size(inode);
lockdep_assert_held(&inode->io_tree.lock);
@@ -2469,8 +2519,8 @@ void btrfs_split_delalloc_extent(struct btrfs_inode *inode,
return;
size = orig->end - orig->start + 1;
- if (size > fs_info->max_extent_size) {
- u32 num_extents;
+ if (size > max_extent_size) {
+ u64 num_extents;
u64 new_size;
/*
@@ -2478,10 +2528,10 @@ void btrfs_split_delalloc_extent(struct btrfs_inode *inode,
* applies here, just in reverse.
*/
new_size = orig->end - split + 1;
- num_extents = count_max_extents(fs_info, new_size);
+ num_extents = btrfs_inode_max_extents(inode, new_size);
new_size = split - orig->start;
- num_extents += count_max_extents(fs_info, new_size);
- if (count_max_extents(fs_info, size) >= num_extents)
+ num_extents += btrfs_inode_max_extents(inode, new_size);
+ if (btrfs_inode_max_extents(inode, size) >= num_extents)
return;
}
@@ -2498,9 +2548,9 @@ void btrfs_split_delalloc_extent(struct btrfs_inode *inode,
void btrfs_merge_delalloc_extent(struct btrfs_inode *inode, struct extent_state *new,
struct extent_state *other)
{
- struct btrfs_fs_info *fs_info = inode->root->fs_info;
u64 new_size, old_size;
- u32 num_extents;
+ u64 max_extent_size = btrfs_inode_max_extent_size(inode);
+ u64 num_extents;
lockdep_assert_held(&inode->io_tree.lock);
@@ -2514,7 +2564,7 @@ void btrfs_merge_delalloc_extent(struct btrfs_inode *inode, struct extent_state
new_size = other->end - new->start + 1;
/* we're not bigger than the max, unreserve the space and go */
- if (new_size <= fs_info->max_extent_size) {
+ if (new_size <= max_extent_size) {
spin_lock(&inode->lock);
btrfs_mod_outstanding_extents(inode, -1);
spin_unlock(&inode->lock);
@@ -2540,10 +2590,10 @@ void btrfs_merge_delalloc_extent(struct btrfs_inode *inode, struct extent_state
* this case.
*/
old_size = other->end - other->start + 1;
- num_extents = count_max_extents(fs_info, old_size);
+ num_extents = btrfs_inode_max_extents(inode, old_size);
old_size = new->end - new->start + 1;
- num_extents += count_max_extents(fs_info, old_size);
- if (count_max_extents(fs_info, new_size) >= num_extents)
+ num_extents += btrfs_inode_max_extents(inode, old_size);
+ if (btrfs_inode_max_extents(inode, new_size) >= num_extents)
return;
spin_lock(&inode->lock);
@@ -2616,7 +2666,7 @@ void btrfs_set_delalloc_extent(struct btrfs_inode *inode, struct extent_state *s
if (!(state->state & EXTENT_DELALLOC) && (bits & EXTENT_DELALLOC)) {
u64 len = state->end + 1 - state->start;
u64 prev_delalloc_bytes;
- u32 num_extents = count_max_extents(fs_info, len);
+ u32 num_extents = btrfs_inode_max_extents(inode, len);
spin_lock(&inode->lock);
btrfs_mod_outstanding_extents(inode, num_extents);
@@ -2662,7 +2712,7 @@ void btrfs_clear_delalloc_extent(struct btrfs_inode *inode,
{
struct btrfs_fs_info *fs_info = inode->root->fs_info;
u64 len = state->end + 1 - state->start;
- u32 num_extents = count_max_extents(fs_info, len);
+ u32 num_extents = btrfs_inode_max_extents(inode, len);
lockdep_assert_held(&inode->io_tree.lock);
--
2.53.0
^ permalink raw reply related [flat|nested] 18+ messages in thread* [PATCH v4 3/4] btrfs: make inode->outstanding_extents a u64
2026-04-09 17:48 [PATCH v4 0/4] btrfs: improve stalls under sudden writeback Boris Burkov
2026-04-09 17:48 ` [PATCH v4 1/4] btrfs: reserve space for delayed_refs in delalloc Boris Burkov
2026-04-09 17:48 ` [PATCH v4 2/4] btrfs: account for compression in delalloc extent reservation Boris Burkov
@ 2026-04-09 17:48 ` Boris Burkov
2026-04-13 18:43 ` David Sterba
2026-04-09 17:48 ` [PATCH v4 4/4] btrfs: cap shrink_delalloc iterations to 128M Boris Burkov
2026-04-13 18:41 ` [PATCH v4 0/4] btrfs: improve stalls under sudden writeback David Sterba
4 siblings, 1 reply; 18+ messages in thread
From: Boris Burkov @ 2026-04-09 17:48 UTC (permalink / raw)
To: linux-btrfs, kernel-team
The maximum file size is MAX_LFS_FILESIZE = (loff_t)LLONG_MAX
As a result, the max extent size computation in btrfs has always been
bounded above by LLONG_MAX / 128MiB, which is ~ 2^63 / 2^27. This has
never fit in a u32. With the recent changes to also divide by 128KiB in
compressed cases, that bound is even higher. Whether or not it is likely
to happen, I think it is nice to try to capture the intent in the types,
so change outstanding_extents to u64, and make mod_outstanding_extents
try to capture some expectations around the size of its inputs.
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Boris Burkov <boris@bur.io>
---
fs/btrfs/btrfs_inode.h | 18 ++++++++++++++----
fs/btrfs/delalloc-space.c | 19 +++++++++----------
fs/btrfs/inode.c | 14 +++++++-------
fs/btrfs/ordered-data.c | 4 ++--
fs/btrfs/tests/inode-tests.c | 18 +++++++++---------
include/trace/events/btrfs.h | 8 ++++----
6 files changed, 45 insertions(+), 36 deletions(-)
diff --git a/fs/btrfs/btrfs_inode.h b/fs/btrfs/btrfs_inode.h
index 5368ef87b41a..0d48f67eb5c7 100644
--- a/fs/btrfs/btrfs_inode.h
+++ b/fs/btrfs/btrfs_inode.h
@@ -180,7 +180,7 @@ struct btrfs_inode {
* items we think we'll end up using, and reserved_extents is the number
* of extent items we've reserved metadata for. Protected by 'lock'.
*/
- unsigned outstanding_extents;
+ u64 outstanding_extents;
/* used to order data wrt metadata */
spinlock_t ordered_tree_lock;
@@ -429,14 +429,24 @@ static inline bool is_data_inode(const struct btrfs_inode *inode)
}
static inline void btrfs_mod_outstanding_extents(struct btrfs_inode *inode,
- int mod)
+ int mod, u64 nr_extents)
{
+ s64 delta = mod * (s64)nr_extents;
+
lockdep_assert_held(&inode->lock);
- inode->outstanding_extents += mod;
+ ASSERT(mod == 1 || mod == -1, "mod=%d", mod);
+ ASSERT(nr_extents <= S64_MAX, "nr_extents=%llu", nr_extents);
+ ASSERT(mod == -1 || inode->outstanding_extents <= U64_MAX - nr_extents,
+ "nr_extents=%llu, inode->outstanding_extents=%llu",
+ nr_extents, inode->outstanding_extents);
+ ASSERT(mod == 1 || inode->outstanding_extents >= nr_extents,
+ "nr_extents=%llu, inode->outstanding_extents=%llu",
+ nr_extents, inode->outstanding_extents);
+ inode->outstanding_extents += delta;
if (btrfs_is_free_space_inode(inode))
return;
trace_btrfs_inode_mod_outstanding_extents(inode->root, btrfs_ino(inode),
- mod, inode->outstanding_extents);
+ delta, inode->outstanding_extents);
}
/*
diff --git a/fs/btrfs/delalloc-space.c b/fs/btrfs/delalloc-space.c
index 609ec75884cd..31b33db8afe5 100644
--- a/fs/btrfs/delalloc-space.c
+++ b/fs/btrfs/delalloc-space.c
@@ -284,7 +284,7 @@ static void btrfs_calculate_inode_block_rsv_size(struct btrfs_fs_info *fs_info,
struct btrfs_block_rsv *block_rsv = &inode->block_rsv;
u64 reserve_size = 0;
u64 qgroup_rsv_size = 0;
- unsigned outstanding_extents;
+ u64 outstanding_extents;
lockdep_assert_held(&inode->lock);
outstanding_extents = inode->outstanding_extents;
@@ -311,7 +311,7 @@ static void btrfs_calculate_inode_block_rsv_size(struct btrfs_fs_info *fs_info,
*
* This is overestimating in most cases.
*/
- qgroup_rsv_size = (u64)outstanding_extents * fs_info->nodesize;
+ qgroup_rsv_size = outstanding_extents * fs_info->nodesize;
spin_lock(&block_rsv->lock);
block_rsv->size = reserve_size;
@@ -371,7 +371,7 @@ int btrfs_delalloc_reserve_metadata(struct btrfs_inode *inode, u64 num_bytes,
struct btrfs_fs_info *fs_info = root->fs_info;
struct btrfs_block_rsv *block_rsv = &inode->block_rsv;
u64 meta_reserve, qgroup_reserve;
- unsigned nr_extents;
+ u64 nr_extents;
enum btrfs_reserve_flush_enum flush = BTRFS_RESERVE_FLUSH_ALL;
int ret = 0;
@@ -425,7 +425,7 @@ int btrfs_delalloc_reserve_metadata(struct btrfs_inode *inode, u64 num_bytes,
*/
nr_extents = btrfs_inode_max_extents(inode, num_bytes);
spin_lock(&inode->lock);
- btrfs_mod_outstanding_extents(inode, nr_extents);
+ btrfs_mod_outstanding_extents(inode, 1, nr_extents);
if (!(inode->flags & BTRFS_INODE_NODATASUM))
inode->csum_bytes += disk_num_bytes;
btrfs_calculate_inode_block_rsv_size(fs_info, inode);
@@ -487,11 +487,11 @@ void btrfs_delalloc_release_metadata(struct btrfs_inode *inode, u64 num_bytes,
void btrfs_delalloc_release_extents(struct btrfs_inode *inode, u64 num_bytes)
{
struct btrfs_fs_info *fs_info = inode->root->fs_info;
- unsigned num_extents;
+ u64 num_extents;
spin_lock(&inode->lock);
num_extents = btrfs_inode_max_extents(inode, num_bytes);
- btrfs_mod_outstanding_extents(inode, -num_extents);
+ btrfs_mod_outstanding_extents(inode, -1, num_extents);
btrfs_calculate_inode_block_rsv_size(fs_info, inode);
spin_unlock(&inode->lock);
@@ -505,16 +505,15 @@ void btrfs_delalloc_release_extents(struct btrfs_inode *inode, u64 num_bytes)
void btrfs_delalloc_shrink_extents(struct btrfs_inode *inode, u64 reserved_len, u64 new_len)
{
struct btrfs_fs_info *fs_info = inode->root->fs_info;
- const u32 reserved_num_extents = btrfs_inode_max_extents(inode, reserved_len);
- const u32 new_num_extents = btrfs_inode_max_extents(inode, new_len);
- const int diff_num_extents = new_num_extents - reserved_num_extents;
+ const u64 reserved_num_extents = btrfs_inode_max_extents(inode, reserved_len);
+ const u64 new_num_extents = btrfs_inode_max_extents(inode, new_len);
ASSERT(new_len <= reserved_len);
if (new_num_extents == reserved_num_extents)
return;
spin_lock(&inode->lock);
- btrfs_mod_outstanding_extents(inode, diff_num_extents);
+ btrfs_mod_outstanding_extents(inode, -1, reserved_num_extents - new_num_extents);
btrfs_calculate_inode_block_rsv_size(fs_info, inode);
spin_unlock(&inode->lock);
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 55255f3794c6..b45a92cfe94e 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -2536,7 +2536,7 @@ void btrfs_split_delalloc_extent(struct btrfs_inode *inode,
}
spin_lock(&inode->lock);
- btrfs_mod_outstanding_extents(inode, 1);
+ btrfs_mod_outstanding_extents(inode, 1, 1);
spin_unlock(&inode->lock);
}
@@ -2566,7 +2566,7 @@ void btrfs_merge_delalloc_extent(struct btrfs_inode *inode, struct extent_state
/* we're not bigger than the max, unreserve the space and go */
if (new_size <= max_extent_size) {
spin_lock(&inode->lock);
- btrfs_mod_outstanding_extents(inode, -1);
+ btrfs_mod_outstanding_extents(inode, -1, 1);
spin_unlock(&inode->lock);
return;
}
@@ -2597,7 +2597,7 @@ void btrfs_merge_delalloc_extent(struct btrfs_inode *inode, struct extent_state
return;
spin_lock(&inode->lock);
- btrfs_mod_outstanding_extents(inode, -1);
+ btrfs_mod_outstanding_extents(inode, -1, 1);
spin_unlock(&inode->lock);
}
@@ -2666,10 +2666,10 @@ void btrfs_set_delalloc_extent(struct btrfs_inode *inode, struct extent_state *s
if (!(state->state & EXTENT_DELALLOC) && (bits & EXTENT_DELALLOC)) {
u64 len = state->end + 1 - state->start;
u64 prev_delalloc_bytes;
- u32 num_extents = btrfs_inode_max_extents(inode, len);
+ u64 num_extents = btrfs_inode_max_extents(inode, len);
spin_lock(&inode->lock);
- btrfs_mod_outstanding_extents(inode, num_extents);
+ btrfs_mod_outstanding_extents(inode, 1, num_extents);
spin_unlock(&inode->lock);
/* For sanity tests */
@@ -2712,7 +2712,7 @@ void btrfs_clear_delalloc_extent(struct btrfs_inode *inode,
{
struct btrfs_fs_info *fs_info = inode->root->fs_info;
u64 len = state->end + 1 - state->start;
- u32 num_extents = btrfs_inode_max_extents(inode, len);
+ u64 num_extents = btrfs_inode_max_extents(inode, len);
lockdep_assert_held(&inode->io_tree.lock);
@@ -2732,7 +2732,7 @@ void btrfs_clear_delalloc_extent(struct btrfs_inode *inode,
u64 new_delalloc_bytes;
spin_lock(&inode->lock);
- btrfs_mod_outstanding_extents(inode, -num_extents);
+ btrfs_mod_outstanding_extents(inode, -1, num_extents);
spin_unlock(&inode->lock);
/*
diff --git a/fs/btrfs/ordered-data.c b/fs/btrfs/ordered-data.c
index e5a24b3ff95e..96ee8ebfdb92 100644
--- a/fs/btrfs/ordered-data.c
+++ b/fs/btrfs/ordered-data.c
@@ -223,7 +223,7 @@ static struct btrfs_ordered_extent *alloc_ordered_extent(
* smallest the extent is going to get.
*/
spin_lock(&inode->lock);
- btrfs_mod_outstanding_extents(inode, 1);
+ btrfs_mod_outstanding_extents(inode, 1, 1);
spin_unlock(&inode->lock);
out:
@@ -655,7 +655,7 @@ void btrfs_remove_ordered_extent(struct btrfs_ordered_extent *entry)
btrfs_lockdep_acquire(fs_info, btrfs_trans_pending_ordered);
/* This is paired with alloc_ordered_extent(). */
spin_lock(&btrfs_inode->lock);
- btrfs_mod_outstanding_extents(btrfs_inode, -1);
+ btrfs_mod_outstanding_extents(btrfs_inode, -1, 1);
spin_unlock(&btrfs_inode->lock);
if (root != fs_info->tree_root) {
u64 release;
diff --git a/fs/btrfs/tests/inode-tests.c b/fs/btrfs/tests/inode-tests.c
index b04fbcaf0a1d..e63afbb9be2b 100644
--- a/fs/btrfs/tests/inode-tests.c
+++ b/fs/btrfs/tests/inode-tests.c
@@ -931,7 +931,7 @@ static int test_extent_accounting(u32 sectorsize, u32 nodesize)
}
if (BTRFS_I(inode)->outstanding_extents != 1) {
ret = -EINVAL;
- test_err("miscount, wanted 1, got %u",
+ test_err("miscount, wanted 1, got %llu",
BTRFS_I(inode)->outstanding_extents);
goto out;
}
@@ -946,7 +946,7 @@ static int test_extent_accounting(u32 sectorsize, u32 nodesize)
}
if (BTRFS_I(inode)->outstanding_extents != 2) {
ret = -EINVAL;
- test_err("miscount, wanted 2, got %u",
+ test_err("miscount, wanted 2, got %llu",
BTRFS_I(inode)->outstanding_extents);
goto out;
}
@@ -962,7 +962,7 @@ static int test_extent_accounting(u32 sectorsize, u32 nodesize)
}
if (BTRFS_I(inode)->outstanding_extents != 2) {
ret = -EINVAL;
- test_err("miscount, wanted 2, got %u",
+ test_err("miscount, wanted 2, got %llu",
BTRFS_I(inode)->outstanding_extents);
goto out;
}
@@ -978,7 +978,7 @@ static int test_extent_accounting(u32 sectorsize, u32 nodesize)
}
if (BTRFS_I(inode)->outstanding_extents != 2) {
ret = -EINVAL;
- test_err("miscount, wanted 2, got %u",
+ test_err("miscount, wanted 2, got %llu",
BTRFS_I(inode)->outstanding_extents);
goto out;
}
@@ -996,7 +996,7 @@ static int test_extent_accounting(u32 sectorsize, u32 nodesize)
}
if (BTRFS_I(inode)->outstanding_extents != 4) {
ret = -EINVAL;
- test_err("miscount, wanted 4, got %u",
+ test_err("miscount, wanted 4, got %llu",
BTRFS_I(inode)->outstanding_extents);
goto out;
}
@@ -1013,7 +1013,7 @@ static int test_extent_accounting(u32 sectorsize, u32 nodesize)
}
if (BTRFS_I(inode)->outstanding_extents != 3) {
ret = -EINVAL;
- test_err("miscount, wanted 3, got %u",
+ test_err("miscount, wanted 3, got %llu",
BTRFS_I(inode)->outstanding_extents);
goto out;
}
@@ -1029,7 +1029,7 @@ static int test_extent_accounting(u32 sectorsize, u32 nodesize)
}
if (BTRFS_I(inode)->outstanding_extents != 4) {
ret = -EINVAL;
- test_err("miscount, wanted 4, got %u",
+ test_err("miscount, wanted 4, got %llu",
BTRFS_I(inode)->outstanding_extents);
goto out;
}
@@ -1047,7 +1047,7 @@ static int test_extent_accounting(u32 sectorsize, u32 nodesize)
}
if (BTRFS_I(inode)->outstanding_extents != 3) {
ret = -EINVAL;
- test_err("miscount, wanted 3, got %u",
+ test_err("miscount, wanted 3, got %llu",
BTRFS_I(inode)->outstanding_extents);
goto out;
}
@@ -1061,7 +1061,7 @@ static int test_extent_accounting(u32 sectorsize, u32 nodesize)
}
if (BTRFS_I(inode)->outstanding_extents) {
ret = -EINVAL;
- test_err("miscount, wanted 0, got %u",
+ test_err("miscount, wanted 0, got %llu",
BTRFS_I(inode)->outstanding_extents);
goto out;
}
diff --git a/include/trace/events/btrfs.h b/include/trace/events/btrfs.h
index 8ad7a2d76c1d..caabdc8d9eed 100644
--- a/include/trace/events/btrfs.h
+++ b/include/trace/events/btrfs.h
@@ -2003,15 +2003,15 @@ DEFINE_EVENT(btrfs__prelim_ref, btrfs_prelim_ref_insert,
);
TRACE_EVENT(btrfs_inode_mod_outstanding_extents,
- TP_PROTO(const struct btrfs_root *root, u64 ino, int mod, unsigned outstanding),
+ TP_PROTO(const struct btrfs_root *root, u64 ino, s64 mod, u64 outstanding),
TP_ARGS(root, ino, mod, outstanding),
TP_STRUCT__entry_btrfs(
__field( u64, root_objectid )
__field( u64, ino )
- __field( int, mod )
- __field( unsigned, outstanding )
+ __field( s64, mod )
+ __field( u64, outstanding )
),
TP_fast_assign_btrfs(root->fs_info,
@@ -2021,7 +2021,7 @@ TRACE_EVENT(btrfs_inode_mod_outstanding_extents,
__entry->outstanding = outstanding;
),
- TP_printk_btrfs("root=%llu(%s) ino=%llu mod=%d outstanding=%u",
+ TP_printk_btrfs("root=%llu(%s) ino=%llu mod=%lld outstanding=%llu",
show_root_type(__entry->root_objectid),
__entry->ino, __entry->mod, __entry->outstanding)
);
--
2.53.0
^ permalink raw reply related [flat|nested] 18+ messages in thread* Re: [PATCH v4 3/4] btrfs: make inode->outstanding_extents a u64
2026-04-09 17:48 ` [PATCH v4 3/4] btrfs: make inode->outstanding_extents a u64 Boris Burkov
@ 2026-04-13 18:43 ` David Sterba
0 siblings, 0 replies; 18+ messages in thread
From: David Sterba @ 2026-04-13 18:43 UTC (permalink / raw)
To: Boris Burkov; +Cc: linux-btrfs, kernel-team
On Thu, Apr 09, 2026 at 10:48:50AM -0700, Boris Burkov wrote:
> - inode->outstanding_extents += mod;
> + ASSERT(mod == 1 || mod == -1, "mod=%d", mod);
> + ASSERT(nr_extents <= S64_MAX, "nr_extents=%llu", nr_extents);
> + ASSERT(mod == -1 || inode->outstanding_extents <= U64_MAX - nr_extents,
> + "nr_extents=%llu, inode->outstanding_extents=%llu",
Small note, please don't put "," between the key=values, it's just
stylistic but to keep it consistent with the rest of the ASSERTs.
Thanks.
^ permalink raw reply [flat|nested] 18+ messages in thread
* [PATCH v4 4/4] btrfs: cap shrink_delalloc iterations to 128M
2026-04-09 17:48 [PATCH v4 0/4] btrfs: improve stalls under sudden writeback Boris Burkov
` (2 preceding siblings ...)
2026-04-09 17:48 ` [PATCH v4 3/4] btrfs: make inode->outstanding_extents a u64 Boris Burkov
@ 2026-04-09 17:48 ` Boris Burkov
2026-04-24 6:38 ` Qu Wenruo
2026-04-13 18:41 ` [PATCH v4 0/4] btrfs: improve stalls under sudden writeback David Sterba
4 siblings, 1 reply; 18+ messages in thread
From: Boris Burkov @ 2026-04-09 17:48 UTC (permalink / raw)
To: linux-btrfs, kernel-team
Even with more accurate delayed_refs reservations, preemptive reclaim is
not perfect and we might generate tickets, especially in cases with a
very large flood of writeback outstanding.
Ultimately, if we do get into a situation with tickets pending and async
reclaim blocking the system, we want to try to make as much progress as
quickly as possible to unblock tasks. We want space reclaim to be
effective, and to have a good chance at making progress, but not to
block arbitrarily as this leads to untenable syscall latencies, long
commits, and even hung task warnings.
I traced such cases of heavy writeback async reclaim hung tasks and
observed that we were blocking for long periods of time in
shrink_delalloc(). This was particularly bad when doing writeback of
incompressible data with the compress-force mount option.
e.g.
dd if=/dev/urandom of=urandom.seed bs=1G count=1
dd if=urandom.seed of=urandom.big bs=1G count=300
shrink_delalloc() computes to_reclaim as delalloc_bytes >> 3. With
hundreds of gigs of delalloc (again imagine a large dirty_ratio and lots
of ram), this is still 10-20+ GiB. Particularly in the wait phases, this
can be quite slow, and generates even more delayed-refs as mentioned in
the previous patch, so it doesn't even help that much with the immediate
space shortfall.
We do satisfy some tickets, but we are ultimately keep the system in
essentially the same state, and with long stalling reclaim calls into
shrink_delalloc().
It would be much better to start some good chunk of I/O and also to work
through the new delayed_refs and keep things moving through the system
while releasing the conservative over-estimated metadata reservations.
To acheive this, tighten up the delalloc work to be in units of the
maximum extent size. If we issue 128MiB of delalloc, we don't leave too
much (any?) extent merging on the table, but don't ever block on
pathological 10GiB+ chunks of delalloc. If we do detect that we
satisfied a ticket, break out of shrink_delalloc() and run some of the
new delayed_refs as well before going again. This way we strike a nice
balance of making delalloc progress, but not at the cost of every other
sort of reservation, as they all feed into each other.
This means iterating over to_reclaim by 128MiB at a time until it is
drained or we satisfy a ticket, rather than trying 3 times to do the
whole thing.
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Boris Burkov <boris@bur.io>
---
fs/btrfs/space-info.c | 31 ++++++++++++++++++++-----------
1 file changed, 20 insertions(+), 11 deletions(-)
diff --git a/fs/btrfs/space-info.c b/fs/btrfs/space-info.c
index f0436eea1544..e931deb3d013 100644
--- a/fs/btrfs/space-info.c
+++ b/fs/btrfs/space-info.c
@@ -725,9 +725,8 @@ static void shrink_delalloc(struct btrfs_space_info *space_info,
struct btrfs_trans_handle *trans;
u64 delalloc_bytes;
u64 ordered_bytes;
- u64 items;
long time_left;
- int loops;
+ u64 orig_tickets_id;
delalloc_bytes = percpu_counter_sum_positive(&fs_info->delalloc_bytes);
ordered_bytes = percpu_counter_sum_positive(&fs_info->ordered_bytes);
@@ -735,9 +734,7 @@ static void shrink_delalloc(struct btrfs_space_info *space_info,
return;
/* Calc the number of the pages we need flush for space reservation */
- if (to_reclaim == U64_MAX) {
- items = U64_MAX;
- } else {
+ if (to_reclaim != U64_MAX) {
/*
* to_reclaim is set to however much metadata we need to
* reclaim, but reclaiming that much data doesn't really track
@@ -751,7 +748,6 @@ static void shrink_delalloc(struct btrfs_space_info *space_info,
* aggressive.
*/
to_reclaim = max(to_reclaim, delalloc_bytes >> 3);
- items = calc_reclaim_items_nr(fs_info, to_reclaim) * 2;
}
trans = current->journal_info;
@@ -764,10 +760,14 @@ static void shrink_delalloc(struct btrfs_space_info *space_info,
if (ordered_bytes > delalloc_bytes && !for_preempt)
wait_ordered = true;
- loops = 0;
- while ((delalloc_bytes || ordered_bytes) && loops < 3) {
- u64 temp = min(delalloc_bytes, to_reclaim) >> PAGE_SHIFT;
- long nr_pages = min_t(u64, temp, LONG_MAX);
+ spin_lock(&space_info->lock);
+ orig_tickets_id = space_info->tickets_id;
+ spin_unlock(&space_info->lock);
+
+ while ((delalloc_bytes || ordered_bytes) && to_reclaim) {
+ u64 iter_reclaim = min_t(u64, to_reclaim, SZ_128M);
+ long nr_pages = min_t(u64, delalloc_bytes, iter_reclaim) >> PAGE_SHIFT;
+ u64 items = calc_reclaim_items_nr(fs_info, iter_reclaim) * 2;
int async_pages;
btrfs_start_delalloc_roots(fs_info, nr_pages, true);
@@ -811,7 +811,7 @@ static void shrink_delalloc(struct btrfs_space_info *space_info,
atomic_read(&fs_info->async_delalloc_pages) <=
async_pages);
skip_async:
- loops++;
+ to_reclaim -= iter_reclaim;
if (wait_ordered && !trans) {
btrfs_wait_ordered_roots(fs_info, items, NULL);
} else {
@@ -834,6 +834,15 @@ static void shrink_delalloc(struct btrfs_space_info *space_info,
spin_unlock(&space_info->lock);
break;
}
+ /*
+ * If a ticket was satisfied since we started, break out
+ * so the async reclaim state machine can process delayed
+ * refs before we flush more delalloc.
+ */
+ if (space_info->tickets_id != orig_tickets_id) {
+ spin_unlock(&space_info->lock);
+ break;
+ }
spin_unlock(&space_info->lock);
delalloc_bytes = percpu_counter_sum_positive(
--
2.53.0
^ permalink raw reply related [flat|nested] 18+ messages in thread* Re: [PATCH v4 4/4] btrfs: cap shrink_delalloc iterations to 128M
2026-04-09 17:48 ` [PATCH v4 4/4] btrfs: cap shrink_delalloc iterations to 128M Boris Burkov
@ 2026-04-24 6:38 ` Qu Wenruo
2026-04-24 9:48 ` Sun YangKai
2026-04-24 10:07 ` Qu Wenruo
0 siblings, 2 replies; 18+ messages in thread
From: Qu Wenruo @ 2026-04-24 6:38 UTC (permalink / raw)
To: Boris Burkov, linux-btrfs, kernel-team
在 2026/4/10 03:18, Boris Burkov 写道:
[...]
>
> This means iterating over to_reclaim by 128MiB at a time until it is
> drained or we satisfy a ticket, rather than trying 3 times to do the
> whole thing.
>
> Reviewed-by: Filipe Manana <fdmanana@suse.com>
> Signed-off-by: Boris Burkov <boris@bur.io>
Hi Boris,
I'm testing the latest for-next base as the baseline for the incoming
huge folio support.
On arm64 64K page size, 4K fs block size, I'm seeing a very weird
behavior on generic/027.
On 7.0-rc7, the test case takes less than 5 seconds and passes as expected.
But on for-next it never finished, furthermore there is always a kworker
taking a full core, deadlooping inside
btrfs_async_reclaim_metadata_space(), and you can not unmount the fs.
Here is the "echo l > /proc/sysrq-trigger" stack dump for the involved
btrfs kworker:
[ 6616.093728] CPU: 0 UID: 0 PID: 501715 Comm: kworker/u33:0 Not tainted
7.0.0-rc7-custom-64k+ #9 PREEMPT(full)
[ 6616.093732] Hardware name: QEMU KVM Virtual Machine, BIOS unknown
2/2/2022
[ 6616.093734] Workqueue: events_unbound
btrfs_async_reclaim_metadata_space [btrfs]
[ 6616.093849] pstate: 63400005 (nZCv daif +PAN -UAO +TCO +DIT -SSBS
BTYPE=--)
[ 6616.093852] pc : btrfs_start_delalloc_roots+0xf0/0x268 [btrfs]
[ 6616.093923] lr : btrfs_start_delalloc_roots+0x88/0x268 [btrfs]
[ 6616.093987] sp : ffff80008af0fbd0
[...]
[ 6616.094008] Call trace:
[ 6616.094009] btrfs_start_delalloc_roots+0xf0/0x268 [btrfs] (P)
[ 6616.094073] flush_space+0x3d4/0x6b0 [btrfs]
[ 6616.094138] do_async_reclaim_metadata_space+0x88/0x1d8 [btrfs]
[ 6616.094201] btrfs_async_reclaim_metadata_space+0x50/0x80 [btrfs]
[ 6616.094263] process_one_work+0x174/0x540
[ 6616.094277] worker_thread+0x1a0/0x318
[ 6616.094279] kthread+0x140/0x158
[ 6616.094285] ret_from_fork+0x10/0x20
So it's a regression, and bisection points to this patch.
And I tried the following steps to further confirm it's caused by this
commit:
- The test passes just before the commit
The previous commit is "btrfs: make inode->outstanding_extents a u64".
- The test failed at that commit
The test case never finish and one kworker dead looping.
- The test case pass at for-next with this commit reverted
The test case finishes in seconds as usual.
Do you have any clue on what's going wrong? I guess it's pretty hard to
hit on x86_64.
I have a local btrfs branch with huge folios support, with that it's
pretty easy to hit similar problems on x86_64, but without that branch,
no hit is observed so far on x86_64.
Thanks,
Qu
> ---
> fs/btrfs/space-info.c | 31 ++++++++++++++++++++-----------
> 1 file changed, 20 insertions(+), 11 deletions(-)
>
> diff --git a/fs/btrfs/space-info.c b/fs/btrfs/space-info.c
> index f0436eea1544..e931deb3d013 100644
> --- a/fs/btrfs/space-info.c
> +++ b/fs/btrfs/space-info.c
> @@ -725,9 +725,8 @@ static void shrink_delalloc(struct btrfs_space_info *space_info,
> struct btrfs_trans_handle *trans;
> u64 delalloc_bytes;
> u64 ordered_bytes;
> - u64 items;
> long time_left;
> - int loops;
> + u64 orig_tickets_id;
>
> delalloc_bytes = percpu_counter_sum_positive(&fs_info->delalloc_bytes);
> ordered_bytes = percpu_counter_sum_positive(&fs_info->ordered_bytes);
> @@ -735,9 +734,7 @@ static void shrink_delalloc(struct btrfs_space_info *space_info,
> return;
>
> /* Calc the number of the pages we need flush for space reservation */
> - if (to_reclaim == U64_MAX) {
> - items = U64_MAX;
> - } else {
> + if (to_reclaim != U64_MAX) {
> /*
> * to_reclaim is set to however much metadata we need to
> * reclaim, but reclaiming that much data doesn't really track
> @@ -751,7 +748,6 @@ static void shrink_delalloc(struct btrfs_space_info *space_info,
> * aggressive.
> */
> to_reclaim = max(to_reclaim, delalloc_bytes >> 3);
> - items = calc_reclaim_items_nr(fs_info, to_reclaim) * 2;
> }
>
> trans = current->journal_info;
> @@ -764,10 +760,14 @@ static void shrink_delalloc(struct btrfs_space_info *space_info,
> if (ordered_bytes > delalloc_bytes && !for_preempt)
> wait_ordered = true;
>
> - loops = 0;
> - while ((delalloc_bytes || ordered_bytes) && loops < 3) {
> - u64 temp = min(delalloc_bytes, to_reclaim) >> PAGE_SHIFT;
> - long nr_pages = min_t(u64, temp, LONG_MAX);
> + spin_lock(&space_info->lock);
> + orig_tickets_id = space_info->tickets_id;
> + spin_unlock(&space_info->lock);
> +
> + while ((delalloc_bytes || ordered_bytes) && to_reclaim) {
> + u64 iter_reclaim = min_t(u64, to_reclaim, SZ_128M);
> + long nr_pages = min_t(u64, delalloc_bytes, iter_reclaim) >> PAGE_SHIFT;
> + u64 items = calc_reclaim_items_nr(fs_info, iter_reclaim) * 2;
> int async_pages;
>
> btrfs_start_delalloc_roots(fs_info, nr_pages, true);
> @@ -811,7 +811,7 @@ static void shrink_delalloc(struct btrfs_space_info *space_info,
> atomic_read(&fs_info->async_delalloc_pages) <=
> async_pages);
> skip_async:
> - loops++;
> + to_reclaim -= iter_reclaim;
> if (wait_ordered && !trans) {
> btrfs_wait_ordered_roots(fs_info, items, NULL);
> } else {
> @@ -834,6 +834,15 @@ static void shrink_delalloc(struct btrfs_space_info *space_info,
> spin_unlock(&space_info->lock);
> break;
> }
> + /*
> + * If a ticket was satisfied since we started, break out
> + * so the async reclaim state machine can process delayed
> + * refs before we flush more delalloc.
> + */
> + if (space_info->tickets_id != orig_tickets_id) {
> + spin_unlock(&space_info->lock);
> + break;
> + }
> spin_unlock(&space_info->lock);
>
> delalloc_bytes = percpu_counter_sum_positive(
^ permalink raw reply [flat|nested] 18+ messages in thread* Re: [PATCH v4 4/4] btrfs: cap shrink_delalloc iterations to 128M
2026-04-24 6:38 ` Qu Wenruo
@ 2026-04-24 9:48 ` Sun YangKai
2026-04-24 10:07 ` Qu Wenruo
1 sibling, 0 replies; 18+ messages in thread
From: Sun YangKai @ 2026-04-24 9:48 UTC (permalink / raw)
To: Qu Wenruo, Boris Burkov, linux-btrfs, kernel-team
On 2026/4/24 14:38, Qu Wenruo wrote:
>
>
> 在 2026/4/10 03:18, Boris Burkov 写道:
> [...]
>>
>> This means iterating over to_reclaim by 128MiB at a time until it is
>> drained or we satisfy a ticket, rather than trying 3 times to do the
>> whole thing.
>>
>> Reviewed-by: Filipe Manana <fdmanana@suse.com>
>> Signed-off-by: Boris Burkov <boris@bur.io>
>
> Hi Boris,
>
> I'm testing the latest for-next base as the baseline for the incoming
> huge folio support.
>
> On arm64 64K page size, 4K fs block size, I'm seeing a very weird
> behavior on generic/027.
> On 7.0-rc7, the test case takes less than 5 seconds and passes as expected.
>
> But on for-next it never finished, furthermore there is always a kworker
> taking a full core, deadlooping inside
> btrfs_async_reclaim_metadata_space(), and you can not unmount the fs.
>
> Here is the "echo l > /proc/sysrq-trigger" stack dump for the involved
> btrfs kworker:
>
> [ 6616.093728] CPU: 0 UID: 0 PID: 501715 Comm: kworker/u33:0 Not tainted
> 7.0.0-rc7-custom-64k+ #9 PREEMPT(full)
> [ 6616.093732] Hardware name: QEMU KVM Virtual Machine, BIOS unknown
> 2/2/2022
> [ 6616.093734] Workqueue: events_unbound
> btrfs_async_reclaim_metadata_space [btrfs]
> [ 6616.093849] pstate: 63400005 (nZCv daif +PAN -UAO +TCO +DIT -SSBS
> BTYPE=--)
> [ 6616.093852] pc : btrfs_start_delalloc_roots+0xf0/0x268 [btrfs]
> [ 6616.093923] lr : btrfs_start_delalloc_roots+0x88/0x268 [btrfs]
> [ 6616.093987] sp : ffff80008af0fbd0
> [...]
> [ 6616.094008] Call trace:
> [ 6616.094009] btrfs_start_delalloc_roots+0xf0/0x268 [btrfs] (P)
> [ 6616.094073] flush_space+0x3d4/0x6b0 [btrfs]
> [ 6616.094138] do_async_reclaim_metadata_space+0x88/0x1d8 [btrfs]
> [ 6616.094201] btrfs_async_reclaim_metadata_space+0x50/0x80 [btrfs]
> [ 6616.094263] process_one_work+0x174/0x540
> [ 6616.094277] worker_thread+0x1a0/0x318
> [ 6616.094279] kthread+0x140/0x158
> [ 6616.094285] ret_from_fork+0x10/0x20
>
> So it's a regression, and bisection points to this patch.
>
> And I tried the following steps to further confirm it's caused by this
> commit:
>
> - The test passes just before the commit
> The previous commit is "btrfs: make inode->outstanding_extents a u64".
>
> - The test failed at that commit
> The test case never finish and one kworker dead looping.
>
> - The test case pass at for-next with this commit reverted
> The test case finishes in seconds as usual.
>
> Do you have any clue on what's going wrong? I guess it's pretty hard to
> hit on x86_64.> I have a local btrfs branch with huge folios support, with that it's
> pretty easy to hit similar problems on x86_64, but without that branch,
> no hit is observed so far on x86_64.
>
> Thanks,
> Qu
>
>> ---
>> fs/btrfs/space-info.c | 31 ++++++++++++++++++++-----------
>> 1 file changed, 20 insertions(+), 11 deletions(-)
>>
>> diff --git a/fs/btrfs/space-info.c b/fs/btrfs/space-info.c
>> index f0436eea1544..e931deb3d013 100644
>> --- a/fs/btrfs/space-info.c
>> +++ b/fs/btrfs/space-info.c
>> @@ -725,9 +725,8 @@ static void shrink_delalloc(struct
>> btrfs_space_info *space_info,
>> struct btrfs_trans_handle *trans;
>> u64 delalloc_bytes;
>> u64 ordered_bytes;
>> - u64 items;
>> long time_left;
>> - int loops;
>> + u64 orig_tickets_id;
>> delalloc_bytes = percpu_counter_sum_positive(&fs_info-
>> >delalloc_bytes);
>> ordered_bytes = percpu_counter_sum_positive(&fs_info-
>> >ordered_bytes);
>> @@ -735,9 +734,7 @@ static void shrink_delalloc(struct
>> btrfs_space_info *space_info,
>> return;
>> /* Calc the number of the pages we need flush for space
>> reservation */
>> - if (to_reclaim == U64_MAX) {
>> - items = U64_MAX;
>> - } else {
>> + if (to_reclaim != U64_MAX) {
>> /*
>> * to_reclaim is set to however much metadata we need to
>> * reclaim, but reclaiming that much data doesn't really track
>> @@ -751,7 +748,6 @@ static void shrink_delalloc(struct
>> btrfs_space_info *space_info,
>> * aggressive.
>> */
>> to_reclaim = max(to_reclaim, delalloc_bytes >> 3);
>> - items = calc_reclaim_items_nr(fs_info, to_reclaim) * 2;
>> }
>> trans = current->journal_info;
>> @@ -764,10 +760,14 @@ static void shrink_delalloc(struct
>> btrfs_space_info *space_info,
>> if (ordered_bytes > delalloc_bytes && !for_preempt)
>> wait_ordered = true;
>> - loops = 0;
>> - while ((delalloc_bytes || ordered_bytes) && loops < 3) {
>> - u64 temp = min(delalloc_bytes, to_reclaim) >> PAGE_SHIFT;>> - long nr_pages = min_t(u64, temp, LONG_MAX);
>> + spin_lock(&space_info->lock);
>> + orig_tickets_id = space_info->tickets_id;
>> + spin_unlock(&space_info->lock);
>> +
>> + while ((delalloc_bytes || ordered_bytes) && to_reclaim) {
>> + u64 iter_reclaim = min_t(u64, to_reclaim, SZ_128M);
>> + long nr_pages = min_t(u64, delalloc_bytes, iter_reclaim) >>
>> PAGE_SHIFT;
I wonder if it's possible that delalloc_bytes < 64k while to_reclaim ==
U64_MAX so we'll get nr_pages == 0 on 64k page size and we'll loop for a
very long time(seems forever).
>> + u64 items = calc_reclaim_items_nr(fs_info, iter_reclaim) * 2;
>> int async_pages;
>> btrfs_start_delalloc_roots(fs_info, nr_pages, true);
>> @@ -811,7 +811,7 @@ static void shrink_delalloc(struct
>> btrfs_space_info *space_info,
>> atomic_read(&fs_info->async_delalloc_pages) <=
>> async_pages);
>> skip_async:
>> - loops++;
>> + to_reclaim -= iter_reclaim;
>> if (wait_ordered && !trans) {
>> btrfs_wait_ordered_roots(fs_info, items, NULL);
>> } else {
>> @@ -834,6 +834,15 @@ static void shrink_delalloc(struct
>> btrfs_space_info *space_info,
>> spin_unlock(&space_info->lock);
>> break;
>> }
>> + /*
>> + * If a ticket was satisfied since we started, break out
>> + * so the async reclaim state machine can process delayed
>> + * refs before we flush more delalloc.
>> + */
>> + if (space_info->tickets_id != orig_tickets_id) {
>> + spin_unlock(&space_info->lock);
>> + break;
>> + }
>> spin_unlock(&space_info->lock);
>> delalloc_bytes = percpu_counter_sum_positive(
>
>
^ permalink raw reply [flat|nested] 18+ messages in thread* Re: [PATCH v4 4/4] btrfs: cap shrink_delalloc iterations to 128M
2026-04-24 6:38 ` Qu Wenruo
2026-04-24 9:48 ` Sun YangKai
@ 2026-04-24 10:07 ` Qu Wenruo
2026-04-24 15:26 ` Boris Burkov
2026-04-24 20:11 ` Boris Burkov
1 sibling, 2 replies; 18+ messages in thread
From: Qu Wenruo @ 2026-04-24 10:07 UTC (permalink / raw)
To: Boris Burkov, linux-btrfs, kernel-team
在 2026/4/24 16:08, Qu Wenruo 写道:
>
>
> 在 2026/4/10 03:18, Boris Burkov 写道:
> [...]
>>
>> This means iterating over to_reclaim by 128MiB at a time until it is
>> drained or we satisfy a ticket, rather than trying 3 times to do the
>> whole thing.
>>
>> Reviewed-by: Filipe Manana <fdmanana@suse.com>
>> Signed-off-by: Boris Burkov <boris@bur.io>
>
> Hi Boris,
>
> I'm testing the latest for-next base as the baseline for the incoming
> huge folio support.
>
> On arm64 64K page size, 4K fs block size, I'm seeing a very weird
> behavior on generic/027.
> On 7.0-rc7, the test case takes less than 5 seconds and passes as expected.
>
> But on for-next it never finished, furthermore there is always a kworker
> taking a full core, deadlooping inside
> btrfs_async_reclaim_metadata_space(), and you can not unmount the fs.
>
> Here is the "echo l > /proc/sysrq-trigger" stack dump for the involved
> btrfs kworker:
>
> [ 6616.093728] CPU: 0 UID: 0 PID: 501715 Comm: kworker/u33:0 Not tainted
> 7.0.0-rc7-custom-64k+ #9 PREEMPT(full)
> [ 6616.093732] Hardware name: QEMU KVM Virtual Machine, BIOS unknown
> 2/2/2022
> [ 6616.093734] Workqueue: events_unbound
> btrfs_async_reclaim_metadata_space [btrfs]
> [ 6616.093849] pstate: 63400005 (nZCv daif +PAN -UAO +TCO +DIT -SSBS
> BTYPE=--)
> [ 6616.093852] pc : btrfs_start_delalloc_roots+0xf0/0x268 [btrfs]
> [ 6616.093923] lr : btrfs_start_delalloc_roots+0x88/0x268 [btrfs]
> [ 6616.093987] sp : ffff80008af0fbd0
> [...]
> [ 6616.094008] Call trace:
> [ 6616.094009] btrfs_start_delalloc_roots+0xf0/0x268 [btrfs] (P)
> [ 6616.094073] flush_space+0x3d4/0x6b0 [btrfs]
> [ 6616.094138] do_async_reclaim_metadata_space+0x88/0x1d8 [btrfs]
> [ 6616.094201] btrfs_async_reclaim_metadata_space+0x50/0x80 [btrfs]
> [ 6616.094263] process_one_work+0x174/0x540
> [ 6616.094277] worker_thread+0x1a0/0x318
> [ 6616.094279] kthread+0x140/0x158
> [ 6616.094285] ret_from_fork+0x10/0x20
>
> So it's a regression, and bisection points to this patch.
>
> And I tried the following steps to further confirm it's caused by this
> commit:
>
> - The test passes just before the commit
> The previous commit is "btrfs: make inode->outstanding_extents a u64".
>
> - The test failed at that commit
> The test case never finish and one kworker dead looping.
>
> - The test case pass at for-next with this commit reverted
> The test case finishes in seconds as usual.
Furthermore, even with this particular patch *reverted*, I'm still
seeing generic/224 hitting the same problem.
Currently I'm testing at the commit before the whole series, which is
"btrfs: abort transaction in do_remap_reloc_trans() on failure", and no
generic/224 hang nor 100% kworker CPU usage.
Thus I'm afraid the whole series may be involved.
Thanks,
Qu
>
> Do you have any clue on what's going wrong? I guess it's pretty hard to
> hit on x86_64.
>
> I have a local btrfs branch with huge folios support, with that it's
> pretty easy to hit similar problems on x86_64, but without that branch,
> no hit is observed so far on x86_64.
>
> Thanks,
> Qu
>
>> ---
>> fs/btrfs/space-info.c | 31 ++++++++++++++++++++-----------
>> 1 file changed, 20 insertions(+), 11 deletions(-)
>>
>> diff --git a/fs/btrfs/space-info.c b/fs/btrfs/space-info.c
>> index f0436eea1544..e931deb3d013 100644
>> --- a/fs/btrfs/space-info.c
>> +++ b/fs/btrfs/space-info.c
>> @@ -725,9 +725,8 @@ static void shrink_delalloc(struct
>> btrfs_space_info *space_info,
>> struct btrfs_trans_handle *trans;
>> u64 delalloc_bytes;
>> u64 ordered_bytes;
>> - u64 items;
>> long time_left;
>> - int loops;
>> + u64 orig_tickets_id;
>> delalloc_bytes = percpu_counter_sum_positive(&fs_info-
>> >delalloc_bytes);
>> ordered_bytes = percpu_counter_sum_positive(&fs_info-
>> >ordered_bytes);
>> @@ -735,9 +734,7 @@ static void shrink_delalloc(struct
>> btrfs_space_info *space_info,
>> return;
>> /* Calc the number of the pages we need flush for space
>> reservation */
>> - if (to_reclaim == U64_MAX) {
>> - items = U64_MAX;
>> - } else {
>> + if (to_reclaim != U64_MAX) {
>> /*
>> * to_reclaim is set to however much metadata we need to
>> * reclaim, but reclaiming that much data doesn't really track
>> @@ -751,7 +748,6 @@ static void shrink_delalloc(struct
>> btrfs_space_info *space_info,
>> * aggressive.
>> */
>> to_reclaim = max(to_reclaim, delalloc_bytes >> 3);
>> - items = calc_reclaim_items_nr(fs_info, to_reclaim) * 2;
>> }
>> trans = current->journal_info;
>> @@ -764,10 +760,14 @@ static void shrink_delalloc(struct
>> btrfs_space_info *space_info,
>> if (ordered_bytes > delalloc_bytes && !for_preempt)
>> wait_ordered = true;
>> - loops = 0;
>> - while ((delalloc_bytes || ordered_bytes) && loops < 3) {
>> - u64 temp = min(delalloc_bytes, to_reclaim) >> PAGE_SHIFT;
>> - long nr_pages = min_t(u64, temp, LONG_MAX);
>> + spin_lock(&space_info->lock);
>> + orig_tickets_id = space_info->tickets_id;
>> + spin_unlock(&space_info->lock);
>> +
>> + while ((delalloc_bytes || ordered_bytes) && to_reclaim) {
>> + u64 iter_reclaim = min_t(u64, to_reclaim, SZ_128M);
>> + long nr_pages = min_t(u64, delalloc_bytes, iter_reclaim) >>
>> PAGE_SHIFT;
>> + u64 items = calc_reclaim_items_nr(fs_info, iter_reclaim) * 2;
>> int async_pages;
>> btrfs_start_delalloc_roots(fs_info, nr_pages, true);
>> @@ -811,7 +811,7 @@ static void shrink_delalloc(struct
>> btrfs_space_info *space_info,
>> atomic_read(&fs_info->async_delalloc_pages) <=
>> async_pages);
>> skip_async:
>> - loops++;
>> + to_reclaim -= iter_reclaim;
>> if (wait_ordered && !trans) {
>> btrfs_wait_ordered_roots(fs_info, items, NULL);
>> } else {
>> @@ -834,6 +834,15 @@ static void shrink_delalloc(struct
>> btrfs_space_info *space_info,
>> spin_unlock(&space_info->lock);
>> break;
>> }
>> + /*
>> + * If a ticket was satisfied since we started, break out
>> + * so the async reclaim state machine can process delayed
>> + * refs before we flush more delalloc.
>> + */
>> + if (space_info->tickets_id != orig_tickets_id) {
>> + spin_unlock(&space_info->lock);
>> + break;
>> + }
>> spin_unlock(&space_info->lock);
>> delalloc_bytes = percpu_counter_sum_positive(
>
>
^ permalink raw reply [flat|nested] 18+ messages in thread* Re: [PATCH v4 4/4] btrfs: cap shrink_delalloc iterations to 128M
2026-04-24 10:07 ` Qu Wenruo
@ 2026-04-24 15:26 ` Boris Burkov
2026-04-24 20:11 ` Boris Burkov
1 sibling, 0 replies; 18+ messages in thread
From: Boris Burkov @ 2026-04-24 15:26 UTC (permalink / raw)
To: Qu Wenruo; +Cc: linux-btrfs, kernel-team
On Fri, Apr 24, 2026 at 07:37:38PM +0930, Qu Wenruo wrote:
>
>
> 在 2026/4/24 16:08, Qu Wenruo 写道:
> >
> >
> > 在 2026/4/10 03:18, Boris Burkov 写道:
> > [...] > > >
> > > This means iterating over to_reclaim by 128MiB at a time until it is
> > > drained or we satisfy a ticket, rather than trying 3 times to do the
> > > whole thing.
> > >
> > > Reviewed-by: Filipe Manana <fdmanana@suse.com>
> > > Signed-off-by: Boris Burkov <boris@bur.io>
> >
> > Hi Boris,
> >
> > I'm testing the latest for-next base as the baseline for the incoming
> > huge folio support.
> >
> > On arm64 64K page size, 4K fs block size, I'm seeing a very weird
> > behavior on generic/027.
> > On 7.0-rc7, the test case takes less than 5 seconds and passes as expected.
> >
> > But on for-next it never finished, furthermore there is always a kworker
> > taking a full core, deadlooping inside
> > btrfs_async_reclaim_metadata_space(), and you can not unmount the fs.
> >
> > Here is the "echo l > /proc/sysrq-trigger" stack dump for the involved
> > btrfs kworker:
> >
> > [ 6616.093728] CPU: 0 UID: 0 PID: 501715 Comm: kworker/u33:0 Not tainted
> > 7.0.0-rc7-custom-64k+ #9 PREEMPT(full)
> > [ 6616.093732] Hardware name: QEMU KVM Virtual Machine, BIOS unknown
> > 2/2/2022
> > [ 6616.093734] Workqueue: events_unbound
> > btrfs_async_reclaim_metadata_space [btrfs]
> > [ 6616.093849] pstate: 63400005 (nZCv daif +PAN -UAO +TCO +DIT -SSBS
> > BTYPE=--)
> > [ 6616.093852] pc : btrfs_start_delalloc_roots+0xf0/0x268 [btrfs]
> > [ 6616.093923] lr : btrfs_start_delalloc_roots+0x88/0x268 [btrfs]
> > [ 6616.093987] sp : ffff80008af0fbd0
> > [...]
> > [ 6616.094008] Call trace:
> > [ 6616.094009] btrfs_start_delalloc_roots+0xf0/0x268 [btrfs] (P)
> > [ 6616.094073] flush_space+0x3d4/0x6b0 [btrfs]
> > [ 6616.094138] do_async_reclaim_metadata_space+0x88/0x1d8 [btrfs]
> > [ 6616.094201] btrfs_async_reclaim_metadata_space+0x50/0x80 [btrfs]
> > [ 6616.094263] process_one_work+0x174/0x540
> > [ 6616.094277] worker_thread+0x1a0/0x318
> > [ 6616.094279] kthread+0x140/0x158
> > [ 6616.094285] ret_from_fork+0x10/0x20
> >
> > So it's a regression, and bisection points to this patch.
> >
> > And I tried the following steps to further confirm it's caused by this
> > commit:
> >
> > - The test passes just before the commit
> > The previous commit is "btrfs: make inode->outstanding_extents a u64".
> >
> > - The test failed at that commit
> > The test case never finish and one kworker dead looping.
> >
> > - The test case pass at for-next with this commit reverted
> > The test case finishes in seconds as usual.
>
> Furthermore, even with this particular patch *reverted*, I'm still seeing
> generic/224 hitting the same problem.
>
> Currently I'm testing at the commit before the whole series, which is
> "btrfs: abort transaction in do_remap_reloc_trans() on failure", and no
> generic/224 hang nor 100% kworker CPU usage.
>
> Thus I'm afraid the whole series may be involved.
>
> Thanks,
> Qu
>
Thank you very much for the thorough debugging, and sorry for the
disruption. I suspect Sun's idea will be at least one bug with the async
reclaim pacing patch. I think I need to go away from the infinite loop
loop entirely, or at least make sure it has other escape hatches.
However, I am really struggling to conceptually see how changing the
reservation sizing at delalloc or fussing with the types of
outstanding_extents would make us spin forever in async metadata
reclaim. Are you 100% sure that the hang reproduces with this series
minus 'btrfs: cap shrink_delalloc iterations to 128M'?
I will try to hunt down an arm machine and reproduce myself. And keep
thinking on what could be wrong with the new delalloc rsv numbers to make
async reclaim loop forever.
Thanks again,
Boris
> >
> > Do you have any clue on what's going wrong? I guess it's pretty hard to
> > hit on x86_64.
> >
> > I have a local btrfs branch with huge folios support, with that it's
> > pretty easy to hit similar problems on x86_64, but without that branch,
> > no hit is observed so far on x86_64.
> >
> > Thanks,
> > Qu
> >
> > > ---
> > > fs/btrfs/space-info.c | 31 ++++++++++++++++++++-----------
> > > 1 file changed, 20 insertions(+), 11 deletions(-)
> > >
> > > diff --git a/fs/btrfs/space-info.c b/fs/btrfs/space-info.c
> > > index f0436eea1544..e931deb3d013 100644
> > > --- a/fs/btrfs/space-info.c
> > > +++ b/fs/btrfs/space-info.c
> > > @@ -725,9 +725,8 @@ static void shrink_delalloc(struct
> > > btrfs_space_info *space_info,
> > > struct btrfs_trans_handle *trans;
> > > u64 delalloc_bytes;
> > > u64 ordered_bytes;
> > > - u64 items;
> > > long time_left;
> > > - int loops;
> > > + u64 orig_tickets_id;
> > > delalloc_bytes = percpu_counter_sum_positive(&fs_info-
> > > >delalloc_bytes);
> > > ordered_bytes = percpu_counter_sum_positive(&fs_info-
> > > >ordered_bytes);
> > > @@ -735,9 +734,7 @@ static void shrink_delalloc(struct
> > > btrfs_space_info *space_info,
> > > return;
> > > /* Calc the number of the pages we need flush for space
> > > reservation */
> > > - if (to_reclaim == U64_MAX) {
> > > - items = U64_MAX;
> > > - } else {
> > > + if (to_reclaim != U64_MAX) {
> > > /*
> > > * to_reclaim is set to however much metadata we need to
> > > * reclaim, but reclaiming that much data doesn't really track
> > > @@ -751,7 +748,6 @@ static void shrink_delalloc(struct
> > > btrfs_space_info *space_info,
> > > * aggressive.
> > > */
> > > to_reclaim = max(to_reclaim, delalloc_bytes >> 3);
> > > - items = calc_reclaim_items_nr(fs_info, to_reclaim) * 2;
> > > }
> > > trans = current->journal_info;
> > > @@ -764,10 +760,14 @@ static void shrink_delalloc(struct
> > > btrfs_space_info *space_info,
> > > if (ordered_bytes > delalloc_bytes && !for_preempt)
> > > wait_ordered = true;
> > > - loops = 0;
> > > - while ((delalloc_bytes || ordered_bytes) && loops < 3) {
> > > - u64 temp = min(delalloc_bytes, to_reclaim) >> PAGE_SHIFT;
> > > - long nr_pages = min_t(u64, temp, LONG_MAX);
> > > + spin_lock(&space_info->lock);
> > > + orig_tickets_id = space_info->tickets_id;
> > > + spin_unlock(&space_info->lock);
> > > +
> > > + while ((delalloc_bytes || ordered_bytes) && to_reclaim) {
> > > + u64 iter_reclaim = min_t(u64, to_reclaim, SZ_128M);
> > > + long nr_pages = min_t(u64, delalloc_bytes, iter_reclaim) >>
> > > PAGE_SHIFT;
> > > + u64 items = calc_reclaim_items_nr(fs_info, iter_reclaim) * 2;
> > > int async_pages;
> > > btrfs_start_delalloc_roots(fs_info, nr_pages, true);
> > > @@ -811,7 +811,7 @@ static void shrink_delalloc(struct
> > > btrfs_space_info *space_info,
> > > atomic_read(&fs_info->async_delalloc_pages) <=
> > > async_pages);
> > > skip_async:
> > > - loops++;
> > > + to_reclaim -= iter_reclaim;
> > > if (wait_ordered && !trans) {
> > > btrfs_wait_ordered_roots(fs_info, items, NULL);
> > > } else {
> > > @@ -834,6 +834,15 @@ static void shrink_delalloc(struct
> > > btrfs_space_info *space_info,
> > > spin_unlock(&space_info->lock);
> > > break;
> > > }
> > > + /*
> > > + * If a ticket was satisfied since we started, break out
> > > + * so the async reclaim state machine can process delayed
> > > + * refs before we flush more delalloc.
> > > + */
> > > + if (space_info->tickets_id != orig_tickets_id) {
> > > + spin_unlock(&space_info->lock);
> > > + break;
> > > + }
> > > spin_unlock(&space_info->lock);
> > > delalloc_bytes = percpu_counter_sum_positive(
> >
> >
>
^ permalink raw reply [flat|nested] 18+ messages in thread* Re: [PATCH v4 4/4] btrfs: cap shrink_delalloc iterations to 128M
2026-04-24 10:07 ` Qu Wenruo
2026-04-24 15:26 ` Boris Burkov
@ 2026-04-24 20:11 ` Boris Burkov
2026-04-24 22:06 ` Qu Wenruo
1 sibling, 1 reply; 18+ messages in thread
From: Boris Burkov @ 2026-04-24 20:11 UTC (permalink / raw)
To: Qu Wenruo; +Cc: linux-btrfs, kernel-team
On Fri, Apr 24, 2026 at 07:37:38PM +0930, Qu Wenruo wrote:
>
>
> 在 2026/4/24 16:08, Qu Wenruo 写道:
> >
> >
> > 在 2026/4/10 03:18, Boris Burkov 写道:
> > [...]
> > >
> > > This means iterating over to_reclaim by 128MiB at a time until it is
> > > drained or we satisfy a ticket, rather than trying 3 times to do the
> > > whole thing.
> > >
> > > Reviewed-by: Filipe Manana <fdmanana@suse.com>
> > > Signed-off-by: Boris Burkov <boris@bur.io>
> >
> > Hi Boris,
> >
> > I'm testing the latest for-next base as the baseline for the incoming
> > huge folio support.
> >
> > On arm64 64K page size, 4K fs block size, I'm seeing a very weird
> > behavior on generic/027.
> > On 7.0-rc7, the test case takes less than 5 seconds and passes as expected.
> >
> > But on for-next it never finished, furthermore there is always a kworker
> > taking a full core, deadlooping inside
> > btrfs_async_reclaim_metadata_space(), and you can not unmount the fs.
> >
> > Here is the "echo l > /proc/sysrq-trigger" stack dump for the involved
> > btrfs kworker:
> >
> > [ 6616.093728] CPU: 0 UID: 0 PID: 501715 Comm: kworker/u33:0 Not tainted
> > 7.0.0-rc7-custom-64k+ #9 PREEMPT(full)
> > [ 6616.093732] Hardware name: QEMU KVM Virtual Machine, BIOS unknown
> > 2/2/2022
> > [ 6616.093734] Workqueue: events_unbound
> > btrfs_async_reclaim_metadata_space [btrfs]
> > [ 6616.093849] pstate: 63400005 (nZCv daif +PAN -UAO +TCO +DIT -SSBS
> > BTYPE=--)
> > [ 6616.093852] pc : btrfs_start_delalloc_roots+0xf0/0x268 [btrfs]
> > [ 6616.093923] lr : btrfs_start_delalloc_roots+0x88/0x268 [btrfs]
> > [ 6616.093987] sp : ffff80008af0fbd0
> > [...]
> > [ 6616.094008] Call trace:
> > [ 6616.094009] btrfs_start_delalloc_roots+0xf0/0x268 [btrfs] (P)
> > [ 6616.094073] flush_space+0x3d4/0x6b0 [btrfs]
> > [ 6616.094138] do_async_reclaim_metadata_space+0x88/0x1d8 [btrfs]
> > [ 6616.094201] btrfs_async_reclaim_metadata_space+0x50/0x80 [btrfs]
> > [ 6616.094263] process_one_work+0x174/0x540
> > [ 6616.094277] worker_thread+0x1a0/0x318
> > [ 6616.094279] kthread+0x140/0x158
> > [ 6616.094285] ret_from_fork+0x10/0x20
> >
> > So it's a regression, and bisection points to this patch.
> >
> > And I tried the following steps to further confirm it's caused by this
> > commit:
> >
> > - The test passes just before the commit
> > The previous commit is "btrfs: make inode->outstanding_extents a u64".
> >
> > - The test failed at that commit
> > The test case never finish and one kworker dead looping.
> >
> > - The test case pass at for-next with this commit reverted
> > The test case finishes in seconds as usual.
>
> Furthermore, even with this particular patch *reverted*, I'm still seeing
> generic/224 hitting the same problem.
>
> Currently I'm testing at the commit before the whole series, which is
> "btrfs: abort transaction in do_remap_reloc_trans() on failure", and no
> generic/224 hang nor 100% kworker CPU usage.
>
> Thus I'm afraid the whole series may be involved.
>
> Thanks,
> Qu
>
Now that I have had a good chance to try and repro, here is what I have
seen so far on my desktop x86 machine and a cloud arm machine.
x86:
a41c84ba2f51 ("btrfs: abort transaction in do_remap_reloc_trans() on failure")
consistently done in 1 second
8099a837f487 ("btrfs: cap shrink_delalloc iterations to 128M")
finishes, but in ~500s
ea60045d9b1b ("btrfs: reserve space for delayed_refs in delalloc")
finishes, but in ~500s
arm:
a41c84ba2f51 ("btrfs: abort transaction in do_remap_reloc_trans() on failure")
consistently done in ~300 seconds
ea60045d9b1b ("btrfs: reserve space for delayed_refs in delalloc")
done in ~600s
The two inconsistencies are that I didn't see it go fast on g/027 with just
the shrink_delalloc iterations patch reverted, and I don't have a 2
second baseline on my arm setup.
So I agree that this patch series effectively breaks those tests, on x86
as well. I didn't notice the change in runtime, unfortunately, as I only
looked for success/failure.
As to the cause:
Both g/027 and g/224 are explicitly testing lots of writes to a small
filesystem.
I suspect that what is happening is what Filipe warned about with
excessive space reclaim/pinning reclaim/etc. choking the workload
due to excessive reservation. I have played around with reducing the
reservation sizes in various ways (set it back to 0, set the level
estimate to 4 as test, etc.) and the result varies from back to full
speed or a 60s run. So in my setup, at least, it looks like the
performance of g/027 is very sensitive to how much we reserve.
Would you be willing to let it run for 5-10m to see if you also
reproduce this behavior?
I will try to instrument the reservations and reclaim codepaths and see
if I can think of a nice fix to reserve "enough but not too much".
I can also try to attack the "stuck big fs under big reclaim" more
directly by trying to make reclaim less stuck-prone, rather than messing
with reservations. Though it would be quite disappointing if we
practically cannot make the reservation choices more accurate..
Thanks,
Boris
> >
> > Do you have any clue on what's going wrong? I guess it's pretty hard to
> > hit on x86_64.
> >
> > I have a local btrfs branch with huge folios support, with that it's
> > pretty easy to hit similar problems on x86_64, but without that branch,
> > no hit is observed so far on x86_64.
> >
> > Thanks,
> > Qu
> >
> > > ---
> > > fs/btrfs/space-info.c | 31 ++++++++++++++++++++-----------
> > > 1 file changed, 20 insertions(+), 11 deletions(-)
> > >
> > > diff --git a/fs/btrfs/space-info.c b/fs/btrfs/space-info.c
> > > index f0436eea1544..e931deb3d013 100644
> > > --- a/fs/btrfs/space-info.c
> > > +++ b/fs/btrfs/space-info.c
> > > @@ -725,9 +725,8 @@ static void shrink_delalloc(struct
> > > btrfs_space_info *space_info,
> > > struct btrfs_trans_handle *trans;
> > > u64 delalloc_bytes;
> > > u64 ordered_bytes;
> > > - u64 items;
> > > long time_left;
> > > - int loops;
> > > + u64 orig_tickets_id;
> > > delalloc_bytes = percpu_counter_sum_positive(&fs_info-
> > > >delalloc_bytes);
> > > ordered_bytes = percpu_counter_sum_positive(&fs_info-
> > > >ordered_bytes);
> > > @@ -735,9 +734,7 @@ static void shrink_delalloc(struct
> > > btrfs_space_info *space_info,
> > > return;
> > > /* Calc the number of the pages we need flush for space
> > > reservation */
> > > - if (to_reclaim == U64_MAX) {
> > > - items = U64_MAX;
> > > - } else {
> > > + if (to_reclaim != U64_MAX) {
> > > /*
> > > * to_reclaim is set to however much metadata we need to
> > > * reclaim, but reclaiming that much data doesn't really track
> > > @@ -751,7 +748,6 @@ static void shrink_delalloc(struct
> > > btrfs_space_info *space_info,
> > > * aggressive.
> > > */
> > > to_reclaim = max(to_reclaim, delalloc_bytes >> 3);
> > > - items = calc_reclaim_items_nr(fs_info, to_reclaim) * 2;
> > > }
> > > trans = current->journal_info;
> > > @@ -764,10 +760,14 @@ static void shrink_delalloc(struct
> > > btrfs_space_info *space_info,
> > > if (ordered_bytes > delalloc_bytes && !for_preempt)
> > > wait_ordered = true;
> > > - loops = 0;
> > > - while ((delalloc_bytes || ordered_bytes) && loops < 3) {
> > > - u64 temp = min(delalloc_bytes, to_reclaim) >> PAGE_SHIFT;
> > > - long nr_pages = min_t(u64, temp, LONG_MAX);
> > > + spin_lock(&space_info->lock);
> > > + orig_tickets_id = space_info->tickets_id;
> > > + spin_unlock(&space_info->lock);
> > > +
> > > + while ((delalloc_bytes || ordered_bytes) && to_reclaim) {
> > > + u64 iter_reclaim = min_t(u64, to_reclaim, SZ_128M);
> > > + long nr_pages = min_t(u64, delalloc_bytes, iter_reclaim) >>
> > > PAGE_SHIFT;
> > > + u64 items = calc_reclaim_items_nr(fs_info, iter_reclaim) * 2;
> > > int async_pages;
> > > btrfs_start_delalloc_roots(fs_info, nr_pages, true);
> > > @@ -811,7 +811,7 @@ static void shrink_delalloc(struct
> > > btrfs_space_info *space_info,
> > > atomic_read(&fs_info->async_delalloc_pages) <=
> > > async_pages);
> > > skip_async:
> > > - loops++;
> > > + to_reclaim -= iter_reclaim;
> > > if (wait_ordered && !trans) {
> > > btrfs_wait_ordered_roots(fs_info, items, NULL);
> > > } else {
> > > @@ -834,6 +834,15 @@ static void shrink_delalloc(struct
> > > btrfs_space_info *space_info,
> > > spin_unlock(&space_info->lock);
> > > break;
> > > }
> > > + /*
> > > + * If a ticket was satisfied since we started, break out
> > > + * so the async reclaim state machine can process delayed
> > > + * refs before we flush more delalloc.
> > > + */
> > > + if (space_info->tickets_id != orig_tickets_id) {
> > > + spin_unlock(&space_info->lock);
> > > + break;
> > > + }
> > > spin_unlock(&space_info->lock);
> > > delalloc_bytes = percpu_counter_sum_positive(
> >
> >
>
^ permalink raw reply [flat|nested] 18+ messages in thread* Re: [PATCH v4 4/4] btrfs: cap shrink_delalloc iterations to 128M
2026-04-24 20:11 ` Boris Burkov
@ 2026-04-24 22:06 ` Qu Wenruo
2026-04-24 22:10 ` Boris Burkov
0 siblings, 1 reply; 18+ messages in thread
From: Qu Wenruo @ 2026-04-24 22:06 UTC (permalink / raw)
To: Boris Burkov; +Cc: linux-btrfs, kernel-team
在 2026/4/25 05:41, Boris Burkov 写道:
> On Fri, Apr 24, 2026 at 07:37:38PM +0930, Qu Wenruo wrote:
[...]
>>
>> Furthermore, even with this particular patch *reverted*, I'm still seeing
>> generic/224 hitting the same problem.
>>
>> Currently I'm testing at the commit before the whole series, which is
>> "btrfs: abort transaction in do_remap_reloc_trans() on failure", and no
>> generic/224 hang nor 100% kworker CPU usage.
>>
>> Thus I'm afraid the whole series may be involved.
Sorry, at least on my arm64 machine, the first 3 patches are not the
root problem.
In fact on v7.0-rc7, I can still hit generic/224 hang, aka kernel
detects 120s time out for hung processes, and my VM is configured to
reset after such detection.
I'm going to slightly loose the hung task detection time (120s->150s)
and check if it's just too slow in this particular case.
Although the last patch is still causing excessive CPU usage here, and
very reliably.
>>
>> Thanks,
>> Qu
>>
>
> Now that I have had a good chance to try and repro, here is what I have
> seen so far on my desktop x86 machine and a cloud arm machine.
>
> x86:
> a41c84ba2f51 ("btrfs: abort transaction in do_remap_reloc_trans() on failure")
> consistently done in 1 second
> 8099a837f487 ("btrfs: cap shrink_delalloc iterations to 128M")
> finishes, but in ~500s
> ea60045d9b1b ("btrfs: reserve space for delayed_refs in delalloc")
> finishes, but in ~500s
>
> arm:
> a41c84ba2f51 ("btrfs: abort transaction in do_remap_reloc_trans() on failure")
> consistently done in ~300 seconds
> ea60045d9b1b ("btrfs: reserve space for delayed_refs in delalloc")
> done in ~600s
>
> The two inconsistencies are that I didn't see it go fast on g/027 with just
> the shrink_delalloc iterations patch reverted, and I don't have a 2
> second baseline on my arm setup.
At least we got something that both of us can reproduce.
Another thing is, for g/027 on arm64 I'm also actively monitoring the
CPU usage through top.
Have you experienced very high (~100%) CPU usage on a kworker during g/027?
That's the most reliably symptom on my arm64 systems, and that's the
criteria I used to bisect, as it takes less than 5 seconds to determine
if it's good or not.
>
> So I agree that this patch series effectively breaks those tests, on x86
> as well. I didn't notice the change in runtime, unfortunately, as I only
> looked for success/failure.
>
> As to the cause:
> Both g/027 and g/224 are explicitly testing lots of writes to a small
> filesystem.
>
> I suspect that what is happening is what Filipe warned about with
> excessive space reclaim/pinning reclaim/etc. choking the workload
> due to excessive reservation. I have played around with reducing the
> reservation sizes in various ways (set it back to 0, set the level
> estimate to 4 as test, etc.) and the result varies from back to full
> speed or a 60s run. So in my setup, at least, it looks like the
> performance of g/027 is very sensitive to how much we reserve.
At least to me, the biggest problem is the 100% CPU usage of the
kworker, which indicates a pretty bad dead looping.
>
> Would you be willing to let it run for 5-10m to see if you also
> reproduce this behavior?
Unfortunately it didn't even finish after 15m here.
And there is the dmesg with time stamps, the calltrace is triggered by
"echo l > /proc/sysrq-trigger".
[ 30.140269] run fstests generic/027 at 2026-04-25 07:19:32
[ 30.392655] BTRFS: device fsid 85ba0f7c-dfed-4220-9d47-72b07a1c81d8
devid 1 transid 8 /dev/mapper/test-scratch1 (253:2) scanned by mount (1108)
[ 30.395605] BTRFS info (device dm-2): first mount of filesystem
85ba0f7c-dfed-4220-9d47-72b07a1c81d8
[ 30.395625] BTRFS info (device dm-2): using crc32c checksum algorithm
[ 30.398590] BTRFS info (device dm-2): checking UUID tree
[ 30.398734] BTRFS info (device dm-2): turning on async discard
[ 30.398737] BTRFS info (device dm-2): enabling free space tree
[ 33.294754] systemd-journald[360]: Time jumped backwards, rotating.
[ 993.736548] sysrq: Show backtrace of all active CPUs
[ 993.736581] NMI backtrace for cpu 0
[ 993.736608] CPU: 0 UID: 0 PID: 2410 Comm: bash Not tainted
7.0.0-rc7-custom-64k+ #10 PREEMPT(full)
[ 993.736613] Hardware name: QEMU KVM Virtual Machine, BIOS unknown
2/2/2022
[ 993.736616] Call trace:
[ 993.736618] show_stack+0x20/0x38 (C)
[ 993.736635] dump_stack_lvl+0x60/0x80
[ 993.736646] dump_stack+0x18/0x24
[ 993.736649] nmi_cpu_backtrace+0xf0/0x128
[ 993.736665] nmi_trigger_cpumask_backtrace+0x1c4/0x1f8
[ 993.736668] arch_trigger_cpumask_backtrace+0x20/0x40
[ 993.736675] sysrq_handle_showallcpus+0x24/0x38
[ 993.736686] __handle_sysrq+0x9c/0x1b8
[ 993.736689] write_sysrq_trigger+0xcc/0x100
[ 993.736692] proc_reg_write+0x7c/0xf0
[ 993.736701] vfs_write+0xd8/0x3a8
[ 993.736716] ksys_write+0x70/0x120
[ 993.736719] __arm64_sys_write+0x20/0x40
[ 993.736722] invoke_syscall.constprop.0+0x64/0xe8
[ 993.736726] el0_svc_common.constprop.0+0x40/0xe8
[ 993.736728] do_el0_svc+0x24/0x38
[ 993.736730] el0_svc+0x3c/0x198
[ 993.736733] el0t_64_sync_handler+0xa0/0xe8
[ 993.736735] el0t_64_sync+0x198/0x1a0
[ 993.736755] Sending NMI from CPU 0 to CPUs 1-7:
[ 993.736769] NMI backtrace for cpu 3
[ 993.736777] CPU: 3 UID: 0 PID: 212 Comm: kworker/u38:2 Not tainted
7.0.0-rc7-custom-64k+ #10 PREEMPT(full)
[ 993.736780] Hardware name: QEMU KVM Virtual Machine, BIOS unknown
2/2/2022
[ 993.736782] Workqueue: events_unbound
btrfs_async_reclaim_metadata_space [btrfs]
[ 993.736879] pstate: 63400005 (nZCv daif +PAN -UAO +TCO +DIT -SSBS
BTYPE=--)
[ 993.736882] pc : _raw_spin_unlock_irqrestore+0x10/0x60
[ 993.736899] lr : __percpu_counter_sum+0x94/0xc0
[ 993.736909] sp : ffff8000834cfc10
[ 993.736910] x29: ffff8000834cfc10 x28: 0000000000000400 x27:
0000000008000000
[ 993.736912] x26: ffff0000ccfef81c x25: 0000000000000000 x24:
ffffb6b45621ef98
[ 993.736914] x23: ffff0000d2e48698 x22: ffffb6b45621a000 x21:
ffffb6b456219080
[ 993.736916] x20: ffffb6b456219288 x19: 0000000000009000 x18:
ffff494da90b0000
[ 993.736917] x17: 0000000000000000 x16: ffffb6b45524e920 x15:
ffffb6b45621ef98
[ 993.736919] x14: ffffb6b456111740 x13: 0000000000000180 x12:
ffff0001ff1c1740
[ 993.736921] x11: 00000000000000c0 x10: 4eb904daffc7d416 x9 :
ffffb6b45524e9b4
[ 993.736922] x8 : ffff8000834cfab0 x7 : 0000000000000000 x6 :
ffffffffffffffff
[ 993.736924] x5 : 0000000000000000 x4 : 0000000000000000 x3 :
0000000000000008
[ 993.736926] x2 : 0000000000000008 x1 : 0000000000000000 x0 :
ffff0000d2e48698
[ 993.736928] Call trace:
[ 993.736929] _raw_spin_unlock_irqrestore+0x10/0x60 (P)
[ 993.736932] flush_space+0x45c/0x6b0 [btrfs]
[ 993.737001] do_async_reclaim_metadata_space+0x88/0x1d8 [btrfs]
[ 993.737064] btrfs_async_reclaim_metadata_space+0x50/0x80 [btrfs]
[ 993.737126] process_one_work+0x174/0x540
[ 993.737138] worker_thread+0x1a0/0x318
[ 993.737140] kthread+0x140/0x158
[ 993.737145] ret_from_fork+0x10/0x20
[ 993.737156] NMI backtrace for cpu 4
>
> I will try to instrument the reservations and reclaim codepaths and see
> if I can think of a nice fix to reserve "enough but not too much".
>
> I can also try to attack the "stuck big fs under big reclaim" more
> directly by trying to make reclaim less stuck-prone, rather than messing
> with reservations. Though it would be quite disappointing if we
> practically cannot make the reservation choices more accurate..
Totally understandable, ENOSPC in btrfs is always the biggest challenge,
and the trade-offs are always hard to balance.
Meanwhile I'd prefer to have the last commit reverted so that we can
continue our regular testing.
Thanks,
Qu
>
> Thanks,
> Boris
>
>>>
>>> Do you have any clue on what's going wrong? I guess it's pretty hard to
>>> hit on x86_64.
>>>
>>> I have a local btrfs branch with huge folios support, with that it's
>>> pretty easy to hit similar problems on x86_64, but without that branch,
>>> no hit is observed so far on x86_64.
>>>
>>> Thanks,
>>> Qu
>>>
>>>> ---
>>>> fs/btrfs/space-info.c | 31 ++++++++++++++++++++-----------
>>>> 1 file changed, 20 insertions(+), 11 deletions(-)
>>>>
>>>> diff --git a/fs/btrfs/space-info.c b/fs/btrfs/space-info.c
>>>> index f0436eea1544..e931deb3d013 100644
>>>> --- a/fs/btrfs/space-info.c
>>>> +++ b/fs/btrfs/space-info.c
>>>> @@ -725,9 +725,8 @@ static void shrink_delalloc(struct
>>>> btrfs_space_info *space_info,
>>>> struct btrfs_trans_handle *trans;
>>>> u64 delalloc_bytes;
>>>> u64 ordered_bytes;
>>>> - u64 items;
>>>> long time_left;
>>>> - int loops;
>>>> + u64 orig_tickets_id;
>>>> delalloc_bytes = percpu_counter_sum_positive(&fs_info-
>>>>> delalloc_bytes);
>>>> ordered_bytes = percpu_counter_sum_positive(&fs_info-
>>>>> ordered_bytes);
>>>> @@ -735,9 +734,7 @@ static void shrink_delalloc(struct
>>>> btrfs_space_info *space_info,
>>>> return;
>>>> /* Calc the number of the pages we need flush for space
>>>> reservation */
>>>> - if (to_reclaim == U64_MAX) {
>>>> - items = U64_MAX;
>>>> - } else {
>>>> + if (to_reclaim != U64_MAX) {
>>>> /*
>>>> * to_reclaim is set to however much metadata we need to
>>>> * reclaim, but reclaiming that much data doesn't really track
>>>> @@ -751,7 +748,6 @@ static void shrink_delalloc(struct
>>>> btrfs_space_info *space_info,
>>>> * aggressive.
>>>> */
>>>> to_reclaim = max(to_reclaim, delalloc_bytes >> 3);
>>>> - items = calc_reclaim_items_nr(fs_info, to_reclaim) * 2;
>>>> }
>>>> trans = current->journal_info;
>>>> @@ -764,10 +760,14 @@ static void shrink_delalloc(struct
>>>> btrfs_space_info *space_info,
>>>> if (ordered_bytes > delalloc_bytes && !for_preempt)
>>>> wait_ordered = true;
>>>> - loops = 0;
>>>> - while ((delalloc_bytes || ordered_bytes) && loops < 3) {
>>>> - u64 temp = min(delalloc_bytes, to_reclaim) >> PAGE_SHIFT;
>>>> - long nr_pages = min_t(u64, temp, LONG_MAX);
>>>> + spin_lock(&space_info->lock);
>>>> + orig_tickets_id = space_info->tickets_id;
>>>> + spin_unlock(&space_info->lock);
>>>> +
>>>> + while ((delalloc_bytes || ordered_bytes) && to_reclaim) {
>>>> + u64 iter_reclaim = min_t(u64, to_reclaim, SZ_128M);
>>>> + long nr_pages = min_t(u64, delalloc_bytes, iter_reclaim) >>
>>>> PAGE_SHIFT;
>>>> + u64 items = calc_reclaim_items_nr(fs_info, iter_reclaim) * 2;
>>>> int async_pages;
>>>> btrfs_start_delalloc_roots(fs_info, nr_pages, true);
>>>> @@ -811,7 +811,7 @@ static void shrink_delalloc(struct
>>>> btrfs_space_info *space_info,
>>>> atomic_read(&fs_info->async_delalloc_pages) <=
>>>> async_pages);
>>>> skip_async:
>>>> - loops++;
>>>> + to_reclaim -= iter_reclaim;
>>>> if (wait_ordered && !trans) {
>>>> btrfs_wait_ordered_roots(fs_info, items, NULL);
>>>> } else {
>>>> @@ -834,6 +834,15 @@ static void shrink_delalloc(struct
>>>> btrfs_space_info *space_info,
>>>> spin_unlock(&space_info->lock);
>>>> break;
>>>> }
>>>> + /*
>>>> + * If a ticket was satisfied since we started, break out
>>>> + * so the async reclaim state machine can process delayed
>>>> + * refs before we flush more delalloc.
>>>> + */
>>>> + if (space_info->tickets_id != orig_tickets_id) {
>>>> + spin_unlock(&space_info->lock);
>>>> + break;
>>>> + }
>>>> spin_unlock(&space_info->lock);
>>>> delalloc_bytes = percpu_counter_sum_positive(
>>>
>>>
>>
^ permalink raw reply [flat|nested] 18+ messages in thread* Re: [PATCH v4 4/4] btrfs: cap shrink_delalloc iterations to 128M
2026-04-24 22:06 ` Qu Wenruo
@ 2026-04-24 22:10 ` Boris Burkov
2026-04-24 22:21 ` Qu Wenruo
0 siblings, 1 reply; 18+ messages in thread
From: Boris Burkov @ 2026-04-24 22:10 UTC (permalink / raw)
To: Qu Wenruo; +Cc: linux-btrfs, kernel-team
On Sat, Apr 25, 2026 at 07:36:49AM +0930, Qu Wenruo wrote:
>
>
> 在 2026/4/25 05:41, Boris Burkov 写道:
> > On Fri, Apr 24, 2026 at 07:37:38PM +0930, Qu Wenruo wrote:
> [...]
> > >
> > > Furthermore, even with this particular patch *reverted*, I'm still seeing
> > > generic/224 hitting the same problem.
> > >
> > > Currently I'm testing at the commit before the whole series, which is
> > > "btrfs: abort transaction in do_remap_reloc_trans() on failure", and no
> > > generic/224 hang nor 100% kworker CPU usage.
> > >
> > > Thus I'm afraid the whole series may be involved.
>
> Sorry, at least on my arm64 machine, the first 3 patches are not the root
> problem.
>
> In fact on v7.0-rc7, I can still hit generic/224 hang, aka kernel detects
> 120s time out for hung processes, and my VM is configured to reset after
> such detection.
>
> I'm going to slightly loose the hung task detection time (120s->150s) and
> check if it's just too slow in this particular case.
>
>
> Although the last patch is still causing excessive CPU usage here, and very
> reliably.
>
> > >
> > > Thanks,
> > > Qu
> > >
> >
> > Now that I have had a good chance to try and repro, here is what I have
> > seen so far on my desktop x86 machine and a cloud arm machine.
> >
> > x86:
> > a41c84ba2f51 ("btrfs: abort transaction in do_remap_reloc_trans() on failure")
> > consistently done in 1 second
> > 8099a837f487 ("btrfs: cap shrink_delalloc iterations to 128M")
> > finishes, but in ~500s
> > ea60045d9b1b ("btrfs: reserve space for delayed_refs in delalloc")
> > finishes, but in ~500s
> >
> > arm:
> > a41c84ba2f51 ("btrfs: abort transaction in do_remap_reloc_trans() on failure")
> > consistently done in ~300 seconds
> > ea60045d9b1b ("btrfs: reserve space for delayed_refs in delalloc")
> > done in ~600s
> >
> > The two inconsistencies are that I didn't see it go fast on g/027 with just
> > the shrink_delalloc iterations patch reverted, and I don't have a 2
> > second baseline on my arm setup.
>
> At least we got something that both of us can reproduce.
>
> Another thing is, for g/027 on arm64 I'm also actively monitoring the CPU
> usage through top.
>
> Have you experienced very high (~100%) CPU usage on a kworker during g/027?
>
No :(
As far as I can tell the system is stuck waiting on a commit. I'll keep
trying to repro your symptom.
I'm curious if it goes away for you with Sun's proposed fix, something
like setting pages to at least 1 after those to min() operations.
Such a patch had no impact on the behavior of either of my systems,
though.
> That's the most reliably symptom on my arm64 systems, and that's the
> criteria I used to bisect, as it takes less than 5 seconds to determine if
> it's good or not.
>
> >
> > So I agree that this patch series effectively breaks those tests, on x86
> > as well. I didn't notice the change in runtime, unfortunately, as I only
> > looked for success/failure.
> >
> > As to the cause:
> > Both g/027 and g/224 are explicitly testing lots of writes to a small
> > filesystem.
> >
> > I suspect that what is happening is what Filipe warned about with
> > excessive space reclaim/pinning reclaim/etc. choking the workload
> > due to excessive reservation. I have played around with reducing the
> > reservation sizes in various ways (set it back to 0, set the level
> > estimate to 4 as test, etc.) and the result varies from back to full
> > speed or a 60s run. So in my setup, at least, it looks like the
> > performance of g/027 is very sensitive to how much we reserve.
>
> At least to me, the biggest problem is the 100% CPU usage of the kworker,
> which indicates a pretty bad dead looping.
>
> >
> > Would you be willing to let it run for 5-10m to see if you also
> > reproduce this behavior?
>
> Unfortunately it didn't even finish after 15m here.
> And there is the dmesg with time stamps, the calltrace is triggered by
> "echo l > /proc/sysrq-trigger".
>
> [ 30.140269] run fstests generic/027 at 2026-04-25 07:19:32
> [ 30.392655] BTRFS: device fsid 85ba0f7c-dfed-4220-9d47-72b07a1c81d8 devid
> 1 transid 8 /dev/mapper/test-scratch1 (253:2) scanned by mount (1108)
> [ 30.395605] BTRFS info (device dm-2): first mount of filesystem
> 85ba0f7c-dfed-4220-9d47-72b07a1c81d8
> [ 30.395625] BTRFS info (device dm-2): using crc32c checksum algorithm
> [ 30.398590] BTRFS info (device dm-2): checking UUID tree
> [ 30.398734] BTRFS info (device dm-2): turning on async discard
> [ 30.398737] BTRFS info (device dm-2): enabling free space tree
> [ 33.294754] systemd-journald[360]: Time jumped backwards, rotating.
> [ 993.736548] sysrq: Show backtrace of all active CPUs
> [ 993.736581] NMI backtrace for cpu 0
> [ 993.736608] CPU: 0 UID: 0 PID: 2410 Comm: bash Not tainted
> 7.0.0-rc7-custom-64k+ #10 PREEMPT(full)
> [ 993.736613] Hardware name: QEMU KVM Virtual Machine, BIOS unknown
> 2/2/2022
> [ 993.736616] Call trace:
> [ 993.736618] show_stack+0x20/0x38 (C)
> [ 993.736635] dump_stack_lvl+0x60/0x80
> [ 993.736646] dump_stack+0x18/0x24
> [ 993.736649] nmi_cpu_backtrace+0xf0/0x128
> [ 993.736665] nmi_trigger_cpumask_backtrace+0x1c4/0x1f8
> [ 993.736668] arch_trigger_cpumask_backtrace+0x20/0x40
> [ 993.736675] sysrq_handle_showallcpus+0x24/0x38
> [ 993.736686] __handle_sysrq+0x9c/0x1b8
> [ 993.736689] write_sysrq_trigger+0xcc/0x100
> [ 993.736692] proc_reg_write+0x7c/0xf0
> [ 993.736701] vfs_write+0xd8/0x3a8
> [ 993.736716] ksys_write+0x70/0x120
> [ 993.736719] __arm64_sys_write+0x20/0x40
> [ 993.736722] invoke_syscall.constprop.0+0x64/0xe8
> [ 993.736726] el0_svc_common.constprop.0+0x40/0xe8
> [ 993.736728] do_el0_svc+0x24/0x38
> [ 993.736730] el0_svc+0x3c/0x198
> [ 993.736733] el0t_64_sync_handler+0xa0/0xe8
> [ 993.736735] el0t_64_sync+0x198/0x1a0
> [ 993.736755] Sending NMI from CPU 0 to CPUs 1-7:
> [ 993.736769] NMI backtrace for cpu 3
> [ 993.736777] CPU: 3 UID: 0 PID: 212 Comm: kworker/u38:2 Not tainted
> 7.0.0-rc7-custom-64k+ #10 PREEMPT(full)
> [ 993.736780] Hardware name: QEMU KVM Virtual Machine, BIOS unknown
> 2/2/2022
> [ 993.736782] Workqueue: events_unbound btrfs_async_reclaim_metadata_space
> [btrfs]
> [ 993.736879] pstate: 63400005 (nZCv daif +PAN -UAO +TCO +DIT -SSBS
> BTYPE=--)
> [ 993.736882] pc : _raw_spin_unlock_irqrestore+0x10/0x60
> [ 993.736899] lr : __percpu_counter_sum+0x94/0xc0
> [ 993.736909] sp : ffff8000834cfc10
> [ 993.736910] x29: ffff8000834cfc10 x28: 0000000000000400 x27:
> 0000000008000000
> [ 993.736912] x26: ffff0000ccfef81c x25: 0000000000000000 x24:
> ffffb6b45621ef98
> [ 993.736914] x23: ffff0000d2e48698 x22: ffffb6b45621a000 x21:
> ffffb6b456219080
> [ 993.736916] x20: ffffb6b456219288 x19: 0000000000009000 x18:
> ffff494da90b0000
> [ 993.736917] x17: 0000000000000000 x16: ffffb6b45524e920 x15:
> ffffb6b45621ef98
> [ 993.736919] x14: ffffb6b456111740 x13: 0000000000000180 x12:
> ffff0001ff1c1740
> [ 993.736921] x11: 00000000000000c0 x10: 4eb904daffc7d416 x9 :
> ffffb6b45524e9b4
> [ 993.736922] x8 : ffff8000834cfab0 x7 : 0000000000000000 x6 :
> ffffffffffffffff
> [ 993.736924] x5 : 0000000000000000 x4 : 0000000000000000 x3 :
> 0000000000000008
> [ 993.736926] x2 : 0000000000000008 x1 : 0000000000000000 x0 :
> ffff0000d2e48698
> [ 993.736928] Call trace:
> [ 993.736929] _raw_spin_unlock_irqrestore+0x10/0x60 (P)
> [ 993.736932] flush_space+0x45c/0x6b0 [btrfs]
> [ 993.737001] do_async_reclaim_metadata_space+0x88/0x1d8 [btrfs]
> [ 993.737064] btrfs_async_reclaim_metadata_space+0x50/0x80 [btrfs]
> [ 993.737126] process_one_work+0x174/0x540
> [ 993.737138] worker_thread+0x1a0/0x318
> [ 993.737140] kthread+0x140/0x158
> [ 993.737145] ret_from_fork+0x10/0x20
> [ 993.737156] NMI backtrace for cpu 4
>
> >
> > I will try to instrument the reservations and reclaim codepaths and see
> > if I can think of a nice fix to reserve "enough but not too much".
> >
> > I can also try to attack the "stuck big fs under big reclaim" more
> > directly by trying to make reclaim less stuck-prone, rather than messing
> > with reservations. Though it would be quite disappointing if we
> > practically cannot make the reservation choices more accurate..
>
> Totally understandable, ENOSPC in btrfs is always the biggest challenge, and
> the trade-offs are always hard to balance.
>
> Meanwhile I'd prefer to have the last commit reverted so that we can
> continue our regular testing.
>
> Thanks,
> Qu
>
Totally agreed. I'm going to revert the whole series till I understand
what happened here.
Thanks,
Boris
> >
> > Thanks,
> > Boris
> >
> > > >
> > > > Do you have any clue on what's going wrong? I guess it's pretty hard to
> > > > hit on x86_64.
> > > >
> > > > I have a local btrfs branch with huge folios support, with that it's
> > > > pretty easy to hit similar problems on x86_64, but without that branch,
> > > > no hit is observed so far on x86_64.
> > > >
> > > > Thanks,
> > > > Qu
> > > >
> > > > > ---
> > > > > fs/btrfs/space-info.c | 31 ++++++++++++++++++++-----------
> > > > > 1 file changed, 20 insertions(+), 11 deletions(-)
> > > > >
> > > > > diff --git a/fs/btrfs/space-info.c b/fs/btrfs/space-info.c
> > > > > index f0436eea1544..e931deb3d013 100644
> > > > > --- a/fs/btrfs/space-info.c
> > > > > +++ b/fs/btrfs/space-info.c
> > > > > @@ -725,9 +725,8 @@ static void shrink_delalloc(struct
> > > > > btrfs_space_info *space_info,
> > > > > struct btrfs_trans_handle *trans;
> > > > > u64 delalloc_bytes;
> > > > > u64 ordered_bytes;
> > > > > - u64 items;
> > > > > long time_left;
> > > > > - int loops;
> > > > > + u64 orig_tickets_id;
> > > > > delalloc_bytes = percpu_counter_sum_positive(&fs_info-
> > > > > > delalloc_bytes);
> > > > > ordered_bytes = percpu_counter_sum_positive(&fs_info-
> > > > > > ordered_bytes);
> > > > > @@ -735,9 +734,7 @@ static void shrink_delalloc(struct
> > > > > btrfs_space_info *space_info,
> > > > > return;
> > > > > /* Calc the number of the pages we need flush for space
> > > > > reservation */
> > > > > - if (to_reclaim == U64_MAX) {
> > > > > - items = U64_MAX;
> > > > > - } else {
> > > > > + if (to_reclaim != U64_MAX) {
> > > > > /*
> > > > > * to_reclaim is set to however much metadata we need to
> > > > > * reclaim, but reclaiming that much data doesn't really track
> > > > > @@ -751,7 +748,6 @@ static void shrink_delalloc(struct
> > > > > btrfs_space_info *space_info,
> > > > > * aggressive.
> > > > > */
> > > > > to_reclaim = max(to_reclaim, delalloc_bytes >> 3);
> > > > > - items = calc_reclaim_items_nr(fs_info, to_reclaim) * 2;
> > > > > }
> > > > > trans = current->journal_info;
> > > > > @@ -764,10 +760,14 @@ static void shrink_delalloc(struct
> > > > > btrfs_space_info *space_info,
> > > > > if (ordered_bytes > delalloc_bytes && !for_preempt)
> > > > > wait_ordered = true;
> > > > > - loops = 0;
> > > > > - while ((delalloc_bytes || ordered_bytes) && loops < 3) {
> > > > > - u64 temp = min(delalloc_bytes, to_reclaim) >> PAGE_SHIFT;
> > > > > - long nr_pages = min_t(u64, temp, LONG_MAX);
> > > > > + spin_lock(&space_info->lock);
> > > > > + orig_tickets_id = space_info->tickets_id;
> > > > > + spin_unlock(&space_info->lock);
> > > > > +
> > > > > + while ((delalloc_bytes || ordered_bytes) && to_reclaim) {
> > > > > + u64 iter_reclaim = min_t(u64, to_reclaim, SZ_128M);
> > > > > + long nr_pages = min_t(u64, delalloc_bytes, iter_reclaim) >>
> > > > > PAGE_SHIFT;
> > > > > + u64 items = calc_reclaim_items_nr(fs_info, iter_reclaim) * 2;
> > > > > int async_pages;
> > > > > btrfs_start_delalloc_roots(fs_info, nr_pages, true);
> > > > > @@ -811,7 +811,7 @@ static void shrink_delalloc(struct
> > > > > btrfs_space_info *space_info,
> > > > > atomic_read(&fs_info->async_delalloc_pages) <=
> > > > > async_pages);
> > > > > skip_async:
> > > > > - loops++;
> > > > > + to_reclaim -= iter_reclaim;
> > > > > if (wait_ordered && !trans) {
> > > > > btrfs_wait_ordered_roots(fs_info, items, NULL);
> > > > > } else {
> > > > > @@ -834,6 +834,15 @@ static void shrink_delalloc(struct
> > > > > btrfs_space_info *space_info,
> > > > > spin_unlock(&space_info->lock);
> > > > > break;
> > > > > }
> > > > > + /*
> > > > > + * If a ticket was satisfied since we started, break out
> > > > > + * so the async reclaim state machine can process delayed
> > > > > + * refs before we flush more delalloc.
> > > > > + */
> > > > > + if (space_info->tickets_id != orig_tickets_id) {
> > > > > + spin_unlock(&space_info->lock);
> > > > > + break;
> > > > > + }
> > > > > spin_unlock(&space_info->lock);
> > > > > delalloc_bytes = percpu_counter_sum_positive(
> > > >
> > > >
> > >
>
^ permalink raw reply [flat|nested] 18+ messages in thread* Re: [PATCH v4 4/4] btrfs: cap shrink_delalloc iterations to 128M
2026-04-24 22:10 ` Boris Burkov
@ 2026-04-24 22:21 ` Qu Wenruo
2026-04-24 22:23 ` Boris Burkov
2026-04-24 22:59 ` Qu Wenruo
0 siblings, 2 replies; 18+ messages in thread
From: Qu Wenruo @ 2026-04-24 22:21 UTC (permalink / raw)
To: Boris Burkov; +Cc: linux-btrfs, kernel-team
在 2026/4/25 07:40, Boris Burkov 写道:
> On Sat, Apr 25, 2026 at 07:36:49AM +0930, Qu Wenruo wrote:
[...]
>> At least we got something that both of us can reproduce.
>>
>> Another thing is, for g/027 on arm64 I'm also actively monitoring the CPU
>> usage through top.
>>
>> Have you experienced very high (~100%) CPU usage on a kworker during g/027?
>>
>
> No :(
> As far as I can tell the system is stuck waiting on a commit. I'll keep
> trying to repro your symptom.
>
> I'm curious if it goes away for you with Sun's proposed fix, something
> like setting pages to at least 1 after those to min() operations.
I go with the following diff:
diff --git a/fs/btrfs/space-info.c b/fs/btrfs/space-info.c
index e931deb3d013..2c5214b24239 100644
--- a/fs/btrfs/space-info.c
+++ b/fs/btrfs/space-info.c
@@ -770,6 +770,9 @@ static void shrink_delalloc(struct btrfs_space_info
*space_info,
u64 items = calc_reclaim_items_nr(fs_info,
iter_reclaim) * 2;
int async_pages;
+ if (nr_pages == 0)
+ nr_pages = 1;
+
btrfs_start_delalloc_roots(fs_info, nr_pages, true);
/*
It solves the dead looping kworker on arm64, now the it's several
different kworkers taking around 5~15% along with 027 itself taking CPU
time.
But unfortunately the test case itself still seems not to end any time soon.
I believe the old 3 loops limit is really what makes the difference.
It's ugly, but at least it seems to work.
Thanks,
Qu
^ permalink raw reply related [flat|nested] 18+ messages in thread* Re: [PATCH v4 4/4] btrfs: cap shrink_delalloc iterations to 128M
2026-04-24 22:21 ` Qu Wenruo
@ 2026-04-24 22:23 ` Boris Burkov
2026-04-24 22:59 ` Qu Wenruo
1 sibling, 0 replies; 18+ messages in thread
From: Boris Burkov @ 2026-04-24 22:23 UTC (permalink / raw)
To: Qu Wenruo; +Cc: linux-btrfs, kernel-team
On Sat, Apr 25, 2026 at 07:51:01AM +0930, Qu Wenruo wrote:
>
>
> 在 2026/4/25 07:40, Boris Burkov 写道:
> > On Sat, Apr 25, 2026 at 07:36:49AM +0930, Qu Wenruo wrote:
> [...]
> > > At least we got something that both of us can reproduce.
> > >
> > > Another thing is, for g/027 on arm64 I'm also actively monitoring the CPU
> > > usage through top.
> > >
> > > Have you experienced very high (~100%) CPU usage on a kworker during g/027?
> > >
> >
> > No :(
> > As far as I can tell the system is stuck waiting on a commit. I'll keep
> > trying to repro your symptom.
> >
> > I'm curious if it goes away for you with Sun's proposed fix, something
> > like setting pages to at least 1 after those to min() operations.
>
> I go with the following diff:
>
> diff --git a/fs/btrfs/space-info.c b/fs/btrfs/space-info.c
> index e931deb3d013..2c5214b24239 100644
> --- a/fs/btrfs/space-info.c
> +++ b/fs/btrfs/space-info.c
> @@ -770,6 +770,9 @@ static void shrink_delalloc(struct btrfs_space_info
> *space_info,
> u64 items = calc_reclaim_items_nr(fs_info, iter_reclaim) *
> 2;
> int async_pages;
>
> + if (nr_pages == 0)
> + nr_pages = 1;
> +
> btrfs_start_delalloc_roots(fs_info, nr_pages, true);
>
> /*
>
>
> It solves the dead looping kworker on arm64, now the it's several different
> kworkers taking around 5~15% along with 027 itself taking CPU time.
>
> But unfortunately the test case itself still seems not to end any time soon.
>
> I believe the old 3 loops limit is really what makes the difference.
> It's ugly, but at least it seems to work.
>
> Thanks,
> Qu
Thanks for re-testing. Glad to hear that at least fixes the dead loop.
Honestly I should have known better than to include an unbounded loop,
that was stupid. I even kind of thought it was dumb while doing it but
convinced myself it "must make progress one extent at a time" or
whatever.. Obviously overlooked the min() bug too.
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [PATCH v4 4/4] btrfs: cap shrink_delalloc iterations to 128M
2026-04-24 22:21 ` Qu Wenruo
2026-04-24 22:23 ` Boris Burkov
@ 2026-04-24 22:59 ` Qu Wenruo
1 sibling, 0 replies; 18+ messages in thread
From: Qu Wenruo @ 2026-04-24 22:59 UTC (permalink / raw)
To: Boris Burkov; +Cc: linux-btrfs, kernel-team
在 2026/4/25 07:51, Qu Wenruo 写道:
>
>
> 在 2026/4/25 07:40, Boris Burkov 写道:
>> On Sat, Apr 25, 2026 at 07:36:49AM +0930, Qu Wenruo wrote:
> [...]
>>> At least we got something that both of us can reproduce.
>>>
>>> Another thing is, for g/027 on arm64 I'm also actively monitoring the
>>> CPU
>>> usage through top.
>>>
>>> Have you experienced very high (~100%) CPU usage on a kworker during
>>> g/027?
>>>
>>
>> No :(
>> As far as I can tell the system is stuck waiting on a commit. I'll keep
>> trying to repro your symptom.
>>
>> I'm curious if it goes away for you with Sun's proposed fix, something
>> like setting pages to at least 1 after those to min() operations.
>
> I go with the following diff:
>
> diff --git a/fs/btrfs/space-info.c b/fs/btrfs/space-info.c
> index e931deb3d013..2c5214b24239 100644
> --- a/fs/btrfs/space-info.c
> +++ b/fs/btrfs/space-info.c
> @@ -770,6 +770,9 @@ static void shrink_delalloc(struct btrfs_space_info
> *space_info,
> u64 items = calc_reclaim_items_nr(fs_info,
> iter_reclaim) * 2;
> int async_pages;
>
> + if (nr_pages == 0)
> + nr_pages = 1;
> +
> btrfs_start_delalloc_roots(fs_info, nr_pages, true);
>
> /*
>
>
> It solves the dead looping kworker on arm64, now the it's several
> different kworkers taking around 5~15% along with 027 itself taking CPU
> time.
>
> But unfortunately the test case itself still seems not to end any time
> soon.
>
> I believe the old 3 loops limit is really what makes the difference.
> It's ugly, but at least it seems to work.
Not an expert on the space reclaim part, but as far as I understand,
there can be some cases where the only dirty folios that can be flushed
is the folios that are being written back.
E.g. we have a large folio (256KiB) dirty, and that's the only dirty
data folio in the whole fs.
Then at delalloc time, we are reserving metadata, and we are waiting for
the ticket to shrink delalloc.
Although the ticket is trying to flush delalloc, the 256KiB dirty folio
is already under write back, thus it will reclaim no space.
Thus the loop never progresses.
This will cause a semi-deadlock, as we want to reserve metadata to do
the folio writeback, but the metadata reserving is trying to flush the
same folio to progress.
And the recent huge folio feature is making the problem more obvious, as
the folios grow larger, we have higher chance to hit such situation.
With that said, it means delalloc shrinking is never guaranteed to make
any progress under extreme ENOSPC situation.
Thus a proper way to break out the loop is highly recommended, when
there is no progress to be done.
Hopes this educated guess would help to explain the problem.
Thanks,
Qu
>
> Thanks,
> Qu
>
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [PATCH v4 0/4] btrfs: improve stalls under sudden writeback
2026-04-09 17:48 [PATCH v4 0/4] btrfs: improve stalls under sudden writeback Boris Burkov
` (3 preceding siblings ...)
2026-04-09 17:48 ` [PATCH v4 4/4] btrfs: cap shrink_delalloc iterations to 128M Boris Burkov
@ 2026-04-13 18:41 ` David Sterba
4 siblings, 0 replies; 18+ messages in thread
From: David Sterba @ 2026-04-13 18:41 UTC (permalink / raw)
To: Boris Burkov; +Cc: linux-btrfs, kernel-team
On Thu, Apr 09, 2026 at 10:48:47AM -0700, Boris Burkov wrote:
> If you have a system with very large memory (TiBs) and a normal
> percentage based dirty_ratio/dirty_background_ratio like the defaults of
> 20%/10%, then we can theoretically rack up 100s of GiB of dirty pages
> before doing any writeback. This is further exacerbated if we also see a
> sudden drop in the free memory due to a large allocation. If we
> (relatively likely for a large ram system) also have a large disk, we are
> unlikely to do trigger much preemptive metadata reclaim either.
>
> Once we do start doing writeback with such a large supply, the results
> are somewhat ugly. The delalloc work generates a huge amount of delayed
> refs without proper reservations which sends the metadata space system
> into a tailspin trying to run yet more delalloc to free space.
> Ultimately, the system stalls waiting for huge amounts of ordered
> extents and delayed refs blocking all users in start_transaction() on
> tickets in reserve_space().
>
> This patch series aims to address these issues in a relatively targeted
> way by improving our reservations for delalloc delayed refs and by doing
> some very basic smoothing of the work in flush_space(). Further work
> could be done to improve flush_space() heuristics and latency but this
> is already a big help on my observed workloads.
>
> I was able to reproduce stalls on a more "modest" system with 264GiB of
> ram by using a somewhat silly 80% dirty_ratio.
>
> I was unfortunately unable to reproduce any stalls on a yet smaller
> system with only 32GiB of ram.
>
> The first 2 patches do the delayed_ref rsv accounting on btrfs_inode,
> mirroring inode->block_rsv.
> The 3th patch is a cleanup to the types counting max extents
> The 4th patch reduces the size of the unit of work in shrink_delalloc()
> to further reduce stalls.
> ---
> Changelog:
> v4:
> - Treat the extent tree data delayed ref as needing reservation for two cow
> operations.
As this has been reviewed by Filipe, please add it to for-next. Thanks.
^ permalink raw reply [flat|nested] 18+ messages in thread