[PATCH v2 0/7] btrfs: fix periodic reclaim condition with some cleanup

public inbox for linux-btrfs@vger.kernel.org
 help / color / mirror / Atom feed

* [PATCH v2 0/7] btrfs: fix periodic reclaim condition with some cleanup
@ 2026-01-03 12:19 Sun YangKai
  2026-01-03 12:19 ` [PATCH v2 1/7] btrfs: fix periodic reclaim condition Sun YangKai
                   ` (6 more replies)
  0 siblings, 7 replies; 16+ messages in thread
From: Sun YangKai @ 2026-01-03 12:19 UTC (permalink / raw)
  To: linux-btrfs; +Cc: Sun YangKai

This series eliminates wasteful periodic reclaim operations that were occurring
when already failed to reclaim any new space, and includes several preparatory
cleanups.

Patch 1 fixes the core issue.
Patch 2-7 are cleanup with no functional change.
Details are in commit messages.

Changes from v1:

- First fix then cleanup, adviced by Qu Wenruo <quwenruo.btrfs@gmx.com>

Sun YangKai (7):
  btrfs: fix periodic reclaim condition
  btrfs: use u8 for reclaim threshold type
  btrfs: clarify reclaim sweep control flow
  btrfs: change block group reclaim_mark to bool
  btrfs: reorder btrfs_block_group members to reduce struct size
  btrfs: use proper types for btrfs_block_group fields
  btrfs: consolidate reclaim readiness checks in btrfs_should_reclaim()

 fs/btrfs/block-group.c | 29 ++++++++++-----------
 fs/btrfs/block-group.h | 22 ++++++++++------
 fs/btrfs/space-info.c  | 58 ++++++++++++++++++++----------------------
 fs/btrfs/space-info.h  | 40 ++++++++++++++++++-----------
 fs/btrfs/sysfs.c       |  3 ++-
 5 files changed, 82 insertions(+), 70 deletions(-)

-- 
2.51.2


^ permalink raw reply	[flat|nested] 16+ messages in thread

* [PATCH v2 1/7] btrfs: fix periodic reclaim condition
  2026-01-03 12:19 [PATCH v2 0/7] btrfs: fix periodic reclaim condition with some cleanup Sun YangKai
@ 2026-01-03 12:19 ` Sun YangKai
  2026-01-04 19:40   ` Boris Burkov
  2026-01-03 12:19 ` [PATCH v2 2/7] btrfs: use u8 for reclaim threshold type Sun YangKai
                   ` (5 subsequent siblings)
  6 siblings, 1 reply; 16+ messages in thread
From: Sun YangKai @ 2026-01-03 12:19 UTC (permalink / raw)
  To: linux-btrfs; +Cc: Sun YangKai, Boris Burkov

Problems with current implementation:
1. reclaimable_bytes is signed while chunk_sz is unsigned, causing
   negative reclaimable_bytes to trigger reclaim unexpectedly
2. The "space must be freed between scans" assumption breaks the
   two-scan requirement: first scan marks block groups, second scan
   reclaims them. Without the second scan, no reclamation occurs.

Instead, track actual reclaim progress: pause reclaim when block groups
will be reclaimed, and resume only when progress is made. This ensures
reclaim continues until no further progress can be made, then resumes when
space_info changes or new reclaimable groups appear.

CC: Boris Burkov <boris@bur.io>
Fixes: 813d4c6422516 ("btrfs: prevent pathological periodic reclaim loops")
Signed-off-by: Sun YangKai <sunk67188@gmail.com>
---
 fs/btrfs/block-group.c | 15 +++++++-------
 fs/btrfs/space-info.c  | 44 +++++++++++++++++++-----------------------
 fs/btrfs/space-info.h  | 28 ++++++++++++++++++---------
 3 files changed, 46 insertions(+), 41 deletions(-)

diff --git a/fs/btrfs/block-group.c b/fs/btrfs/block-group.c
index e417aba4c4c7..94a4068cd42a 100644
--- a/fs/btrfs/block-group.c
+++ b/fs/btrfs/block-group.c
@@ -1871,6 +1871,7 @@ void btrfs_reclaim_bgs_work(struct work_struct *work)
 	while (!list_empty(&fs_info->reclaim_bgs)) {
 		u64 used;
 		u64 reserved;
+		u64 old_total;
 		int ret = 0;
 
 		bg = list_first_entry(&fs_info->reclaim_bgs,
@@ -1936,6 +1937,7 @@ void btrfs_reclaim_bgs_work(struct work_struct *work)
 		}
 
 		spin_unlock(&bg->lock);
+		old_total = space_info->total_bytes;
 		spin_unlock(&space_info->lock);
 
 		/*
@@ -1988,14 +1990,14 @@ void btrfs_reclaim_bgs_work(struct work_struct *work)
 			reserved = 0;
 			spin_lock(&space_info->lock);
 			space_info->reclaim_errors++;
-			if (READ_ONCE(space_info->periodic_reclaim))
-				space_info->periodic_reclaim_ready = false;
 			spin_unlock(&space_info->lock);
 		}
 		spin_lock(&space_info->lock);
 		space_info->reclaim_count++;
 		space_info->reclaim_bytes += used;
 		space_info->reclaim_bytes += reserved;
+		if (space_info->total_bytes < old_total)
+			btrfs_resume_periodic_reclaim(space_info);
 		spin_unlock(&space_info->lock);
 
 next:
@@ -3730,8 +3732,6 @@ int btrfs_update_block_group(struct btrfs_trans_handle *trans,
 		space_info->bytes_reserved -= num_bytes;
 		space_info->bytes_used += num_bytes;
 		space_info->disk_used += num_bytes * factor;
-		if (READ_ONCE(space_info->periodic_reclaim))
-			btrfs_space_info_update_reclaimable(space_info, -num_bytes);
 		spin_unlock(&cache->lock);
 		spin_unlock(&space_info->lock);
 	} else {
@@ -3741,12 +3741,11 @@ int btrfs_update_block_group(struct btrfs_trans_handle *trans,
 		btrfs_space_info_update_bytes_pinned(space_info, num_bytes);
 		space_info->bytes_used -= num_bytes;
 		space_info->disk_used -= num_bytes * factor;
-		if (READ_ONCE(space_info->periodic_reclaim))
-			btrfs_space_info_update_reclaimable(space_info, num_bytes);
-		else
-			reclaim = should_reclaim_block_group(cache, num_bytes);
+		reclaim = should_reclaim_block_group(cache, num_bytes);
 
 		spin_unlock(&cache->lock);
+		if (reclaim)
+			btrfs_resume_periodic_reclaim(space_info);
 		spin_unlock(&space_info->lock);
 
 		btrfs_set_extent_bit(&trans->transaction->pinned_extents, bytenr,
diff --git a/fs/btrfs/space-info.c b/fs/btrfs/space-info.c
index 7b7b7255f7d8..de8bde1081be 100644
--- a/fs/btrfs/space-info.c
+++ b/fs/btrfs/space-info.c
@@ -2119,48 +2119,44 @@ static void do_reclaim_sweep(struct btrfs_space_info *space_info, int raid)
 	 * really need a block group, do take a fresh one.
 	 */
 	if (try_again && urgent) {
-		try_again = false;
+		urgent = false;
 		goto again;
 	}
 
 	up_read(&space_info->groups_sem);
-}
-
-void btrfs_space_info_update_reclaimable(struct btrfs_space_info *space_info, s64 bytes)
-{
-	u64 chunk_sz = calc_effective_data_chunk_size(space_info->fs_info);
-
-	lockdep_assert_held(&space_info->lock);
-	space_info->reclaimable_bytes += bytes;
 
-	if (space_info->reclaimable_bytes >= chunk_sz)
-		btrfs_set_periodic_reclaim_ready(space_info, true);
-}
-
-void btrfs_set_periodic_reclaim_ready(struct btrfs_space_info *space_info, bool ready)
-{
-	lockdep_assert_held(&space_info->lock);
-	if (!READ_ONCE(space_info->periodic_reclaim))
-		return;
-	if (ready != space_info->periodic_reclaim_ready) {
-		space_info->periodic_reclaim_ready = ready;
-		if (!ready)
-			space_info->reclaimable_bytes = 0;
+	/*
+	 * Temporary pause periodic reclaim until reclaim make some progress.
+	 * This can prevent periodic reclaim keep happening but make no progress.
+	 */
+	if (!try_again) {
+		spin_lock(&space_info->lock);
+		btrfs_pause_periodic_reclaim(space_info);
+		spin_unlock(&space_info->lock);
 	}
 }
 
 static bool btrfs_should_periodic_reclaim(struct btrfs_space_info *space_info)
 {
 	bool ret;
+	u64 chunk_sz;
+	u64 unused;
 
 	if (space_info->flags & BTRFS_BLOCK_GROUP_SYSTEM)
 		return false;
 	if (!READ_ONCE(space_info->periodic_reclaim))
 		return false;
+	if (!READ_ONCE(space_info->periodic_reclaim_paused))
+		return true;
+
+	chunk_sz = calc_effective_data_chunk_size(space_info->fs_info);
 
 	spin_lock(&space_info->lock);
-	ret = space_info->periodic_reclaim_ready;
-	btrfs_set_periodic_reclaim_ready(space_info, false);
+	unused = space_info->total_bytes - space_info->bytes_used;
+	ret = (unused >= space_info->last_reclaim_unused + chunk_sz ||
+	       btrfs_calc_reclaim_threshold(space_info) != space_info->last_reclaim_threshold);
+	if (ret)
+		btrfs_resume_periodic_reclaim(space_info);
 	spin_unlock(&space_info->lock);
 
 	return ret;
diff --git a/fs/btrfs/space-info.h b/fs/btrfs/space-info.h
index 0703f24b23f7..a49a4c7b0a68 100644
--- a/fs/btrfs/space-info.h
+++ b/fs/btrfs/space-info.h
@@ -214,14 +214,11 @@ struct btrfs_space_info {
 
 	/*
 	 * Periodic reclaim should be a no-op if a space_info hasn't
-	 * freed any space since the last time we tried.
+	 * freed any space since the last time we made no progress.
 	 */
-	bool periodic_reclaim_ready;
-
-	/*
-	 * Net bytes freed or allocated since the last reclaim pass.
-	 */
-	s64 reclaimable_bytes;
+	bool periodic_reclaim_paused;
+	int last_reclaim_threshold;
+	u64 last_reclaim_unused;
 };
 
 static inline bool btrfs_mixed_space_info(const struct btrfs_space_info *space_info)
@@ -301,9 +298,22 @@ void btrfs_dump_space_info_for_trans_abort(struct btrfs_fs_info *fs_info);
 void btrfs_init_async_reclaim_work(struct btrfs_fs_info *fs_info);
 u64 btrfs_account_ro_block_groups_free_space(struct btrfs_space_info *sinfo);
 
-void btrfs_space_info_update_reclaimable(struct btrfs_space_info *space_info, s64 bytes);
-void btrfs_set_periodic_reclaim_ready(struct btrfs_space_info *space_info, bool ready);
 int btrfs_calc_reclaim_threshold(const struct btrfs_space_info *space_info);
+static inline void btrfs_resume_periodic_reclaim(struct btrfs_space_info *space_info)
+{
+	lockdep_assert_held(&space_info->lock);
+	if (space_info->periodic_reclaim_paused)
+		space_info->periodic_reclaim_paused = false;
+}
+static inline void btrfs_pause_periodic_reclaim(struct btrfs_space_info *space_info)
+{
+	lockdep_assert_held(&space_info->lock);
+	if (!space_info->periodic_reclaim_paused) {
+		space_info->periodic_reclaim_paused = true;
+		space_info->last_reclaim_threshold = btrfs_calc_reclaim_threshold(space_info);
+		space_info->last_reclaim_unused = space_info->total_bytes - space_info->bytes_used;
+	}
+}
 void btrfs_reclaim_sweep(const struct btrfs_fs_info *fs_info);
 void btrfs_return_free_space(struct btrfs_space_info *space_info, u64 len);
 
-- 
2.51.2


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [PATCH v2 2/7] btrfs: use u8 for reclaim threshold type
  2026-01-03 12:19 [PATCH v2 0/7] btrfs: fix periodic reclaim condition with some cleanup Sun YangKai
  2026-01-03 12:19 ` [PATCH v2 1/7] btrfs: fix periodic reclaim condition Sun YangKai
@ 2026-01-03 12:19 ` Sun YangKai
  2026-01-03 12:19 ` [PATCH v2 3/7] btrfs: clarify reclaim sweep control flow Sun YangKai
                   ` (4 subsequent siblings)
  6 siblings, 0 replies; 16+ messages in thread
From: Sun YangKai @ 2026-01-03 12:19 UTC (permalink / raw)
  To: linux-btrfs; +Cc: Sun YangKai

The bg_reclaim_threshold field stores a percentage value (0-100), making
u8 a more appropriate type than int. Update the field and related
function return types:

- struct btrfs_space_info::bg_reclaim_threshold
- calc_dynamic_reclaim_threshold()
- btrfs_calc_reclaim_threshold()

The sysfs store function already validates the range is 0-100, ensuring
the cast to u8 is safe.

Signed-off-by: Sun YangKai <sunk67188@gmail.com>
---
 fs/btrfs/space-info.c |  6 +++---
 fs/btrfs/space-info.h | 14 +++++++-------
 fs/btrfs/sysfs.c      |  3 ++-
 3 files changed, 12 insertions(+), 11 deletions(-)

diff --git a/fs/btrfs/space-info.c b/fs/btrfs/space-info.c
index de8bde1081be..29dffbd9aadd 100644
--- a/fs/btrfs/space-info.c
+++ b/fs/btrfs/space-info.c
@@ -2037,7 +2037,7 @@ static u64 calc_unalloc_target(struct btrfs_fs_info *fs_info)
  * Typically with 10 block groups as the target, the discrete values this comes
  * out to are 0, 10, 20, ... , 80, 90, and 99.
  */
-static int calc_dynamic_reclaim_threshold(const struct btrfs_space_info *space_info)
+static u8 calc_dynamic_reclaim_threshold(const struct btrfs_space_info *space_info)
 {
 	struct btrfs_fs_info *fs_info = space_info->fs_info;
 	u64 unalloc = atomic64_read(&fs_info->free_chunk_space);
@@ -2052,11 +2052,11 @@ static int calc_dynamic_reclaim_threshold(const struct btrfs_space_info *space_i
 	if (unused < data_chunk_size)
 		return 0;
 
-	/* Cast to int is OK because want <= target. */
+	/* Cast to u8 is OK because want <= target. */
 	return calc_pct_ratio(want, target);
 }
 
-int btrfs_calc_reclaim_threshold(const struct btrfs_space_info *space_info)
+u8 btrfs_calc_reclaim_threshold(const struct btrfs_space_info *space_info)
 {
 	lockdep_assert_held(&space_info->lock);
 
diff --git a/fs/btrfs/space-info.h b/fs/btrfs/space-info.h
index a49a4c7b0a68..3cf22d7fad20 100644
--- a/fs/btrfs/space-info.h
+++ b/fs/btrfs/space-info.h
@@ -132,15 +132,15 @@ struct btrfs_space_info {
 	/* Chunk size in bytes */
 	u64 chunk_size;
 
+	int clamp;		/* Used to scale our threshold for preemptive
+				   flushing. The value is >> clamp, so turns
+				   out to be a 2^clamp divisor. */
+
 	/*
 	 * Once a block group drops below this threshold (percents) we'll
 	 * schedule it for reclaim.
 	 */
-	int bg_reclaim_threshold;
-
-	int clamp;		/* Used to scale our threshold for preemptive
-				   flushing. The value is >> clamp, so turns
-				   out to be a 2^clamp divisor. */
+	u8 bg_reclaim_threshold;
 
 	bool full;		/* indicates that we cannot allocate any more
 				   chunks for this space */
@@ -217,7 +217,7 @@ struct btrfs_space_info {
 	 * freed any space since the last time we made no progress.
 	 */
 	bool periodic_reclaim_paused;
-	int last_reclaim_threshold;
+	u8 last_reclaim_threshold;
 	u64 last_reclaim_unused;
 };
 
@@ -298,7 +298,7 @@ void btrfs_dump_space_info_for_trans_abort(struct btrfs_fs_info *fs_info);
 void btrfs_init_async_reclaim_work(struct btrfs_fs_info *fs_info);
 u64 btrfs_account_ro_block_groups_free_space(struct btrfs_space_info *sinfo);
 
-int btrfs_calc_reclaim_threshold(const struct btrfs_space_info *space_info);
+u8 btrfs_calc_reclaim_threshold(const struct btrfs_space_info *space_info);
 static inline void btrfs_resume_periodic_reclaim(struct btrfs_space_info *space_info)
 {
 	lockdep_assert_held(&space_info->lock);
diff --git a/fs/btrfs/sysfs.c b/fs/btrfs/sysfs.c
index f0974f4c0ae4..468c6a9acd3b 100644
--- a/fs/btrfs/sysfs.c
+++ b/fs/btrfs/sysfs.c
@@ -937,7 +937,8 @@ static ssize_t btrfs_sinfo_bg_reclaim_threshold_store(struct kobject *kobj,
 	if (thresh < 0 || thresh > 100)
 		return -EINVAL;
 
-	WRITE_ONCE(space_info->bg_reclaim_threshold, thresh);
+	/* Safe to case to u8 after checking thresh's range is between 0 and 100 */
+	WRITE_ONCE(space_info->bg_reclaim_threshold, (u8)thresh);
 
 	return len;
 }
-- 
2.51.2


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [PATCH v2 3/7] btrfs: clarify reclaim sweep control flow
  2026-01-03 12:19 [PATCH v2 0/7] btrfs: fix periodic reclaim condition with some cleanup Sun YangKai
  2026-01-03 12:19 ` [PATCH v2 1/7] btrfs: fix periodic reclaim condition Sun YangKai
  2026-01-03 12:19 ` [PATCH v2 2/7] btrfs: use u8 for reclaim threshold type Sun YangKai
@ 2026-01-03 12:19 ` Sun YangKai
  2026-01-03 12:19 ` [PATCH v2 4/7] btrfs: change block group reclaim_mark to bool Sun YangKai
                   ` (3 subsequent siblings)
  6 siblings, 0 replies; 16+ messages in thread
From: Sun YangKai @ 2026-01-03 12:19 UTC (permalink / raw)
  To: linux-btrfs; +Cc: Sun YangKai

Replace the try_again flag with will_reclaim and adjust the
to better reflect the intent of the logic. This makes the reclaim
decision easier to follow without changing behavior.

Also prepare for the next patch.

No functional change.

Signed-off-by: Sun YangKai <sunk67188@gmail.com>
---
 fs/btrfs/space-info.c | 8 ++++----
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/fs/btrfs/space-info.c b/fs/btrfs/space-info.c
index 29dffbd9aadd..3ccb365549ff 100644
--- a/fs/btrfs/space-info.c
+++ b/fs/btrfs/space-info.c
@@ -2083,7 +2083,7 @@ static void do_reclaim_sweep(struct btrfs_space_info *space_info, int raid)
 {
 	struct btrfs_block_group *bg;
 	int thresh_pct;
-	bool try_again = true;
+	bool will_reclaim = false;
 	bool urgent;
 
 	spin_lock(&space_info->lock);
@@ -2101,7 +2101,7 @@ static void do_reclaim_sweep(struct btrfs_space_info *space_info, int raid)
 		spin_lock(&bg->lock);
 		thresh = mult_perc(bg->length, thresh_pct);
 		if (bg->used < thresh && bg->reclaim_mark) {
-			try_again = false;
+			will_reclaim = true;
 			reclaim = true;
 		}
 		bg->reclaim_mark++;
@@ -2118,7 +2118,7 @@ static void do_reclaim_sweep(struct btrfs_space_info *space_info, int raid)
 	 * If we have any staler groups, we don't touch the fresher ones, but if we
 	 * really need a block group, do take a fresh one.
 	 */
-	if (try_again && urgent) {
+	if (!will_reclaim && urgent) {
 		urgent = false;
 		goto again;
 	}
@@ -2129,7 +2129,7 @@ static void do_reclaim_sweep(struct btrfs_space_info *space_info, int raid)
 	 * Temporary pause periodic reclaim until reclaim make some progress.
 	 * This can prevent periodic reclaim keep happening but make no progress.
 	 */
-	if (!try_again) {
+	if (will_reclaim) {
 		spin_lock(&space_info->lock);
 		btrfs_pause_periodic_reclaim(space_info);
 		spin_unlock(&space_info->lock);
-- 
2.51.2


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [PATCH v2 4/7] btrfs: change block group reclaim_mark to bool
  2026-01-03 12:19 [PATCH v2 0/7] btrfs: fix periodic reclaim condition with some cleanup Sun YangKai
                   ` (2 preceding siblings ...)
  2026-01-03 12:19 ` [PATCH v2 3/7] btrfs: clarify reclaim sweep control flow Sun YangKai
@ 2026-01-03 12:19 ` Sun YangKai
  2026-01-03 12:19 ` [PATCH v2 5/7] btrfs: reorder btrfs_block_group members to reduce struct size Sun YangKai
                   ` (2 subsequent siblings)
  6 siblings, 0 replies; 16+ messages in thread
From: Sun YangKai @ 2026-01-03 12:19 UTC (permalink / raw)
  To: linux-btrfs; +Cc: Sun YangKai

The reclaim_mark field in struct btrfs_block_group was a u64 that was
incremented when marking block groups for reclaim during sweeping, but
the actual counter value was never used - only the zero/non-zero state
mattered for determining if a block group needed reclaim.

Convert it to a bool to properly reflect its usage and reduce memory
footprint by 8 bytes. Update assignments to use true/false instead of
increment and zero.

Now sizeof(struct btrfs_block_group) is 632->624.

Signed-off-by: Sun YangKai <sunk67188@gmail.com>
---
 fs/btrfs/block-group.c | 2 +-
 fs/btrfs/block-group.h | 7 ++++++-
 fs/btrfs/space-info.c  | 2 +-
 3 files changed, 8 insertions(+), 3 deletions(-)

diff --git a/fs/btrfs/block-group.c b/fs/btrfs/block-group.c
index 94a4068cd42a..f29553fa204a 100644
--- a/fs/btrfs/block-group.c
+++ b/fs/btrfs/block-group.c
@@ -3728,7 +3728,7 @@ int btrfs_update_block_group(struct btrfs_trans_handle *trans,
 		old_val += num_bytes;
 		cache->used = old_val;
 		cache->reserved -= num_bytes;
-		cache->reclaim_mark = 0;
+		cache->reclaim_mark = false;
 		space_info->bytes_reserved -= num_bytes;
 		space_info->bytes_used += num_bytes;
 		space_info->disk_used += num_bytes * factor;
diff --git a/fs/btrfs/block-group.h b/fs/btrfs/block-group.h
index 5f933455118c..3b3c61b3af2c 100644
--- a/fs/btrfs/block-group.h
+++ b/fs/btrfs/block-group.h
@@ -251,6 +251,12 @@ struct btrfs_block_group {
 	/* Protected by @free_space_lock. */
 	bool using_free_space_bitmaps_cached;
 
+	/*
+	 * Mark this blockgroup is not used for allocation
+	 * between two reclaim sweeps.
+	 */
+	bool reclaim_mark;
+
 	/*
 	 * Number of extents in this block group used for swap files.
 	 * All accesses protected by the spinlock 'lock'.
@@ -270,7 +276,6 @@ struct btrfs_block_group {
 	struct work_struct zone_finish_work;
 	struct extent_buffer *last_eb;
 	enum btrfs_block_group_size_class size_class;
-	u64 reclaim_mark;
 };
 
 static inline u64 btrfs_block_group_end(const struct btrfs_block_group *block_group)
diff --git a/fs/btrfs/space-info.c b/fs/btrfs/space-info.c
index 3ccb365549ff..dce21809e640 100644
--- a/fs/btrfs/space-info.c
+++ b/fs/btrfs/space-info.c
@@ -2104,7 +2104,7 @@ static void do_reclaim_sweep(struct btrfs_space_info *space_info, int raid)
 			will_reclaim = true;
 			reclaim = true;
 		}
-		bg->reclaim_mark++;
+		bg->reclaim_mark = true;
 		spin_unlock(&bg->lock);
 		if (reclaim)
 			btrfs_mark_bg_to_reclaim(bg);
-- 
2.51.2


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [PATCH v2 5/7] btrfs: reorder btrfs_block_group members to reduce struct size
  2026-01-03 12:19 [PATCH v2 0/7] btrfs: fix periodic reclaim condition with some cleanup Sun YangKai
                   ` (3 preceding siblings ...)
  2026-01-03 12:19 ` [PATCH v2 4/7] btrfs: change block group reclaim_mark to bool Sun YangKai
@ 2026-01-03 12:19 ` Sun YangKai
  2026-01-05 15:07   ` Filipe Manana
  2026-01-03 12:19 ` [PATCH v2 6/7] btrfs: use proper types for btrfs_block_group fields Sun YangKai
  2026-01-03 12:19 ` [PATCH v2 7/7] btrfs: consolidate reclaim readiness checks in btrfs_should_reclaim() Sun YangKai
  6 siblings, 1 reply; 16+ messages in thread
From: Sun YangKai @ 2026-01-03 12:19 UTC (permalink / raw)
  To: linux-btrfs; +Cc: Sun YangKai

Reorder struct btrfs_block_group fields to improve packing and reduce
memory footprint from 624 to 600 bytes (24 bytes saved per instance).

Here's pahole output after this change:

struct btrfs_block_group {
	struct btrfs_fs_info *     fs_info;              /*     0     8 */
	struct btrfs_inode *       inode;                /*     8     8 */
	u64                        start;                /*    16     8 */
	u64                        length;               /*    24     8 */
	u64                        pinned;               /*    32     8 */
	u64                        reserved;             /*    40     8 */
	u64                        used;                 /*    48     8 */
	u64                        delalloc_bytes;       /*    56     8 */
	/* --- cacheline 1 boundary (64 bytes) --- */
	u64                        bytes_super;          /*    64     8 */
	u64                        flags;                /*    72     8 */
	u64                        cache_generation;     /*    80     8 */
	u64                        global_root_id;       /*    88     8 */
	u64                        commit_used;          /*    96     8 */
	u32                        bitmap_high_thresh;   /*   104     4 */
	u32                        bitmap_low_thresh;    /*   108     4 */
	struct rw_semaphore        data_rwsem;           /*   112    40 */
	/* --- cacheline 2 boundary (128 bytes) was 24 bytes ago --- */
	unsigned long              full_stripe_len;      /*   152     8 */
	unsigned long              runtime_flags;        /*   160     8 */
	spinlock_t                 lock;                 /*   168     4 */
	unsigned int               ro;                   /*   172     4 */
	enum btrfs_disk_cache_state disk_cache_state;    /*   176     4 */
	enum btrfs_caching_type    cached;               /*   180     4 */
	struct btrfs_caching_control * caching_ctl;      /*   184     8 */
	/* --- cacheline 3 boundary (192 bytes) --- */
	struct btrfs_space_info *  space_info;           /*   192     8 */
	struct btrfs_free_space_ctl * free_space_ctl;    /*   200     8 */
	struct rb_node             cache_node;           /*   208    24 */
	struct list_head           list;                 /*   232    16 */
	struct list_head           cluster_list;         /*   248    16 */
	/* --- cacheline 4 boundary (256 bytes) was 8 bytes ago --- */
	struct list_head           bg_list;              /*   264    16 */
	struct list_head           ro_list;              /*   280    16 */
	refcount_t                 refs;                 /*   296     4 */
	atomic_t                   frozen;               /*   300     4 */
	struct list_head           discard_list;         /*   304    16 */
	/* --- cacheline 5 boundary (320 bytes) --- */
	enum btrfs_discard_state   discard_state;        /*   320     4 */
	int                        discard_index;        /*   324     4 */
	u64                        discard_eligible_time; /*   328     8 */
	u64                        discard_cursor;       /*   336     8 */
	struct list_head           dirty_list;           /*   344    16 */
	struct list_head           io_list;              /*   360    16 */
	struct btrfs_io_ctl        io_ctl;               /*   376    72 */
	/* --- cacheline 7 boundary (448 bytes) --- */
	atomic_t                   reservations;         /*   448     4 */
	atomic_t                   nocow_writers;        /*   452     4 */
	struct mutex               free_space_lock;      /*   456    32 */
	bool                       using_free_space_bitmaps; /*   488     1 */
	bool                       using_free_space_bitmaps_cached; /*   489     1 */
	bool                       reclaim_mark;         /*   490     1 */

	/* XXX 1 byte hole, try to pack */

	int                        swap_extents;         /*   492     4 */
	u64                        alloc_offset;         /*   496     8 */
	u64                        zone_unusable;        /*   504     8 */
	/* --- cacheline 8 boundary (512 bytes) --- */
	u64                        zone_capacity;        /*   512     8 */
	u64                        meta_write_pointer;   /*   520     8 */
	struct btrfs_chunk_map *   physical_map;         /*   528     8 */
	struct list_head           active_bg_list;       /*   536    16 */
	struct work_struct         zone_finish_work;     /*   552    32 */
	/* --- cacheline 9 boundary (576 bytes) was 8 bytes ago --- */
	struct extent_buffer *     last_eb;              /*   584     8 */
	enum btrfs_block_group_size_class size_class;    /*   592     4 */

	/* size: 600, cachelines: 10, members: 56 */
	/* sum members: 595, holes: 1, sum holes: 1 */
	/* padding: 4 */
	/* last cacheline: 24 bytes */
};

Signed-off-by: Sun YangKai <sunk67188@gmail.com>
---
 fs/btrfs/block-group.h | 9 +++++----
 1 file changed, 5 insertions(+), 4 deletions(-)

diff --git a/fs/btrfs/block-group.h b/fs/btrfs/block-group.h
index 3b3c61b3af2c..88c2e3a0a5a0 100644
--- a/fs/btrfs/block-group.h
+++ b/fs/btrfs/block-group.h
@@ -118,7 +118,6 @@ struct btrfs_caching_control {
 struct btrfs_block_group {
 	struct btrfs_fs_info *fs_info;
 	struct btrfs_inode *inode;
-	spinlock_t lock;
 	u64 start;
 	u64 length;
 	u64 pinned;
@@ -159,6 +158,8 @@ struct btrfs_block_group {
 	unsigned long full_stripe_len;
 	unsigned long runtime_flags;
 
+	spinlock_t lock;
+
 	unsigned int ro;
 
 	int disk_cache_state;
@@ -178,8 +179,6 @@ struct btrfs_block_group {
 	/* For block groups in the same raid type */
 	struct list_head list;
 
-	refcount_t refs;
-
 	/*
 	 * List of struct btrfs_free_clusters for this block group.
 	 * Today it will only have one thing on it, but that may change
@@ -199,6 +198,8 @@ struct btrfs_block_group {
 	/* For read-only block groups */
 	struct list_head ro_list;
 
+	refcount_t refs;
+
 	/*
 	 * When non-zero it means the block group's logical address and its
 	 * device extents can not be reused for future block group allocations
@@ -211,10 +212,10 @@ struct btrfs_block_group {
 
 	/* For discard operations */
 	struct list_head discard_list;
+	enum btrfs_discard_state discard_state;
 	int discard_index;
 	u64 discard_eligible_time;
 	u64 discard_cursor;
-	enum btrfs_discard_state discard_state;
 
 	/* For dirty block groups */
 	struct list_head dirty_list;
-- 
2.51.2


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [PATCH v2 6/7] btrfs: use proper types for btrfs_block_group fields
  2026-01-03 12:19 [PATCH v2 0/7] btrfs: fix periodic reclaim condition with some cleanup Sun YangKai
                   ` (4 preceding siblings ...)
  2026-01-03 12:19 ` [PATCH v2 5/7] btrfs: reorder btrfs_block_group members to reduce struct size Sun YangKai
@ 2026-01-03 12:19 ` Sun YangKai
  2026-01-03 12:19 ` [PATCH v2 7/7] btrfs: consolidate reclaim readiness checks in btrfs_should_reclaim() Sun YangKai
  6 siblings, 0 replies; 16+ messages in thread
From: Sun YangKai @ 2026-01-03 12:19 UTC (permalink / raw)
  To: linux-btrfs; +Cc: Sun YangKai

Convert disk_cache_state and cached fields in struct btrfs_block_group
from int to their respective enum types (enum btrfs_disk_cache_state
and enum btrfs_caching_type). Also change btrfs_block_group_done() to
return bool instead of int since it returns a boolean comparison.

No functional changes intended.

Signed-off-by: Sun YangKai <sunk67188@gmail.com>
---
 fs/btrfs/block-group.h | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/fs/btrfs/block-group.h b/fs/btrfs/block-group.h
index 88c2e3a0a5a0..bbd9371e33fd 100644
--- a/fs/btrfs/block-group.h
+++ b/fs/btrfs/block-group.h
@@ -162,10 +162,10 @@ struct btrfs_block_group {
 
 	unsigned int ro;
 
-	int disk_cache_state;
+	enum btrfs_disk_cache_state disk_cache_state;
 
 	/* Cache tracking stuff */
-	int cached;
+	enum btrfs_caching_type cached;
 	struct btrfs_caching_control *caching_ctl;
 
 	struct btrfs_space_info *space_info;
@@ -383,7 +383,7 @@ static inline u64 btrfs_system_alloc_profile(struct btrfs_fs_info *fs_info)
 	return btrfs_get_alloc_profile(fs_info, BTRFS_BLOCK_GROUP_SYSTEM);
 }
 
-static inline int btrfs_block_group_done(const struct btrfs_block_group *cache)
+static inline bool btrfs_block_group_done(const struct btrfs_block_group *cache)
 {
 	smp_mb();
 	return cache->cached == BTRFS_CACHE_FINISHED ||
-- 
2.51.2


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [PATCH v2 7/7] btrfs: consolidate reclaim readiness checks in btrfs_should_reclaim()
  2026-01-03 12:19 [PATCH v2 0/7] btrfs: fix periodic reclaim condition with some cleanup Sun YangKai
                   ` (5 preceding siblings ...)
  2026-01-03 12:19 ` [PATCH v2 6/7] btrfs: use proper types for btrfs_block_group fields Sun YangKai
@ 2026-01-03 12:19 ` Sun YangKai
  6 siblings, 0 replies; 16+ messages in thread
From: Sun YangKai @ 2026-01-03 12:19 UTC (permalink / raw)
  To: linux-btrfs; +Cc: Sun YangKai

Move the filesystem state validation from btrfs_reclaim_bgs_work() into
btrfs_should_reclaim() to centralize the reclaim eligibility logic.
Since it is the only caller of btrfs_should_reclaim(), there's no
functional change.

Signed-off-by: Sun YangKai <sunk67188@gmail.com>
---
 fs/btrfs/block-group.c | 12 ++++++------
 1 file changed, 6 insertions(+), 6 deletions(-)

diff --git a/fs/btrfs/block-group.c b/fs/btrfs/block-group.c
index f29553fa204a..39e6f1bf3506 100644
--- a/fs/btrfs/block-group.c
+++ b/fs/btrfs/block-group.c
@@ -1804,6 +1804,12 @@ static int reclaim_bgs_cmp(void *unused, const struct list_head *a,
 
 static inline bool btrfs_should_reclaim(const struct btrfs_fs_info *fs_info)
 {
+	if (!test_bit(BTRFS_FS_OPEN, &fs_info->flags))
+		return false;
+
+	if (btrfs_fs_closing(fs_info))
+		return false;
+
 	if (btrfs_is_zoned(fs_info))
 		return btrfs_zoned_should_reclaim(fs_info);
 	return true;
@@ -1838,12 +1844,6 @@ void btrfs_reclaim_bgs_work(struct work_struct *work)
 	struct btrfs_space_info *space_info;
 	LIST_HEAD(retry_list);
 
-	if (!test_bit(BTRFS_FS_OPEN, &fs_info->flags))
-		return;
-
-	if (btrfs_fs_closing(fs_info))
-		return;
-
 	if (!btrfs_should_reclaim(fs_info))
 		return;
 
-- 
2.51.2


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* Re: [PATCH v2 1/7] btrfs: fix periodic reclaim condition
  2026-01-03 12:19 ` [PATCH v2 1/7] btrfs: fix periodic reclaim condition Sun YangKai
@ 2026-01-04 19:40   ` Boris Burkov
  2026-01-05 13:00     ` Sun Yangkai
  0 siblings, 1 reply; 16+ messages in thread
From: Boris Burkov @ 2026-01-04 19:40 UTC (permalink / raw)
  To: Sun YangKai; +Cc: linux-btrfs

On Sat, Jan 03, 2026 at 08:19:48PM +0800, Sun YangKai wrote:

First off, thank you very much for your attention to this and for
your in depth fixes.

> Problems with current implementation:
> 1. reclaimable_bytes is signed while chunk_sz is unsigned, causing
>    negative reclaimable_bytes to trigger reclaim unexpectedly

Agreed, completely. IMO, the message here could use a bit more clarity
that negative reclaimable_bytes is an intentionally designed condition
so this is a fundamental bug, not a weird edge case.

> 2. The "space must be freed between scans" assumption breaks the
>    two-scan requirement: first scan marks block groups, second scan
>    reclaims them. Without the second scan, no reclamation occurs.

I think I understand what you are saying, but let me try to restate it
myself and confirm:

Previously, due to bug 1, we were calling
btrfs_set_periodic_reclaim_ready() too frequently (after an allocation
that made the reclaimable_bytes negative). With bug 1 fixed, you really
have to free a chunk_sz to call btrfs_set_periodic_reclaim_ready(). As a
result, periodic_reclaim became way too conservative. Now the
multi-sweep marking thing AND reclaim_ready mean we basically never
reclaim, as the second sweep doesn't get triggered later.

Is that about right? I agree that this needs some re-thinking / fixing,
if so.

> 
> Instead, track actual reclaim progress: pause reclaim when block groups
> will be reclaimed, and resume only when progress is made. This ensures
> reclaim continues until no further progress can be made, then resumes when
> space_info changes or new reclaimable groups appear.

I think you are making a much bigger change than merely fixing the bugs
in the original design, so it requires deeper explanation of the new
invariants you are introducing. Clearly, I am one to talk about it, as I
messed up with my attempt, but I still think it is very valuable for
future readers. Thanks for bearing with me :)

I think the simplest pure "fix" is to change the current "ready = false"
setting from unconditional on every run of periodic reclaim to only when
a sweep actually queues a block group for reclaim. I have compiled but
not yet tested the code I included in this mail :) We can do that more
carefully once we agree how we want to move forward.

If you still want to re-design things on top of a fix like this one, I
am totally open to that.

Thanks again for all your work on this, and happy new year,
Boris

diff --git a/fs/btrfs/space-info.c b/fs/btrfs/space-info.c
index 6babbe333741..aad402485763 100644
--- a/fs/btrfs/space-info.c
+++ b/fs/btrfs/space-info.c
@@ -2095,12 +2095,13 @@ static bool is_reclaim_urgent(struct btrfs_space_info *space_info)
 	return unalloc < data_chunk_size;
 }

-static void do_reclaim_sweep(struct btrfs_space_info *space_info, int raid)
+static bool do_reclaim_sweep(struct btrfs_space_info *space_info, int raid)
 {
 	struct btrfs_block_group *bg;
 	int thresh_pct;
 	bool try_again = true;
 	bool urgent;
+	bool ret = false;

 	spin_lock(&space_info->lock);
 	urgent = is_reclaim_urgent(space_info);
@@ -2122,8 +2123,10 @@ static void do_reclaim_sweep(struct btrfs_space_info *space_info, int raid)
 		}
 		bg->reclaim_mark++;
 		spin_unlock(&bg->lock);
-		if (reclaim)
+		if (reclaim) {
 			btrfs_mark_bg_to_reclaim(bg);
+			ret = true;
+		}
 		btrfs_put_block_group(bg);
 	}

@@ -2140,6 +2143,7 @@ static void do_reclaim_sweep(struct btrfs_space_info *space_info, int raid)
 	}

 	up_read(&space_info->groups_sem);
+	return ret;
 }

 void btrfs_space_info_update_reclaimable(struct btrfs_space_info *space_info, s64 bytes)
@@ -2149,7 +2153,8 @@ void btrfs_space_info_update_reclaimable(struct btrfs_space_info *space_info, s6
 	lockdep_assert_held(&space_info->lock);
 	space_info->reclaimable_bytes += bytes;

-	if (space_info->reclaimable_bytes >= chunk_sz)
+	if (space_info->reclaimable_bytes > 0 &&
+	    space_info->reclaimable_bytes >= chunk_sz)
 		btrfs_set_periodic_reclaim_ready(space_info, true);
 }

@@ -2176,7 +2181,6 @@ static bool btrfs_should_periodic_reclaim(struct btrfs_space_info *space_info)

 	spin_lock(&space_info->lock);
 	ret = space_info->periodic_reclaim_ready;
-	btrfs_set_periodic_reclaim_ready(space_info, false);
 	spin_unlock(&space_info->lock);

 	return ret;
@@ -2190,8 +2194,10 @@ void btrfs_reclaim_sweep(const struct btrfs_fs_info *fs_info)
 	list_for_each_entry(space_info, &fs_info->space_info, list) {
 		if (!btrfs_should_periodic_reclaim(space_info))
 			continue;
-		for (raid = 0; raid < BTRFS_NR_RAID_TYPES; raid++)
-			do_reclaim_sweep(space_info, raid);
+		for (raid = 0; raid < BTRFS_NR_RAID_TYPES; raid++) {
+			if (do_reclaim_sweep(space_info, raid))
+				btrfs_set_periodic_reclaim_ready(space_info, false);
+		}
 	}
 }

> 
> CC: Boris Burkov <boris@bur.io>
> Fixes: 813d4c6422516 ("btrfs: prevent pathological periodic reclaim loops")
> Signed-off-by: Sun YangKai <sunk67188@gmail.com>
> ---
>  fs/btrfs/block-group.c | 15 +++++++-------
>  fs/btrfs/space-info.c  | 44 +++++++++++++++++++-----------------------
>  fs/btrfs/space-info.h  | 28 ++++++++++++++++++---------
>  3 files changed, 46 insertions(+), 41 deletions(-)
> 
> diff --git a/fs/btrfs/block-group.c b/fs/btrfs/block-group.c
> index e417aba4c4c7..94a4068cd42a 100644
> --- a/fs/btrfs/block-group.c
> +++ b/fs/btrfs/block-group.c
> @@ -1871,6 +1871,7 @@ void btrfs_reclaim_bgs_work(struct work_struct *work)
>  	while (!list_empty(&fs_info->reclaim_bgs)) {
>  		u64 used;
>  		u64 reserved;
> +		u64 old_total;
>  		int ret = 0;
>  
>  		bg = list_first_entry(&fs_info->reclaim_bgs,
> @@ -1936,6 +1937,7 @@ void btrfs_reclaim_bgs_work(struct work_struct *work)
>  		}
>  
>  		spin_unlock(&bg->lock);
> +		old_total = space_info->total_bytes;
>  		spin_unlock(&space_info->lock);
>  
>  		/*
> @@ -1988,14 +1990,14 @@ void btrfs_reclaim_bgs_work(struct work_struct *work)
>  			reserved = 0;
>  			spin_lock(&space_info->lock);
>  			space_info->reclaim_errors++;
> -			if (READ_ONCE(space_info->periodic_reclaim))
> -				space_info->periodic_reclaim_ready = false;
>  			spin_unlock(&space_info->lock);
>  		}
>  		spin_lock(&space_info->lock);
>  		space_info->reclaim_count++;
>  		space_info->reclaim_bytes += used;
>  		space_info->reclaim_bytes += reserved;
> +		if (space_info->total_bytes < old_total)
> +			btrfs_resume_periodic_reclaim(space_info);

Why is this here? We've just completed a reclaim, which I would expect
to be neutral to the space_info->total_bytes (just moving them around).
So if (any) unrelated freeing happens to happen while we are reclaiming,
we resume? Doesn't seem wrong, but also seems a little specific and
random. I am probably missing some aspect of your new design.

Put a different way, what is special about frees that happen *while* we
are reclaiming?

>  		spin_unlock(&space_info->lock);
>  
>  next:
> @@ -3730,8 +3732,6 @@ int btrfs_update_block_group(struct btrfs_trans_handle *trans,
>  		space_info->bytes_reserved -= num_bytes;
>  		space_info->bytes_used += num_bytes;
>  		space_info->disk_used += num_bytes * factor;
> -		if (READ_ONCE(space_info->periodic_reclaim))
> -			btrfs_space_info_update_reclaimable(space_info, -num_bytes);
>  		spin_unlock(&cache->lock);
>  		spin_unlock(&space_info->lock);
>  	} else {
> @@ -3741,12 +3741,11 @@ int btrfs_update_block_group(struct btrfs_trans_handle *trans,
>  		btrfs_space_info_update_bytes_pinned(space_info, num_bytes);
>  		space_info->bytes_used -= num_bytes;
>  		space_info->disk_used -= num_bytes * factor;
> -		if (READ_ONCE(space_info->periodic_reclaim))
> -			btrfs_space_info_update_reclaimable(space_info, num_bytes);
> -		else
> -			reclaim = should_reclaim_block_group(cache, num_bytes);
> +		reclaim = should_reclaim_block_group(cache, num_bytes);

I think this is a bug with periodic_reclaim == 1

In that case, if should_reclaim_block_group() returns true (could be a
fixed or dynamic threshold), we will put that block group directly on
the reclaim list, which is a complete contradiction of the point of
periodic_reclaim.

>  
>  		spin_unlock(&cache->lock);
> +		if (reclaim)
> +			btrfs_resume_periodic_reclaim(space_info);

This also makes me wonder about the idea behind your change. If you want
periodic reclaim to "pause" until a block group meets the condition and
then we "resume", that's not exactly in the spirit of checking at a
periodic cadence, rather than as block groups update.

>  		spin_unlock(&space_info->lock);
>  
>  		btrfs_set_extent_bit(&trans->transaction->pinned_extents, bytenr,
> diff --git a/fs/btrfs/space-info.c b/fs/btrfs/space-info.c
> index 7b7b7255f7d8..de8bde1081be 100644
> --- a/fs/btrfs/space-info.c
> +++ b/fs/btrfs/space-info.c
> @@ -2119,48 +2119,44 @@ static void do_reclaim_sweep(struct btrfs_space_info *space_info, int raid)
>  	 * really need a block group, do take a fresh one.
>  	 */
>  	if (try_again && urgent) {
> -		try_again = false;
> +		urgent = false;
>  		goto again;
>  	}
>  
>  	up_read(&space_info->groups_sem);
> -}
> -
> -void btrfs_space_info_update_reclaimable(struct btrfs_space_info *space_info, s64 bytes)
> -{
> -	u64 chunk_sz = calc_effective_data_chunk_size(space_info->fs_info);
> -
> -	lockdep_assert_held(&space_info->lock);
> -	space_info->reclaimable_bytes += bytes;
>  
> -	if (space_info->reclaimable_bytes >= chunk_sz)
> -		btrfs_set_periodic_reclaim_ready(space_info, true);
> -}
> -
> -void btrfs_set_periodic_reclaim_ready(struct btrfs_space_info *space_info, bool ready)
> -{
> -	lockdep_assert_held(&space_info->lock);
> -	if (!READ_ONCE(space_info->periodic_reclaim))
> -		return;
> -	if (ready != space_info->periodic_reclaim_ready) {
> -		space_info->periodic_reclaim_ready = ready;
> -		if (!ready)
> -			space_info->reclaimable_bytes = 0;
> +	/*
> +	 * Temporary pause periodic reclaim until reclaim make some progress.
> +	 * This can prevent periodic reclaim keep happening but make no progress.
> +	 */
> +	if (!try_again) {
> +		spin_lock(&space_info->lock);
> +		btrfs_pause_periodic_reclaim(space_info);
> +		spin_unlock(&space_info->lock);
>  	}
>  }
>  
>  static bool btrfs_should_periodic_reclaim(struct btrfs_space_info *space_info)
>  {
>  	bool ret;
> +	u64 chunk_sz;
> +	u64 unused;
>  
>  	if (space_info->flags & BTRFS_BLOCK_GROUP_SYSTEM)
>  		return false;
>  	if (!READ_ONCE(space_info->periodic_reclaim))
>  		return false;
> +	if (!READ_ONCE(space_info->periodic_reclaim_paused))
> +		return true;

This condition doesn't feel like a "pause". If the "pause" is set to
false, then we proceed no matter what, otherwise we check conditions? It
should be something like:

if (READ_ONCE(space_info->periodic_reclaim_force))
        return true;

as it is acting like a "force" not a "not paused".

Hope that makes sense, it's obviously a bit tied up in language semantics..

> +
> +	chunk_sz = calc_effective_data_chunk_size(space_info->fs_info);
>  
>  	spin_lock(&space_info->lock);
> -	ret = space_info->periodic_reclaim_ready;
> -	btrfs_set_periodic_reclaim_ready(space_info, false);
> +	unused = space_info->total_bytes - space_info->bytes_used;

Probably makes sense to use a zoned aware wrapper for this, just in
case we make this friendly with zones eventually.

> +	ret = (unused >= space_info->last_reclaim_unused + chunk_sz ||
> +	       btrfs_calc_reclaim_threshold(space_info) != space_info->last_reclaim_threshold);

This second condition is quite interesting to me, I hadn't thought of
it. I think it makes some sense to care again if the threshold changed,
but it is a new behavior, rather than a fix, per-se.

What made you want to add this?

> +	if (ret)
> +		btrfs_resume_periodic_reclaim(space_info);
>  	spin_unlock(&space_info->lock);
>  
>  	return ret;
> diff --git a/fs/btrfs/space-info.h b/fs/btrfs/space-info.h
> index 0703f24b23f7..a49a4c7b0a68 100644
> --- a/fs/btrfs/space-info.h
> +++ b/fs/btrfs/space-info.h
> @@ -214,14 +214,11 @@ struct btrfs_space_info {
>  
>  	/*
>  	 * Periodic reclaim should be a no-op if a space_info hasn't
> -	 * freed any space since the last time we tried.
> +	 * freed any space since the last time we made no progress.
>  	 */
> -	bool periodic_reclaim_ready;
> -
> -	/*
> -	 * Net bytes freed or allocated since the last reclaim pass.
> -	 */
> -	s64 reclaimable_bytes;
> +	bool periodic_reclaim_paused;
> +	int last_reclaim_threshold;
> +	u64 last_reclaim_unused;
>  };
>  
>  static inline bool btrfs_mixed_space_info(const struct btrfs_space_info *space_info)
> @@ -301,9 +298,22 @@ void btrfs_dump_space_info_for_trans_abort(struct btrfs_fs_info *fs_info);
>  void btrfs_init_async_reclaim_work(struct btrfs_fs_info *fs_info);
>  u64 btrfs_account_ro_block_groups_free_space(struct btrfs_space_info *sinfo);
>  
> -void btrfs_space_info_update_reclaimable(struct btrfs_space_info *space_info, s64 bytes);
> -void btrfs_set_periodic_reclaim_ready(struct btrfs_space_info *space_info, bool ready);
>  int btrfs_calc_reclaim_threshold(const struct btrfs_space_info *space_info);
> +static inline void btrfs_resume_periodic_reclaim(struct btrfs_space_info *space_info)
> +{
> +	lockdep_assert_held(&space_info->lock);
> +	if (space_info->periodic_reclaim_paused)
> +		space_info->periodic_reclaim_paused = false;
> +}
> +static inline void btrfs_pause_periodic_reclaim(struct btrfs_space_info *space_info)
> +{
> +	lockdep_assert_held(&space_info->lock);
> +	if (!space_info->periodic_reclaim_paused) {
> +		space_info->periodic_reclaim_paused = true;
> +		space_info->last_reclaim_threshold = btrfs_calc_reclaim_threshold(space_info);
> +		space_info->last_reclaim_unused = space_info->total_bytes - space_info->bytes_used;
> +	}
> +}
>  void btrfs_reclaim_sweep(const struct btrfs_fs_info *fs_info);
>  void btrfs_return_free_space(struct btrfs_space_info *space_info, u64 len);
>  
> -- 
> 2.51.2
> 

^ permalink raw reply related	[flat|nested] 16+ messages in thread

* Re: [PATCH v2 1/7] btrfs: fix periodic reclaim condition
  2026-01-04 19:40   ` Boris Burkov
@ 2026-01-05 13:00     ` Sun Yangkai
  2026-01-05 18:21       ` Boris Burkov
  0 siblings, 1 reply; 16+ messages in thread
From: Sun Yangkai @ 2026-01-05 13:00 UTC (permalink / raw)
  To: Boris Burkov; +Cc: linux-btrfs



在 2026/1/5 03:40, Boris Burkov 写道:
> On Sat, Jan 03, 2026 at 08:19:48PM +0800, Sun YangKai wrote:
> 
> First off, thank you very much for your attention to this and for
> your in depth fixes.

Glad to hear from you and Happy New Year! :)

>> Problems with current implementation:
>> 1. reclaimable_bytes is signed while chunk_sz is unsigned, causing
>>    negative reclaimable_bytes to trigger reclaim unexpectedly
> 
> Agreed, completely. IMO, the message here could use a bit more clarity
> that negative reclaimable_bytes is an intentionally designed condition
> so this is a fundamental bug, not a weird edge case.
> 
>> 2. The "space must be freed between scans" assumption breaks the
>>    two-scan requirement: first scan marks block groups, second scan
>>    reclaims them. Without the second scan, no reclamation occurs.
> 
> I think I understand what you are saying, but let me try to restate it
> myself and confirm:
> 
> Previously, due to bug 1, we were calling
> btrfs_set_periodic_reclaim_ready() too frequently (after an allocation
> that made the reclaimable_bytes negative). With bug 1 fixed, you really
> have to free a chunk_sz to call btrfs_set_periodic_reclaim_ready(). As a
> result, periodic_reclaim became way too conservative. Now the
> multi-sweep marking thing AND reclaim_ready mean we basically never
> reclaim, as the second sweep doesn't get triggered later.
> 
> Is that about right? I agree that this needs some re-thinking / fixing,
> if so.

Yes that is the case. Your statement is way much better than mine. A sweep
always reclaim blockgroups marked by the previous sweep and mark blockgroups for
the next sweep. That explains why the assumption is not proper and we need a new
assumption to work on.

>> Instead, track actual reclaim progress: pause reclaim when block groups
>> will be reclaimed, and resume only when progress is made. This ensures
>> reclaim continues until no further progress can be made, then resumes when
>> space_info changes or new reclaimable groups appear.
> 
> I think you are making a much bigger change than merely fixing the bugs
> in the original design, so it requires deeper explanation of the new
> invariants you are introducing. Clearly, I am one to talk about it, as I
> messed up with my attempt, but I still think it is very valuable for
> future readers. Thanks for bearing with me :)

Thank you for bring us such a cool feature so I can try it out, learn details of
the implementation and bring some my opinions. I'm quite happy with these
experience. I started with a simple fix but the change was getting bigger and
bigger to reduce regressions and finally it's more like a redesign rather than a
fix.

Sorry for the previous short but unclear statement and let me try to make it
clear in the new year :)

Let's start with the new assumptions:
1. Reclaim is for getting more unallocated blockgroups/spaces.
2. Periodic reclaim should happen periodically :)

I think both users and developers will agree with this but I want to explain
more about 2. Currently, the bg->reclaim_mark related logic and dynamic periodic
reclaim logic both depend on 2. Your previous statement explains the former one
very well and I'll focus on the latter one here.

When we have less than 10GiB unallocated space, dynamic periodic reclaim will
kick in. With the filesystem getting fuller, the bg_reclaim_threshold is getting
larger to make periodic reclaim more aggressive to reclaim some unallocated
space. However, if we paused periodic reclaim when the threshold is small, like
19, then we'll never resume it even if we want to reclaim more aggressively with
a larger threshold later. And dynamic periodic reclaim will not work as
expected. This may happen when we first try reclaiming with threshold 19, while
having some quite large extents in 10% used blockgroups and having no enough
continuous free space in other 70+% full blockgroups. We'll fail to get more
unallocated space, pause periodic reclaim, and not try reclaiming when the
threshold gets larger than 79 later. This is why I resume periodic reclaim when
threshold changes.

So the core issue of this part logic about periodic reclaim is when to pause and
when to continue.

If we cannot free more unallocated blockgroups in periodic reclaim, we should
pause it because we don't want to just move things from a blockgroup to a newly
allocated blockgroup and then remove the old one periodically.

And we should continue periodic reclaim when we may be able to free more
blockgroups.

For pausing, it's hard to tell if we're freeing more unallocated directly
without changing current code. But in btrfs_reclaim_bgs_work() we have

btrfs_relocate_chunk()
  ->btrfs_remove_chunk()
    ->btrfs_remove_block_group()

So the old blockgroup is removed and we can see that in space_info->total_bytes.

Let's start with all the blockgroups having the same size and reclaiming a
single blockgroup. If we succeeded moving extents into existing blockgroups,
we'll not allocate a new one and space_info->total_bytes is getting smaller
because we've freed the old one. If we failed moving things into existing bgs
and allocated a new one, then space_info->total_bytes will not change.

When reclaiming n blockgroups, as long as we're using less space, such as n-1
new blockgroups are allocated or n new smaller blockgroups, we're making some
progress and able to detect that from space_info->total_bytes.

Reclaim may racing with some other work that free/allocate blockgroups, which
will lead to either an unnecessary extra periodic reclaim or an unexpected
pause. An extra periodic reclaim is not a big thing. But the unexpected pause
matters. This is the only regression currently and I think that will not happen
frequently.

For continuing, rather than tracing used_bytes, I think we should care more
about unused_bytes. This is inspired by calc_dynamic_reclaim_threshold(), which
will return 0 if there's no enough unused space. And also care about
reclaim_threshold to make dynamic periodic reclaim work.


> I think the simplest pure "fix" is to change the current "ready = false"
> setting from unconditional on every run of periodic reclaim to only when
> a sweep actually queues a block group for reclaim. I have compiled but
> not yet tested the code I included in this mail :) We can do that more
> carefully once we agree how we want to move forward.
> 
> If you still want to re-design things on top of a fix like this one, I
> am totally open to that.
> 
> Thanks again for all your work on this, and happy new year,
> Boris
> 
> diff --git a/fs/btrfs/space-info.c b/fs/btrfs/space-info.c
> index 6babbe333741..aad402485763 100644
> --- a/fs/btrfs/space-info.c
> +++ b/fs/btrfs/space-info.c
> @@ -2095,12 +2095,13 @@ static bool is_reclaim_urgent(struct btrfs_space_info *space_info)
>  	return unalloc < data_chunk_size;
>  }
> 
> -static void do_reclaim_sweep(struct btrfs_space_info *space_info, int raid)
> +static bool do_reclaim_sweep(struct btrfs_space_info *space_info, int raid)
>  {
>  	struct btrfs_block_group *bg;
>  	int thresh_pct;
>  	bool try_again = true;
>  	bool urgent;
> +	bool ret = false;
> 
>  	spin_lock(&space_info->lock);
>  	urgent = is_reclaim_urgent(space_info);
> @@ -2122,8 +2123,10 @@ static void do_reclaim_sweep(struct btrfs_space_info *space_info, int raid)
>  		}
>  		bg->reclaim_mark++;
>  		spin_unlock(&bg->lock);
> -		if (reclaim)
> +		if (reclaim) {
>  			btrfs_mark_bg_to_reclaim(bg);
> +			ret = true;
> +		}
>  		btrfs_put_block_group(bg);
>  	}
> 
> @@ -2140,6 +2143,7 @@ static void do_reclaim_sweep(struct btrfs_space_info *space_info, int raid)
>  	}
> 
>  	up_read(&space_info->groups_sem);
> +	return ret;
>  }

It's a good idea to return bool here! I hadn't thought of it.

>  void btrfs_space_info_update_reclaimable(struct btrfs_space_info *space_info, s64 bytes)
> @@ -2149,7 +2153,8 @@ void btrfs_space_info_update_reclaimable(struct btrfs_space_info *space_info, s6
>  	lockdep_assert_held(&space_info->lock);
>  	space_info->reclaimable_bytes += bytes;
> 
> -	if (space_info->reclaimable_bytes >= chunk_sz)
> +	if (space_info->reclaimable_bytes > 0 &&
> +	    space_info->reclaimable_bytes >= chunk_sz)
>  		btrfs_set_periodic_reclaim_ready(space_info, true);
>  }
> 
> @@ -2176,7 +2181,6 @@ static bool btrfs_should_periodic_reclaim(struct btrfs_space_info *space_info)
> 
>  	spin_lock(&space_info->lock);
>  	ret = space_info->periodic_reclaim_ready;
> -	btrfs_set_periodic_reclaim_ready(space_info, false);
>  	spin_unlock(&space_info->lock);
> 
>  	return ret;
> @@ -2190,8 +2194,10 @@ void btrfs_reclaim_sweep(const struct btrfs_fs_info *fs_info)
>  	list_for_each_entry(space_info, &fs_info->space_info, list) {
>  		if (!btrfs_should_periodic_reclaim(space_info))
>  			continue;
> -		for (raid = 0; raid < BTRFS_NR_RAID_TYPES; raid++)
> -			do_reclaim_sweep(space_info, raid);
> +		for (raid = 0; raid < BTRFS_NR_RAID_TYPES; raid++) {
> +			if (do_reclaim_sweep(space_info, raid))
> +				btrfs_set_periodic_reclaim_ready(space_info, false);
> +		}

Even if we have blockgroups to reclaim in this turn, we still mark some
bg->reclaim_mark and expect the next periodic reclaim, which will not run, might
reclaim them.

>  	}
>  }

So I'm afraid this patch will introduce obvious regression on periodic reclaim.
Add the "changed threshold" condition can alleviate this when dynamic reclaim is
also enabled but will not work for fixed threshold. So I failed to come up with
a simple fix patch with no obvious regression :(

>> CC: Boris Burkov <boris@bur.io>
>> Fixes: 813d4c6422516 ("btrfs: prevent pathological periodic reclaim loops")
>> Signed-off-by: Sun YangKai <sunk67188@gmail.com>
>> ---
>>  fs/btrfs/block-group.c | 15 +++++++-------
>>  fs/btrfs/space-info.c  | 44 +++++++++++++++++++-----------------------
>>  fs/btrfs/space-info.h  | 28 ++++++++++++++++++---------
>>  3 files changed, 46 insertions(+), 41 deletions(-)
>>
>> diff --git a/fs/btrfs/block-group.c b/fs/btrfs/block-group.c
>> index e417aba4c4c7..94a4068cd42a 100644
>> --- a/fs/btrfs/block-group.c
>> +++ b/fs/btrfs/block-group.c
>> @@ -1871,6 +1871,7 @@ void btrfs_reclaim_bgs_work(struct work_struct *work)
>>  	while (!list_empty(&fs_info->reclaim_bgs)) {
>>  		u64 used;
>>  		u64 reserved;
>> +		u64 old_total;
>>  		int ret = 0;
>>  
>>  		bg = list_first_entry(&fs_info->reclaim_bgs,
>> @@ -1936,6 +1937,7 @@ void btrfs_reclaim_bgs_work(struct work_struct *work)
>>  		}
>>  
>>  		spin_unlock(&bg->lock);
>> +		old_total = space_info->total_bytes;
>>  		spin_unlock(&space_info->lock);
>>  
>>  		/*
>> @@ -1988,14 +1990,14 @@ void btrfs_reclaim_bgs_work(struct work_struct *work)
>>  			reserved = 0;
>>  			spin_lock(&space_info->lock);
>>  			space_info->reclaim_errors++;
>> -			if (READ_ONCE(space_info->periodic_reclaim))
>> -				space_info->periodic_reclaim_ready = false;
>>  			spin_unlock(&space_info->lock);
>>  		}
>>  		spin_lock(&space_info->lock);
>>  		space_info->reclaim_count++;
>>  		space_info->reclaim_bytes += used;
>>  		space_info->reclaim_bytes += reserved;
>> +		if (space_info->total_bytes < old_total)
>> +			btrfs_resume_periodic_reclaim(space_info);
> 
> Why is this here? We've just completed a reclaim, which I would expect
> to be neutral to the space_info->total_bytes (just moving them around).
> So if (any) unrelated freeing happens to happen while we are reclaiming,
> we resume? Doesn't seem wrong, but also seems a little specific and
> random. I am probably missing some aspect of your new design.
> 
> Put a different way, what is special about frees that happen *while* we
> are reclaiming?

As explained, I want to use this to detect if the reclaim freed some unallocated
space.

>>  		spin_unlock(&space_info->lock);
>>  
>>  next:
>> @@ -3730,8 +3732,6 @@ int btrfs_update_block_group(struct btrfs_trans_handle *trans,
>>  		space_info->bytes_reserved -= num_bytes;
>>  		space_info->bytes_used += num_bytes;
>>  		space_info->disk_used += num_bytes * factor;
>> -		if (READ_ONCE(space_info->periodic_reclaim))
>> -			btrfs_space_info_update_reclaimable(space_info, -num_bytes);
>>  		spin_unlock(&cache->lock);
>>  		spin_unlock(&space_info->lock);
>>  	} else {
>> @@ -3741,12 +3741,11 @@ int btrfs_update_block_group(struct btrfs_trans_handle *trans,
>>  		btrfs_space_info_update_bytes_pinned(space_info, num_bytes);
>>  		space_info->bytes_used -= num_bytes;
>>  		space_info->disk_used -= num_bytes * factor;
>> -		if (READ_ONCE(space_info->periodic_reclaim))
>> -			btrfs_space_info_update_reclaimable(space_info, num_bytes);
>> -		else
>> -			reclaim = should_reclaim_block_group(cache, num_bytes);
>> +		reclaim = should_reclaim_block_group(cache, num_bytes);
> 
> I think this is a bug with periodic_reclaim == 1
> 
> In that case, if should_reclaim_block_group() returns true (could be a
> fixed or dynamic threshold), we will put that block group directly on
> the reclaim list, which is a complete contradiction of the point of
> periodic_reclaim.

I guess you mean just resume periodic reclaim is enough and put it on reclaim
list is not necessary. I agree.

>>  
>>  		spin_unlock(&cache->lock);
>> +		if (reclaim)
>> +			btrfs_resume_periodic_reclaim(space_info);
> 
> This also makes me wonder about the idea behind your change. If you want
> periodic reclaim to "pause" until a block group meets the condition and
> then we "resume", that's not exactly in the spirit of checking at a
> periodic cadence, rather than as block groups update.

Because IMO we need periodic reclaim to reclaim this blockgroup. So if it's
paused, resume here so the next periodic reclaim will handle this blockgroup.

>>  		spin_unlock(&space_info->lock);
>>  
>>  		btrfs_set_extent_bit(&trans->transaction->pinned_extents, bytenr,
>> diff --git a/fs/btrfs/space-info.c b/fs/btrfs/space-info.c
>> index 7b7b7255f7d8..de8bde1081be 100644
>> --- a/fs/btrfs/space-info.c
>> +++ b/fs/btrfs/space-info.c
>> @@ -2119,48 +2119,44 @@ static void do_reclaim_sweep(struct btrfs_space_info *space_info, int raid)
>>  	 * really need a block group, do take a fresh one.
>>  	 */
>>  	if (try_again && urgent) {
>> -		try_again = false;
>> +		urgent = false;
>>  		goto again;
>>  	}
>>  
>>  	up_read(&space_info->groups_sem);
>> -}
>> -
>> -void btrfs_space_info_update_reclaimable(struct btrfs_space_info *space_info, s64 bytes)
>> -{
>> -	u64 chunk_sz = calc_effective_data_chunk_size(space_info->fs_info);
>> -
>> -	lockdep_assert_held(&space_info->lock);
>> -	space_info->reclaimable_bytes += bytes;
>>  
>> -	if (space_info->reclaimable_bytes >= chunk_sz)
>> -		btrfs_set_periodic_reclaim_ready(space_info, true);
>> -}
>> -
>> -void btrfs_set_periodic_reclaim_ready(struct btrfs_space_info *space_info, bool ready)
>> -{
>> -	lockdep_assert_held(&space_info->lock);
>> -	if (!READ_ONCE(space_info->periodic_reclaim))
>> -		return;
>> -	if (ready != space_info->periodic_reclaim_ready) {
>> -		space_info->periodic_reclaim_ready = ready;
>> -		if (!ready)
>> -			space_info->reclaimable_bytes = 0;
>> +	/*
>> +	 * Temporary pause periodic reclaim until reclaim make some progress.
>> +	 * This can prevent periodic reclaim keep happening but make no progress.
>> +	 */
>> +	if (!try_again) {
>> +		spin_lock(&space_info->lock);
>> +		btrfs_pause_periodic_reclaim(space_info);
>> +		spin_unlock(&space_info->lock);
>>  	}
>>  }
>>  
>>  static bool btrfs_should_periodic_reclaim(struct btrfs_space_info *space_info)
>>  {
>>  	bool ret;
>> +	u64 chunk_sz;
>> +	u64 unused;
>>  
>>  	if (space_info->flags & BTRFS_BLOCK_GROUP_SYSTEM)
>>  		return false;
>>  	if (!READ_ONCE(space_info->periodic_reclaim))
>>  		return false;
>> +	if (!READ_ONCE(space_info->periodic_reclaim_paused))
>> +		return true;
> 
> This condition doesn't feel like a "pause". If the "pause" is set to
> false, then we proceed no matter what, otherwise we check conditions? It
> should be something like:

> if (READ_ONCE(space_info->periodic_reclaim_force))
>         return true;
> 
> as it is acting like a "force" not a "not paused".
> 
> Hope that makes sense, it's obviously a bit tied up in language semantics..

Because assumption 2. if it's not paused, periodic reclaim should happen
periodically. I use "pause" here because I expect periodic reclaim happen
periodically, and setting "pause" to true express that the periodic reclaim
should not keep ticking.

>> +
>> +	chunk_sz = calc_effective_data_chunk_size(space_info->fs_info);
>>  
>>  	spin_lock(&space_info->lock);
>> -	ret = space_info->periodic_reclaim_ready;
>> -	btrfs_set_periodic_reclaim_ready(space_info, false);
>> +	unused = space_info->total_bytes - space_info->bytes_used;
> 
> Probably makes sense to use a zoned aware wrapper for this, just in
> case we make this friendly with zones eventually.
> 
>> +	ret = (unused >= space_info->last_reclaim_unused + chunk_sz ||
>> +	       btrfs_calc_reclaim_threshold(space_info) != space_info->last_reclaim_threshold);
> 
> This second condition is quite interesting to me, I hadn't thought of
> it. I think it makes some sense to care again if the threshold changed,
> but it is a new behavior, rather than a fix, per-se.
> 
> What made you want to add this?

To make dynamic periodic reclaim work but it will also help for fixed threshold
when it's changed :)

>> +	if (ret)
>> +		btrfs_resume_periodic_reclaim(space_info);
>>  	spin_unlock(&space_info->lock);
>>  
>>  	return ret;
>> diff --git a/fs/btrfs/space-info.h b/fs/btrfs/space-info.h
>> index 0703f24b23f7..a49a4c7b0a68 100644
>> --- a/fs/btrfs/space-info.h
>> +++ b/fs/btrfs/space-info.h
>> @@ -214,14 +214,11 @@ struct btrfs_space_info {
>>  
>>  	/*
>>  	 * Periodic reclaim should be a no-op if a space_info hasn't
>> -	 * freed any space since the last time we tried.
>> +	 * freed any space since the last time we made no progress.
>>  	 */
>> -	bool periodic_reclaim_ready;
>> -
>> -	/*
>> -	 * Net bytes freed or allocated since the last reclaim pass.
>> -	 */
>> -	s64 reclaimable_bytes;
>> +	bool periodic_reclaim_paused;
>> +	int last_reclaim_threshold;
>> +	u64 last_reclaim_unused;
>>  };
>>  
>>  static inline bool btrfs_mixed_space_info(const struct btrfs_space_info *space_info)
>> @@ -301,9 +298,22 @@ void btrfs_dump_space_info_for_trans_abort(struct btrfs_fs_info *fs_info);
>>  void btrfs_init_async_reclaim_work(struct btrfs_fs_info *fs_info);
>>  u64 btrfs_account_ro_block_groups_free_space(struct btrfs_space_info *sinfo);
>>  
>> -void btrfs_space_info_update_reclaimable(struct btrfs_space_info *space_info, s64 bytes);
>> -void btrfs_set_periodic_reclaim_ready(struct btrfs_space_info *space_info, bool ready);
>>  int btrfs_calc_reclaim_threshold(const struct btrfs_space_info *space_info);
>> +static inline void btrfs_resume_periodic_reclaim(struct btrfs_space_info *space_info)
>> +{
>> +	lockdep_assert_held(&space_info->lock);
>> +	if (space_info->periodic_reclaim_paused)
>> +		space_info->periodic_reclaim_paused = false;
>> +}
>> +static inline void btrfs_pause_periodic_reclaim(struct btrfs_space_info *space_info)
>> +{
>> +	lockdep_assert_held(&space_info->lock);
>> +	if (!space_info->periodic_reclaim_paused) {
>> +		space_info->periodic_reclaim_paused = true;
>> +		space_info->last_reclaim_threshold = btrfs_calc_reclaim_threshold(space_info);
>> +		space_info->last_reclaim_unused = space_info->total_bytes - space_info->bytes_used;
>> +	}
>> +}
>>  void btrfs_reclaim_sweep(const struct btrfs_fs_info *fs_info);
>>  void btrfs_return_free_space(struct btrfs_space_info *space_info, u64 len);
>>  
>> -- 
>> 2.51.2
>>

Thanks,
Sun Yangkai


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v2 5/7] btrfs: reorder btrfs_block_group members to reduce struct size
  2026-01-03 12:19 ` [PATCH v2 5/7] btrfs: reorder btrfs_block_group members to reduce struct size Sun YangKai
@ 2026-01-05 15:07   ` Filipe Manana
  2026-01-05 15:26     ` Sun Yangkai
  0 siblings, 1 reply; 16+ messages in thread
From: Filipe Manana @ 2026-01-05 15:07 UTC (permalink / raw)
  To: Sun YangKai; +Cc: linux-btrfs

On Sat, Jan 3, 2026 at 1:06 PM Sun YangKai <sunk67188@gmail.com> wrote:
>
> Reorder struct btrfs_block_group fields to improve packing and reduce
> memory footprint from 624 to 600 bytes (24 bytes saved per instance).

The memory footprint is not reduced.

The structure's size is reduced, yes, but we are allocating block
groups using kzalloc, so we end up still using the kmalloc-1k slab.
Unless we could reduce the structure's size to 512 bytes, and
therefore use the kmalloc-512 slab, we are not saving any memory.

The number of cache lines also remains the same, so no improvements there.
Changing the order of the fields could also have an impact in the
caching (either for the best or the worst), but I don't think it will
be significantly visible for any realistic workload.

Thanks.


>
> Here's pahole output after this change:
>
> struct btrfs_block_group {
>         struct btrfs_fs_info *     fs_info;              /*     0     8 */
>         struct btrfs_inode *       inode;                /*     8     8 */
>         u64                        start;                /*    16     8 */
>         u64                        length;               /*    24     8 */
>         u64                        pinned;               /*    32     8 */
>         u64                        reserved;             /*    40     8 */
>         u64                        used;                 /*    48     8 */
>         u64                        delalloc_bytes;       /*    56     8 */
>         /* --- cacheline 1 boundary (64 bytes) --- */
>         u64                        bytes_super;          /*    64     8 */
>         u64                        flags;                /*    72     8 */
>         u64                        cache_generation;     /*    80     8 */
>         u64                        global_root_id;       /*    88     8 */
>         u64                        commit_used;          /*    96     8 */
>         u32                        bitmap_high_thresh;   /*   104     4 */
>         u32                        bitmap_low_thresh;    /*   108     4 */
>         struct rw_semaphore        data_rwsem;           /*   112    40 */
>         /* --- cacheline 2 boundary (128 bytes) was 24 bytes ago --- */
>         unsigned long              full_stripe_len;      /*   152     8 */
>         unsigned long              runtime_flags;        /*   160     8 */
>         spinlock_t                 lock;                 /*   168     4 */
>         unsigned int               ro;                   /*   172     4 */
>         enum btrfs_disk_cache_state disk_cache_state;    /*   176     4 */
>         enum btrfs_caching_type    cached;               /*   180     4 */
>         struct btrfs_caching_control * caching_ctl;      /*   184     8 */
>         /* --- cacheline 3 boundary (192 bytes) --- */
>         struct btrfs_space_info *  space_info;           /*   192     8 */
>         struct btrfs_free_space_ctl * free_space_ctl;    /*   200     8 */
>         struct rb_node             cache_node;           /*   208    24 */
>         struct list_head           list;                 /*   232    16 */
>         struct list_head           cluster_list;         /*   248    16 */
>         /* --- cacheline 4 boundary (256 bytes) was 8 bytes ago --- */
>         struct list_head           bg_list;              /*   264    16 */
>         struct list_head           ro_list;              /*   280    16 */
>         refcount_t                 refs;                 /*   296     4 */
>         atomic_t                   frozen;               /*   300     4 */
>         struct list_head           discard_list;         /*   304    16 */
>         /* --- cacheline 5 boundary (320 bytes) --- */
>         enum btrfs_discard_state   discard_state;        /*   320     4 */
>         int                        discard_index;        /*   324     4 */
>         u64                        discard_eligible_time; /*   328     8 */
>         u64                        discard_cursor;       /*   336     8 */
>         struct list_head           dirty_list;           /*   344    16 */
>         struct list_head           io_list;              /*   360    16 */
>         struct btrfs_io_ctl        io_ctl;               /*   376    72 */
>         /* --- cacheline 7 boundary (448 bytes) --- */
>         atomic_t                   reservations;         /*   448     4 */
>         atomic_t                   nocow_writers;        /*   452     4 */
>         struct mutex               free_space_lock;      /*   456    32 */
>         bool                       using_free_space_bitmaps; /*   488     1 */
>         bool                       using_free_space_bitmaps_cached; /*   489     1 */
>         bool                       reclaim_mark;         /*   490     1 */
>
>         /* XXX 1 byte hole, try to pack */
>
>         int                        swap_extents;         /*   492     4 */
>         u64                        alloc_offset;         /*   496     8 */
>         u64                        zone_unusable;        /*   504     8 */
>         /* --- cacheline 8 boundary (512 bytes) --- */
>         u64                        zone_capacity;        /*   512     8 */
>         u64                        meta_write_pointer;   /*   520     8 */
>         struct btrfs_chunk_map *   physical_map;         /*   528     8 */
>         struct list_head           active_bg_list;       /*   536    16 */
>         struct work_struct         zone_finish_work;     /*   552    32 */
>         /* --- cacheline 9 boundary (576 bytes) was 8 bytes ago --- */
>         struct extent_buffer *     last_eb;              /*   584     8 */
>         enum btrfs_block_group_size_class size_class;    /*   592     4 */
>
>         /* size: 600, cachelines: 10, members: 56 */
>         /* sum members: 595, holes: 1, sum holes: 1 */
>         /* padding: 4 */
>         /* last cacheline: 24 bytes */
> };
>
> Signed-off-by: Sun YangKai <sunk67188@gmail.com>
> ---
>  fs/btrfs/block-group.h | 9 +++++----
>  1 file changed, 5 insertions(+), 4 deletions(-)
>
> diff --git a/fs/btrfs/block-group.h b/fs/btrfs/block-group.h
> index 3b3c61b3af2c..88c2e3a0a5a0 100644
> --- a/fs/btrfs/block-group.h
> +++ b/fs/btrfs/block-group.h
> @@ -118,7 +118,6 @@ struct btrfs_caching_control {
>  struct btrfs_block_group {
>         struct btrfs_fs_info *fs_info;
>         struct btrfs_inode *inode;
> -       spinlock_t lock;
>         u64 start;
>         u64 length;
>         u64 pinned;
> @@ -159,6 +158,8 @@ struct btrfs_block_group {
>         unsigned long full_stripe_len;
>         unsigned long runtime_flags;
>
> +       spinlock_t lock;
> +
>         unsigned int ro;
>
>         int disk_cache_state;
> @@ -178,8 +179,6 @@ struct btrfs_block_group {
>         /* For block groups in the same raid type */
>         struct list_head list;
>
> -       refcount_t refs;
> -
>         /*
>          * List of struct btrfs_free_clusters for this block group.
>          * Today it will only have one thing on it, but that may change
> @@ -199,6 +198,8 @@ struct btrfs_block_group {
>         /* For read-only block groups */
>         struct list_head ro_list;
>
> +       refcount_t refs;
> +
>         /*
>          * When non-zero it means the block group's logical address and its
>          * device extents can not be reused for future block group allocations
> @@ -211,10 +212,10 @@ struct btrfs_block_group {
>
>         /* For discard operations */
>         struct list_head discard_list;
> +       enum btrfs_discard_state discard_state;
>         int discard_index;
>         u64 discard_eligible_time;
>         u64 discard_cursor;
> -       enum btrfs_discard_state discard_state;
>
>         /* For dirty block groups */
>         struct list_head dirty_list;
> --
> 2.51.2
>
>

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v2 5/7] btrfs: reorder btrfs_block_group members to reduce struct size
  2026-01-05 15:07   ` Filipe Manana
@ 2026-01-05 15:26     ` Sun Yangkai
  0 siblings, 0 replies; 16+ messages in thread
From: Sun Yangkai @ 2026-01-05 15:26 UTC (permalink / raw)
  To: Filipe Manana; +Cc: linux-btrfs



在 2026/1/5 23:07, Filipe Manana 写道:
> On Sat, Jan 3, 2026 at 1:06 PM Sun YangKai <sunk67188@gmail.com> wrote:
>>
>> Reorder struct btrfs_block_group fields to improve packing and reduce
>> memory footprint from 624 to 600 bytes (24 bytes saved per instance).
> 
> The memory footprint is not reduced.
> 
> The structure's size is reduced, yes, but we are allocating block
> groups using kzalloc, so we end up still using the kmalloc-1k slab.
> Unless we could reduce the structure's size to 512 bytes, and
> therefore use the kmalloc-512 slab, we are not saving any memory.

Thank you for your review opinions.

Currently we have a size at ~600 bytes either with or without this patch and I'm
wondering if we can use kmem_cache for btrfs_block_group. I'm not sure but it
should also make it easier to trace the allocation.

> The number of cache lines also remains the same, so no improvements there.
> Changing the order of the fields could also have an impact in the
> caching (either for the best or the worst), but I don't think it will
> be significantly visible for any realistic workload.

I didn't take this into consideration because the fields of btrfs_block_group
seems not optimized for caching but thanks a lot for telling me this :)

Thanks.

> Thanks.
> 
> 
>>
>> Here's pahole output after this change:
>>
>> struct btrfs_block_group {
>>         struct btrfs_fs_info *     fs_info;              /*     0     8 */
>>         struct btrfs_inode *       inode;                /*     8     8 */
>>         u64                        start;                /*    16     8 */
>>         u64                        length;               /*    24     8 */
>>         u64                        pinned;               /*    32     8 */
>>         u64                        reserved;             /*    40     8 */
>>         u64                        used;                 /*    48     8 */
>>         u64                        delalloc_bytes;       /*    56     8 */
>>         /* --- cacheline 1 boundary (64 bytes) --- */
>>         u64                        bytes_super;          /*    64     8 */
>>         u64                        flags;                /*    72     8 */
>>         u64                        cache_generation;     /*    80     8 */
>>         u64                        global_root_id;       /*    88     8 */
>>         u64                        commit_used;          /*    96     8 */
>>         u32                        bitmap_high_thresh;   /*   104     4 */
>>         u32                        bitmap_low_thresh;    /*   108     4 */
>>         struct rw_semaphore        data_rwsem;           /*   112    40 */
>>         /* --- cacheline 2 boundary (128 bytes) was 24 bytes ago --- */
>>         unsigned long              full_stripe_len;      /*   152     8 */
>>         unsigned long              runtime_flags;        /*   160     8 */
>>         spinlock_t                 lock;                 /*   168     4 */
>>         unsigned int               ro;                   /*   172     4 */
>>         enum btrfs_disk_cache_state disk_cache_state;    /*   176     4 */
>>         enum btrfs_caching_type    cached;               /*   180     4 */
>>         struct btrfs_caching_control * caching_ctl;      /*   184     8 */
>>         /* --- cacheline 3 boundary (192 bytes) --- */
>>         struct btrfs_space_info *  space_info;           /*   192     8 */
>>         struct btrfs_free_space_ctl * free_space_ctl;    /*   200     8 */
>>         struct rb_node             cache_node;           /*   208    24 */
>>         struct list_head           list;                 /*   232    16 */
>>         struct list_head           cluster_list;         /*   248    16 */
>>         /* --- cacheline 4 boundary (256 bytes) was 8 bytes ago --- */
>>         struct list_head           bg_list;              /*   264    16 */
>>         struct list_head           ro_list;              /*   280    16 */
>>         refcount_t                 refs;                 /*   296     4 */
>>         atomic_t                   frozen;               /*   300     4 */
>>         struct list_head           discard_list;         /*   304    16 */
>>         /* --- cacheline 5 boundary (320 bytes) --- */
>>         enum btrfs_discard_state   discard_state;        /*   320     4 */
>>         int                        discard_index;        /*   324     4 */
>>         u64                        discard_eligible_time; /*   328     8 */
>>         u64                        discard_cursor;       /*   336     8 */
>>         struct list_head           dirty_list;           /*   344    16 */
>>         struct list_head           io_list;              /*   360    16 */
>>         struct btrfs_io_ctl        io_ctl;               /*   376    72 */
>>         /* --- cacheline 7 boundary (448 bytes) --- */
>>         atomic_t                   reservations;         /*   448     4 */
>>         atomic_t                   nocow_writers;        /*   452     4 */
>>         struct mutex               free_space_lock;      /*   456    32 */
>>         bool                       using_free_space_bitmaps; /*   488     1 */
>>         bool                       using_free_space_bitmaps_cached; /*   489     1 */
>>         bool                       reclaim_mark;         /*   490     1 */
>>
>>         /* XXX 1 byte hole, try to pack */
>>
>>         int                        swap_extents;         /*   492     4 */
>>         u64                        alloc_offset;         /*   496     8 */
>>         u64                        zone_unusable;        /*   504     8 */
>>         /* --- cacheline 8 boundary (512 bytes) --- */
>>         u64                        zone_capacity;        /*   512     8 */
>>         u64                        meta_write_pointer;   /*   520     8 */
>>         struct btrfs_chunk_map *   physical_map;         /*   528     8 */
>>         struct list_head           active_bg_list;       /*   536    16 */
>>         struct work_struct         zone_finish_work;     /*   552    32 */
>>         /* --- cacheline 9 boundary (576 bytes) was 8 bytes ago --- */
>>         struct extent_buffer *     last_eb;              /*   584     8 */
>>         enum btrfs_block_group_size_class size_class;    /*   592     4 */
>>
>>         /* size: 600, cachelines: 10, members: 56 */
>>         /* sum members: 595, holes: 1, sum holes: 1 */
>>         /* padding: 4 */
>>         /* last cacheline: 24 bytes */
>> };
>>
>> Signed-off-by: Sun YangKai <sunk67188@gmail.com>
>> ---
>>  fs/btrfs/block-group.h | 9 +++++----
>>  1 file changed, 5 insertions(+), 4 deletions(-)
>>
>> diff --git a/fs/btrfs/block-group.h b/fs/btrfs/block-group.h
>> index 3b3c61b3af2c..88c2e3a0a5a0 100644
>> --- a/fs/btrfs/block-group.h
>> +++ b/fs/btrfs/block-group.h
>> @@ -118,7 +118,6 @@ struct btrfs_caching_control {
>>  struct btrfs_block_group {
>>         struct btrfs_fs_info *fs_info;
>>         struct btrfs_inode *inode;
>> -       spinlock_t lock;
>>         u64 start;
>>         u64 length;
>>         u64 pinned;
>> @@ -159,6 +158,8 @@ struct btrfs_block_group {
>>         unsigned long full_stripe_len;
>>         unsigned long runtime_flags;
>>
>> +       spinlock_t lock;
>> +
>>         unsigned int ro;
>>
>>         int disk_cache_state;
>> @@ -178,8 +179,6 @@ struct btrfs_block_group {
>>         /* For block groups in the same raid type */
>>         struct list_head list;
>>
>> -       refcount_t refs;
>> -
>>         /*
>>          * List of struct btrfs_free_clusters for this block group.
>>          * Today it will only have one thing on it, but that may change
>> @@ -199,6 +198,8 @@ struct btrfs_block_group {
>>         /* For read-only block groups */
>>         struct list_head ro_list;
>>
>> +       refcount_t refs;
>> +
>>         /*
>>          * When non-zero it means the block group's logical address and its
>>          * device extents can not be reused for future block group allocations
>> @@ -211,10 +212,10 @@ struct btrfs_block_group {
>>
>>         /* For discard operations */
>>         struct list_head discard_list;
>> +       enum btrfs_discard_state discard_state;
>>         int discard_index;
>>         u64 discard_eligible_time;
>>         u64 discard_cursor;
>> -       enum btrfs_discard_state discard_state;
>>
>>         /* For dirty block groups */
>>         struct list_head dirty_list;
>> --
>> 2.51.2
>>
>>


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v2 1/7] btrfs: fix periodic reclaim condition
  2026-01-05 13:00     ` Sun Yangkai
@ 2026-01-05 18:21       ` Boris Burkov
  2026-01-07 14:09         ` Sun Yangkai
  0 siblings, 1 reply; 16+ messages in thread
From: Boris Burkov @ 2026-01-05 18:21 UTC (permalink / raw)
  To: Sun Yangkai; +Cc: linux-btrfs

On Mon, Jan 05, 2026 at 09:00:15PM +0800, Sun Yangkai wrote:
> 
> 
> 在 2026/1/5 03:40, Boris Burkov 写道:
> > On Sat, Jan 03, 2026 at 08:19:48PM +0800, Sun YangKai wrote:
> > 
> > First off, thank you very much for your attention to this and for
> > your in depth fixes.
> 
> Glad to hear from you and Happy New Year! :)
> 
> >> Problems with current implementation:
> >> 1. reclaimable_bytes is signed while chunk_sz is unsigned, causing
> >>    negative reclaimable_bytes to trigger reclaim unexpectedly
> > 
> > Agreed, completely. IMO, the message here could use a bit more clarity
> > that negative reclaimable_bytes is an intentionally designed condition
> > so this is a fundamental bug, not a weird edge case.
> > 
> >> 2. The "space must be freed between scans" assumption breaks the
> >>    two-scan requirement: first scan marks block groups, second scan
> >>    reclaims them. Without the second scan, no reclamation occurs.
> > 
> > I think I understand what you are saying, but let me try to restate it
> > myself and confirm:
> > 
> > Previously, due to bug 1, we were calling
> > btrfs_set_periodic_reclaim_ready() too frequently (after an allocation
> > that made the reclaimable_bytes negative). With bug 1 fixed, you really
> > have to free a chunk_sz to call btrfs_set_periodic_reclaim_ready(). As a
> > result, periodic_reclaim became way too conservative. Now the
> > multi-sweep marking thing AND reclaim_ready mean we basically never
> > reclaim, as the second sweep doesn't get triggered later.
> > 
> > Is that about right? I agree that this needs some re-thinking / fixing,
> > if so.
> 
> Yes that is the case. Your statement is way much better than mine. A sweep
> always reclaim blockgroups marked by the previous sweep and mark blockgroups for
> the next sweep. That explains why the assumption is not proper and we need a new
> assumption to work on.
> 
> >> Instead, track actual reclaim progress: pause reclaim when block groups
> >> will be reclaimed, and resume only when progress is made. This ensures
> >> reclaim continues until no further progress can be made, then resumes when
> >> space_info changes or new reclaimable groups appear.
> > 
> > I think you are making a much bigger change than merely fixing the bugs
> > in the original design, so it requires deeper explanation of the new
> > invariants you are introducing. Clearly, I am one to talk about it, as I
> > messed up with my attempt, but I still think it is very valuable for
> > future readers. Thanks for bearing with me :)
> 
> Thank you for bring us such a cool feature so I can try it out, learn details of
> the implementation and bring some my opinions. I'm quite happy with these
> experience. I started with a simple fix but the change was getting bigger and
> bigger to reduce regressions and finally it's more like a redesign rather than a
> fix.
> 
> Sorry for the previous short but unclear statement and let me try to make it
> clear in the new year :)
> 
> Let's start with the new assumptions:
> 1. Reclaim is for getting more unallocated blockgroups/spaces.
> 2. Periodic reclaim should happen periodically :)
> 
> I think both users and developers will agree with this but I want to explain
> more about 2. Currently, the bg->reclaim_mark related logic and dynamic periodic
> reclaim logic both depend on 2. Your previous statement explains the former one
> very well and I'll focus on the latter one here.
> 
> When we have less than 10GiB unallocated space, dynamic periodic reclaim will
> kick in. With the filesystem getting fuller, the bg_reclaim_threshold is getting
> larger to make periodic reclaim more aggressive to reclaim some unallocated
> space. However, if we paused periodic reclaim when the threshold is small, like
> 19, then we'll never resume it even if we want to reclaim more aggressively with
> a larger threshold later. And dynamic periodic reclaim will not work as
> expected. This may happen when we first try reclaiming with threshold 19, while
> having some quite large extents in 10% used blockgroups and having no enough
> continuous free space in other 70+% full blockgroups. We'll fail to get more
> unallocated space, pause periodic reclaim, and not try reclaiming when the
> threshold gets larger than 79 later. This is why I resume periodic reclaim when
> threshold changes.
> 
> So the core issue of this part logic about periodic reclaim is when to pause and
> when to continue.

Agreed. And I think the larger point is that the signedness bug
invalidates just about everything to do with the existing "reclaim
ready" heuristics.

> 
> If we cannot free more unallocated blockgroups in periodic reclaim, we should
> pause it because we don't want to just move things from a blockgroup to a newly
> allocated blockgroup and then remove the old one periodically.

I observed exactly this, as well as reclaim attempts that would make a
bunch of allocations then fail on a later allocation attempt relocating
a larger extent which is a bunch of pointless cpu and lock churn in the
allocation algorithm.

> 
> And we should continue periodic reclaim when we may be able to free more
> blockgroups.
> 

This assumption I would push back on. This was my initial assumption as
well, but it is worth challenging. At Meta, we used to use the automatic
threshold based reclaim that reclaims every bg that goes over threshold
(25% for us) then back under. This actually creates a huge amount of
reclaim and was the reason for even coming up with dynamic thresholds.
In practice, switching to the buggy dynamic+periodic drastically reduced
this extra reclaim.

Suppose we allocate some block extents / block groups then free half the
space and fragment them. You might think it's best to rapidly compact
everything. But as long as we have unallocated and the holes are big
enough for allocations to come through and re-fill them, it's mostly OK.

We much more want to do a "minimal" amount of reclaim until we truly
need it.

With all that said, your points above about the threshold going up but
reclaim staying asleep make a lot of sense to me. That is a signal that
the "need" is increasing.

> For pausing, it's hard to tell if we're freeing more unallocated directly
> without changing current code. But in btrfs_reclaim_bgs_work() we have
> 
> btrfs_relocate_chunk()
>   ->btrfs_remove_chunk()
>     ->btrfs_remove_block_group()
> 
> So the old blockgroup is removed and we can see that in space_info->total_bytes.
> 
> Let's start with all the blockgroups having the same size and reclaiming a
> single blockgroup. If we succeeded moving extents into existing blockgroups,
> we'll not allocate a new one and space_info->total_bytes is getting smaller
> because we've freed the old one. If we failed moving things into existing bgs
> and allocated a new one, then space_info->total_bytes will not change.

Some of my comments on this point were silly, I was thinking about
used_bytes which would not change, not total_bytes. Thanks for the
detailed description to knock me out of that mistake.

Also, I'm not sure the logic about total_bytes is 100% ironclad, as
I haven't carefully checked what block group allocation/freeing is
possible concurrently over the span where you track the sizes.

> 
> When reclaiming n blockgroups, as long as we're using less space, such as n-1
> new blockgroups are allocated or n new smaller blockgroups, we're making some
> progress and able to detect that from space_info->total_bytes.
> 
> Reclaim may racing with some other work that free/allocate blockgroups, which
> will lead to either an unnecessary extra periodic reclaim or an unexpected
> pause. An extra periodic reclaim is not a big thing. But the unexpected pause
> matters. This is the only regression currently and I think that will not happen
> frequently.

OK, we're on the same page with the possibility of some racing :)
Your explanation makes sense and I don't think it's a huge deal or
anything.

> 
> For continuing, rather than tracing used_bytes, I think we should care more
> about unused_bytes. This is inspired by calc_dynamic_reclaim_threshold(), which
> will return 0 if there's no enough unused space. And also care about
> reclaim_threshold to make dynamic periodic reclaim work.
> 
> 

First, I think let's agree what we want and don't want to pause:
I don't think we care about running the reclaim sweep and checking
thresholds, updating marks, etc. That is relatively cheap. So if we
change the pausing logic to still allow that work more often, it's OK.
We really want to prevent triggering reclaims in tight fail loops.

With that said, let's consider all the cases again.

1. We scan all the bgs and don't attempt any reclaims.
2. We attempt a reclaim of BG X at threshold T and it completely fails to
allocate a new home for extent E of X.
3. We attempt a reclaim of BG X at threshold T and it "succeeds" by
allocating a new BG Y and moving all the extents from X to Y.
4. We attempt a reclaim of BG X at threshold T and it makes proper
progress and puts some the extents of X into the existing bgs.

After case 1, I think we agree pausing isn't needed, and is in fact bad.

After cases 2 and 3, we want to avoid triggering more reclaim until we
have evidence that it is likely a reclaim could possibly succeed. If
unused gets worse since then, and the threshold goes up, it is
*possible* a reclaim will succeed (we might pick a more full block group
that happens to be full of smaller easy to allocate extents) but it is
much more likely it won't. As far as I can tell, the best signal for
unpausing reclaim for the "we have seen a reclaim fail" case is still
"relative freeing" of extents.

After case 4, we don't *need* to pause, as it is going to increase
unallocated and reduce the "pressure" on the reclaim system, but it
doesn't fundamentally hurt either. We made some progress and we want to
be conservative and not be too greedy. Let allocation breathe and do
some work to fill the gaps up. However, this case is quite different
from 2 and 3 in that a subsequent relocation seems likely enough to also
succeed.

My original (designed) logic was "pause on 2,3,4; unpause on the
conditions that unstick 2,3" which I think you have correctly argued is
suboptimal. 4 can get stuck never reclaiming even as things get worse,
even if we might succeed (we've never failed..)

So I think the best (short term) solution is:
"detect failure modes 2 and 3 and unpause on release"

I believe your patches already enact "detect failure modes 2 and 3" and
we just disagree on the unpausing mechanism.

> > I think the simplest pure "fix" is to change the current "ready = false"
> > setting from unconditional on every run of periodic reclaim to only when
> > a sweep actually queues a block group for reclaim. I have compiled but
> > not yet tested the code I included in this mail :) We can do that more
> > carefully once we agree how we want to move forward.
> > 
> > If you still want to re-design things on top of a fix like this one, I
> > am totally open to that.
> > 
> > Thanks again for all your work on this, and happy new year,
> > Boris
> > 
> > diff --git a/fs/btrfs/space-info.c b/fs/btrfs/space-info.c
> > index 6babbe333741..aad402485763 100644
> > --- a/fs/btrfs/space-info.c
> > +++ b/fs/btrfs/space-info.c
> > @@ -2095,12 +2095,13 @@ static bool is_reclaim_urgent(struct btrfs_space_info *space_info)
> >  	return unalloc < data_chunk_size;
> >  }
> > 
> > -static void do_reclaim_sweep(struct btrfs_space_info *space_info, int raid)
> > +static bool do_reclaim_sweep(struct btrfs_space_info *space_info, int raid)
> >  {
> >  	struct btrfs_block_group *bg;
> >  	int thresh_pct;
> >  	bool try_again = true; > >  	bool urgent;
> > +	bool ret = false;
> > 
> >  	spin_lock(&space_info->lock);
> >  	urgent = is_reclaim_urgent(space_info);
> > @@ -2122,8 +2123,10 @@ static void do_reclaim_sweep(struct btrfs_space_info *space_info, int raid)
> >  		}
> >  		bg->reclaim_mark++;
> >  		spin_unlock(&bg->lock);
> > -		if (reclaim)
> > +		if (reclaim) {
> >  			btrfs_mark_bg_to_reclaim(bg);
> > +			ret = true;
> > +		}
> >  		btrfs_put_block_group(bg);
> >  	}
> > 
> > @@ -2140,6 +2143,7 @@ static void do_reclaim_sweep(struct btrfs_space_info *space_info, int raid)
> >  	}
> > 
> >  	up_read(&space_info->groups_sem);
> > +	return ret;
> >  }
> 
> It's a good idea to return bool here! I hadn't thought of it.
> 
> >  void btrfs_space_info_update_reclaimable(struct btrfs_space_info *space_info, s64 bytes)
> > @@ -2149,7 +2153,8 @@ void btrfs_space_info_update_reclaimable(struct btrfs_space_info *space_info, s6
> >  	lockdep_assert_held(&space_info->lock);
> >  	space_info->reclaimable_bytes += bytes;
> > 
> > -	if (space_info->reclaimable_bytes >= chunk_sz)
> > +	if (space_info->reclaimable_bytes > 0 &&
> > +	    space_info->reclaimable_bytes >= chunk_sz)
> >  		btrfs_set_periodic_reclaim_ready(space_info, true);
> >  }
> > 
> > @@ -2176,7 +2181,6 @@ static bool btrfs_should_periodic_reclaim(struct btrfs_space_info *space_info)
> > 
> >  	spin_lock(&space_info->lock);
> >  	ret = space_info->periodic_reclaim_ready;
> > -	btrfs_set_periodic_reclaim_ready(space_info, false);
> >  	spin_unlock(&space_info->lock);
> > 
> >  	return ret;
> > @@ -2190,8 +2194,10 @@ void btrfs_reclaim_sweep(const struct btrfs_fs_info *fs_info)
> >  	list_for_each_entry(space_info, &fs_info->space_info, list) {
> >  		if (!btrfs_should_periodic_reclaim(space_info))
> >  			continue;
> > -		for (raid = 0; raid < BTRFS_NR_RAID_TYPES; raid++)
> > -			do_reclaim_sweep(space_info, raid);
> > +		for (raid = 0; raid < BTRFS_NR_RAID_TYPES; raid++) {
> > +			if (do_reclaim_sweep(space_info, raid))
> > +				btrfs_set_periodic_reclaim_ready(space_info, false);
> > +		}
> 
> Even if we have blockgroups to reclaim in this turn, we still mark some
> bg->reclaim_mark and expect the next periodic reclaim, which will not run, might
> reclaim them.

Why won't it run? We only pause if we truly called
btrfs_mark_bg_to_reclaim() (the word mark is unfortunately overloaded
here... that function really queues the bg for a reclaim attempt)

> 
> >  	}
> >  }
> 
> So I'm afraid this patch will introduce obvious regression on periodic reclaim.
> Add the "changed threshold" condition can alleviate this when dynamic reclaim is
> also enabled but will not work for fixed threshold. So I failed to come up with
> a simple fix patch with no obvious regression :(
> 

Can you share a description of any workloads you've been testing
against? I think that would also help frame the discussion to avoid me
talking too much :D

The regression I can foresee here is that the successful passes will
pause for too long (until the next chunk_sz of freeing)

We can also relax that condition to something more like an extent size
(1M or BTRFS_MAX_EXTENT_SIZE perhaps) to make it a little gentler of a pause.

We can also unpause on "urgent". I think that would likely be
sufficient. I'll do some actual experiments.. :)

Thanks,
Boris

> >> CC: Boris Burkov <boris@bur.io>
> >> Fixes: 813d4c6422516 ("btrfs: prevent pathological periodic reclaim loops")
> >> Signed-off-by: Sun YangKai <sunk67188@gmail.com>
> >> ---
> >>  fs/btrfs/block-group.c | 15 +++++++-------
> >>  fs/btrfs/space-info.c  | 44 +++++++++++++++++++-----------------------
> >>  fs/btrfs/space-info.h  | 28 ++++++++++++++++++---------
> >>  3 files changed, 46 insertions(+), 41 deletions(-)
> >>
> >> diff --git a/fs/btrfs/block-group.c b/fs/btrfs/block-group.c
> >> index e417aba4c4c7..94a4068cd42a 100644
> >> --- a/fs/btrfs/block-group.c
> >> +++ b/fs/btrfs/block-group.c
> >> @@ -1871,6 +1871,7 @@ void btrfs_reclaim_bgs_work(struct work_struct *work)
> >>  	while (!list_empty(&fs_info->reclaim_bgs)) {
> >>  		u64 used;
> >>  		u64 reserved;
> >> +		u64 old_total;
> >>  		int ret = 0;
> >>  
> >>  		bg = list_first_entry(&fs_info->reclaim_bgs,
> >> @@ -1936,6 +1937,7 @@ void btrfs_reclaim_bgs_work(struct work_struct *work)
> >>  		}
> >>  
> >>  		spin_unlock(&bg->lock);
> >> +		old_total = space_info->total_bytes;
> >>  		spin_unlock(&space_info->lock);
> >>  
> >>  		/*
> >> @@ -1988,14 +1990,14 @@ void btrfs_reclaim_bgs_work(struct work_struct *work)
> >>  			reserved = 0;
> >>  			spin_lock(&space_info->lock);
> >>  			space_info->reclaim_errors++;
> >> -			if (READ_ONCE(space_info->periodic_reclaim))
> >> -				space_info->periodic_reclaim_ready = false;
> >>  			spin_unlock(&space_info->lock);
> >>  		}
> >>  		spin_lock(&space_info->lock);
> >>  		space_info->reclaim_count++;
> >>  		space_info->reclaim_bytes += used;
> >>  		space_info->reclaim_bytes += reserved;
> >> +		if (space_info->total_bytes < old_total)
> >> +			btrfs_resume_periodic_reclaim(space_info);
> > 
> > Why is this here? We've just completed a reclaim, which I would expect
> > to be neutral to the space_info->total_bytes (just moving them around).
> > So if (any) unrelated freeing happens to happen while we are reclaiming,
> > we resume? Doesn't seem wrong, but also seems a little specific and
> > random. I am probably missing some aspect of your new design.
> > 
> > Put a different way, what is special about frees that happen *while* we
> > are reclaiming?
> 
> As explained, I want to use this to detect if the reclaim freed some unallocated
> space.
> 

Makes sense now.

> >>  		spin_unlock(&space_info->lock);
> >>  
> >>  next:
> >> @@ -3730,8 +3732,6 @@ int btrfs_update_block_group(struct btrfs_trans_handle *trans,
> >>  		space_info->bytes_reserved -= num_bytes;
> >>  		space_info->bytes_used += num_bytes;
> >>  		space_info->disk_used += num_bytes * factor;
> >> -		if (READ_ONCE(space_info->periodic_reclaim))
> >> -			btrfs_space_info_update_reclaimable(space_info, -num_bytes);
> >>  		spin_unlock(&cache->lock);
> >>  		spin_unlock(&space_info->lock);
> >>  	} else {
> >> @@ -3741,12 +3741,11 @@ int btrfs_update_block_group(struct btrfs_trans_handle *trans,
> >>  		btrfs_space_info_update_bytes_pinned(space_info, num_bytes);
> >>  		space_info->bytes_used -= num_bytes;
> >>  		space_info->disk_used -= num_bytes * factor;
> >> -		if (READ_ONCE(space_info->periodic_reclaim))
> >> -			btrfs_space_info_update_reclaimable(space_info, num_bytes);
> >> -		else
> >> -			reclaim = should_reclaim_block_group(cache, num_bytes);
> >> +		reclaim = should_reclaim_block_group(cache, num_bytes);
> > 
> > I think this is a bug with periodic_reclaim == 1
> > 
> > In that case, if should_reclaim_block_group() returns true (could be a
> > fixed or dynamic threshold), we will put that block group directly on
> > the reclaim list, which is a complete contradiction of the point of
> > periodic_reclaim.
> 
> I guess you mean just resume periodic reclaim is enough and put it on reclaim
> list is not necessary. I agree.
> 

+1

> >>  
> >>  		spin_unlock(&cache->lock);
> >> +		if (reclaim)
> >> +			btrfs_resume_periodic_reclaim(space_info);
> > 
> > This also makes me wonder about the idea behind your change. If you want
> > periodic reclaim to "pause" until a block group meets the condition and
> > then we "resume", that's not exactly in the spirit of checking at a
> > periodic cadence, rather than as block groups update.
> 
> Because IMO we need periodic reclaim to reclaim this blockgroup. So if it's
> paused, resume here so the next periodic reclaim will handle this blockgroup.
> 

As argued above, we want to resume when there is space in the "target"
block groups, not an exciting "source". Otherwise the failed stuck case
can keep hammering hopelessly.

> >>  		spin_unlock(&space_info->lock);
> >>  
> >>  		btrfs_set_extent_bit(&trans->transaction->pinned_extents, bytenr,
> >> diff --git a/fs/btrfs/space-info.c b/fs/btrfs/space-info.c
> >> index 7b7b7255f7d8..de8bde1081be 100644
> >> --- a/fs/btrfs/space-info.c
> >> +++ b/fs/btrfs/space-info.c
> >> @@ -2119,48 +2119,44 @@ static void do_reclaim_sweep(struct btrfs_space_info *space_info, int raid)
> >>  	 * really need a block group, do take a fresh one.
> >>  	 */
> >>  	if (try_again && urgent) {
> >> -		try_again = false;
> >> +		urgent = false;
> >>  		goto again;
> >>  	}
> >>  
> >>  	up_read(&space_info->groups_sem);
> >> -}
> >> -
> >> -void btrfs_space_info_update_reclaimable(struct btrfs_space_info *space_info, s64 bytes)
> >> -{
> >> -	u64 chunk_sz = calc_effective_data_chunk_size(space_info->fs_info);
> >> -
> >> -	lockdep_assert_held(&space_info->lock);
> >> -	space_info->reclaimable_bytes += bytes;
> >>  
> >> -	if (space_info->reclaimable_bytes >= chunk_sz)
> >> -		btrfs_set_periodic_reclaim_ready(space_info, true);
> >> -}
> >> -
> >> -void btrfs_set_periodic_reclaim_ready(struct btrfs_space_info *space_info, bool ready)
> >> -{
> >> -	lockdep_assert_held(&space_info->lock);
> >> -	if (!READ_ONCE(space_info->periodic_reclaim))
> >> -		return;
> >> -	if (ready != space_info->periodic_reclaim_ready) {
> >> -		space_info->periodic_reclaim_ready = ready;
> >> -		if (!ready)
> >> -			space_info->reclaimable_bytes = 0;
> >> +	/*
> >> +	 * Temporary pause periodic reclaim until reclaim make some progress.
> >> +	 * This can prevent periodic reclaim keep happening but make no progress.
> >> +	 */
> >> +	if (!try_again) {
> >> +		spin_lock(&space_info->lock);
> >> +		btrfs_pause_periodic_reclaim(space_info);
> >> +		spin_unlock(&space_info->lock);
> >>  	}
> >>  }
> >>  
> >>  static bool btrfs_should_periodic_reclaim(struct btrfs_space_info *space_info)
> >>  {
> >>  	bool ret;
> >> +	u64 chunk_sz;
> >> +	u64 unused;
> >>  
> >>  	if (space_info->flags & BTRFS_BLOCK_GROUP_SYSTEM)
> >>  		return false;
> >>  	if (!READ_ONCE(space_info->periodic_reclaim))
> >>  		return false;
> >> +	if (!READ_ONCE(space_info->periodic_reclaim_paused))
> >> +		return true;
> > 
> > This condition doesn't feel like a "pause". If the "pause" is set to
> > false, then we proceed no matter what, otherwise we check conditions? It
> > should be something like:
> 
> > if (READ_ONCE(space_info->periodic_reclaim_force))
> >         return true;
> > 
> > as it is acting like a "force" not a "not paused".
> > 
> > Hope that makes sense, it's obviously a bit tied up in language semantics..
> 
> Because assumption 2. if it's not paused, periodic reclaim should happen
> periodically. I use "pause" here because I expect periodic reclaim happen
> periodically, and setting "pause" to true express that the periodic reclaim
> should not keep ticking.
> 
> >> +
> >> +	chunk_sz = calc_effective_data_chunk_size(space_info->fs_info);
> >>  
> >>  	spin_lock(&space_info->lock);
> >> -	ret = space_info->periodic_reclaim_ready;
> >> -	btrfs_set_periodic_reclaim_ready(space_info, false);
> >> +	unused = space_info->total_bytes - space_info->bytes_used;
> > 
> > Probably makes sense to use a zoned aware wrapper for this, just in
> > case we make this friendly with zones eventually.
> > 
> >> +	ret = (unused >= space_info->last_reclaim_unused + chunk_sz ||
> >> +	       btrfs_calc_reclaim_threshold(space_info) != space_info->last_reclaim_threshold);
> > 
> > This second condition is quite interesting to me, I hadn't thought of
> > it. I think it makes some sense to care again if the threshold changed,
> > but it is a new behavior, rather than a fix, per-se.
> > 
> > What made you want to add this?
> 
> To make dynamic periodic reclaim work but it will also help for fixed threshold
> when it's changed :)
> 
> >> +	if (ret)
> >> +		btrfs_resume_periodic_reclaim(space_info);
> >>  	spin_unlock(&space_info->lock);
> >>  
> >>  	return ret;
> >> diff --git a/fs/btrfs/space-info.h b/fs/btrfs/space-info.h
> >> index 0703f24b23f7..a49a4c7b0a68 100644
> >> --- a/fs/btrfs/space-info.h
> >> +++ b/fs/btrfs/space-info.h
> >> @@ -214,14 +214,11 @@ struct btrfs_space_info {
> >>  
> >>  	/*
> >>  	 * Periodic reclaim should be a no-op if a space_info hasn't
> >> -	 * freed any space since the last time we tried.
> >> +	 * freed any space since the last time we made no progress.
> >>  	 */
> >> -	bool periodic_reclaim_ready;
> >> -
> >> -	/*
> >> -	 * Net bytes freed or allocated since the last reclaim pass.
> >> -	 */
> >> -	s64 reclaimable_bytes;
> >> +	bool periodic_reclaim_paused;
> >> +	int last_reclaim_threshold;
> >> +	u64 last_reclaim_unused;
> >>  };
> >>  
> >>  static inline bool btrfs_mixed_space_info(const struct btrfs_space_info *space_info)
> >> @@ -301,9 +298,22 @@ void btrfs_dump_space_info_for_trans_abort(struct btrfs_fs_info *fs_info);
> >>  void btrfs_init_async_reclaim_work(struct btrfs_fs_info *fs_info);
> >>  u64 btrfs_account_ro_block_groups_free_space(struct btrfs_space_info *sinfo);
> >>  
> >> -void btrfs_space_info_update_reclaimable(struct btrfs_space_info *space_info, s64 bytes);
> >> -void btrfs_set_periodic_reclaim_ready(struct btrfs_space_info *space_info, bool ready);
> >>  int btrfs_calc_reclaim_threshold(const struct btrfs_space_info *space_info);
> >> +static inline void btrfs_resume_periodic_reclaim(struct btrfs_space_info *space_info)
> >> +{
> >> +	lockdep_assert_held(&space_info->lock);
> >> +	if (space_info->periodic_reclaim_paused)
> >> +		space_info->periodic_reclaim_paused = false;
> >> +}
> >> +static inline void btrfs_pause_periodic_reclaim(struct btrfs_space_info *space_info)
> >> +{
> >> +	lockdep_assert_held(&space_info->lock);
> >> +	if (!space_info->periodic_reclaim_paused) {
> >> +		space_info->periodic_reclaim_paused = true;
> >> +		space_info->last_reclaim_threshold = btrfs_calc_reclaim_threshold(space_info);
> >> +		space_info->last_reclaim_unused = space_info->total_bytes - space_info->bytes_used;
> >> +	}
> >> +}
> >>  void btrfs_reclaim_sweep(const struct btrfs_fs_info *fs_info);
> >>  void btrfs_return_free_space(struct btrfs_space_info *space_info, u64 len);
> >>  
> >> -- 
> >> 2.51.2
> >>
> 
> Thanks,
> Sun Yangkai
> 

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v2 1/7] btrfs: fix periodic reclaim condition
  2026-01-05 18:21       ` Boris Burkov
@ 2026-01-07 14:09         ` Sun Yangkai
  2026-01-07 17:57           ` Boris Burkov
  0 siblings, 1 reply; 16+ messages in thread
From: Sun Yangkai @ 2026-01-07 14:09 UTC (permalink / raw)
  To: Boris Burkov; +Cc: linux-btrfs



在 2026/1/6 02:21, Boris Burkov 写道:

>> And we should continue periodic reclaim when we may be able to free more
>> blockgroups.
>>
> 
> This assumption I would push back on. This was my initial assumption as
> well, but it is worth challenging. At Meta, we used to use the automatic
> threshold based reclaim that reclaims every bg that goes over threshold
> (25% for us) then back under. This actually creates a huge amount of
> reclaim and was the reason for even coming up with dynamic thresholds.
> In practice, switching to the buggy dynamic+periodic drastically reduced
> this extra reclaim.

It seems hard to make periodic reclaim work practically with fixed threshold,
and maybe we can merge them into "auto reclaim" or something in future.

> Suppose we allocate some block extents / block groups then free half the
> space and fragment them. You might think it's best to rapidly compact
> everything. But as long as we have unallocated and the holes are big
> enough for allocations to come through and re-fill them, it's mostly OK.

I agree. When we have enough unallocated space, the fragment of free space is
not that bad and it may disappear automatically with future allocation.

> We much more want to do a "minimal" amount of reclaim until we truly
> need it.

And I guess we also don't want a "cliff", or we'll only try once with a high
threshold when "urgent" instead of using a dynamic threshold :)

> With all that said, your points above about the threshold going up but
> reclaim staying asleep make a lot of sense to me. That is a signal that
> the "need" is increasing.

>> For pausing, it's hard to tell if we're freeing more unallocated directly
>> without changing current code. But in btrfs_reclaim_bgs_work() we have
>>
>> btrfs_relocate_chunk()
>>   ->btrfs_remove_chunk()
>>     ->btrfs_remove_block_group()
>>
>> So the old blockgroup is removed and we can see that in space_info->total_bytes.
>>
>> Let's start with all the blockgroups having the same size and reclaiming a
>> single blockgroup. If we succeeded moving extents into existing blockgroups,
>> we'll not allocate a new one and space_info->total_bytes is getting smaller
>> because we've freed the old one. If we failed moving things into existing bgs
>> and allocated a new one, then space_info->total_bytes will not change.
> 
> Some of my comments on this point were silly, I was thinking about
> used_bytes which would not change, not total_bytes. Thanks for the
> detailed description to knock me out of that mistake.
> 
> Also, I'm not sure the logic about total_bytes is 100% ironclad, as
> I haven't carefully checked what block group allocation/freeing is
> possible concurrently over the span where you track the sizes.
> 
>>
>> When reclaiming n blockgroups, as long as we're using less space, such as n-1
>> new blockgroups are allocated or n new smaller blockgroups, we're making some
>> progress and able to detect that from space_info->total_bytes.
>>
>> Reclaim may racing with some other work that free/allocate blockgroups, which
>> will lead to either an unnecessary extra periodic reclaim or an unexpected
>> pause. An extra periodic reclaim is not a big thing. But the unexpected pause
>> matters. This is the only regression currently and I think that will not happen
>> frequently.
> 
> OK, we're on the same page with the possibility of some racing :)
> Your explanation makes sense and I don't think it's a huge deal or
> anything.
> 
>>
>> For continuing, rather than tracing used_bytes, I think we should care more
>> about unused_bytes. This is inspired by calc_dynamic_reclaim_threshold(), which
>> will return 0 if there's no enough unused space. And also care about
>> reclaim_threshold to make dynamic periodic reclaim work.
>>
>>
> 
> First, I think let's agree what we want and don't want to pause:
> I don't think we care about running the reclaim sweep and checking
> thresholds, updating marks, etc. That is relatively cheap. So if we
> change the pausing logic to still allow that work more often, it's OK.
> We really want to prevent triggering reclaims in tight fail loops.

> With that said, let's consider all the cases again.
> 
> 1. We scan all the bgs and don't attempt any reclaims.
> 2. We attempt a reclaim of BG X at threshold T and it completely fails to
> allocate a new home for extent E of X.
> 3. We attempt a reclaim of BG X at threshold T and it "succeeds" by
> allocating a new BG Y and moving all the extents from X to Y.
> 4. We attempt a reclaim of BG X at threshold T and it makes proper
> progress and puts some the extents of X into the existing bgs.
> 
> After case 1, I think we agree pausing isn't needed, and is in fact bad.

Yes!

> After cases 2 and 3, we want to avoid triggering more reclaim until we
> have evidence that it is likely a reclaim could possibly succeed. If
> unused gets worse since then, and the threshold goes up, it is
> *possible* a reclaim will succeed (we might pick a more full block group
> that happens to be full of smaller easy to allocate extents) but it is
> much more likely it won't. As far as I can tell, the best signal for
> unpausing reclaim for the "we have seen a reclaim fail" case is still
> "relative freeing" of extents.

> After case 4, we don't *need* to pause, as it is going to increase
> unallocated and reduce the "pressure" on the reclaim system, but it
> doesn't fundamentally hurt either. We made some progress and we want to
> be conservative and not be too greedy. Let allocation breathe and do
> some work to fill the gaps up. However, this case is quite different
> from 2 and 3 in that a subsequent relocation seems likely enough to also
> succeed.

Yes. I started as a user of this function, monitoring the threshold,
blockgroups' used rates, and filesystem's unallocated space. So I got confused
when there were still some blockgroups could be reclaimed but periodic reclaim
stopped working.

While you are the author of the function and focus on resolving the problem
behind (running out of unallocated space) so just reclaim some blockgroups is ok
so we don't need to try that hard.

> My original (designed) logic was "pause on 2,3,4; unpause on the
> conditions that unstick 2,3" which I think you have correctly argued is
> suboptimal. 4 can get stuck never reclaiming even as things get worse,
> even if we might succeed (we've never failed..)
> 
> So I think the best (short term) solution is:
> "detect failure modes 2 and 3 and unpause on release"
> 
> I believe your patches already enact "detect failure modes 2 and 3" and
> we just disagree on the unpausing mechanism.

For unpausing, I think we both agree on

1. enough space freed
2. threshold changed to a larger value because the increase of "need"
3. do not pause or unpause on case 4

I agree other unpausing conditions are not necessary.


>>> I think the simplest pure "fix" is to change the current "ready = false"
>>> setting from unconditional on every run of periodic reclaim to only when
>>> a sweep actually queues a block group for reclaim. I have compiled but
>>> not yet tested the code I included in this mail :) We can do that more
>>> carefully once we agree how we want to move forward.
>>>
>>> If you still want to re-design things on top of a fix like this one, I
>>> am totally open to that.
>>>
>>> Thanks again for all your work on this, and happy new year,
>>> Boris


>>> @@ -2176,7 +2181,6 @@ static bool btrfs_should_periodic_reclaim(struct btrfs_space_info *space_info)
>>>
>>>  	spin_lock(&space_info->lock);
>>>  	ret = space_info->periodic_reclaim_ready;
>>> -	btrfs_set_periodic_reclaim_ready(space_info, false);
>>>  	spin_unlock(&space_info->lock);
>>>
>>>  	return ret;
>>> @@ -2190,8 +2194,10 @@ void btrfs_reclaim_sweep(const struct btrfs_fs_info *fs_info)
>>>  	list_for_each_entry(space_info, &fs_info->space_info, list) {
>>>  		if (!btrfs_should_periodic_reclaim(space_info))
>>>  			continue;
>>> -		for (raid = 0; raid < BTRFS_NR_RAID_TYPES; raid++)
>>> -			do_reclaim_sweep(space_info, raid);
>>> +		for (raid = 0; raid < BTRFS_NR_RAID_TYPES; raid++) {
>>> +			if (do_reclaim_sweep(space_info, raid))
>>> +				btrfs_set_periodic_reclaim_ready(space_info, false);
>>> +		}
>>
>> Even if we have blockgroups to reclaim in this turn, we still mark some
>> bg->reclaim_mark and expect the next periodic reclaim, which will not run, might
>> reclaim them.
> 
> Why won't it run? We only pause if we truly called
> btrfs_mark_bg_to_reclaim() (the word mark is unfortunately overloaded
> here... that function really queues the bg for a reclaim attempt)

After userspace stopped writing, we still need 2 periodic reclaims to finish our
work and each of them will reclaim some of the bgs in my experience. But with
this patch, after the first one reclaimed some bgs, the second one will not run.
We might reclaim less and the behavior is a little different. I called this a
regression. But if we focus on "just don't run out of unallocated space", this
patch seems fine.

>>>  	}
>>>  }
>>
>> So I'm afraid this patch will introduce obvious regression on periodic reclaim.
I called it regression here :)
>> Add the "changed threshold" condition can alleviate this when dynamic reclaim is
>> also enabled but will not work for fixed threshold. So I failed to come up with
>> a simple fix patch with no obvious regression :(
>>
> 
> Can you share a description of any workloads you've been testing
> against? I think that would also help frame the discussion to avoid me
> talking too much :D

I test this feature on by BT machine. Download, fill the fs to nearly full and
delete some files and repeat...

> The regression I can foresee here is that the successful passes will
> pause for too long (until the next chunk_sz of freeing)

> We can also relax that condition to something more like an extent size
> (1M or BTRFS_MAX_EXTENT_SIZE perhaps) to make it a little gentler of a pause.
> 
> We can also unpause on "urgent". I think that would likely be
> sufficient. I'll do some actual experiments.. :)

I love these two ideas and looking forward to your experiment results :)

Thanks,
Sun YangKai

> Thanks,
> Boris
> 
>>>> CC: Boris Burkov <boris@bur.io>
>>>> Fixes: 813d4c6422516 ("btrfs: prevent pathological periodic reclaim loops")
>>>> Signed-off-by: Sun YangKai <sunk67188@gmail.com>
>>>> ---
>>>>  fs/btrfs/block-group.c | 15 +++++++-------
>>>>  fs/btrfs/space-info.c  | 44 +++++++++++++++++++-----------------------
>>>>  fs/btrfs/space-info.h  | 28 ++++++++++++++++++---------
>>>>  3 files changed, 46 insertions(+), 41 deletions(-)
>>>>
>>>> diff --git a/fs/btrfs/block-group.c b/fs/btrfs/block-group.c
>>>> index e417aba4c4c7..94a4068cd42a 100644
>>>> --- a/fs/btrfs/block-group.c
>>>> +++ b/fs/btrfs/block-group.c
>>>> @@ -1871,6 +1871,7 @@ void btrfs_reclaim_bgs_work(struct work_struct *work)
>>>>  	while (!list_empty(&fs_info->reclaim_bgs)) {
>>>>  		u64 used;
>>>>  		u64 reserved;
>>>> +		u64 old_total;
>>>>  		int ret = 0;
>>>>  
>>>>  		bg = list_first_entry(&fs_info->reclaim_bgs,
>>>> @@ -1936,6 +1937,7 @@ void btrfs_reclaim_bgs_work(struct work_struct *work)
>>>>  		}
>>>>  
>>>>  		spin_unlock(&bg->lock);
>>>> +		old_total = space_info->total_bytes;
>>>>  		spin_unlock(&space_info->lock);
>>>>  
>>>>  		/*
>>>> @@ -1988,14 +1990,14 @@ void btrfs_reclaim_bgs_work(struct work_struct *work)
>>>>  			reserved = 0;
>>>>  			spin_lock(&space_info->lock);
>>>>  			space_info->reclaim_errors++;
>>>> -			if (READ_ONCE(space_info->periodic_reclaim))
>>>> -				space_info->periodic_reclaim_ready = false;
>>>>  			spin_unlock(&space_info->lock);
>>>>  		}
>>>>  		spin_lock(&space_info->lock);
>>>>  		space_info->reclaim_count++;
>>>>  		space_info->reclaim_bytes += used;
>>>>  		space_info->reclaim_bytes += reserved;
>>>> +		if (space_info->total_bytes < old_total)
>>>> +			btrfs_resume_periodic_reclaim(space_info);
>>>
>>> Why is this here? We've just completed a reclaim, which I would expect
>>> to be neutral to the space_info->total_bytes (just moving them around).
>>> So if (any) unrelated freeing happens to happen while we are reclaiming,
>>> we resume? Doesn't seem wrong, but also seems a little specific and
>>> random. I am probably missing some aspect of your new design.
>>>
>>> Put a different way, what is special about frees that happen *while* we
>>> are reclaiming?
>>
>> As explained, I want to use this to detect if the reclaim freed some unallocated
>> space.
>>
> 
> Makes sense now.
> 
>>>>  		spin_unlock(&space_info->lock);
>>>>  
>>>>  next:
>>>> @@ -3730,8 +3732,6 @@ int btrfs_update_block_group(struct btrfs_trans_handle *trans,
>>>>  		space_info->bytes_reserved -= num_bytes;
>>>>  		space_info->bytes_used += num_bytes;
>>>>  		space_info->disk_used += num_bytes * factor;
>>>> -		if (READ_ONCE(space_info->periodic_reclaim))
>>>> -			btrfs_space_info_update_reclaimable(space_info, -num_bytes);
>>>>  		spin_unlock(&cache->lock);
>>>>  		spin_unlock(&space_info->lock);
>>>>  	} else {
>>>> @@ -3741,12 +3741,11 @@ int btrfs_update_block_group(struct btrfs_trans_handle *trans,
>>>>  		btrfs_space_info_update_bytes_pinned(space_info, num_bytes);
>>>>  		space_info->bytes_used -= num_bytes;
>>>>  		space_info->disk_used -= num_bytes * factor;
>>>> -		if (READ_ONCE(space_info->periodic_reclaim))
>>>> -			btrfs_space_info_update_reclaimable(space_info, num_bytes);
>>>> -		else
>>>> -			reclaim = should_reclaim_block_group(cache, num_bytes);
>>>> +		reclaim = should_reclaim_block_group(cache, num_bytes);
>>>
>>> I think this is a bug with periodic_reclaim == 1
>>>
>>> In that case, if should_reclaim_block_group() returns true (could be a
>>> fixed or dynamic threshold), we will put that block group directly on
>>> the reclaim list, which is a complete contradiction of the point of
>>> periodic_reclaim.
>>
>> I guess you mean just resume periodic reclaim is enough and put it on reclaim
>> list is not necessary. I agree.
>>
> 
> +1
> 
>>>>  
>>>>  		spin_unlock(&cache->lock);
>>>> +		if (reclaim)
>>>> +			btrfs_resume_periodic_reclaim(space_info);
>>>
>>> This also makes me wonder about the idea behind your change. If you want
>>> periodic reclaim to "pause" until a block group meets the condition and
>>> then we "resume", that's not exactly in the spirit of checking at a
>>> periodic cadence, rather than as block groups update.
>>
>> Because IMO we need periodic reclaim to reclaim this blockgroup. So if it's
>> paused, resume here so the next periodic reclaim will handle this blockgroup.
>>
> 
> As argued above, we want to resume when there is space in the "target"
> block groups, not an exciting "source". Otherwise the failed stuck case
> can keep hammering hopelessly.

Make sense.



^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v2 1/7] btrfs: fix periodic reclaim condition
  2026-01-07 14:09         ` Sun Yangkai
@ 2026-01-07 17:57           ` Boris Burkov
  2026-01-08 15:11             ` Sun Yangkai
  0 siblings, 1 reply; 16+ messages in thread
From: Boris Burkov @ 2026-01-07 17:57 UTC (permalink / raw)
  To: Sun Yangkai; +Cc: linux-btrfs

On Wed, Jan 07, 2026 at 10:09:24PM +0800, Sun Yangkai wrote:
> 
> 
> 在 2026/1/6 02:21, Boris Burkov 写道:
> 
> >> And we should continue periodic reclaim when we may be able to free more
> >> blockgroups.
> >>
> > 
> > This assumption I would push back on. This was my initial assumption as
> > well, but it is worth challenging. At Meta, we used to use the automatic
> > threshold based reclaim that reclaims every bg that goes over threshold
> > (25% for us) then back under. This actually creates a huge amount of
> > reclaim and was the reason for even coming up with dynamic thresholds.
> > In practice, switching to the buggy dynamic+periodic drastically reduced
> > this extra reclaim.
> 
> It seems hard to make periodic reclaim work practically with fixed threshold,
> and maybe we can merge them into "auto reclaim" or something in future.
> 
> > Suppose we allocate some block extents / block groups then free half the
> > space and fragment them. You might think it's best to rapidly compact
> > everything. But as long as we have unallocated and the holes are big
> > enough for allocations to come through and re-fill them, it's mostly OK.
> 
> I agree. When we have enough unallocated space, the fragment of free space is
> not that bad and it may disappear automatically with future allocation.
> 
> > We much more want to do a "minimal" amount of reclaim until we truly
> > need it.
> 
> And I guess we also don't want a "cliff", or we'll only try once with a high
> threshold when "urgent" instead of using a dynamic threshold :)
> 
> > With all that said, your points above about the threshold going up but
> > reclaim staying asleep make a lot of sense to me. That is a signal that
> > the "need" is increasing.
> 
> >> For pausing, it's hard to tell if we're freeing more unallocated directly
> >> without changing current code. But in btrfs_reclaim_bgs_work() we have
> >>
> >> btrfs_relocate_chunk()
> >>   ->btrfs_remove_chunk()
> >>     ->btrfs_remove_block_group()
> >>
> >> So the old blockgroup is removed and we can see that in space_info->total_bytes.
> >>
> >> Let's start with all the blockgroups having the same size and reclaiming a
> >> single blockgroup. If we succeeded moving extents into existing blockgroups,
> >> we'll not allocate a new one and space_info->total_bytes is getting smaller
> >> because we've freed the old one. If we failed moving things into existing bgs
> >> and allocated a new one, then space_info->total_bytes will not change.
> > 
> > Some of my comments on this point were silly, I was thinking about
> > used_bytes which would not change, not total_bytes. Thanks for the
> > detailed description to knock me out of that mistake.
> > 
> > Also, I'm not sure the logic about total_bytes is 100% ironclad, as
> > I haven't carefully checked what block group allocation/freeing is
> > possible concurrently over the span where you track the sizes.
> > 
> >>
> >> When reclaiming n blockgroups, as long as we're using less space, such as n-1
> >> new blockgroups are allocated or n new smaller blockgroups, we're making some
> >> progress and able to detect that from space_info->total_bytes.
> >>
> >> Reclaim may racing with some other work that free/allocate blockgroups, which
> >> will lead to either an unnecessary extra periodic reclaim or an unexpected
> >> pause. An extra periodic reclaim is not a big thing. But the unexpected pause
> >> matters. This is the only regression currently and I think that will not happen
> >> frequently.
> > 
> > OK, we're on the same page with the possibility of some racing :)
> > Your explanation makes sense and I don't think it's a huge deal or
> > anything.
> > 
> >>
> >> For continuing, rather than tracing used_bytes, I think we should care more
> >> about unused_bytes. This is inspired by calc_dynamic_reclaim_threshold(), which
> >> will return 0 if there's no enough unused space. And also care about
> >> reclaim_threshold to make dynamic periodic reclaim work.
> >>
> >>
> > 
> > First, I think let's agree what we want and don't want to pause:
> > I don't think we care about running the reclaim sweep and checking
> > thresholds, updating marks, etc. That is relatively cheap. So if we
> > change the pausing logic to still allow that work more often, it's OK.
> > We really want to prevent triggering reclaims in tight fail loops.
> 
> > With that said, let's consider all the cases again.
> > 
> > 1. We scan all the bgs and don't attempt any reclaims.
> > 2. We attempt a reclaim of BG X at threshold T and it completely fails to
> > allocate a new home for extent E of X.
> > 3. We attempt a reclaim of BG X at threshold T and it "succeeds" by
> > allocating a new BG Y and moving all the extents from X to Y.
> > 4. We attempt a reclaim of BG X at threshold T and it makes proper
> > progress and puts some the extents of X into the existing bgs.
> > 
> > After case 1, I think we agree pausing isn't needed, and is in fact bad.
> 
> Yes!
> 
> > After cases 2 and 3, we want to avoid triggering more reclaim until we
> > have evidence that it is likely a reclaim could possibly succeed. If
> > unused gets worse since then, and the threshold goes up, it is
> > *possible* a reclaim will succeed (we might pick a more full block group
> > that happens to be full of smaller easy to allocate extents) but it is
> > much more likely it won't. As far as I can tell, the best signal for
> > unpausing reclaim for the "we have seen a reclaim fail" case is still
> > "relative freeing" of extents.
> 
> > After case 4, we don't *need* to pause, as it is going to increase
> > unallocated and reduce the "pressure" on the reclaim system, but it
> > doesn't fundamentally hurt either. We made some progress and we want to
> > be conservative and not be too greedy. Let allocation breathe and do
> > some work to fill the gaps up. However, this case is quite different
> > from 2 and 3 in that a subsequent relocation seems likely enough to also
> > succeed.
> 
> Yes. I started as a user of this function, monitoring the threshold,
> blockgroups' used rates, and filesystem's unallocated space. So I got confused
> when there were still some blockgroups could be reclaimed but periodic reclaim
> stopped working.
> 
> While you are the author of the function and focus on resolving the problem
> behind (running out of unallocated space) so just reclaim some blockgroups is ok
> so we don't need to try that hard.
> 
> > My original (designed) logic was "pause on 2,3,4; unpause on the
> > conditions that unstick 2,3" which I think you have correctly argued is
> > suboptimal. 4 can get stuck never reclaiming even as things get worse,
> > even if we might succeed (we've never failed..)
> > 
> > So I think the best (short term) solution is:
> > "detect failure modes 2 and 3 and unpause on release"
> > 
> > I believe your patches already enact "detect failure modes 2 and 3" and
> > we just disagree on the unpausing mechanism.
> 
> For unpausing, I think we both agree on
> 
> 1. enough space freed
> 2. threshold changed to a larger value because the increase of "need"
> 3. do not pause or unpause on case 4
> 
> I agree other unpausing conditions are not necessary.
> 
> 

Given this and your other comments below about my proposed patch,
would you like to send a v3 which mixes our two approaches and does
something like:

- no new "pause" concept, keep the reclaim_ready (like my patch)
- only set ready = false in cases 2 and 3 (like your patch)

In that case we will *not* stop on pure success, and we will stop on
both failures modes.

Going forward on top of that with deeper testing, there can be more changes
like "special pause on 4 which gets unpaused on threshold change" or
something. Or "always unpause on threshold change" and of course more radical
ideas.

I'd just like to keep moving towards at least fixing the obvious
horrible bug.

Thanks,
Boris

> >>> I think the simplest pure "fix" is to change the current "ready = false"
> >>> setting from unconditional on every run of periodic reclaim to only when
> >>> a sweep actually queues a block group for reclaim. I have compiled but
> >>> not yet tested the code I included in this mail :) We can do that more
> >>> carefully once we agree how we want to move forward.
> >>>
> >>> If you still want to re-design things on top of a fix like this one, I
> >>> am totally open to that.
> >>>
> >>> Thanks again for all your work on this, and happy new year,
> >>> Boris
> 
> 
> >>> @@ -2176,7 +2181,6 @@ static bool btrfs_should_periodic_reclaim(struct btrfs_space_info *space_info)
> >>>
> >>>  	spin_lock(&space_info->lock);
> >>>  	ret = space_info->periodic_reclaim_ready;
> >>> -	btrfs_set_periodic_reclaim_ready(space_info, false);
> >>>  	spin_unlock(&space_info->lock);
> >>>
> >>>  	return ret;
> >>> @@ -2190,8 +2194,10 @@ void btrfs_reclaim_sweep(const struct btrfs_fs_info *fs_info)
> >>>  	list_for_each_entry(space_info, &fs_info->space_info, list) {
> >>>  		if (!btrfs_should_periodic_reclaim(space_info))
> >>>  			continue;
> >>> -		for (raid = 0; raid < BTRFS_NR_RAID_TYPES; raid++)
> >>> -			do_reclaim_sweep(space_info, raid);
> >>> +		for (raid = 0; raid < BTRFS_NR_RAID_TYPES; raid++) {
> >>> +			if (do_reclaim_sweep(space_info, raid))
> >>> +				btrfs_set_periodic_reclaim_ready(space_info, false);
> >>> +		}
> >>
> >> Even if we have blockgroups to reclaim in this turn, we still mark some
> >> bg->reclaim_mark and expect the next periodic reclaim, which will not run, might
> >> reclaim them.
> > 
> > Why won't it run? We only pause if we truly called
> > btrfs_mark_bg_to_reclaim() (the word mark is unfortunately overloaded
> > here... that function really queues the bg for a reclaim attempt)
> 
> After userspace stopped writing, we still need 2 periodic reclaims to finish our
> work and each of them will reclaim some of the bgs in my experience. But with
> this patch, after the first one reclaimed some bgs, the second one will not run.
> We might reclaim less and the behavior is a little different. I called this a
> regression. But if we focus on "just don't run out of unallocated space", this
> patch seems fine.
> 
> >>>  	}
> >>>  }
> >>
> >> So I'm afraid this patch will introduce obvious regression on periodic reclaim.
> I called it regression here :)
> >> Add the "changed threshold" condition can alleviate this when dynamic reclaim is
> >> also enabled but will not work for fixed threshold. So I failed to come up with
> >> a simple fix patch with no obvious regression :(
> >>
> > 
> > Can you share a description of any workloads you've been testing
> > against? I think that would also help frame the discussion to avoid me
> > talking too much :D
> 
> I test this feature on by BT machine. Download, fill the fs to nearly full and
> delete some files and repeat...
> 
> > The regression I can foresee here is that the successful passes will
> > pause for too long (until the next chunk_sz of freeing)
> 
> > We can also relax that condition to something more like an extent size
> > (1M or BTRFS_MAX_EXTENT_SIZE perhaps) to make it a little gentler of a pause.
> > 
> > We can also unpause on "urgent". I think that would likely be
> > sufficient. I'll do some actual experiments.. :)
> 
> I love these two ideas and looking forward to your experiment results :)
> 
> Thanks,
> Sun YangKai
> 
> > Thanks,
> > Boris
> > 
> >>>> CC: Boris Burkov <boris@bur.io>
> >>>> Fixes: 813d4c6422516 ("btrfs: prevent pathological periodic reclaim loops")
> >>>> Signed-off-by: Sun YangKai <sunk67188@gmail.com>
> >>>> ---
> >>>>  fs/btrfs/block-group.c | 15 +++++++-------
> >>>>  fs/btrfs/space-info.c  | 44 +++++++++++++++++++-----------------------
> >>>>  fs/btrfs/space-info.h  | 28 ++++++++++++++++++---------
> >>>>  3 files changed, 46 insertions(+), 41 deletions(-)
> >>>>
> >>>> diff --git a/fs/btrfs/block-group.c b/fs/btrfs/block-group.c
> >>>> index e417aba4c4c7..94a4068cd42a 100644
> >>>> --- a/fs/btrfs/block-group.c
> >>>> +++ b/fs/btrfs/block-group.c
> >>>> @@ -1871,6 +1871,7 @@ void btrfs_reclaim_bgs_work(struct work_struct *work)
> >>>>  	while (!list_empty(&fs_info->reclaim_bgs)) {
> >>>>  		u64 used;
> >>>>  		u64 reserved;
> >>>> +		u64 old_total;
> >>>>  		int ret = 0;
> >>>>  
> >>>>  		bg = list_first_entry(&fs_info->reclaim_bgs,
> >>>> @@ -1936,6 +1937,7 @@ void btrfs_reclaim_bgs_work(struct work_struct *work)
> >>>>  		}
> >>>>  
> >>>>  		spin_unlock(&bg->lock);
> >>>> +		old_total = space_info->total_bytes;
> >>>>  		spin_unlock(&space_info->lock);
> >>>>  
> >>>>  		/*
> >>>> @@ -1988,14 +1990,14 @@ void btrfs_reclaim_bgs_work(struct work_struct *work)
> >>>>  			reserved = 0;
> >>>>  			spin_lock(&space_info->lock);
> >>>>  			space_info->reclaim_errors++;
> >>>> -			if (READ_ONCE(space_info->periodic_reclaim))
> >>>> -				space_info->periodic_reclaim_ready = false;
> >>>>  			spin_unlock(&space_info->lock);
> >>>>  		}
> >>>>  		spin_lock(&space_info->lock);
> >>>>  		space_info->reclaim_count++;
> >>>>  		space_info->reclaim_bytes += used;
> >>>>  		space_info->reclaim_bytes += reserved;
> >>>> +		if (space_info->total_bytes < old_total)
> >>>> +			btrfs_resume_periodic_reclaim(space_info);
> >>>
> >>> Why is this here? We've just completed a reclaim, which I would expect
> >>> to be neutral to the space_info->total_bytes (just moving them around).
> >>> So if (any) unrelated freeing happens to happen while we are reclaiming,
> >>> we resume? Doesn't seem wrong, but also seems a little specific and
> >>> random. I am probably missing some aspect of your new design.
> >>>
> >>> Put a different way, what is special about frees that happen *while* we
> >>> are reclaiming?
> >>
> >> As explained, I want to use this to detect if the reclaim freed some unallocated
> >> space.
> >>
> > 
> > Makes sense now.
> > 
> >>>>  		spin_unlock(&space_info->lock);
> >>>>  
> >>>>  next:
> >>>> @@ -3730,8 +3732,6 @@ int btrfs_update_block_group(struct btrfs_trans_handle *trans,
> >>>>  		space_info->bytes_reserved -= num_bytes;
> >>>>  		space_info->bytes_used += num_bytes;
> >>>>  		space_info->disk_used += num_bytes * factor;
> >>>> -		if (READ_ONCE(space_info->periodic_reclaim))
> >>>> -			btrfs_space_info_update_reclaimable(space_info, -num_bytes);
> >>>>  		spin_unlock(&cache->lock);
> >>>>  		spin_unlock(&space_info->lock);
> >>>>  	} else {
> >>>> @@ -3741,12 +3741,11 @@ int btrfs_update_block_group(struct btrfs_trans_handle *trans,
> >>>>  		btrfs_space_info_update_bytes_pinned(space_info, num_bytes);
> >>>>  		space_info->bytes_used -= num_bytes;
> >>>>  		space_info->disk_used -= num_bytes * factor;
> >>>> -		if (READ_ONCE(space_info->periodic_reclaim))
> >>>> -			btrfs_space_info_update_reclaimable(space_info, num_bytes);
> >>>> -		else
> >>>> -			reclaim = should_reclaim_block_group(cache, num_bytes);
> >>>> +		reclaim = should_reclaim_block_group(cache, num_bytes);
> >>>
> >>> I think this is a bug with periodic_reclaim == 1
> >>>
> >>> In that case, if should_reclaim_block_group() returns true (could be a
> >>> fixed or dynamic threshold), we will put that block group directly on
> >>> the reclaim list, which is a complete contradiction of the point of
> >>> periodic_reclaim.
> >>
> >> I guess you mean just resume periodic reclaim is enough and put it on reclaim
> >> list is not necessary. I agree.
> >>
> > 
> > +1
> > 
> >>>>  
> >>>>  		spin_unlock(&cache->lock);
> >>>> +		if (reclaim)
> >>>> +			btrfs_resume_periodic_reclaim(space_info);
> >>>
> >>> This also makes me wonder about the idea behind your change. If you want
> >>> periodic reclaim to "pause" until a block group meets the condition and
> >>> then we "resume", that's not exactly in the spirit of checking at a
> >>> periodic cadence, rather than as block groups update.
> >>
> >> Because IMO we need periodic reclaim to reclaim this blockgroup. So if it's
> >> paused, resume here so the next periodic reclaim will handle this blockgroup.
> >>
> > 
> > As argued above, we want to resume when there is space in the "target"
> > block groups, not an exciting "source". Otherwise the failed stuck case
> > can keep hammering hopelessly.
> 
> Make sense.
> 
> 

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v2 1/7] btrfs: fix periodic reclaim condition
  2026-01-07 17:57           ` Boris Burkov
@ 2026-01-08 15:11             ` Sun Yangkai
  0 siblings, 0 replies; 16+ messages in thread
From: Sun Yangkai @ 2026-01-08 15:11 UTC (permalink / raw)
  To: Boris Burkov; +Cc: linux-btrfs



在 2026/1/8 01:57, Boris Burkov 写道:
> Given this and your other comments below about my proposed patch,
> would you like to send a v3 which mixes our two approaches and does
> something like:
> 
> - no new "pause" concept, keep the reclaim_ready (like my patch)
> - only set ready = false in cases 2 and 3 (like your patch)
> 
> In that case we will *not* stop on pure success, and we will stop on
> both failures modes.

I'm drafting the new patch and got something seems a little stupid:

@@ -1989,13 +1991,15 @@ void btrfs_reclaim_bgs_work(struct work_struct *work)
                        spin_lock(&space_info->lock);
                        space_info->reclaim_errors++;
                        if (READ_ONCE(space_info->periodic_reclaim))
-                               space_info->periodic_reclaim_ready = false;
+                               btrfs_set_periodic_reclaim_ready(space_info, false);
                        spin_unlock(&space_info->lock);
                }
                spin_lock(&space_info->lock);
                space_info->reclaim_count++;
                space_info->reclaim_bytes += used;
                space_info->reclaim_bytes += reserved;
+               if (space_info->total_bytes < old_total)
+                       btrfs_set_periodic_reclaim_ready(space_info, true);
                spin_unlock(&space_info->lock);

Maybe we should change it to something like this in future:

		if (ret) {
			btrfs_dec_block_group_ro(bg);
			btrfs_err(fs_info, "error relocating chunk %llu",
				  bg->start);
			spin_lock(&space_info->lock);
			space_info->reclaim_count++;
			space_info->reclaim_errors++;
			if (READ_ONCE(space_info->periodic_reclaim))
				btrfs_set_periodic_reclaim_ready(space_info, false);
			spin_unlock(&space_info->lock);
		} else {
			spin_lock(&space_info->lock);
			space_info->reclaim_count++;
			space_info->reclaim_bytes += used;
			space_info->reclaim_bytes += reserved;
			if (space_info->total_bytes < old_total)
				btrfs_set_periodic_reclaim_ready(space_info, true);
			spin_unlock(&space_info->lock);
		}

But it's beyond the scope of this patch :)

> Going forward on top of that with deeper testing, there can be more changes
> like "special pause on 4 which gets unpaused on threshold change" or
> something. Or "always unpause on threshold change" and of course more radical
> ideas.
> 
> I'd just like to keep moving towards at least fixing the obvious
> horrible bug.
> 
> Thanks,
> Boris

^ permalink raw reply	[flat|nested] 16+ messages in thread

end of thread, other threads:[~2026-01-08 15:11 UTC | newest]

Thread overview: 16+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-01-03 12:19 [PATCH v2 0/7] btrfs: fix periodic reclaim condition with some cleanup Sun YangKai
2026-01-03 12:19 ` [PATCH v2 1/7] btrfs: fix periodic reclaim condition Sun YangKai
2026-01-04 19:40   ` Boris Burkov
2026-01-05 13:00     ` Sun Yangkai
2026-01-05 18:21       ` Boris Burkov
2026-01-07 14:09         ` Sun Yangkai
2026-01-07 17:57           ` Boris Burkov
2026-01-08 15:11             ` Sun Yangkai
2026-01-03 12:19 ` [PATCH v2 2/7] btrfs: use u8 for reclaim threshold type Sun YangKai
2026-01-03 12:19 ` [PATCH v2 3/7] btrfs: clarify reclaim sweep control flow Sun YangKai
2026-01-03 12:19 ` [PATCH v2 4/7] btrfs: change block group reclaim_mark to bool Sun YangKai
2026-01-03 12:19 ` [PATCH v2 5/7] btrfs: reorder btrfs_block_group members to reduce struct size Sun YangKai
2026-01-05 15:07   ` Filipe Manana
2026-01-05 15:26     ` Sun Yangkai
2026-01-03 12:19 ` [PATCH v2 6/7] btrfs: use proper types for btrfs_block_group fields Sun YangKai
2026-01-03 12:19 ` [PATCH v2 7/7] btrfs: consolidate reclaim readiness checks in btrfs_should_reclaim() Sun YangKai

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox