[PATCH v2 0/6] btrfs: dynamic and periodic block

public inbox for linux-btrfs@vger.kernel.org
 help / color / mirror / Atom feed

* [PATCH v2 0/6] btrfs: dynamic and periodic block_group reclaim
@ 2024-06-17 23:11 Boris Burkov
  2024-06-17 23:11 ` [PATCH v2 1/6] btrfs: report reclaim stats in sysfs Boris Burkov
                   ` (6 more replies)
  0 siblings, 7 replies; 13+ messages in thread
From: Boris Burkov @ 2024-06-17 23:11 UTC (permalink / raw)
  To: linux-btrfs, kernel-team

Btrfs's block_group allocator suffers from a well known problem, that
it is capable of eagerly allocating too much space to either data or
metadata (most often data, absent bugs) and then later be unable to
allocate more space for the other, when needed. When data starves
metadata, this can extra painfully result in read only filesystems that
need careful manual balancing to fix.

This can be worked around by:
- enabling automatic reclaim
- periodically running balance

The latter is widely deployed via btrfsmaintenance
(https://github.com/kdave/btrfsmaintenance) and the former is used at
scale at Meta with good results. However, neither of those solutions is
perfect, as they both currently use a fixed threshold. A fixed threshold
is vulnerable to workloads that trigger high amounts of reclaim. This
has led to btrfsmaintenance setting very conservative thresholds of 5
and 10 percent of data block groups.
(https://github.com/kdave/btrfsmaintenance/commit/edbbfffe592f47c2849a8825f523e2ccc38b15f5)
At Meta, we deal with an elevated level of reclaim which would be
desirable to reduce.

This patch set expands on automatic reclaim, adding the ability to set a
dynamic reclaim threshold that appropriately scales with the global file
system allocation conditions as well as periodic reclaim which runs that
reclaim sweep in the cleaner thread. Together, I believe they constitute
a robust and general automatic reclaim system that should avoid
unfortunate read only filesystems in all but extreme conditions, where
space is running quite low anyway and failure is more reasonable.

At a very high level, the dynamic threshold's strategy is to set a fixed
target of unallocated block groups (10 block groups) and linearly scale
its aggression the further we are from that target. That way we do no
automatic reclaim until we actually press against the unallocated
target, allowing the allocator to gradually fill fragmented space with
new extents, but do claw back space after  workloads that use and free a
bunch of space, perhaps with fragmentation.

I ran it on three workloads (described in detail on the dynamic reclaim
patch) but they are:
1. bounce allocations around X% full.
2. fill up all the way and introduce full fragmentation.
3. write in a fragmented way until the filesystem is just about full.
script can be found here:
https://github.com/boryas/scripts/tree/main/fio/reclaim

The important results can be seen here (full results explorable at
https://bur.io/dyn-rec/)

bounce at 30%, higher relocations with a fixed threshold:
https://bur.io/dyn-rec/bounce/reclaims.png
https://bur.io/dyn-rec/bounce/reclaim_bytes.png
https://bur.io/dyn-rec/bounce/unalloc_bytes.png

hard 30% fragmentation, dynamic actually reclaims, relocs not crazy:
https://bur.io/dyn-rec/strict_frag/reclaims.png
https://bur.io/dyn-rec/strict_frag/reclaim_bytes.png
https://bur.io/dyn-rec/strict_frag/unalloc_bytes.png

fill it all the way up in a fragmented way, then keep making
allocations: 
https://bur.io/dyn-rec/last_gig/reclaims.png
https://bur.io/dyn-rec/last_gig/reclaim_bytes.png
https://bur.io/dyn-rec/last_gig/unalloc_bytes.png
--
Changelog:
v2:
- add reclaim errors counter
- refactor reclaim counter to remove extra else
- account for zone unusable in threshold calculation

Boris Burkov (6):
  btrfs: report reclaim stats in sysfs
  btrfs: store fs_info on space_info
  btrfs: dynamic block_group reclaim threshold
  btrfs: periodic block_group reclaim
  btrfs: prevent pathological periodic reclaim loops
  btrfs: urgent periodic reclaim pass

 fs/btrfs/block-group.c |  42 ++++++--
 fs/btrfs/block-group.h |   1 +
 fs/btrfs/space-info.c  | 240 +++++++++++++++++++++++++++++++++++++++--
 fs/btrfs/space-info.h  |  48 +++++++++
 fs/btrfs/sysfs.c       |  83 +++++++++++++-
 5 files changed, 391 insertions(+), 23 deletions(-)

-- 
2.45.2

^ permalink raw reply	[flat|nested] 13+ messages in thread

* [PATCH v2 1/6] btrfs: report reclaim stats in sysfs
  2024-06-17 23:11 [PATCH v2 0/6] btrfs: dynamic and periodic block_group reclaim Boris Burkov
@ 2024-06-17 23:11 ` Boris Burkov
  2024-06-17 23:11 ` [PATCH v2 2/6] btrfs: store fs_info on space_info Boris Burkov
                   ` (5 subsequent siblings)
  6 siblings, 0 replies; 13+ messages in thread
From: Boris Burkov @ 2024-06-17 23:11 UTC (permalink / raw)
  To: linux-btrfs, kernel-team

When evaluating various reclaim strategies/thresholds against each
other, it is useful to collect data about the amount of reclaim
happening. Expose a count, error count, and byte count via sysfs
per space_info.

Note that this is only for automatic reclaim, not manually invoked
balances or other codepaths that use "relocate_block_group"

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Boris Burkov <boris@bur.io>
---
 fs/btrfs/block-group.c | 10 ++++++++++
 fs/btrfs/space-info.h  | 18 ++++++++++++++++++
 fs/btrfs/sysfs.c       |  6 ++++++
 3 files changed, 34 insertions(+)

diff --git a/fs/btrfs/block-group.c b/fs/btrfs/block-group.c
index 9f1d328b603e..824fd229d129 100644
--- a/fs/btrfs/block-group.c
+++ b/fs/btrfs/block-group.c
@@ -1829,6 +1829,7 @@ void btrfs_reclaim_bgs_work(struct work_struct *work)
 	list_sort(NULL, &fs_info->reclaim_bgs, reclaim_bgs_cmp);
 	while (!list_empty(&fs_info->reclaim_bgs)) {
 		u64 zone_unusable;
+		u64 reclaimed;
 		int ret = 0;
 
 		bg = list_first_entry(&fs_info->reclaim_bgs,
@@ -1921,12 +1922,21 @@ void btrfs_reclaim_bgs_work(struct work_struct *work)
 				div64_u64(bg->used * 100, bg->length),
 				div64_u64(zone_unusable * 100, bg->length));
 		trace_btrfs_reclaim_block_group(bg);
+		reclaimed = bg->used;
 		ret = btrfs_relocate_chunk(fs_info, bg->start);
 		if (ret) {
 			btrfs_dec_block_group_ro(bg);
 			btrfs_err(fs_info, "error relocating chunk %llu",
 				  bg->start);
+			reclaimed = 0;
+			spin_lock(&space_info->lock);
+			space_info->reclaim_errors++;
+			spin_unlock(&space_info->lock);
 		}
+		spin_lock(&space_info->lock);
+		space_info->reclaim_count++;
+		space_info->reclaim_bytes += reclaimed;
+		spin_unlock(&space_info->lock);
 
 next:
 		if (ret) {
diff --git a/fs/btrfs/space-info.h b/fs/btrfs/space-info.h
index a733458fd13b..98ea35ae60fe 100644
--- a/fs/btrfs/space-info.h
+++ b/fs/btrfs/space-info.h
@@ -165,6 +165,24 @@ struct btrfs_space_info {
 
 	struct kobject kobj;
 	struct kobject *block_group_kobjs[BTRFS_NR_RAID_TYPES];
+
+	/*
+	 * Monotonically increasing counter of block group reclaim attempts
+	 * Exposed in /sys/fs/<uuid>/allocation/<type>/reclaim_count
+	 */
+	u64 reclaim_count;
+
+	/*
+	 * Monotonically increasing counter of reclaimed bytes
+	 * Exposed in /sys/fs/<uuid>/allocation/<type>/reclaim_bytes
+	 */
+	u64 reclaim_bytes;
+
+	/*
+	 * Monotonically increasing counter of reclaim errors
+	 * Exposed in /sys/fs/<uuid>/allocation/<type>/reclaim_errors
+	 */
+	u64 reclaim_errors;
 };
 
 struct reserve_ticket {
diff --git a/fs/btrfs/sysfs.c b/fs/btrfs/sysfs.c
index af545b6b1190..919c7ba45121 100644
--- a/fs/btrfs/sysfs.c
+++ b/fs/btrfs/sysfs.c
@@ -894,6 +894,9 @@ SPACE_INFO_ATTR(bytes_readonly);
 SPACE_INFO_ATTR(bytes_zone_unusable);
 SPACE_INFO_ATTR(disk_used);
 SPACE_INFO_ATTR(disk_total);
+SPACE_INFO_ATTR(reclaim_count);
+SPACE_INFO_ATTR(reclaim_bytes);
+SPACE_INFO_ATTR(reclaim_errors);
 BTRFS_ATTR_RW(space_info, chunk_size, btrfs_chunk_size_show, btrfs_chunk_size_store);
 BTRFS_ATTR(space_info, size_classes, btrfs_size_classes_show);
 
@@ -949,6 +952,9 @@ static struct attribute *space_info_attrs[] = {
 	BTRFS_ATTR_PTR(space_info, bg_reclaim_threshold),
 	BTRFS_ATTR_PTR(space_info, chunk_size),
 	BTRFS_ATTR_PTR(space_info, size_classes),
+	BTRFS_ATTR_PTR(space_info, reclaim_count),
+	BTRFS_ATTR_PTR(space_info, reclaim_bytes),
+	BTRFS_ATTR_PTR(space_info, reclaim_errors),
 #ifdef CONFIG_BTRFS_DEBUG
 	BTRFS_ATTR_PTR(space_info, force_chunk_alloc),
 #endif
-- 
2.45.2


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [PATCH v2 2/6] btrfs: store fs_info on space_info
  2024-06-17 23:11 [PATCH v2 0/6] btrfs: dynamic and periodic block_group reclaim Boris Burkov
  2024-06-17 23:11 ` [PATCH v2 1/6] btrfs: report reclaim stats in sysfs Boris Burkov
@ 2024-06-17 23:11 ` Boris Burkov
  2024-06-17 23:11 ` [PATCH v2 3/6] btrfs: dynamic block_group reclaim threshold Boris Burkov
                   ` (4 subsequent siblings)
  6 siblings, 0 replies; 13+ messages in thread
From: Boris Burkov @ 2024-06-17 23:11 UTC (permalink / raw)
  To: linux-btrfs, kernel-team

This is handy when computing space_info dynamic reclaim thresholds where
we do not have access to a block group. We could add it to the various
functions as a parameter, but it seems reasonable for space_info to have
an fs_info pointer.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Boris Burkov <boris@bur.io>
---
 fs/btrfs/space-info.c | 1 +
 fs/btrfs/space-info.h | 1 +
 2 files changed, 2 insertions(+)

diff --git a/fs/btrfs/space-info.c b/fs/btrfs/space-info.c
index 0283ee9bf813..7384286c5058 100644
--- a/fs/btrfs/space-info.c
+++ b/fs/btrfs/space-info.c
@@ -232,6 +232,7 @@ static int create_space_info(struct btrfs_fs_info *info, u64 flags)
 	if (!space_info)
 		return -ENOMEM;
 
+	space_info->fs_info = info;
 	for (i = 0; i < BTRFS_NR_RAID_TYPES; i++)
 		INIT_LIST_HEAD(&space_info->block_groups[i]);
 	init_rwsem(&space_info->groups_sem);
diff --git a/fs/btrfs/space-info.h b/fs/btrfs/space-info.h
index 98ea35ae60fe..25edfd453b27 100644
--- a/fs/btrfs/space-info.h
+++ b/fs/btrfs/space-info.h
@@ -94,6 +94,7 @@ enum btrfs_flush_state {
 };
 
 struct btrfs_space_info {
+	struct btrfs_fs_info *fs_info;
 	spinlock_t lock;
 
 	u64 total_bytes;	/* total bytes in the space,
-- 
2.45.2


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [PATCH v2 3/6] btrfs: dynamic block_group reclaim threshold
  2024-06-17 23:11 [PATCH v2 0/6] btrfs: dynamic and periodic block_group reclaim Boris Burkov
  2024-06-17 23:11 ` [PATCH v2 1/6] btrfs: report reclaim stats in sysfs Boris Burkov
  2024-06-17 23:11 ` [PATCH v2 2/6] btrfs: store fs_info on space_info Boris Burkov
@ 2024-06-17 23:11 ` Boris Burkov
  2024-06-25 13:40   ` Naohiro Aota
  2024-06-17 23:11 ` [PATCH v2 4/6] btrfs: periodic block_group reclaim Boris Burkov
                   ` (3 subsequent siblings)
  6 siblings, 1 reply; 13+ messages in thread
From: Boris Burkov @ 2024-06-17 23:11 UTC (permalink / raw)
  To: linux-btrfs, kernel-team

We can currently recover allocated block_groups by:
- explicitly starting balance operations
- "auto reclaim" via bg_reclaim_threshold

The latter works by checking against a fixed threshold on frees. If we
pass from above the threshold to below, relocation triggers and the
block group will get reclaimed by the cleaner thread (assuming it is
still eligible)

Picking a threshold is challenging. Too high, and you end up trying to
reclaim very full block_groups which is quite costly, and you don't do
reclaim on block_groups that don't get quite THAT full, but could still
be quite fragmented and stranding a lot of space. Too low, and you
similarly miss out on reclaim even if you badly need it to avoid running
out of unallocated space, if you have heavily fragmented block groups
living above the threshold.

No matter the threshold, it suffers from a workload that happens to
bounce around that threshold, which can introduce arbitrary amounts of
reclaim waste.

To improve this situation, introduce a dynamic threshold. The basic idea
behind this threshold is that it should be very lax when there is plenty
of unallocated space, and increasingly aggressive as we approach zero
unallocated space. To that end, it sets a target for unallocated space
(10 chunks) and then linearly increases the threshold as the amount of
space short of the target we are increases. The formula is:
(target - unalloc) / target

I tested this by running it on three interesting workloads:
1. bounce allocations around X% full.
2. fill up all the way and introduce full fragmentation.
3. write in a fragmented way until the filesystem is just about full.

1. and 2. attack the weaknesses of a fixed threshold; fixed either works
perfectly or fully falls apart, depending on the threshold. Dynamic
always handles these cases well.

3. attacks dynamic by checking whether it is too zealous to reclaim in
conditions with low unallocated and low unused. It tends to claw back
1GiB of unallocated fairly aggressively, but not much more. Early
versions of dynamic threshold struggled on this test.

Additional work could be done to intelligently ratchet up the urgency of
reclaim in very low unallocated conditions. Existing mechanisms are
already useless in that case anyway.

Signed-off-by: Boris Burkov <boris@bur.io>
---
 fs/btrfs/block-group.c |  18 ++++---
 fs/btrfs/space-info.c  | 115 +++++++++++++++++++++++++++++++++++++----
 fs/btrfs/space-info.h  |   8 +++
 fs/btrfs/sysfs.c       |  43 ++++++++++++++-
 4 files changed, 164 insertions(+), 20 deletions(-)

diff --git a/fs/btrfs/block-group.c b/fs/btrfs/block-group.c
index 824fd229d129..c3313697475f 100644
--- a/fs/btrfs/block-group.c
+++ b/fs/btrfs/block-group.c
@@ -1764,24 +1764,21 @@ static inline bool btrfs_should_reclaim(struct btrfs_fs_info *fs_info)
 
 static bool should_reclaim_block_group(struct btrfs_block_group *bg, u64 bytes_freed)
 {
-	const struct btrfs_space_info *space_info = bg->space_info;
-	const int reclaim_thresh = READ_ONCE(space_info->bg_reclaim_threshold);
+	const int thresh_pct = btrfs_calc_reclaim_threshold(bg->space_info);
+	u64 thresh_bytes = mult_perc(bg->length, thresh_pct);
 	const u64 new_val = bg->used;
 	const u64 old_val = new_val + bytes_freed;
-	u64 thresh;
 
-	if (reclaim_thresh == 0)
+	if (thresh_bytes == 0)
 		return false;
 
-	thresh = mult_perc(bg->length, reclaim_thresh);
-
 	/*
 	 * If we were below the threshold before don't reclaim, we are likely a
 	 * brand new block group and we don't want to relocate new block groups.
 	 */
-	if (old_val < thresh)
+	if (old_val < thresh_bytes)
 		return false;
-	if (new_val >= thresh)
+	if (new_val >= thresh_bytes)
 		return false;
 	return true;
 }
@@ -1843,6 +1840,7 @@ void btrfs_reclaim_bgs_work(struct work_struct *work)
 		/* Don't race with allocators so take the groups_sem */
 		down_write(&space_info->groups_sem);
 
+		spin_lock(&space_info->lock);
 		spin_lock(&bg->lock);
 		if (bg->reserved || bg->pinned || bg->ro) {
 			/*
@@ -1852,6 +1850,7 @@ void btrfs_reclaim_bgs_work(struct work_struct *work)
 			 * this block group.
 			 */
 			spin_unlock(&bg->lock);
+			spin_unlock(&space_info->lock);
 			up_write(&space_info->groups_sem);
 			goto next;
 		}
@@ -1870,6 +1869,7 @@ void btrfs_reclaim_bgs_work(struct work_struct *work)
 			if (!btrfs_test_opt(fs_info, DISCARD_ASYNC))
 				btrfs_mark_bg_unused(bg);
 			spin_unlock(&bg->lock);
+			spin_unlock(&space_info->lock);
 			up_write(&space_info->groups_sem);
 			goto next;
 
@@ -1886,10 +1886,12 @@ void btrfs_reclaim_bgs_work(struct work_struct *work)
 		 */
 		if (!should_reclaim_block_group(bg, bg->length)) {
 			spin_unlock(&bg->lock);
+			spin_unlock(&space_info->lock);
 			up_write(&space_info->groups_sem);
 			goto next;
 		}
 		spin_unlock(&bg->lock);
+		spin_unlock(&space_info->lock);
 
 		/*
 		 * Get out fast, in case we're read-only or unmounting the
diff --git a/fs/btrfs/space-info.c b/fs/btrfs/space-info.c
index 7384286c5058..0d13282dac05 100644
--- a/fs/btrfs/space-info.c
+++ b/fs/btrfs/space-info.c
@@ -1,5 +1,6 @@
 // SPDX-License-Identifier: GPL-2.0
 
+#include <linux/minmax.h>
 #include "misc.h"
 #include "ctree.h"
 #include "space-info.h"
@@ -190,6 +191,8 @@ void btrfs_clear_space_info_full(struct btrfs_fs_info *info)
  */
 #define BTRFS_DEFAULT_ZONED_RECLAIM_THRESH			(75)
 
+#define BTRFS_UNALLOC_BLOCK_GROUP_TARGET			(10ULL)
+
 /*
  * Calculate chunk size depending on volume type (regular or zoned).
  */
@@ -341,11 +344,27 @@ struct btrfs_space_info *btrfs_find_space_info(struct btrfs_fs_info *info,
 	return NULL;
 }
 
+static u64 calc_effective_data_chunk_size(struct btrfs_fs_info *fs_info)
+{
+	struct btrfs_space_info *data_sinfo;
+	u64 data_chunk_size;
+	/*
+	 * Calculate the data_chunk_size, space_info->chunk_size is the
+	 * "optimal" chunk size based on the fs size.  However when we actually
+	 * allocate the chunk we will strip this down further, making it no more
+	 * than 10% of the disk or 1G, whichever is smaller.
+	 */
+	data_sinfo = btrfs_find_space_info(fs_info, BTRFS_BLOCK_GROUP_DATA);
+	data_chunk_size = min(data_sinfo->chunk_size,
+			      mult_perc(fs_info->fs_devices->total_rw_bytes, 10));
+	return min_t(u64, data_chunk_size, SZ_1G);
+
+}
+
 static u64 calc_available_free_space(struct btrfs_fs_info *fs_info,
 			  struct btrfs_space_info *space_info,
 			  enum btrfs_reserve_flush_enum flush)
 {
-	struct btrfs_space_info *data_sinfo;
 	u64 profile;
 	u64 avail;
 	u64 data_chunk_size;
@@ -369,16 +388,7 @@ static u64 calc_available_free_space(struct btrfs_fs_info *fs_info,
 	if (avail == 0)
 		return 0;
 
-	/*
-	 * Calculate the data_chunk_size, space_info->chunk_size is the
-	 * "optimal" chunk size based on the fs size.  However when we actually
-	 * allocate the chunk we will strip this down further, making it no more
-	 * than 10% of the disk or 1G, whichever is smaller.
-	 */
-	data_sinfo = btrfs_find_space_info(fs_info, BTRFS_BLOCK_GROUP_DATA);
-	data_chunk_size = min(data_sinfo->chunk_size,
-			      mult_perc(fs_info->fs_devices->total_rw_bytes, 10));
-	data_chunk_size = min_t(u64, data_chunk_size, SZ_1G);
+	data_chunk_size = calc_effective_data_chunk_size(fs_info);
 
 	/*
 	 * Since data allocations immediately use block groups as part of the
@@ -1860,3 +1870,86 @@ u64 btrfs_account_ro_block_groups_free_space(struct btrfs_space_info *sinfo)
 
 	return free_bytes;
 }
+
+static u64 calc_pct_ratio(u64 x, u64 y)
+{
+	int err;
+
+	if (!y)
+		return 0;
+again:
+	err = check_mul_overflow(100, x, &x);
+	if (err)
+		goto lose_precision;
+	return div64_u64(x, y);
+lose_precision:
+	x >>= 10;
+	y >>= 10;
+	if (!y)
+		y = 1;
+	goto again;
+}
+
+/*
+ * A reasonable buffer for unallocated space is 10 data block_groups.
+ * If we claw this back repeatedly, we can still achieve efficient
+ * utilization when near full, and not do too much reclaim while
+ * always maintaining a solid buffer for workloads that quickly
+ * allocate and pressure the unallocated space.
+ */
+static u64 calc_unalloc_target(struct btrfs_fs_info *fs_info)
+{
+	return BTRFS_UNALLOC_BLOCK_GROUP_TARGET * calc_effective_data_chunk_size(fs_info);
+}
+
+/*
+ * The fundamental goal of automatic reclaim is to protect the filesystem's
+ * unallocated space and thus minimize the probability of the filesystem going
+ * read only when a metadata allocation failure causes a transaction abort.
+ *
+ * However, relocations happen into the space_info's unused space, therefore
+ * automatic reclaim must also back off as that space runs low. There is no
+ * value in doing trivial "relocations" of re-writing the same block group
+ * into a fresh one.
+ *
+ * Furthermore, we want to avoid doing too much reclaim even if there are good
+ * candidates. This is because the allocator is pretty good at filling up the
+ * holes with writes. So we want to do just enough reclaim to try and stay
+ * safe from running out of unallocated space but not be wasteful about it.
+ *
+ * Therefore, the dynamic reclaim threshold is calculated as follows:
+ * - calculate a target unallocated amount of 5 block group sized chunks
+ * - ratchet up the intensity of reclaim depending on how far we are from
+ *   that target by using a formula of unalloc / target to set the threshold.
+ *
+ * Typically with 10 block groups as the target, the discrete values this comes
+ * out to are 0, 10, 20, ... , 80, 90, and 99.
+ */
+static int calc_dynamic_reclaim_threshold(struct btrfs_space_info *space_info)
+{
+	struct btrfs_fs_info *fs_info = space_info->fs_info;
+	u64 unalloc = atomic64_read(&fs_info->free_chunk_space);
+	u64 target = calc_unalloc_target(fs_info);
+	u64 alloc = space_info->total_bytes;
+	u64 used = btrfs_space_info_used(space_info, false);
+	u64 unused = alloc - used;
+	u64 want = target > unalloc ? target - unalloc : 0;
+	u64 data_chunk_size = calc_effective_data_chunk_size(fs_info);
+	/* Cast to int is OK because want <= target */
+	int ratio = calc_pct_ratio(want, target);
+
+	/* If we have no unused space, don't bother, it won't work anyway */
+	if (unused < data_chunk_size)
+		return 0;
+
+	return ratio;
+}
+
+int btrfs_calc_reclaim_threshold(struct btrfs_space_info *space_info)
+{
+	lockdep_assert_held(&space_info->lock);
+
+	if (READ_ONCE(space_info->dynamic_reclaim))
+		return calc_dynamic_reclaim_threshold(space_info);
+	return READ_ONCE(space_info->bg_reclaim_threshold);
+}
diff --git a/fs/btrfs/space-info.h b/fs/btrfs/space-info.h
index 25edfd453b27..2cac771321c7 100644
--- a/fs/btrfs/space-info.h
+++ b/fs/btrfs/space-info.h
@@ -184,6 +184,12 @@ struct btrfs_space_info {
 	 * Exposed in /sys/fs/<uuid>/allocation/<type>/reclaim_errors
 	 */
 	u64 reclaim_errors;
+
+	/*
+	 * If true, use the dynamic relocation threshold, instead of the
+	 * fixed bg_reclaim_threshold.
+	 */
+	bool dynamic_reclaim;
 };
 
 struct reserve_ticket {
@@ -266,4 +272,6 @@ void btrfs_dump_space_info_for_trans_abort(struct btrfs_fs_info *fs_info);
 void btrfs_init_async_reclaim_work(struct btrfs_fs_info *fs_info);
 u64 btrfs_account_ro_block_groups_free_space(struct btrfs_space_info *sinfo);
 
+int btrfs_calc_reclaim_threshold(struct btrfs_space_info *space_info);
+
 #endif /* BTRFS_SPACE_INFO_H */
diff --git a/fs/btrfs/sysfs.c b/fs/btrfs/sysfs.c
index 919c7ba45121..360d6093476f 100644
--- a/fs/btrfs/sysfs.c
+++ b/fs/btrfs/sysfs.c
@@ -905,8 +905,12 @@ static ssize_t btrfs_sinfo_bg_reclaim_threshold_show(struct kobject *kobj,
 						     char *buf)
 {
 	struct btrfs_space_info *space_info = to_space_info(kobj);
+	ssize_t ret;
 
-	return sysfs_emit(buf, "%d\n", READ_ONCE(space_info->bg_reclaim_threshold));
+	spin_lock(&space_info->lock);
+	ret = sysfs_emit(buf, "%d\n", btrfs_calc_reclaim_threshold(space_info));
+	spin_unlock(&space_info->lock);
+	return ret;
 }
 
 static ssize_t btrfs_sinfo_bg_reclaim_threshold_store(struct kobject *kobj,
@@ -917,6 +921,9 @@ static ssize_t btrfs_sinfo_bg_reclaim_threshold_store(struct kobject *kobj,
 	int thresh;
 	int ret;
 
+	if (READ_ONCE(space_info->dynamic_reclaim))
+		return -EINVAL;
+
 	ret = kstrtoint(buf, 10, &thresh);
 	if (ret)
 		return ret;
@@ -933,6 +940,39 @@ BTRFS_ATTR_RW(space_info, bg_reclaim_threshold,
 	      btrfs_sinfo_bg_reclaim_threshold_show,
 	      btrfs_sinfo_bg_reclaim_threshold_store);
 
+static ssize_t btrfs_sinfo_dynamic_reclaim_show(struct kobject *kobj,
+						struct kobj_attribute *a,
+						char *buf)
+{
+	struct btrfs_space_info *space_info = to_space_info(kobj);
+
+	return sysfs_emit(buf, "%d\n", READ_ONCE(space_info->dynamic_reclaim));
+}
+
+static ssize_t btrfs_sinfo_dynamic_reclaim_store(struct kobject *kobj,
+						 struct kobj_attribute *a,
+						 const char *buf, size_t len)
+{
+	struct btrfs_space_info *space_info = to_space_info(kobj);
+	int dynamic_reclaim;
+	int ret;
+
+	ret = kstrtoint(buf, 10, &dynamic_reclaim);
+	if (ret)
+		return ret;
+
+	if (dynamic_reclaim < 0)
+		return -EINVAL;
+
+	WRITE_ONCE(space_info->dynamic_reclaim, dynamic_reclaim != 0);
+
+	return len;
+}
+
+BTRFS_ATTR_RW(space_info, dynamic_reclaim,
+	      btrfs_sinfo_dynamic_reclaim_show,
+	      btrfs_sinfo_dynamic_reclaim_store);
+
 /*
  * Allocation information about block group types.
  *
@@ -950,6 +990,7 @@ static struct attribute *space_info_attrs[] = {
 	BTRFS_ATTR_PTR(space_info, disk_used),
 	BTRFS_ATTR_PTR(space_info, disk_total),
 	BTRFS_ATTR_PTR(space_info, bg_reclaim_threshold),
+	BTRFS_ATTR_PTR(space_info, dynamic_reclaim),
 	BTRFS_ATTR_PTR(space_info, chunk_size),
 	BTRFS_ATTR_PTR(space_info, size_classes),
 	BTRFS_ATTR_PTR(space_info, reclaim_count),
-- 
2.45.2


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [PATCH v2 4/6] btrfs: periodic block_group reclaim
  2024-06-17 23:11 [PATCH v2 0/6] btrfs: dynamic and periodic block_group reclaim Boris Burkov
                   ` (2 preceding siblings ...)
  2024-06-17 23:11 ` [PATCH v2 3/6] btrfs: dynamic block_group reclaim threshold Boris Burkov
@ 2024-06-17 23:11 ` Boris Burkov
  2024-06-17 23:11 ` [PATCH v2 5/6] btrfs: prevent pathological periodic reclaim loops Boris Burkov
                   ` (2 subsequent siblings)
  6 siblings, 0 replies; 13+ messages in thread
From: Boris Burkov @ 2024-06-17 23:11 UTC (permalink / raw)
  To: linux-btrfs, kernel-team

We currently employ a edge-triggered block group reclaim strategy which
marks block groups for reclaim as they free down past a threshold.

With a dynamic threshold, this is worse than doing it in a
level-triggered fashion periodically. That is because the reclaim
itself happens periodically, so the threshold at that point in time is
what really matters, not the threshold at freeing time. If we mark the
reclaim in a big pass, then sort by usage and do reclaim, we also
benefit from a negative feedback loop preventing unnecessary reclaims as
we crunch through the "best" candidates.

Since this is quite a different model, it requires some additional
support. The edge triggered reclaim has a good heuristic for not
reclaiming fresh block groups, so we need to replace that with a typical
GC sweep mark which skips block groups that have seen an allocation
since the last sweep.

Signed-off-by: Boris Burkov <boris@bur.io>
---
 fs/btrfs/block-group.c |  2 ++
 fs/btrfs/block-group.h |  1 +
 fs/btrfs/space-info.c  | 51 ++++++++++++++++++++++++++++++++++++++++++
 fs/btrfs/space-info.h  |  7 ++++++
 fs/btrfs/sysfs.c       | 34 ++++++++++++++++++++++++++++
 5 files changed, 95 insertions(+)

diff --git a/fs/btrfs/block-group.c b/fs/btrfs/block-group.c
index c3313697475f..6bcf24f2ac79 100644
--- a/fs/btrfs/block-group.c
+++ b/fs/btrfs/block-group.c
@@ -1974,6 +1974,7 @@ void btrfs_reclaim_bgs_work(struct work_struct *work)
 
 void btrfs_reclaim_bgs(struct btrfs_fs_info *fs_info)
 {
+	btrfs_reclaim_sweep(fs_info);
 	spin_lock(&fs_info->unused_bgs_lock);
 	if (!list_empty(&fs_info->reclaim_bgs))
 		queue_work(system_unbound_wq, &fs_info->reclaim_bgs_work);
@@ -3672,6 +3673,7 @@ int btrfs_update_block_group(struct btrfs_trans_handle *trans,
 		old_val += num_bytes;
 		cache->used = old_val;
 		cache->reserved -= num_bytes;
+		cache->reclaim_mark = 0;
 		space_info->bytes_reserved -= num_bytes;
 		space_info->bytes_used += num_bytes;
 		space_info->disk_used += num_bytes * factor;
diff --git a/fs/btrfs/block-group.h b/fs/btrfs/block-group.h
index 85e2d4cd12dc..8656b38f1fa5 100644
--- a/fs/btrfs/block-group.h
+++ b/fs/btrfs/block-group.h
@@ -263,6 +263,7 @@ struct btrfs_block_group {
 	struct work_struct zone_finish_work;
 	struct extent_buffer *last_eb;
 	enum btrfs_block_group_size_class size_class;
+	u64 reclaim_mark;
 };
 
 static inline u64 btrfs_block_group_end(struct btrfs_block_group *block_group)
diff --git a/fs/btrfs/space-info.c b/fs/btrfs/space-info.c
index 0d13282dac05..ff92ad26ffa2 100644
--- a/fs/btrfs/space-info.c
+++ b/fs/btrfs/space-info.c
@@ -1953,3 +1953,54 @@ int btrfs_calc_reclaim_threshold(struct btrfs_space_info *space_info)
 		return calc_dynamic_reclaim_threshold(space_info);
 	return READ_ONCE(space_info->bg_reclaim_threshold);
 }
+
+static int do_reclaim_sweep(struct btrfs_fs_info *fs_info,
+			    struct btrfs_space_info *space_info, int raid)
+{
+	struct btrfs_block_group *bg;
+	int thresh_pct;
+
+	spin_lock(&space_info->lock);
+	thresh_pct = btrfs_calc_reclaim_threshold(space_info);
+	spin_unlock(&space_info->lock);
+
+	down_read(&space_info->groups_sem);
+	list_for_each_entry(bg, &space_info->block_groups[raid], list) {
+		u64 thresh;
+		bool reclaim = false;
+
+		btrfs_get_block_group(bg);
+		spin_lock(&bg->lock);
+		thresh = mult_perc(bg->length, thresh_pct);
+		if (bg->used < thresh && bg->reclaim_mark)
+			reclaim = true;
+		bg->reclaim_mark++;
+		spin_unlock(&bg->lock);
+		if (reclaim)
+			btrfs_mark_bg_to_reclaim(bg);
+		btrfs_put_block_group(bg);
+	}
+	up_read(&space_info->groups_sem);
+	return 0;
+}
+
+int btrfs_reclaim_sweep(struct btrfs_fs_info *fs_info)
+{
+	int ret;
+	int raid;
+	struct btrfs_space_info *space_info;
+
+	list_for_each_entry(space_info, &fs_info->space_info, list) {
+		if (space_info->flags & BTRFS_BLOCK_GROUP_SYSTEM)
+			continue;
+		if (!READ_ONCE(space_info->periodic_reclaim))
+			continue;
+		for (raid = 0; raid < BTRFS_NR_RAID_TYPES; raid++) {
+			ret = do_reclaim_sweep(fs_info, space_info, raid);
+			if (ret)
+				return ret;
+		}
+	}
+
+	return ret;
+}
diff --git a/fs/btrfs/space-info.h b/fs/btrfs/space-info.h
index 2cac771321c7..ae4a1f7d5856 100644
--- a/fs/btrfs/space-info.h
+++ b/fs/btrfs/space-info.h
@@ -190,6 +190,12 @@ struct btrfs_space_info {
 	 * fixed bg_reclaim_threshold.
 	 */
 	bool dynamic_reclaim;
+
+	/*
+	 * Periodically check all block groups against the reclaim
+	 * threshold in the cleaner thread.
+	 */
+	bool periodic_reclaim;
 };
 
 struct reserve_ticket {
@@ -273,5 +279,6 @@ void btrfs_init_async_reclaim_work(struct btrfs_fs_info *fs_info);
 u64 btrfs_account_ro_block_groups_free_space(struct btrfs_space_info *sinfo);
 
 int btrfs_calc_reclaim_threshold(struct btrfs_space_info *space_info);
+int btrfs_reclaim_sweep(struct btrfs_fs_info *fs_info);
 
 #endif /* BTRFS_SPACE_INFO_H */
diff --git a/fs/btrfs/sysfs.c b/fs/btrfs/sysfs.c
index 360d6093476f..c58cea0da597 100644
--- a/fs/btrfs/sysfs.c
+++ b/fs/btrfs/sysfs.c
@@ -973,6 +973,39 @@ BTRFS_ATTR_RW(space_info, dynamic_reclaim,
 	      btrfs_sinfo_dynamic_reclaim_show,
 	      btrfs_sinfo_dynamic_reclaim_store);
 
+static ssize_t btrfs_sinfo_periodic_reclaim_show(struct kobject *kobj,
+						struct kobj_attribute *a,
+						char *buf)
+{
+	struct btrfs_space_info *space_info = to_space_info(kobj);
+
+	return sysfs_emit(buf, "%d\n", READ_ONCE(space_info->periodic_reclaim));
+}
+
+static ssize_t btrfs_sinfo_periodic_reclaim_store(struct kobject *kobj,
+						 struct kobj_attribute *a,
+						 const char *buf, size_t len)
+{
+	struct btrfs_space_info *space_info = to_space_info(kobj);
+	int periodic_reclaim;
+	int ret;
+
+	ret = kstrtoint(buf, 10, &periodic_reclaim);
+	if (ret)
+		return ret;
+
+	if (periodic_reclaim < 0)
+		return -EINVAL;
+
+	WRITE_ONCE(space_info->periodic_reclaim, periodic_reclaim != 0);
+
+	return len;
+}
+
+BTRFS_ATTR_RW(space_info, periodic_reclaim,
+	      btrfs_sinfo_periodic_reclaim_show,
+	      btrfs_sinfo_periodic_reclaim_store);
+
 /*
  * Allocation information about block group types.
  *
@@ -996,6 +1029,7 @@ static struct attribute *space_info_attrs[] = {
 	BTRFS_ATTR_PTR(space_info, reclaim_count),
 	BTRFS_ATTR_PTR(space_info, reclaim_bytes),
 	BTRFS_ATTR_PTR(space_info, reclaim_errors),
+	BTRFS_ATTR_PTR(space_info, periodic_reclaim),
 #ifdef CONFIG_BTRFS_DEBUG
 	BTRFS_ATTR_PTR(space_info, force_chunk_alloc),
 #endif
-- 
2.45.2


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [PATCH v2 5/6] btrfs: prevent pathological periodic reclaim loops
  2024-06-17 23:11 [PATCH v2 0/6] btrfs: dynamic and periodic block_group reclaim Boris Burkov
                   ` (3 preceding siblings ...)
  2024-06-17 23:11 ` [PATCH v2 4/6] btrfs: periodic block_group reclaim Boris Burkov
@ 2024-06-17 23:11 ` Boris Burkov
  2024-06-24 15:23   ` Josef Bacik
  2025-12-26  4:18   ` Sun Yangkai
  2024-06-17 23:11 ` [PATCH v2 6/6] btrfs: urgent periodic reclaim pass Boris Burkov
  2024-06-24 15:25 ` [PATCH v2 0/6] btrfs: dynamic and periodic block_group reclaim Josef Bacik
  6 siblings, 2 replies; 13+ messages in thread
From: Boris Burkov @ 2024-06-17 23:11 UTC (permalink / raw)
  To: linux-btrfs, kernel-team

Periodic reclaim runs the risk of getting stuck in a state where it
keeps reclaiming the same block group over and over. This can happen if
1. reclaiming that block_group fails
2. reclaiming that block_group fails to move any extents into existing
   block_groups and just allocates a fresh chunk and moves everything.

Currently, 1. is a very tight loop inside the reclaim worker. That is
critical for edge triggered reclaim or else we risk forgetting about a
reclaimable group. On the other hand, with level triggered reclaim we
can break out of that loop and get it later.

With that fixed, 2. applies to both failures and "successes" with no
progress. If we have done a periodic reclaim on a space_info and nothing
has changed in that space_info, there is not much point to trying again,
so don't, until enough space gets free, which we capture with a
heuristic of needing to net free 1 chunk.

Signed-off-by: Boris Burkov <boris@bur.io>
---
 fs/btrfs/block-group.c | 12 ++++++---
 fs/btrfs/space-info.c  | 56 ++++++++++++++++++++++++++++++++++++------
 fs/btrfs/space-info.h  | 14 +++++++++++
 3 files changed, 71 insertions(+), 11 deletions(-)

diff --git a/fs/btrfs/block-group.c b/fs/btrfs/block-group.c
index 6bcf24f2ac79..ba9afb94e7ce 100644
--- a/fs/btrfs/block-group.c
+++ b/fs/btrfs/block-group.c
@@ -1933,6 +1933,8 @@ void btrfs_reclaim_bgs_work(struct work_struct *work)
 			reclaimed = 0;
 			spin_lock(&space_info->lock);
 			space_info->reclaim_errors++;
+			if (READ_ONCE(space_info->periodic_reclaim))
+				space_info->periodic_reclaim_ready = false;
 			spin_unlock(&space_info->lock);
 		}
 		spin_lock(&space_info->lock);
@@ -1941,7 +1943,7 @@ void btrfs_reclaim_bgs_work(struct work_struct *work)
 		spin_unlock(&space_info->lock);
 
 next:
-		if (ret) {
+		if (ret && !READ_ONCE(space_info->periodic_reclaim)) {
 			/* Refcount held by the reclaim_bgs list after splice. */
 			btrfs_get_block_group(bg);
 			list_add_tail(&bg->bg_list, &retry_list);
@@ -3677,6 +3679,8 @@ int btrfs_update_block_group(struct btrfs_trans_handle *trans,
 		space_info->bytes_reserved -= num_bytes;
 		space_info->bytes_used += num_bytes;
 		space_info->disk_used += num_bytes * factor;
+		if (READ_ONCE(space_info->periodic_reclaim))
+			btrfs_space_info_update_reclaimable(space_info, -num_bytes);
 		spin_unlock(&cache->lock);
 		spin_unlock(&space_info->lock);
 	} else {
@@ -3686,8 +3690,10 @@ int btrfs_update_block_group(struct btrfs_trans_handle *trans,
 		btrfs_space_info_update_bytes_pinned(info, space_info, num_bytes);
 		space_info->bytes_used -= num_bytes;
 		space_info->disk_used -= num_bytes * factor;
-
-		reclaim = should_reclaim_block_group(cache, num_bytes);
+		if (READ_ONCE(space_info->periodic_reclaim))
+			btrfs_space_info_update_reclaimable(space_info, num_bytes);
+		else
+			reclaim = should_reclaim_block_group(cache, num_bytes);
 
 		spin_unlock(&cache->lock);
 		spin_unlock(&space_info->lock);
diff --git a/fs/btrfs/space-info.c b/fs/btrfs/space-info.c
index ff92ad26ffa2..e7a2aa751f8f 100644
--- a/fs/btrfs/space-info.c
+++ b/fs/btrfs/space-info.c
@@ -1,5 +1,6 @@
 // SPDX-License-Identifier: GPL-2.0
 
+#include "linux/spinlock.h"
 #include <linux/minmax.h>
 #include "misc.h"
 #include "ctree.h"
@@ -1899,7 +1900,9 @@ static u64 calc_pct_ratio(u64 x, u64 y)
  */
 static u64 calc_unalloc_target(struct btrfs_fs_info *fs_info)
 {
-	return BTRFS_UNALLOC_BLOCK_GROUP_TARGET * calc_effective_data_chunk_size(fs_info);
+	u64 chunk_sz = calc_effective_data_chunk_size(fs_info);
+
+	return BTRFS_UNALLOC_BLOCK_GROUP_TARGET * chunk_sz;
 }
 
 /*
@@ -1935,14 +1938,13 @@ static int calc_dynamic_reclaim_threshold(struct btrfs_space_info *space_info)
 	u64 unused = alloc - used;
 	u64 want = target > unalloc ? target - unalloc : 0;
 	u64 data_chunk_size = calc_effective_data_chunk_size(fs_info);
-	/* Cast to int is OK because want <= target */
-	int ratio = calc_pct_ratio(want, target);
 
-	/* If we have no unused space, don't bother, it won't work anyway */
+	/* If we have no unused space, don't bother, it won't work anyway. */
 	if (unused < data_chunk_size)
 		return 0;
 
-	return ratio;
+	/* Cast to int is OK because want <= target. */
+	return calc_pct_ratio(want, target);
 }
 
 int btrfs_calc_reclaim_threshold(struct btrfs_space_info *space_info)
@@ -1984,6 +1986,46 @@ static int do_reclaim_sweep(struct btrfs_fs_info *fs_info,
 	return 0;
 }
 
+void btrfs_space_info_update_reclaimable(struct btrfs_space_info *space_info, s64 bytes)
+{
+	u64 chunk_sz = calc_effective_data_chunk_size(space_info->fs_info);
+
+	assert_spin_locked(&space_info->lock);
+	space_info->reclaimable_bytes += bytes;
+
+	if (space_info->reclaimable_bytes >= chunk_sz)
+		btrfs_set_periodic_reclaim_ready(space_info, true);
+}
+
+void btrfs_set_periodic_reclaim_ready(struct btrfs_space_info *space_info, bool ready)
+{
+	assert_spin_locked(&space_info->lock);
+	if (!READ_ONCE(space_info->periodic_reclaim))
+		return;
+	if (ready != space_info->periodic_reclaim_ready) {
+		space_info->periodic_reclaim_ready = ready;
+		if (!ready)
+			space_info->reclaimable_bytes = 0;
+	}
+}
+
+bool btrfs_should_periodic_reclaim(struct btrfs_space_info *space_info)
+{
+	bool ret;
+
+	if (space_info->flags & BTRFS_BLOCK_GROUP_SYSTEM)
+		return false;
+	if (!READ_ONCE(space_info->periodic_reclaim))
+		return false;
+
+	spin_lock(&space_info->lock);
+	ret = space_info->periodic_reclaim_ready;
+	btrfs_set_periodic_reclaim_ready(space_info, false);
+	spin_unlock(&space_info->lock);
+
+	return ret;
+}
+
 int btrfs_reclaim_sweep(struct btrfs_fs_info *fs_info)
 {
 	int ret;
@@ -1991,9 +2033,7 @@ int btrfs_reclaim_sweep(struct btrfs_fs_info *fs_info)
 	struct btrfs_space_info *space_info;
 
 	list_for_each_entry(space_info, &fs_info->space_info, list) {
-		if (space_info->flags & BTRFS_BLOCK_GROUP_SYSTEM)
-			continue;
-		if (!READ_ONCE(space_info->periodic_reclaim))
+		if (!btrfs_should_periodic_reclaim(space_info))
 			continue;
 		for (raid = 0; raid < BTRFS_NR_RAID_TYPES; raid++) {
 			ret = do_reclaim_sweep(fs_info, space_info, raid);
diff --git a/fs/btrfs/space-info.h b/fs/btrfs/space-info.h
index ae4a1f7d5856..4db8a0267c16 100644
--- a/fs/btrfs/space-info.h
+++ b/fs/btrfs/space-info.h
@@ -196,6 +196,17 @@ struct btrfs_space_info {
 	 * threshold in the cleaner thread.
 	 */
 	bool periodic_reclaim;
+
+	/*
+	 * Periodic reclaim should be a no-op if a space_info hasn't
+	 * freed any space since the last time we tried.
+	 */
+	bool periodic_reclaim_ready;
+
+	/*
+	 * Net bytes freed or allocated since the last reclaim pass.
+	 */
+	s64 reclaimable_bytes;
 };
 
 struct reserve_ticket {
@@ -278,6 +289,9 @@ void btrfs_dump_space_info_for_trans_abort(struct btrfs_fs_info *fs_info);
 void btrfs_init_async_reclaim_work(struct btrfs_fs_info *fs_info);
 u64 btrfs_account_ro_block_groups_free_space(struct btrfs_space_info *sinfo);
 
+void btrfs_space_info_update_reclaimable(struct btrfs_space_info *space_info, s64 bytes);
+void btrfs_set_periodic_reclaim_ready(struct btrfs_space_info *space_info, bool ready);
+bool btrfs_should_periodic_reclaim(struct btrfs_space_info *space_info);
 int btrfs_calc_reclaim_threshold(struct btrfs_space_info *space_info);
 int btrfs_reclaim_sweep(struct btrfs_fs_info *fs_info);
 
-- 
2.45.2


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [PATCH v2 6/6] btrfs: urgent periodic reclaim pass
  2024-06-17 23:11 [PATCH v2 0/6] btrfs: dynamic and periodic block_group reclaim Boris Burkov
                   ` (4 preceding siblings ...)
  2024-06-17 23:11 ` [PATCH v2 5/6] btrfs: prevent pathological periodic reclaim loops Boris Burkov
@ 2024-06-17 23:11 ` Boris Burkov
  2024-06-24 15:25 ` [PATCH v2 0/6] btrfs: dynamic and periodic block_group reclaim Josef Bacik
  6 siblings, 0 replies; 13+ messages in thread
From: Boris Burkov @ 2024-06-17 23:11 UTC (permalink / raw)
  To: linux-btrfs, kernel-team

Periodic reclaim attempts to avoid block_groups seeing active use with a
sweep mark that gets cleared on allocation and set on a sweep. In urgent
conditions where we have very little unallocated space (less than one
chunk used by the threshold calculation for the unallocated target), we
want to be able to override this mechanism.

Introduce a second pass that only happens if we fail to find a reclaim
candidate and reclaim is urgent. In that case, do a second pass where
all block groups are eligible.

Signed-off-by: Boris Burkov <boris@bur.io>
---
 fs/btrfs/space-info.c | 35 ++++++++++++++++++++++++++++++++++-
 1 file changed, 34 insertions(+), 1 deletion(-)

diff --git a/fs/btrfs/space-info.c b/fs/btrfs/space-info.c
index e7a2aa751f8f..95e65d5163ab 100644
--- a/fs/btrfs/space-info.c
+++ b/fs/btrfs/space-info.c
@@ -1956,17 +1956,35 @@ int btrfs_calc_reclaim_threshold(struct btrfs_space_info *space_info)
 	return READ_ONCE(space_info->bg_reclaim_threshold);
 }
 
+/*
+ * Under "urgent" reclaim, we will reclaim even fresh block groups that have
+ * recently seen successful allocations, as we are desperate to reclaim
+ * whatever we can to avoid ENOSPC in a transaction leading to a readonly fs.
+ */
+static bool is_reclaim_urgent(struct btrfs_space_info *space_info)
+{
+	struct btrfs_fs_info *fs_info = space_info->fs_info;
+	u64 unalloc = atomic64_read(&fs_info->free_chunk_space);
+	u64 data_chunk_size = calc_effective_data_chunk_size(fs_info);
+
+	return unalloc < data_chunk_size;
+}
+
 static int do_reclaim_sweep(struct btrfs_fs_info *fs_info,
 			    struct btrfs_space_info *space_info, int raid)
 {
 	struct btrfs_block_group *bg;
 	int thresh_pct;
+	bool try_again = true;
+	bool urgent;
 
 	spin_lock(&space_info->lock);
+	urgent = is_reclaim_urgent(space_info);
 	thresh_pct = btrfs_calc_reclaim_threshold(space_info);
 	spin_unlock(&space_info->lock);
 
 	down_read(&space_info->groups_sem);
+again:
 	list_for_each_entry(bg, &space_info->block_groups[raid], list) {
 		u64 thresh;
 		bool reclaim = false;
@@ -1974,14 +1992,29 @@ static int do_reclaim_sweep(struct btrfs_fs_info *fs_info,
 		btrfs_get_block_group(bg);
 		spin_lock(&bg->lock);
 		thresh = mult_perc(bg->length, thresh_pct);
-		if (bg->used < thresh && bg->reclaim_mark)
+		if (bg->used < thresh && bg->reclaim_mark) {
+			try_again = false;
 			reclaim = true;
+		}
 		bg->reclaim_mark++;
 		spin_unlock(&bg->lock);
 		if (reclaim)
 			btrfs_mark_bg_to_reclaim(bg);
 		btrfs_put_block_group(bg);
 	}
+
+	/*
+	 * In situations where we are very motivated to reclaim (low unalloc)
+	 * use two passes to make the reclaim mark check best effort.
+	 *
+	 * If we have any staler groups, we don't touch the fresher ones, but if we
+	 * really need a block group, do take a fresh one.
+	 */
+	if (try_again && urgent) {
+		try_again = false;
+		goto again;
+	}
+
 	up_read(&space_info->groups_sem);
 	return 0;
 }
-- 
2.45.2


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* Re: [PATCH v2 5/6] btrfs: prevent pathological periodic reclaim loops
  2024-06-17 23:11 ` [PATCH v2 5/6] btrfs: prevent pathological periodic reclaim loops Boris Burkov
@ 2024-06-24 15:23   ` Josef Bacik
  2024-06-24 16:05     ` David Sterba
  2025-12-26  4:18   ` Sun Yangkai
  1 sibling, 1 reply; 13+ messages in thread
From: Josef Bacik @ 2024-06-24 15:23 UTC (permalink / raw)
  To: Boris Burkov; +Cc: linux-btrfs, kernel-team

On Mon, Jun 17, 2024 at 04:11:17PM -0700, Boris Burkov wrote:
> Periodic reclaim runs the risk of getting stuck in a state where it
> keeps reclaiming the same block group over and over. This can happen if
> 1. reclaiming that block_group fails
> 2. reclaiming that block_group fails to move any extents into existing
>    block_groups and just allocates a fresh chunk and moves everything.
> 
> Currently, 1. is a very tight loop inside the reclaim worker. That is
> critical for edge triggered reclaim or else we risk forgetting about a
> reclaimable group. On the other hand, with level triggered reclaim we
> can break out of that loop and get it later.
> 
> With that fixed, 2. applies to both failures and "successes" with no
> progress. If we have done a periodic reclaim on a space_info and nothing
> has changed in that space_info, there is not much point to trying again,
> so don't, until enough space gets free, which we capture with a
> heuristic of needing to net free 1 chunk.
> 
> Signed-off-by: Boris Burkov <boris@bur.io>
> ---
>  fs/btrfs/block-group.c | 12 ++++++---
>  fs/btrfs/space-info.c  | 56 ++++++++++++++++++++++++++++++++++++------
>  fs/btrfs/space-info.h  | 14 +++++++++++
>  3 files changed, 71 insertions(+), 11 deletions(-)
> 
> diff --git a/fs/btrfs/block-group.c b/fs/btrfs/block-group.c
> index 6bcf24f2ac79..ba9afb94e7ce 100644
> --- a/fs/btrfs/block-group.c
> +++ b/fs/btrfs/block-group.c
> @@ -1933,6 +1933,8 @@ void btrfs_reclaim_bgs_work(struct work_struct *work)
>  			reclaimed = 0;
>  			spin_lock(&space_info->lock);
>  			space_info->reclaim_errors++;
> +			if (READ_ONCE(space_info->periodic_reclaim))
> +				space_info->periodic_reclaim_ready = false;
>  			spin_unlock(&space_info->lock);
>  		}
>  		spin_lock(&space_info->lock);
> @@ -1941,7 +1943,7 @@ void btrfs_reclaim_bgs_work(struct work_struct *work)
>  		spin_unlock(&space_info->lock);
>  
>  next:
> -		if (ret) {
> +		if (ret && !READ_ONCE(space_info->periodic_reclaim)) {
>  			/* Refcount held by the reclaim_bgs list after splice. */
>  			btrfs_get_block_group(bg);
>  			list_add_tail(&bg->bg_list, &retry_list);
> @@ -3677,6 +3679,8 @@ int btrfs_update_block_group(struct btrfs_trans_handle *trans,
>  		space_info->bytes_reserved -= num_bytes;
>  		space_info->bytes_used += num_bytes;
>  		space_info->disk_used += num_bytes * factor;
> +		if (READ_ONCE(space_info->periodic_reclaim))
> +			btrfs_space_info_update_reclaimable(space_info, -num_bytes);
>  		spin_unlock(&cache->lock);
>  		spin_unlock(&space_info->lock);
>  	} else {
> @@ -3686,8 +3690,10 @@ int btrfs_update_block_group(struct btrfs_trans_handle *trans,
>  		btrfs_space_info_update_bytes_pinned(info, space_info, num_bytes);
>  		space_info->bytes_used -= num_bytes;
>  		space_info->disk_used -= num_bytes * factor;
> -
> -		reclaim = should_reclaim_block_group(cache, num_bytes);
> +		if (READ_ONCE(space_info->periodic_reclaim))
> +			btrfs_space_info_update_reclaimable(space_info, num_bytes);
> +		else
> +			reclaim = should_reclaim_block_group(cache, num_bytes);
>  
>  		spin_unlock(&cache->lock);
>  		spin_unlock(&space_info->lock);
> diff --git a/fs/btrfs/space-info.c b/fs/btrfs/space-info.c
> index ff92ad26ffa2..e7a2aa751f8f 100644
> --- a/fs/btrfs/space-info.c
> +++ b/fs/btrfs/space-info.c
> @@ -1,5 +1,6 @@
>  // SPDX-License-Identifier: GPL-2.0
>  
> +#include "linux/spinlock.h"
>  #include <linux/minmax.h>
>  #include "misc.h"
>  #include "ctree.h"
> @@ -1899,7 +1900,9 @@ static u64 calc_pct_ratio(u64 x, u64 y)
>   */
>  static u64 calc_unalloc_target(struct btrfs_fs_info *fs_info)
>  {
> -	return BTRFS_UNALLOC_BLOCK_GROUP_TARGET * calc_effective_data_chunk_size(fs_info);
> +	u64 chunk_sz = calc_effective_data_chunk_size(fs_info);
> +
> +	return BTRFS_UNALLOC_BLOCK_GROUP_TARGET * chunk_sz;
>  }
>  
>  /*
> @@ -1935,14 +1938,13 @@ static int calc_dynamic_reclaim_threshold(struct btrfs_space_info *space_info)
>  	u64 unused = alloc - used;
>  	u64 want = target > unalloc ? target - unalloc : 0;
>  	u64 data_chunk_size = calc_effective_data_chunk_size(fs_info);
> -	/* Cast to int is OK because want <= target */
> -	int ratio = calc_pct_ratio(want, target);
>  
> -	/* If we have no unused space, don't bother, it won't work anyway */
> +	/* If we have no unused space, don't bother, it won't work anyway. */
>  	if (unused < data_chunk_size)
>  		return 0;
>  
> -	return ratio;
> +	/* Cast to int is OK because want <= target. */
> +	return calc_pct_ratio(want, target);
>  }
>  
>  int btrfs_calc_reclaim_threshold(struct btrfs_space_info *space_info)
> @@ -1984,6 +1986,46 @@ static int do_reclaim_sweep(struct btrfs_fs_info *fs_info,
>  	return 0;
>  }
>  
> +void btrfs_space_info_update_reclaimable(struct btrfs_space_info *space_info, s64 bytes)
> +{
> +	u64 chunk_sz = calc_effective_data_chunk_size(space_info->fs_info);
> +
> +	assert_spin_locked(&space_info->lock);
> +	space_info->reclaimable_bytes += bytes;
> +
> +	if (space_info->reclaimable_bytes >= chunk_sz)
> +		btrfs_set_periodic_reclaim_ready(space_info, true);
> +}
> +
> +void btrfs_set_periodic_reclaim_ready(struct btrfs_space_info *space_info, bool ready)
> +{
> +	assert_spin_locked(&space_info->lock);

This is essentially

BUG_ON(!locked(spin_lock));

instead use

lockdep_assert_held()

which will just yell at us so we can fix it.  Thanks,

Josef

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH v2 0/6] btrfs: dynamic and periodic block_group reclaim
  2024-06-17 23:11 [PATCH v2 0/6] btrfs: dynamic and periodic block_group reclaim Boris Burkov
                   ` (5 preceding siblings ...)
  2024-06-17 23:11 ` [PATCH v2 6/6] btrfs: urgent periodic reclaim pass Boris Burkov
@ 2024-06-24 15:25 ` Josef Bacik
  6 siblings, 0 replies; 13+ messages in thread
From: Josef Bacik @ 2024-06-24 15:25 UTC (permalink / raw)
  To: Boris Burkov; +Cc: linux-btrfs, kernel-team

On Mon, Jun 17, 2024 at 04:11:12PM -0700, Boris Burkov wrote:
> Btrfs's block_group allocator suffers from a well known problem, that
> it is capable of eagerly allocating too much space to either data or
> metadata (most often data, absent bugs) and then later be unable to
> allocate more space for the other, when needed. When data starves
> metadata, this can extra painfully result in read only filesystems that
> need careful manual balancing to fix.
> 
> This can be worked around by:
> - enabling automatic reclaim
> - periodically running balance
> 
> The latter is widely deployed via btrfsmaintenance
> (https://github.com/kdave/btrfsmaintenance) and the former is used at
> scale at Meta with good results. However, neither of those solutions is
> perfect, as they both currently use a fixed threshold. A fixed threshold
> is vulnerable to workloads that trigger high amounts of reclaim. This
> has led to btrfsmaintenance setting very conservative thresholds of 5
> and 10 percent of data block groups.
> (https://github.com/kdave/btrfsmaintenance/commit/edbbfffe592f47c2849a8825f523e2ccc38b15f5)
> At Meta, we deal with an elevated level of reclaim which would be
> desirable to reduce.
> 
> This patch set expands on automatic reclaim, adding the ability to set a
> dynamic reclaim threshold that appropriately scales with the global file
> system allocation conditions as well as periodic reclaim which runs that
> reclaim sweep in the cleaner thread. Together, I believe they constitute
> a robust and general automatic reclaim system that should avoid
> unfortunate read only filesystems in all but extreme conditions, where
> space is running quite low anyway and failure is more reasonable.
> 
> At a very high level, the dynamic threshold's strategy is to set a fixed
> target of unallocated block groups (10 block groups) and linearly scale
> its aggression the further we are from that target. That way we do no
> automatic reclaim until we actually press against the unallocated
> target, allowing the allocator to gradually fill fragmented space with
> new extents, but do claw back space after  workloads that use and free a
> bunch of space, perhaps with fragmentation.
> 
> I ran it on three workloads (described in detail on the dynamic reclaim
> patch) but they are:
> 1. bounce allocations around X% full.
> 2. fill up all the way and introduce full fragmentation.
> 3. write in a fragmented way until the filesystem is just about full.
> script can be found here:
> https://github.com/boryas/scripts/tree/main/fio/reclaim
> 
> The important results can be seen here (full results explorable at
> https://bur.io/dyn-rec/)
> 
> bounce at 30%, higher relocations with a fixed threshold:
> https://bur.io/dyn-rec/bounce/reclaims.png
> https://bur.io/dyn-rec/bounce/reclaim_bytes.png
> https://bur.io/dyn-rec/bounce/unalloc_bytes.png
> 
> hard 30% fragmentation, dynamic actually reclaims, relocs not crazy:
> https://bur.io/dyn-rec/strict_frag/reclaims.png
> https://bur.io/dyn-rec/strict_frag/reclaim_bytes.png
> https://bur.io/dyn-rec/strict_frag/unalloc_bytes.png
> 
> fill it all the way up in a fragmented way, then keep making
> allocations: 
> https://bur.io/dyn-rec/last_gig/reclaims.png
> https://bur.io/dyn-rec/last_gig/reclaim_bytes.png
> https://bur.io/dyn-rec/last_gig/unalloc_bytes.png

These results are great, once you fix up the one comment I had you can add

Reviewed-by: Josef Bacik <josef@toxicpanda.com>

to the whole series.  Thanks,

Josef

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH v2 5/6] btrfs: prevent pathological periodic reclaim loops
  2024-06-24 15:23   ` Josef Bacik
@ 2024-06-24 16:05     ` David Sterba
  0 siblings, 0 replies; 13+ messages in thread
From: David Sterba @ 2024-06-24 16:05 UTC (permalink / raw)
  To: Josef Bacik; +Cc: Boris Burkov, linux-btrfs, kernel-team

On Mon, Jun 24, 2024 at 11:23:00AM -0400, Josef Bacik wrote:
> On Mon, Jun 17, 2024 at 04:11:17PM -0700, Boris Burkov wrote:
> > Periodic reclaim runs the risk of getting stuck in a state where it
> > keeps reclaiming the same block group over and over. This can happen if
> > 1. reclaiming that block_group fails
> > 2. reclaiming that block_group fails to move any extents into existing
> >    block_groups and just allocates a fresh chunk and moves everything.
> > 
> > Currently, 1. is a very tight loop inside the reclaim worker. That is
> > critical for edge triggered reclaim or else we risk forgetting about a
> > reclaimable group. On the other hand, with level triggered reclaim we
> > can break out of that loop and get it later.
> > 
> > With that fixed, 2. applies to both failures and "successes" with no
> > progress. If we have done a periodic reclaim on a space_info and nothing
> > has changed in that space_info, there is not much point to trying again,
> > so don't, until enough space gets free, which we capture with a
> > heuristic of needing to net free 1 chunk.
> > 
> > Signed-off-by: Boris Burkov <boris@bur.io>
> > ---
> >  fs/btrfs/block-group.c | 12 ++++++---
> >  fs/btrfs/space-info.c  | 56 ++++++++++++++++++++++++++++++++++++------
> >  fs/btrfs/space-info.h  | 14 +++++++++++
> >  3 files changed, 71 insertions(+), 11 deletions(-)
> > 
> > diff --git a/fs/btrfs/block-group.c b/fs/btrfs/block-group.c
> > index 6bcf24f2ac79..ba9afb94e7ce 100644
> > --- a/fs/btrfs/block-group.c
> > +++ b/fs/btrfs/block-group.c
> > @@ -1933,6 +1933,8 @@ void btrfs_reclaim_bgs_work(struct work_struct *work)
> >  			reclaimed = 0;
> >  			spin_lock(&space_info->lock);
> >  			space_info->reclaim_errors++;
> > +			if (READ_ONCE(space_info->periodic_reclaim))
> > +				space_info->periodic_reclaim_ready = false;
> >  			spin_unlock(&space_info->lock);
> >  		}
> >  		spin_lock(&space_info->lock);
> > @@ -1941,7 +1943,7 @@ void btrfs_reclaim_bgs_work(struct work_struct *work)
> >  		spin_unlock(&space_info->lock);
> >  
> >  next:
> > -		if (ret) {
> > +		if (ret && !READ_ONCE(space_info->periodic_reclaim)) {
> >  			/* Refcount held by the reclaim_bgs list after splice. */
> >  			btrfs_get_block_group(bg);
> >  			list_add_tail(&bg->bg_list, &retry_list);
> > @@ -3677,6 +3679,8 @@ int btrfs_update_block_group(struct btrfs_trans_handle *trans,
> >  		space_info->bytes_reserved -= num_bytes;
> >  		space_info->bytes_used += num_bytes;
> >  		space_info->disk_used += num_bytes * factor;
> > +		if (READ_ONCE(space_info->periodic_reclaim))
> > +			btrfs_space_info_update_reclaimable(space_info, -num_bytes);
> >  		spin_unlock(&cache->lock);
> >  		spin_unlock(&space_info->lock);
> >  	} else {
> > @@ -3686,8 +3690,10 @@ int btrfs_update_block_group(struct btrfs_trans_handle *trans,
> >  		btrfs_space_info_update_bytes_pinned(info, space_info, num_bytes);
> >  		space_info->bytes_used -= num_bytes;
> >  		space_info->disk_used -= num_bytes * factor;
> > -
> > -		reclaim = should_reclaim_block_group(cache, num_bytes);
> > +		if (READ_ONCE(space_info->periodic_reclaim))
> > +			btrfs_space_info_update_reclaimable(space_info, num_bytes);
> > +		else
> > +			reclaim = should_reclaim_block_group(cache, num_bytes);
> >  
> >  		spin_unlock(&cache->lock);
> >  		spin_unlock(&space_info->lock);
> > diff --git a/fs/btrfs/space-info.c b/fs/btrfs/space-info.c
> > index ff92ad26ffa2..e7a2aa751f8f 100644
> > --- a/fs/btrfs/space-info.c
> > +++ b/fs/btrfs/space-info.c
> > @@ -1,5 +1,6 @@
> >  // SPDX-License-Identifier: GPL-2.0
> >  
> > +#include "linux/spinlock.h"
> >  #include <linux/minmax.h>
> >  #include "misc.h"
> >  #include "ctree.h"
> > @@ -1899,7 +1900,9 @@ static u64 calc_pct_ratio(u64 x, u64 y)
> >   */
> >  static u64 calc_unalloc_target(struct btrfs_fs_info *fs_info)
> >  {
> > -	return BTRFS_UNALLOC_BLOCK_GROUP_TARGET * calc_effective_data_chunk_size(fs_info);
> > +	u64 chunk_sz = calc_effective_data_chunk_size(fs_info);
> > +
> > +	return BTRFS_UNALLOC_BLOCK_GROUP_TARGET * chunk_sz;
> >  }
> >  
> >  /*
> > @@ -1935,14 +1938,13 @@ static int calc_dynamic_reclaim_threshold(struct btrfs_space_info *space_info)
> >  	u64 unused = alloc - used;
> >  	u64 want = target > unalloc ? target - unalloc : 0;
> >  	u64 data_chunk_size = calc_effective_data_chunk_size(fs_info);
> > -	/* Cast to int is OK because want <= target */
> > -	int ratio = calc_pct_ratio(want, target);
> >  
> > -	/* If we have no unused space, don't bother, it won't work anyway */
> > +	/* If we have no unused space, don't bother, it won't work anyway. */
> >  	if (unused < data_chunk_size)
> >  		return 0;
> >  
> > -	return ratio;
> > +	/* Cast to int is OK because want <= target. */
> > +	return calc_pct_ratio(want, target);
> >  }
> >  
> >  int btrfs_calc_reclaim_threshold(struct btrfs_space_info *space_info)
> > @@ -1984,6 +1986,46 @@ static int do_reclaim_sweep(struct btrfs_fs_info *fs_info,
> >  	return 0;
> >  }
> >  
> > +void btrfs_space_info_update_reclaimable(struct btrfs_space_info *space_info, s64 bytes)
> > +{
> > +	u64 chunk_sz = calc_effective_data_chunk_size(space_info->fs_info);
> > +
> > +	assert_spin_locked(&space_info->lock);
> > +	space_info->reclaimable_bytes += bytes;
> > +
> > +	if (space_info->reclaimable_bytes >= chunk_sz)
> > +		btrfs_set_periodic_reclaim_ready(space_info, true);
> > +}
> > +
> > +void btrfs_set_periodic_reclaim_ready(struct btrfs_space_info *space_info, bool ready)
> > +{
> > +	assert_spin_locked(&space_info->lock);
> 
> This is essentially
> 
> BUG_ON(!locked(spin_lock));
> 
> instead use
> 
> lockdep_assert_held()
> 
> which will just yell at us so we can fix it.  Thanks,

Also documented

https://btrfs.readthedocs.io/en/latest/dev/Development-notes.html#locking

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH v2 3/6] btrfs: dynamic block_group reclaim threshold
  2024-06-17 23:11 ` [PATCH v2 3/6] btrfs: dynamic block_group reclaim threshold Boris Burkov
@ 2024-06-25 13:40   ` Naohiro Aota
  0 siblings, 0 replies; 13+ messages in thread
From: Naohiro Aota @ 2024-06-25 13:40 UTC (permalink / raw)
  To: Boris Burkov; +Cc: linux-btrfs@vger.kernel.org, kernel-team@fb.com

On Mon, Jun 17, 2024 at 04:11:15PM GMT, Boris Burkov wrote:
> We can currently recover allocated block_groups by:
> - explicitly starting balance operations
> - "auto reclaim" via bg_reclaim_threshold
> 
> The latter works by checking against a fixed threshold on frees. If we
> pass from above the threshold to below, relocation triggers and the
> block group will get reclaimed by the cleaner thread (assuming it is
> still eligible)
> 
> Picking a threshold is challenging. Too high, and you end up trying to
> reclaim very full block_groups which is quite costly, and you don't do
> reclaim on block_groups that don't get quite THAT full, but could still
> be quite fragmented and stranding a lot of space. Too low, and you
> similarly miss out on reclaim even if you badly need it to avoid running
> out of unallocated space, if you have heavily fragmented block groups
> living above the threshold.
> 
> No matter the threshold, it suffers from a workload that happens to
> bounce around that threshold, which can introduce arbitrary amounts of
> reclaim waste.
> 
> To improve this situation, introduce a dynamic threshold. The basic idea
> behind this threshold is that it should be very lax when there is plenty
> of unallocated space, and increasingly aggressive as we approach zero
> unallocated space. To that end, it sets a target for unallocated space
> (10 chunks) and then linearly increases the threshold as the amount of
> space short of the target we are increases. The formula is:
> (target - unalloc) / target
> 
> I tested this by running it on three interesting workloads:
> 1. bounce allocations around X% full.
> 2. fill up all the way and introduce full fragmentation.
> 3. write in a fragmented way until the filesystem is just about full.
> 
> 1. and 2. attack the weaknesses of a fixed threshold; fixed either works
> perfectly or fully falls apart, depending on the threshold. Dynamic
> always handles these cases well.
> 
> 3. attacks dynamic by checking whether it is too zealous to reclaim in
> conditions with low unallocated and low unused. It tends to claw back
> 1GiB of unallocated fairly aggressively, but not much more. Early
> versions of dynamic threshold struggled on this test.
> 
> Additional work could be done to intelligently ratchet up the urgency of
> reclaim in very low unallocated conditions. Existing mechanisms are
> already useless in that case anyway.
> 
> Signed-off-by: Boris Burkov <boris@bur.io>
> ---
>  fs/btrfs/block-group.c |  18 ++++---
>  fs/btrfs/space-info.c  | 115 +++++++++++++++++++++++++++++++++++++----
>  fs/btrfs/space-info.h  |   8 +++
>  fs/btrfs/sysfs.c       |  43 ++++++++++++++-
>  4 files changed, 164 insertions(+), 20 deletions(-)
> 
> diff --git a/fs/btrfs/block-group.c b/fs/btrfs/block-group.c
> index 824fd229d129..c3313697475f 100644
> --- a/fs/btrfs/block-group.c
> +++ b/fs/btrfs/block-group.c
> @@ -1764,24 +1764,21 @@ static inline bool btrfs_should_reclaim(struct btrfs_fs_info *fs_info)
>  
>  static bool should_reclaim_block_group(struct btrfs_block_group *bg, u64 bytes_freed)
>  {
> -	const struct btrfs_space_info *space_info = bg->space_info;
> -	const int reclaim_thresh = READ_ONCE(space_info->bg_reclaim_threshold);
> +	const int thresh_pct = btrfs_calc_reclaim_threshold(bg->space_info);
> +	u64 thresh_bytes = mult_perc(bg->length, thresh_pct);
>  	const u64 new_val = bg->used;
>  	const u64 old_val = new_val + bytes_freed;
> -	u64 thresh;
>  
> -	if (reclaim_thresh == 0)
> +	if (thresh_bytes == 0)
>  		return false;
>  
> -	thresh = mult_perc(bg->length, reclaim_thresh);
> -
>  	/*
>  	 * If we were below the threshold before don't reclaim, we are likely a
>  	 * brand new block group and we don't want to relocate new block groups.
>  	 */
> -	if (old_val < thresh)
> +	if (old_val < thresh_bytes)
>  		return false;
> -	if (new_val >= thresh)
> +	if (new_val >= thresh_bytes)
>  		return false;
>  	return true;
>  }
> @@ -1843,6 +1840,7 @@ void btrfs_reclaim_bgs_work(struct work_struct *work)
>  		/* Don't race with allocators so take the groups_sem */
>  		down_write(&space_info->groups_sem);
>  
> +		spin_lock(&space_info->lock);
>  		spin_lock(&bg->lock);
>  		if (bg->reserved || bg->pinned || bg->ro) {
>  			/*
> @@ -1852,6 +1850,7 @@ void btrfs_reclaim_bgs_work(struct work_struct *work)
>  			 * this block group.
>  			 */
>  			spin_unlock(&bg->lock);
> +			spin_unlock(&space_info->lock);
>  			up_write(&space_info->groups_sem);
>  			goto next;
>  		}
> @@ -1870,6 +1869,7 @@ void btrfs_reclaim_bgs_work(struct work_struct *work)
>  			if (!btrfs_test_opt(fs_info, DISCARD_ASYNC))
>  				btrfs_mark_bg_unused(bg);
>  			spin_unlock(&bg->lock);
> +			spin_unlock(&space_info->lock);
>  			up_write(&space_info->groups_sem);
>  			goto next;
>  
> @@ -1886,10 +1886,12 @@ void btrfs_reclaim_bgs_work(struct work_struct *work)
>  		 */
>  		if (!should_reclaim_block_group(bg, bg->length)) {
>  			spin_unlock(&bg->lock);
> +			spin_unlock(&space_info->lock);
>  			up_write(&space_info->groups_sem);
>  			goto next;
>  		}
>  		spin_unlock(&bg->lock);
> +		spin_unlock(&space_info->lock);
>  
>  		/*
>  		 * Get out fast, in case we're read-only or unmounting the
> diff --git a/fs/btrfs/space-info.c b/fs/btrfs/space-info.c
> index 7384286c5058..0d13282dac05 100644
> --- a/fs/btrfs/space-info.c
> +++ b/fs/btrfs/space-info.c
> @@ -1,5 +1,6 @@
>  // SPDX-License-Identifier: GPL-2.0
>  
> +#include <linux/minmax.h>
>  #include "misc.h"
>  #include "ctree.h"
>  #include "space-info.h"
> @@ -190,6 +191,8 @@ void btrfs_clear_space_info_full(struct btrfs_fs_info *info)
>   */
>  #define BTRFS_DEFAULT_ZONED_RECLAIM_THRESH			(75)
>  
> +#define BTRFS_UNALLOC_BLOCK_GROUP_TARGET			(10ULL)
> +
>  /*
>   * Calculate chunk size depending on volume type (regular or zoned).
>   */
> @@ -341,11 +344,27 @@ struct btrfs_space_info *btrfs_find_space_info(struct btrfs_fs_info *info,
>  	return NULL;
>  }
>  
> +static u64 calc_effective_data_chunk_size(struct btrfs_fs_info *fs_info)
> +{
> +	struct btrfs_space_info *data_sinfo;
> +	u64 data_chunk_size;
> +	/*
> +	 * Calculate the data_chunk_size, space_info->chunk_size is the
> +	 * "optimal" chunk size based on the fs size.  However when we actually
> +	 * allocate the chunk we will strip this down further, making it no more
> +	 * than 10% of the disk or 1G, whichever is smaller.
> +	 */
> +	data_sinfo = btrfs_find_space_info(fs_info, BTRFS_BLOCK_GROUP_DATA);
> +	data_chunk_size = min(data_sinfo->chunk_size,
> +			      mult_perc(fs_info->fs_devices->total_rw_bytes, 10));
> +	return min_t(u64, data_chunk_size, SZ_1G);
> +
> +}

I know this is copied from the previous code. But, the logic is wrong on
zoned mode. We always use data_sinfo->chunk_size which is zone_size.

I was working on a fix in the old code, and a patch is almost ready. Since
the fix should be backported, it would be easier if I could put my fix patch
before this series in the for-next branch.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH v2 5/6] btrfs: prevent pathological periodic reclaim loops
  2024-06-17 23:11 ` [PATCH v2 5/6] btrfs: prevent pathological periodic reclaim loops Boris Burkov
  2024-06-24 15:23   ` Josef Bacik
@ 2025-12-26  4:18   ` Sun Yangkai
  2025-12-29 23:54     ` Boris Burkov
  1 sibling, 1 reply; 13+ messages in thread
From: Sun Yangkai @ 2025-12-26  4:18 UTC (permalink / raw)
  To: boris; +Cc: kernel-team, linux-btrfs

> Periodic reclaim runs the risk of getting stuck in a state where it
> keeps reclaiming the same block group over and over. This can happen if
> 1. reclaiming that block_group fails
> 2. reclaiming that block_group fails to move any extents into existing
>    block_groups and just allocates a fresh chunk and moves everything.
> 
> Currently, 1. is a very tight loop inside the reclaim worker. That is
> critical for edge triggered reclaim or else we risk forgetting about a
> reclaimable group. On the other hand, with level triggered reclaim we
> can break out of that loop and get it later.
> 
> With that fixed, 2. applies to both failures and "successes" with no
> progress. If we have done a periodic reclaim on a space_info and nothing
> has changed in that space_info, there is not much point to trying again,
> so don't, until enough space gets free, which we capture with a
> heuristic of needing to net free 1 chunk.
> 
> Signed-off-by: Boris Burkov <boris@bur.io>
> ---
>  fs/btrfs/block-group.c | 12 ++++++---
>  fs/btrfs/space-info.c  | 56 ++++++++++++++++++++++++++++++++++++------
>  fs/btrfs/space-info.h  | 14 +++++++++++
>  3 files changed, 71 insertions(+), 11 deletions(-)
> 
> diff --git a/fs/btrfs/block-group.c b/fs/btrfs/block-group.c
> index 6bcf24f2ac79..ba9afb94e7ce 100644
> --- a/fs/btrfs/block-group.c
> +++ b/fs/btrfs/block-group.c
> @@ -1933,6 +1933,8 @@ void btrfs_reclaim_bgs_work(struct work_struct *work)
>  			reclaimed = 0;
>  			spin_lock(&space_info->lock);
>  			space_info->reclaim_errors++;
> +			if (READ_ONCE(space_info->periodic_reclaim))
> +				space_info->periodic_reclaim_ready = false;

I wonder why we're not clearing the reclaimble_bytes count here.

>  			spin_unlock(&space_info->lock);
>  		}
>  		spin_lock(&space_info->lock);
> @@ -1941,7 +1943,7 @@ void btrfs_reclaim_bgs_work(struct work_struct *work)
>  		spin_unlock(&space_info->lock);
>  
>  next:
> -		if (ret) {
> +		if (ret && !READ_ONCE(space_info->periodic_reclaim)) {
>  			/* Refcount held by the reclaim_bgs list after splice. */
>  			btrfs_get_block_group(bg);
>  			list_add_tail(&bg->bg_list, &retry_list);
> @@ -3677,6 +3679,8 @@ int btrfs_update_block_group(struct btrfs_trans_handle *trans,
>  		space_info->bytes_reserved -= num_bytes;
>  		space_info->bytes_used += num_bytes;
>  		space_info->disk_used += num_bytes * factor;
> +		if (READ_ONCE(space_info->periodic_reclaim))
> +			btrfs_space_info_update_reclaimable(space_info, -num_bytes);
>  		spin_unlock(&cache->lock);
>  		spin_unlock(&space_info->lock);
>  	} else {
> @@ -3686,8 +3690,10 @@ int btrfs_update_block_group(struct btrfs_trans_handle *trans,
>  		btrfs_space_info_update_bytes_pinned(info, space_info, num_bytes);
>  		space_info->bytes_used -= num_bytes;
>  		space_info->disk_used -= num_bytes * factor;
> -
> -		reclaim = should_reclaim_block_group(cache, num_bytes);
> +		if (READ_ONCE(space_info->periodic_reclaim))
> +			btrfs_space_info_update_reclaimable(space_info, num_bytes);
> +		else
> +			reclaim = should_reclaim_block_group(cache, num_bytes);
>  
>  		spin_unlock(&cache->lock);
>  		spin_unlock(&space_info->lock);
> diff --git a/fs/btrfs/space-info.c b/fs/btrfs/space-info.c
> index ff92ad26ffa2..e7a2aa751f8f 100644
> --- a/fs/btrfs/space-info.c
> +++ b/fs/btrfs/space-info.c
> @@ -1,5 +1,6 @@
>  // SPDX-License-Identifier: GPL-2.0
>  
> +#include "linux/spinlock.h"
>  #include <linux/minmax.h>
>  #include "misc.h"
>  #include "ctree.h"
> @@ -1899,7 +1900,9 @@ static u64 calc_pct_ratio(u64 x, u64 y)
>   */
>  static u64 calc_unalloc_target(struct btrfs_fs_info *fs_info)
>  {
> -	return BTRFS_UNALLOC_BLOCK_GROUP_TARGET * calc_effective_data_chunk_size(fs_info);
> +	u64 chunk_sz = calc_effective_data_chunk_size(fs_info);
> +
> +	return BTRFS_UNALLOC_BLOCK_GROUP_TARGET * chunk_sz;
>  }
>  
>  /*
> @@ -1935,14 +1938,13 @@ static int calc_dynamic_reclaim_threshold(struct btrfs_space_info *space_info)
>  	u64 unused = alloc - used;
>  	u64 want = target > unalloc ? target - unalloc : 0;
>  	u64 data_chunk_size = calc_effective_data_chunk_size(fs_info);
> -	/* Cast to int is OK because want <= target */
> -	int ratio = calc_pct_ratio(want, target);
>  
> -	/* If we have no unused space, don't bother, it won't work anyway */
> +	/* If we have no unused space, don't bother, it won't work anyway. */
>  	if (unused < data_chunk_size)
>  		return 0;
>  
> -	return ratio;
> +	/* Cast to int is OK because want <= target. */
> +	return calc_pct_ratio(want, target);
>  }
>  
>  int btrfs_calc_reclaim_threshold(struct btrfs_space_info *space_info)
> @@ -1984,6 +1986,46 @@ static int do_reclaim_sweep(struct btrfs_fs_info *fs_info,
>  	return 0;
>  }
>  
> +void btrfs_space_info_update_reclaimable(struct btrfs_space_info *space_info, s64 bytes)
> +{
> +	u64 chunk_sz = calc_effective_data_chunk_size(space_info->fs_info);
> +
> +	assert_spin_locked(&space_info->lock);
> +	space_info->reclaimable_bytes += bytes;
> +
> +	if (space_info->reclaimable_bytes >= chunk_sz)

We're comparing s64 with u64 here, and it won't work as expected.

Even after fixing this by changing chunk_sz to s64, it will not work as expected
in the following case:

- We're filling the disk so reclaimable_bytes is always negative.
- There's less than 10G unallocated so dynamic_reclaim kicked in.
- periodic_reclaim will never work since reclaimable_bytes is always negetive.

> +		btrfs_set_periodic_reclaim_ready(space_info, true);
> +}
> +
> +void btrfs_set_periodic_reclaim_ready(struct btrfs_space_info *space_info, bool ready)
> +{
> +	assert_spin_locked(&space_info->lock);
> +	if (!READ_ONCE(space_info->periodic_reclaim))
> +		return;
> +	if (ready != space_info->periodic_reclaim_ready) {
> +		space_info->periodic_reclaim_ready = ready;
> +		if (!ready)
> +			space_info->reclaimable_bytes = 0;
> +	}
> +}
> +
> +bool btrfs_should_periodic_reclaim(struct btrfs_space_info *space_info)
> +{
> +	bool ret;
> +
> +	if (space_info->flags & BTRFS_BLOCK_GROUP_SYSTEM)
> +		return false;
> +	if (!READ_ONCE(space_info->periodic_reclaim))
> +		return false;
> +
> +	spin_lock(&space_info->lock);
> +	ret = space_info->periodic_reclaim_ready;
> +	btrfs_set_periodic_reclaim_ready(space_info, false);
> +	spin_unlock(&space_info->lock);
> +
> +	return ret;
> +}
> +
>  int btrfs_reclaim_sweep(struct btrfs_fs_info *fs_info)
>  {
>  	int ret;
> @@ -1991,9 +2033,7 @@ int btrfs_reclaim_sweep(struct btrfs_fs_info *fs_info)
>  	struct btrfs_space_info *space_info;
>  
>  	list_for_each_entry(space_info, &fs_info->space_info, list) {
> -		if (space_info->flags & BTRFS_BLOCK_GROUP_SYSTEM)
> -			continue;
> -		if (!READ_ONCE(space_info->periodic_reclaim))
> +		if (!btrfs_should_periodic_reclaim(space_info))
>  			continue;
>  		for (raid = 0; raid < BTRFS_NR_RAID_TYPES; raid++) {
>  			ret = do_reclaim_sweep(fs_info, space_info, raid);
> diff --git a/fs/btrfs/space-info.h b/fs/btrfs/space-info.h
> index ae4a1f7d5856..4db8a0267c16 100644
> --- a/fs/btrfs/space-info.h
> +++ b/fs/btrfs/space-info.h
> @@ -196,6 +196,17 @@ struct btrfs_space_info {
>  	 * threshold in the cleaner thread.
>  	 */
>  	bool periodic_reclaim;
> +
> +	/*
> +	 * Periodic reclaim should be a no-op if a space_info hasn't
> +	 * freed any space since the last time we tried.
> +	 */
> +	bool periodic_reclaim_ready;

Also, I wonder why we need this bool flag. I think we care more about if
reclaimable_bytes' value is more than 1G when calling
btrfs_should_periodic_reclaim() rather than if it has been more than 1G during
two calls of btrfs_should_periodic_reclaim().

Thanks,
Sun YangKai

> +
> +	/*
> +	 * Net bytes freed or allocated since the last reclaim pass.
> +	 */
> +	s64 reclaimable_bytes;
>  };
>  
>  struct reserve_ticket {
> @@ -278,6 +289,9 @@ void btrfs_dump_space_info_for_trans_abort(struct btrfs_fs_info *fs_info);
>  void btrfs_init_async_reclaim_work(struct btrfs_fs_info *fs_info);
>  u64 btrfs_account_ro_block_groups_free_space(struct btrfs_space_info *sinfo);
>  
> +void btrfs_space_info_update_reclaimable(struct btrfs_space_info *space_info, s64 bytes);
> +void btrfs_set_periodic_reclaim_ready(struct btrfs_space_info *space_info, bool ready);
> +bool btrfs_should_periodic_reclaim(struct btrfs_space_info *space_info);
>  int btrfs_calc_reclaim_threshold(struct btrfs_space_info *space_info);
>  int btrfs_reclaim_sweep(struct btrfs_fs_info *fs_info);
>  
> -- 
> 2.45.2


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH v2 5/6] btrfs: prevent pathological periodic reclaim loops
  2025-12-26  4:18   ` Sun Yangkai
@ 2025-12-29 23:54     ` Boris Burkov
  0 siblings, 0 replies; 13+ messages in thread
From: Boris Burkov @ 2025-12-29 23:54 UTC (permalink / raw)
  To: Sun Yangkai; +Cc: kernel-team, linux-btrfs

On Fri, Dec 26, 2025 at 12:18:51PM +0800, Sun Yangkai wrote:
> > Periodic reclaim runs the risk of getting stuck in a state where it
> > keeps reclaiming the same block group over and over. This can happen if
> > 1. reclaiming that block_group fails
> > 2. reclaiming that block_group fails to move any extents into existing
> >    block_groups and just allocates a fresh chunk and moves everything.
> > 
> > Currently, 1. is a very tight loop inside the reclaim worker. That is
> > critical for edge triggered reclaim or else we risk forgetting about a
> > reclaimable group. On the other hand, with level triggered reclaim we
> > can break out of that loop and get it later.
> > 
> > With that fixed, 2. applies to both failures and "successes" with no
> > progress. If we have done a periodic reclaim on a space_info and nothing
> > has changed in that space_info, there is not much point to trying again,
> > so don't, until enough space gets free, which we capture with a
> > heuristic of needing to net free 1 chunk.
> > 
> > Signed-off-by: Boris Burkov <boris@bur.io>
> > ---
> >  fs/btrfs/block-group.c | 12 ++++++---
> >  fs/btrfs/space-info.c  | 56 ++++++++++++++++++++++++++++++++++++------
> >  fs/btrfs/space-info.h  | 14 +++++++++++
> >  3 files changed, 71 insertions(+), 11 deletions(-)
> > 
> > diff --git a/fs/btrfs/block-group.c b/fs/btrfs/block-group.c
> > index 6bcf24f2ac79..ba9afb94e7ce 100644
> > --- a/fs/btrfs/block-group.c
> > +++ b/fs/btrfs/block-group.c
> > @@ -1933,6 +1933,8 @@ void btrfs_reclaim_bgs_work(struct work_struct *work)
> >  			reclaimed = 0;
> >  			spin_lock(&space_info->lock);
> >  			space_info->reclaim_errors++;
> > +			if (READ_ONCE(space_info->periodic_reclaim))
> > +				space_info->periodic_reclaim_ready = false;
> 
> I wonder why we're not clearing the reclaimble_bytes count here.
> 

As far as I can tell, it's an oversight. I think it ought to use
btrfs_set_reclaim_ready(fs_info, false) here. However, FWIW, the
reclaimable bytes already got reset when we checked whether reclaim was
ready, so this would extra re-clear it after. It honestly might make the
most sense to just get rid of this logic here entirely, as it's pretty
redundant with setting ready=false at the start of each invocation of
periodic_reclaim.

> >  			spin_unlock(&space_info->lock);
> >  		}
> >  		spin_lock(&space_info->lock);
> > @@ -1941,7 +1943,7 @@ void btrfs_reclaim_bgs_work(struct work_struct *work)
> >  		spin_unlock(&space_info->lock);
> >  
> >  next:
> > -		if (ret) {
> > +		if (ret && !READ_ONCE(space_info->periodic_reclaim)) {
> >  			/* Refcount held by the reclaim_bgs list after splice. */
> >  			btrfs_get_block_group(bg);
> >  			list_add_tail(&bg->bg_list, &retry_list);
> > @@ -3677,6 +3679,8 @@ int btrfs_update_block_group(struct btrfs_trans_handle *trans,
> >  		space_info->bytes_reserved -= num_bytes;
> >  		space_info->bytes_used += num_bytes;
> >  		space_info->disk_used += num_bytes * factor;
> > +		if (READ_ONCE(space_info->periodic_reclaim))
> > +			btrfs_space_info_update_reclaimable(space_info, -num_bytes);
> >  		spin_unlock(&cache->lock);
> >  		spin_unlock(&space_info->lock);
> >  	} else {
> > @@ -3686,8 +3690,10 @@ int btrfs_update_block_group(struct btrfs_trans_handle *trans,
> >  		btrfs_space_info_update_bytes_pinned(info, space_info, num_bytes);
> >  		space_info->bytes_used -= num_bytes;
> >  		space_info->disk_used -= num_bytes * factor;
> > -
> > -		reclaim = should_reclaim_block_group(cache, num_bytes);
> > +		if (READ_ONCE(space_info->periodic_reclaim))
> > +			btrfs_space_info_update_reclaimable(space_info, num_bytes);
> > +		else
> > +			reclaim = should_reclaim_block_group(cache, num_bytes);
> >  
> >  		spin_unlock(&cache->lock);
> >  		spin_unlock(&space_info->lock);
> > diff --git a/fs/btrfs/space-info.c b/fs/btrfs/space-info.c
> > index ff92ad26ffa2..e7a2aa751f8f 100644
> > --- a/fs/btrfs/space-info.c
> > +++ b/fs/btrfs/space-info.c
> > @@ -1,5 +1,6 @@
> >  // SPDX-License-Identifier: GPL-2.0
> >  
> > +#include "linux/spinlock.h"
> >  #include <linux/minmax.h>
> >  #include "misc.h"
> >  #include "ctree.h"
> > @@ -1899,7 +1900,9 @@ static u64 calc_pct_ratio(u64 x, u64 y)
> >   */
> >  static u64 calc_unalloc_target(struct btrfs_fs_info *fs_info)
> >  {
> > -	return BTRFS_UNALLOC_BLOCK_GROUP_TARGET * calc_effective_data_chunk_size(fs_info);
> > +	u64 chunk_sz = calc_effective_data_chunk_size(fs_info);
> > +
> > +	return BTRFS_UNALLOC_BLOCK_GROUP_TARGET * chunk_sz;
> >  }
> >  
> >  /*
> > @@ -1935,14 +1938,13 @@ static int calc_dynamic_reclaim_threshold(struct btrfs_space_info *space_info)
> >  	u64 unused = alloc - used;
> >  	u64 want = target > unalloc ? target - unalloc : 0;
> >  	u64 data_chunk_size = calc_effective_data_chunk_size(fs_info);
> > -	/* Cast to int is OK because want <= target */
> > -	int ratio = calc_pct_ratio(want, target);
> >  
> > -	/* If we have no unused space, don't bother, it won't work anyway */
> > +	/* If we have no unused space, don't bother, it won't work anyway. */
> >  	if (unused < data_chunk_size)
> >  		return 0;
> >  
> > -	return ratio;
> > +	/* Cast to int is OK because want <= target. */
> > +	return calc_pct_ratio(want, target);
> >  }
> >  
> >  int btrfs_calc_reclaim_threshold(struct btrfs_space_info *space_info)
> > @@ -1984,6 +1986,46 @@ static int do_reclaim_sweep(struct btrfs_fs_info *fs_info,
> >  	return 0;
> >  }
> >  
> > +void btrfs_space_info_update_reclaimable(struct btrfs_space_info *space_info, s64 bytes)
> > +{
> > +	u64 chunk_sz = calc_effective_data_chunk_size(space_info->fs_info);
> > +
> > +	assert_spin_locked(&space_info->lock);
> > +	space_info->reclaimable_bytes += bytes;
> > +
> > +	if (space_info->reclaimable_bytes >= chunk_sz)
> 
> We're comparing s64 with u64 here, and it won't work as expected.

Good catch. Since we could do a bunch of allocation and make
reclaimable_bytes negative, we need to check for negative
reclaimable_bytes first, as the coercion to a giant u64 is actually
quite possible in practice.

> 
> Even after fixing this by changing chunk_sz to s64, it will not work as expected
> in the following case:
> 
> - We're filling the disk so reclaimable_bytes is always negative.
> - There's less than 10G unallocated so dynamic_reclaim kicked in.
> - periodic_reclaim will never work since reclaimable_bytes is always negetive.
> 

Yes, I agree that this case will thrash the reclaim if we keep
allocating constantly (without going enospc).

Good catch! Hopefully that's what is happening in your report on the
other email.

> > +		btrfs_set_periodic_reclaim_ready(space_info, true);
> > +}
> > +
> > +void btrfs_set_periodic_reclaim_ready(struct btrfs_space_info *space_info, bool ready)
> > +{
> > +	assert_spin_locked(&space_info->lock);
> > +	if (!READ_ONCE(space_info->periodic_reclaim))
> > +		return;
> > +	if (ready != space_info->periodic_reclaim_ready) {
> > +		space_info->periodic_reclaim_ready = ready;
> > +		if (!ready)
> > +			space_info->reclaimable_bytes = 0;
> > +	}
> > +}
> > +
> > +bool btrfs_should_periodic_reclaim(struct btrfs_space_info *space_info)
> > +{
> > +	bool ret;
> > +
> > +	if (space_info->flags & BTRFS_BLOCK_GROUP_SYSTEM)
> > +		return false;
> > +	if (!READ_ONCE(space_info->periodic_reclaim))
> > +		return false;
> > +
> > +	spin_lock(&space_info->lock);
> > +	ret = space_info->periodic_reclaim_ready;
> > +	btrfs_set_periodic_reclaim_ready(space_info, false);
> > +	spin_unlock(&space_info->lock);
> > +
> > +	return ret;
> > +}
> > +
> >  int btrfs_reclaim_sweep(struct btrfs_fs_info *fs_info)
> >  {
> >  	int ret;
> > @@ -1991,9 +2033,7 @@ int btrfs_reclaim_sweep(struct btrfs_fs_info *fs_info)
> >  	struct btrfs_space_info *space_info;
> >  
> >  	list_for_each_entry(space_info, &fs_info->space_info, list) {
> > -		if (space_info->flags & BTRFS_BLOCK_GROUP_SYSTEM)
> > -			continue;
> > -		if (!READ_ONCE(space_info->periodic_reclaim))
> > +		if (!btrfs_should_periodic_reclaim(space_info))
> >  			continue;
> >  		for (raid = 0; raid < BTRFS_NR_RAID_TYPES; raid++) {
> >  			ret = do_reclaim_sweep(fs_info, space_info, raid);
> > diff --git a/fs/btrfs/space-info.h b/fs/btrfs/space-info.h
> > index ae4a1f7d5856..4db8a0267c16 100644
> > --- a/fs/btrfs/space-info.h
> > +++ b/fs/btrfs/space-info.h
> > @@ -196,6 +196,17 @@ struct btrfs_space_info {
> >  	 * threshold in the cleaner thread.
> >  	 */
> >  	bool periodic_reclaim;
> > +
> > +	/*
> > +	 * Periodic reclaim should be a no-op if a space_info hasn't
> > +	 * freed any space since the last time we tried.
> > +	 */
> > +	bool periodic_reclaim_ready;
> 
> Also, I wonder why we need this bool flag. I think we care more about if
> reclaimable_bytes' value is more than 1G when calling
> btrfs_should_periodic_reclaim() rather than if it has been more than 1G during
> two calls of btrfs_should_periodic_reclaim().

This is probably an accident of how the logic evolved as I iterated on
the design. The first version was to set ready false on a failure then
set it true on a deallocation, then I "enhanced" it with the
reclaimable_bytes logic to make it more conservative.

I think getting rid of periodic_reclaim_ready and just doing it off of
bytes_reclaimable (which still gets reset after every reclaim attempt)
would make sense.

Thanks,
Boris

> 
> Thanks,
> Sun YangKai
> 
> > +
> > +	/*
> > +	 * Net bytes freed or allocated since the last reclaim pass.
> > +	 */
> > +	s64 reclaimable_bytes;
> >  };
> >  
> >  struct reserve_ticket {
> > @@ -278,6 +289,9 @@ void btrfs_dump_space_info_for_trans_abort(struct btrfs_fs_info *fs_info);
> >  void btrfs_init_async_reclaim_work(struct btrfs_fs_info *fs_info);
> >  u64 btrfs_account_ro_block_groups_free_space(struct btrfs_space_info *sinfo);
> >  
> > +void btrfs_space_info_update_reclaimable(struct btrfs_space_info *space_info, s64 bytes);
> > +void btrfs_set_periodic_reclaim_ready(struct btrfs_space_info *space_info, bool ready);
> > +bool btrfs_should_periodic_reclaim(struct btrfs_space_info *space_info);
> >  int btrfs_calc_reclaim_threshold(struct btrfs_space_info *space_info);
> >  int btrfs_reclaim_sweep(struct btrfs_fs_info *fs_info);
> >  
> > -- 
> > 2.45.2
> 

^ permalink raw reply	[flat|nested] 13+ messages in thread

end of thread, other threads:[~2025-12-29 23:55 UTC | newest]

Thread overview: 13+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-06-17 23:11 [PATCH v2 0/6] btrfs: dynamic and periodic block_group reclaim Boris Burkov
2024-06-17 23:11 ` [PATCH v2 1/6] btrfs: report reclaim stats in sysfs Boris Burkov
2024-06-17 23:11 ` [PATCH v2 2/6] btrfs: store fs_info on space_info Boris Burkov
2024-06-17 23:11 ` [PATCH v2 3/6] btrfs: dynamic block_group reclaim threshold Boris Burkov
2024-06-25 13:40   ` Naohiro Aota
2024-06-17 23:11 ` [PATCH v2 4/6] btrfs: periodic block_group reclaim Boris Burkov
2024-06-17 23:11 ` [PATCH v2 5/6] btrfs: prevent pathological periodic reclaim loops Boris Burkov
2024-06-24 15:23   ` Josef Bacik
2024-06-24 16:05     ` David Sterba
2025-12-26  4:18   ` Sun Yangkai
2025-12-29 23:54     ` Boris Burkov
2024-06-17 23:11 ` [PATCH v2 6/6] btrfs: urgent periodic reclaim pass Boris Burkov
2024-06-24 15:25 ` [PATCH v2 0/6] btrfs: dynamic and periodic block_group reclaim Josef Bacik

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox