public inbox for linux-btrfs@vger.kernel.org
 help / color / mirror / Atom feed
* [RFC PATCH 0/6] btrfs: dynamic and periodic block_group reclaim
@ 2024-02-02 23:12 Boris Burkov
  2024-02-02 23:12 ` [PATCH 1/6] btrfs: report reclaim count in sysfs Boris Burkov
                   ` (6 more replies)
  0 siblings, 7 replies; 11+ messages in thread
From: Boris Burkov @ 2024-02-02 23:12 UTC (permalink / raw)
  To: linux-btrfs, kernel-team

Btrfs's block_group allocator suffers from a well known problem, that
it is capable of eagerly allocating too much space to either data or
metadata (most often data, absent bugs) and then later be unable to
allocate more space for the other, when needed. When data starves
metadata, this can extra painfully result in read only filesystems that
need careful manual balancing to fix.

This can be worked around by:
- enabling automatic reclaim
- periodically running balance

Neither of these enjoy widespread use, as far as I know, though the
former is used at scale at Meta with good results.

This patch set expands on automatic reclaim, adding the ability to set a
dynamic reclaim threshold that appropriately scales with the global file
system allocation conditions as well as periodic reclaim which runs that
reclaim sweep in the cleaner thread. Together, I believe they constitute
a robust and general automatic reclaim system that should avoid
unfortunate read only filesystems in all but extreme conditions, where
space is running quite low anyway and failure is more reasonable.

I ran it on three workloads (described in detail on the dynamic reclaim
patch) but they are:
1. bounce allocations around X% full.
2. fill up all the way and introduce full fragmentation.
3. write in a fragmented way until the filesystem is just about full.
script can be found here:
https://github.com/boryas/scripts/tree/main/fio/reclaim

The important results can be seen here (full results explorable at
bur.io/dyn-rec/)

bounce at 30%, much higher relocations with a fixed threshold:
https://bur.io/dyn-rec/bounce-30/relocs.png

hard 30% fragmentation, dynamic actually reclaims, relocs not crazy:
https://bur.io/dyn-rec/strict_frag-30/unalloc_bytes.png
https://bur.io/dyn-rec/strict_frag-30/relocs.png

fill it all the way up, not crazy churn, but saving a buffer:
https://bur.io/dyn-rec/last_gig/unalloc_bytes.png
https://bur.io/dyn-rec/last_gig/relocs.png
https://bur.io/dyn-rec/last_gig/thresh.png

Boris Burkov (6):
  btrfs: report reclaim count in sysfs
  btrfs: store fs_info on space_info
  btrfs: dynamic block_group reclaim threshold
  btrfs: periodic block_group reclaim
  btrfs: urgent periodic reclaim pass
  btrfs: prevent pathological periodic reclaim loops

 fs/btrfs/block-group.c |  26 ++++---
 fs/btrfs/block-group.h |   1 +
 fs/btrfs/space-info.c  | 165 +++++++++++++++++++++++++++++++++++++++++
 fs/btrfs/space-info.h  |  28 +++++++
 fs/btrfs/sysfs.c       |  79 +++++++++++++++++++-
 5 files changed, 289 insertions(+), 10 deletions(-)

-- 
2.43.0


^ permalink raw reply	[flat|nested] 11+ messages in thread

* [PATCH 1/6] btrfs: report reclaim count in sysfs
  2024-02-02 23:12 [RFC PATCH 0/6] btrfs: dynamic and periodic block_group reclaim Boris Burkov
@ 2024-02-02 23:12 ` Boris Burkov
  2024-02-02 23:12 ` [PATCH 2/6] btrfs: store fs_info on space_info Boris Burkov
                   ` (5 subsequent siblings)
  6 siblings, 0 replies; 11+ messages in thread
From: Boris Burkov @ 2024-02-02 23:12 UTC (permalink / raw)
  To: linux-btrfs, kernel-team

When evaluating various reclaim strategies/thresholds against each
other, it is useful to collect data about the amount of reclaim
happening. Expose it via sysfs per space_info.

Signed-off-by: Boris Burkov <boris@bur.io>
---
 fs/btrfs/block-group.c | 3 +++
 fs/btrfs/space-info.h  | 6 ++++++
 fs/btrfs/sysfs.c       | 2 ++
 3 files changed, 11 insertions(+)

diff --git a/fs/btrfs/block-group.c b/fs/btrfs/block-group.c
index a9be9ac99222..7f05fdcee199 100644
--- a/fs/btrfs/block-group.c
+++ b/fs/btrfs/block-group.c
@@ -1858,6 +1858,9 @@ void btrfs_reclaim_bgs_work(struct work_struct *work)
 				div64_u64(bg->used * 100, bg->length),
 				div64_u64(zone_unusable * 100, bg->length));
 		trace_btrfs_reclaim_block_group(bg);
+		spin_lock(&space_info->lock);
+		space_info->reclaim_count++;
+		spin_unlock(&space_info->lock);
 		ret = btrfs_relocate_chunk(fs_info, bg->start);
 		if (ret) {
 			btrfs_dec_block_group_ro(bg);
diff --git a/fs/btrfs/space-info.h b/fs/btrfs/space-info.h
index 92c595fed1b0..da3e68612d5c 100644
--- a/fs/btrfs/space-info.h
+++ b/fs/btrfs/space-info.h
@@ -156,6 +156,12 @@ struct btrfs_space_info {
 
 	struct kobject kobj;
 	struct kobject *block_group_kobjs[BTRFS_NR_RAID_TYPES];
+
+	/*
+	 * Monotonically increasing counter of relocated block groups.
+	 * Exposed in /sys/fs/<uuid>/allocation/<type>/reclaim_count
+	 */
+	u64 reclaim_count;
 };
 
 struct reserve_ticket {
diff --git a/fs/btrfs/sysfs.c b/fs/btrfs/sysfs.c
index 84c05246ffd8..1b866b2a01ce 100644
--- a/fs/btrfs/sysfs.c
+++ b/fs/btrfs/sysfs.c
@@ -894,6 +894,7 @@ SPACE_INFO_ATTR(bytes_readonly);
 SPACE_INFO_ATTR(bytes_zone_unusable);
 SPACE_INFO_ATTR(disk_used);
 SPACE_INFO_ATTR(disk_total);
+SPACE_INFO_ATTR(reclaim_count);
 BTRFS_ATTR_RW(space_info, chunk_size, btrfs_chunk_size_show, btrfs_chunk_size_store);
 BTRFS_ATTR(space_info, size_classes, btrfs_size_classes_show);
 
@@ -949,6 +950,7 @@ static struct attribute *space_info_attrs[] = {
 	BTRFS_ATTR_PTR(space_info, bg_reclaim_threshold),
 	BTRFS_ATTR_PTR(space_info, chunk_size),
 	BTRFS_ATTR_PTR(space_info, size_classes),
+	BTRFS_ATTR_PTR(space_info, reclaim_count),
 #ifdef CONFIG_BTRFS_DEBUG
 	BTRFS_ATTR_PTR(space_info, force_chunk_alloc),
 #endif
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 11+ messages in thread

* [PATCH 2/6] btrfs: store fs_info on space_info
  2024-02-02 23:12 [RFC PATCH 0/6] btrfs: dynamic and periodic block_group reclaim Boris Burkov
  2024-02-02 23:12 ` [PATCH 1/6] btrfs: report reclaim count in sysfs Boris Burkov
@ 2024-02-02 23:12 ` Boris Burkov
  2024-02-02 23:12 ` [PATCH 3/6] btrfs: dynamic block_group reclaim threshold Boris Burkov
                   ` (4 subsequent siblings)
  6 siblings, 0 replies; 11+ messages in thread
From: Boris Burkov @ 2024-02-02 23:12 UTC (permalink / raw)
  To: linux-btrfs, kernel-team

This is handy when computing space_info dynamic reclaim thresholds where
we do not have access to a block group. We could add it to the various
functions as a parameter, but it seems reasonable for space_info to have
an fs_info pointer.

Signed-off-by: Boris Burkov <boris@bur.io>
---
 fs/btrfs/space-info.c | 1 +
 fs/btrfs/space-info.h | 1 +
 2 files changed, 2 insertions(+)

diff --git a/fs/btrfs/space-info.c b/fs/btrfs/space-info.c
index 571bb13587d5..f4a1e6341ca6 100644
--- a/fs/btrfs/space-info.c
+++ b/fs/btrfs/space-info.c
@@ -233,6 +233,7 @@ static int create_space_info(struct btrfs_fs_info *info, u64 flags)
 	if (!space_info)
 		return -ENOMEM;
 
+	space_info->fs_info = info;
 	for (i = 0; i < BTRFS_NR_RAID_TYPES; i++)
 		INIT_LIST_HEAD(&space_info->block_groups[i]);
 	init_rwsem(&space_info->groups_sem);
diff --git a/fs/btrfs/space-info.h b/fs/btrfs/space-info.h
index da3e68612d5c..1cc4ef8dca38 100644
--- a/fs/btrfs/space-info.h
+++ b/fs/btrfs/space-info.h
@@ -85,6 +85,7 @@ enum btrfs_flush_state {
 };
 
 struct btrfs_space_info {
+	struct btrfs_fs_info *fs_info;
 	spinlock_t lock;
 
 	u64 total_bytes;	/* total bytes in the space,
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 11+ messages in thread

* [PATCH 3/6] btrfs: dynamic block_group reclaim threshold
  2024-02-02 23:12 [RFC PATCH 0/6] btrfs: dynamic and periodic block_group reclaim Boris Burkov
  2024-02-02 23:12 ` [PATCH 1/6] btrfs: report reclaim count in sysfs Boris Burkov
  2024-02-02 23:12 ` [PATCH 2/6] btrfs: store fs_info on space_info Boris Burkov
@ 2024-02-02 23:12 ` Boris Burkov
  2024-02-02 23:12 ` [PATCH 4/6] btrfs: periodic block_group reclaim Boris Burkov
                   ` (3 subsequent siblings)
  6 siblings, 0 replies; 11+ messages in thread
From: Boris Burkov @ 2024-02-02 23:12 UTC (permalink / raw)
  To: linux-btrfs, kernel-team

We can currently recover allocated block_groups by:
- explicitly starting balance operations
- "auto reclaim" via bg_reclaim_threshold

The latter works by checking against a fixed threshold on frees. If we
pass from above the threshold to below, relocation triggers and the
block group will get reclaimed by the cleaner thread (assuming it is
still eligible)

Picking a threshold is challenging. Too high, and you end up trying to
reclaim very full block_groups which is quite costly, and you don't do
reclaim on block_groups that don't get quite THAT full, but could still
be quite fragmented and stranding a lot of space. Too low, and you
similarly miss out on reclaim even if you badly need it to avoid running
out of unallocated space, if you have heavily fragmented block groups
living above the threshold.

No matter the threshold, it suffers from a workload that happens to
bounce around that threshold, which can introduce arbitrary amounts of
reclaim waste.

To improve this situation, introduce a dynamic threshold. The basic idea
behind this threshold is that it should be high when there is lots of
unused space, and little unallocated space, relative to fs size. OTOH,
when either unused is low or unallocated is high, reclaim is not that
important, so we can set a quite low threshold.

The formula to acheive this is:
(unused / allocated) * (unused / unallocated)
which is also clamped to 90% as fuller than that is very challenging to
reclaim flat out, and means the file system is legitimately quite full.

I tested this by running it on three interesting workloads:
1. bounce allocations around X% full.
2. fill up all the way and introduce full fragmentation.
3. write in a fragmented way until the filesystem is just about full.

1. and 2. attack the weaknesses of a fixed threshold; fixed either works
perfectly or fully falls apart, depending on the threshold. Dynamic
always handles these cases well.

3. attacks dynamic by checking whether it is too zealous to reclaim in
conditions with low unallocated and low unused. It tends to claw back
1GiB of unallocated fairly aggressively, but not much more. Early
versions of dynamic threshold struggled on this test.

Additional work could be done to intelligently ratchet up the urgency of
reclaim in very low unallocated conditions. Existing mechanisms are
already useless in that case anyway.

Signed-off-by: Boris Burkov <boris@bur.io>
---
 fs/btrfs/block-group.c | 18 +++++-----
 fs/btrfs/space-info.c  | 74 ++++++++++++++++++++++++++++++++++++++++++
 fs/btrfs/space-info.h  |  8 +++++
 fs/btrfs/sysfs.c       | 43 +++++++++++++++++++++++-
 4 files changed, 134 insertions(+), 9 deletions(-)

diff --git a/fs/btrfs/block-group.c b/fs/btrfs/block-group.c
index 7f05fdcee199..6244c76f3584 100644
--- a/fs/btrfs/block-group.c
+++ b/fs/btrfs/block-group.c
@@ -1702,24 +1702,21 @@ static inline bool btrfs_should_reclaim(struct btrfs_fs_info *fs_info)
 
 static bool should_reclaim_block_group(struct btrfs_block_group *bg, u64 bytes_freed)
 {
-	const struct btrfs_space_info *space_info = bg->space_info;
-	const int reclaim_thresh = READ_ONCE(space_info->bg_reclaim_threshold);
+	const int thresh_pct = btrfs_calc_reclaim_threshold(bg->space_info);
+	u64 thresh_bytes = mult_perc(bg->length, thresh_pct);
 	const u64 new_val = bg->used;
 	const u64 old_val = new_val + bytes_freed;
-	u64 thresh;
 
-	if (reclaim_thresh == 0)
+	if (thresh_bytes == 0)
 		return false;
 
-	thresh = mult_perc(bg->length, reclaim_thresh);
-
 	/*
 	 * If we were below the threshold before don't reclaim, we are likely a
 	 * brand new block group and we don't want to relocate new block groups.
 	 */
-	if (old_val < thresh)
+	if (old_val < thresh_bytes)
 		return false;
-	if (new_val >= thresh)
+	if (new_val >= thresh_bytes)
 		return false;
 	return true;
 }
@@ -1779,6 +1776,7 @@ void btrfs_reclaim_bgs_work(struct work_struct *work)
 		/* Don't race with allocators so take the groups_sem */
 		down_write(&space_info->groups_sem);
 
+		spin_lock(&space_info->lock);
 		spin_lock(&bg->lock);
 		if (bg->reserved || bg->pinned || bg->ro) {
 			/*
@@ -1788,6 +1786,7 @@ void btrfs_reclaim_bgs_work(struct work_struct *work)
 			 * this block group.
 			 */
 			spin_unlock(&bg->lock);
+			spin_unlock(&space_info->lock);
 			up_write(&space_info->groups_sem);
 			goto next;
 		}
@@ -1806,6 +1805,7 @@ void btrfs_reclaim_bgs_work(struct work_struct *work)
 			if (!btrfs_test_opt(fs_info, DISCARD_ASYNC))
 				btrfs_mark_bg_unused(bg);
 			spin_unlock(&bg->lock);
+			spin_unlock(&space_info->lock);
 			up_write(&space_info->groups_sem);
 			goto next;
 
@@ -1822,10 +1822,12 @@ void btrfs_reclaim_bgs_work(struct work_struct *work)
 		 */
 		if (!should_reclaim_block_group(bg, bg->length)) {
 			spin_unlock(&bg->lock);
+			spin_unlock(&space_info->lock);
 			up_write(&space_info->groups_sem);
 			goto next;
 		}
 		spin_unlock(&bg->lock);
+		spin_unlock(&space_info->lock);
 
 		/*
 		 * Get out fast, in case we're read-only or unmounting the
diff --git a/fs/btrfs/space-info.c b/fs/btrfs/space-info.c
index f4a1e6341ca6..86a87501af08 100644
--- a/fs/btrfs/space-info.c
+++ b/fs/btrfs/space-info.c
@@ -1,5 +1,6 @@
 // SPDX-License-Identifier: GPL-2.0
 
+#include <linux/minmax.h>
 #include "misc.h"
 #include "ctree.h"
 #include "space-info.h"
@@ -191,6 +192,8 @@ void btrfs_clear_space_info_full(struct btrfs_fs_info *info)
  */
 #define BTRFS_DEFAULT_ZONED_RECLAIM_THRESH			(75)
 
+#define BTRFS_DYNAMIC_RECLAIM_THRESH_MAX			(90)
+
 /*
  * Calculate chunk size depending on volume type (regular or zoned).
  */
@@ -1870,3 +1873,74 @@ u64 btrfs_account_ro_block_groups_free_space(struct btrfs_space_info *sinfo)
 
 	return free_bytes;
 }
+
+static u64 calc_pct_ratio(u64 x, u64 y)
+{
+	int err;
+
+	if (!y)
+		return 0;
+again:
+	err = check_mul_overflow(100, x, &x);
+	if (err)
+		goto lose_precision;
+	return div64_u64(x, y);
+lose_precision:
+	x >>= 10;
+	y >>= 10;
+	if (!y)
+		y = 1;
+	goto again;
+}
+
+/*
+ * The dynamic threshold formula is:
+ * (unused / allocated) * (unused / unallocated) or equivalently
+ * unused^2 / (allocated * unallocated)
+ *
+ * The fundamental goal of automatic reclaim is to protect the filesystem's
+ * unallocated space and thus minimize the probability of the filesystem going
+ * read only when a metadata allocation failure causes a transaction abort.
+ *
+ * However, relocations happen into the space_info's unused space, therefore
+ * automatic reclaim must also back off as that space runs low. There is no
+ * value in doing trivial "relocations" of re-writing the same block group
+ * into a fresh one.
+ *
+ * unused / allocated sets a baseline, very conservative threshold which
+ * properly goes to 0 as unused goes to a small portion of the allocated space.
+ *
+ * On its own, this would likely do very little reclaim, so include
+ * unused / unallocated (which can be greatly in excess of 100%) to bias heavily
+ * towards reclaim when unallocated goes low or unused goes high.
+ */
+
+static int calc_dynamic_reclaim_threshold(struct btrfs_space_info *space_info)
+{
+	struct btrfs_fs_info *fs_info = space_info->fs_info;
+	u64 unalloc = atomic64_read(&fs_info->free_chunk_space);
+	u64 alloc = space_info->total_bytes;
+	u64 used = btrfs_space_info_used(space_info, false);
+	u64 unused = alloc - used;
+	/* unused <= alloc; clamped to 100 */
+	int unused_pct = calc_pct_ratio(unused, alloc);
+	u64 unused_unalloc_ratio = calc_pct_ratio(unused, unalloc);
+	int err;
+	u64 thresh;
+
+	err = check_mul_overflow(unused_pct, unused_unalloc_ratio, &thresh);
+	if (err)
+		return BTRFS_DYNAMIC_RECLAIM_THRESH_MAX;
+	/* Both quantities are percentages; remove the squared factor of 100. */
+	thresh = div64_u64(thresh, 100);
+	return clamp_val(thresh, 0, BTRFS_DYNAMIC_RECLAIM_THRESH_MAX);
+}
+
+int btrfs_calc_reclaim_threshold(struct btrfs_space_info *space_info)
+{
+	lockdep_assert_held(&space_info->lock);
+
+	if (READ_ONCE(space_info->dynamic_reclaim))
+		return calc_dynamic_reclaim_threshold(space_info);
+	return READ_ONCE(space_info->bg_reclaim_threshold);
+}
diff --git a/fs/btrfs/space-info.h b/fs/btrfs/space-info.h
index 1cc4ef8dca38..2f4c00525a08 100644
--- a/fs/btrfs/space-info.h
+++ b/fs/btrfs/space-info.h
@@ -163,6 +163,12 @@ struct btrfs_space_info {
 	 * Exposed in /sys/fs/<uuid>/allocation/<type>/reclaim_count
 	 */
 	u64 reclaim_count;
+
+	/*
+	 * If true, use the dynamic relocation threshold, instead of the
+	 * fixed bg_reclaim_threshold.
+	 */
+	bool dynamic_reclaim;
 };
 
 struct reserve_ticket {
@@ -245,4 +251,6 @@ void btrfs_dump_space_info_for_trans_abort(struct btrfs_fs_info *fs_info);
 void btrfs_init_async_reclaim_work(struct btrfs_fs_info *fs_info);
 u64 btrfs_account_ro_block_groups_free_space(struct btrfs_space_info *sinfo);
 
+int btrfs_calc_reclaim_threshold(struct btrfs_space_info *space_info);
+
 #endif /* BTRFS_SPACE_INFO_H */
diff --git a/fs/btrfs/sysfs.c b/fs/btrfs/sysfs.c
index 1b866b2a01ce..0683a23e5254 100644
--- a/fs/btrfs/sysfs.c
+++ b/fs/btrfs/sysfs.c
@@ -903,8 +903,12 @@ static ssize_t btrfs_sinfo_bg_reclaim_threshold_show(struct kobject *kobj,
 						     char *buf)
 {
 	struct btrfs_space_info *space_info = to_space_info(kobj);
+	ssize_t ret;
 
-	return sysfs_emit(buf, "%d\n", READ_ONCE(space_info->bg_reclaim_threshold));
+	spin_lock(&space_info->lock);
+	ret = sysfs_emit(buf, "%d\n", btrfs_calc_reclaim_threshold(space_info));
+	spin_unlock(&space_info->lock);
+	return ret;
 }
 
 static ssize_t btrfs_sinfo_bg_reclaim_threshold_store(struct kobject *kobj,
@@ -915,6 +919,9 @@ static ssize_t btrfs_sinfo_bg_reclaim_threshold_store(struct kobject *kobj,
 	int thresh;
 	int ret;
 
+	if (READ_ONCE(space_info->dynamic_reclaim))
+		return -EINVAL;
+
 	ret = kstrtoint(buf, 10, &thresh);
 	if (ret)
 		return ret;
@@ -931,6 +938,39 @@ BTRFS_ATTR_RW(space_info, bg_reclaim_threshold,
 	      btrfs_sinfo_bg_reclaim_threshold_show,
 	      btrfs_sinfo_bg_reclaim_threshold_store);
 
+static ssize_t btrfs_sinfo_dynamic_reclaim_show(struct kobject *kobj,
+						struct kobj_attribute *a,
+						char *buf)
+{
+	struct btrfs_space_info *space_info = to_space_info(kobj);
+
+	return sysfs_emit(buf, "%d\n", READ_ONCE(space_info->dynamic_reclaim));
+}
+
+static ssize_t btrfs_sinfo_dynamic_reclaim_store(struct kobject *kobj,
+						 struct kobj_attribute *a,
+						 const char *buf, size_t len)
+{
+	struct btrfs_space_info *space_info = to_space_info(kobj);
+	int dynamic_reclaim;
+	int ret;
+
+	ret = kstrtoint(buf, 10, &dynamic_reclaim);
+	if (ret)
+		return ret;
+
+	if (dynamic_reclaim < 0)
+		return -EINVAL;
+
+	WRITE_ONCE(space_info->dynamic_reclaim, dynamic_reclaim != 0);
+
+	return len;
+}
+
+BTRFS_ATTR_RW(space_info, dynamic_reclaim,
+	      btrfs_sinfo_dynamic_reclaim_show,
+	      btrfs_sinfo_dynamic_reclaim_store);
+
 /*
  * Allocation information about block group types.
  *
@@ -948,6 +988,7 @@ static struct attribute *space_info_attrs[] = {
 	BTRFS_ATTR_PTR(space_info, disk_used),
 	BTRFS_ATTR_PTR(space_info, disk_total),
 	BTRFS_ATTR_PTR(space_info, bg_reclaim_threshold),
+	BTRFS_ATTR_PTR(space_info, dynamic_reclaim),
 	BTRFS_ATTR_PTR(space_info, chunk_size),
 	BTRFS_ATTR_PTR(space_info, size_classes),
 	BTRFS_ATTR_PTR(space_info, reclaim_count),
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 11+ messages in thread

* [PATCH 4/6] btrfs: periodic block_group reclaim
  2024-02-02 23:12 [RFC PATCH 0/6] btrfs: dynamic and periodic block_group reclaim Boris Burkov
                   ` (2 preceding siblings ...)
  2024-02-02 23:12 ` [PATCH 3/6] btrfs: dynamic block_group reclaim threshold Boris Burkov
@ 2024-02-02 23:12 ` Boris Burkov
  2024-02-04 18:19   ` kernel test robot
  2024-02-02 23:12 ` [PATCH 5/6] btrfs: urgent periodic reclaim pass Boris Burkov
                   ` (2 subsequent siblings)
  6 siblings, 1 reply; 11+ messages in thread
From: Boris Burkov @ 2024-02-02 23:12 UTC (permalink / raw)
  To: linux-btrfs, kernel-team

We currently employ a edge-triggered block group reclaim strategy which
marks block groups for reclaim as they free down past a threshold.

With a dynamic threshold, this is worse than doing it in a
level-triggered fashion periodically. That is because the reclaim
itself happens periodically, so the threshold at that point in time is
what really matters, not the threshold at freeing time. If we mark the
reclaim in a big pass, then sort by usage and do reclaim, we also
benefit from a negative feedback loop preventing unnecessary reclaims as
we crunch through the "best" candidates.

Since this is quite a different model, it requires some additional
support. The edge triggered reclaim has a good heuristic for not
reclaiming fresh block groups, so we need to replace that with a typical
GC sweep mark which skips block groups that have seen an allocation
since the last sweep.

Signed-off-by: Boris Burkov <boris@bur.io>
---
 fs/btrfs/block-group.c |  2 ++
 fs/btrfs/block-group.h |  1 +
 fs/btrfs/space-info.c  | 51 ++++++++++++++++++++++++++++++++++++++++++
 fs/btrfs/space-info.h  |  7 ++++++
 fs/btrfs/sysfs.c       | 34 ++++++++++++++++++++++++++++
 5 files changed, 95 insertions(+)

diff --git a/fs/btrfs/block-group.c b/fs/btrfs/block-group.c
index 6244c76f3584..1a752a8a1bea 100644
--- a/fs/btrfs/block-group.c
+++ b/fs/btrfs/block-group.c
@@ -1898,6 +1898,7 @@ void btrfs_reclaim_bgs_work(struct work_struct *work)
 
 void btrfs_reclaim_bgs(struct btrfs_fs_info *fs_info)
 {
+	btrfs_reclaim_sweep(fs_info);
 	spin_lock(&fs_info->unused_bgs_lock);
 	if (!list_empty(&fs_info->reclaim_bgs))
 		queue_work(system_unbound_wq, &fs_info->reclaim_bgs_work);
@@ -3565,6 +3566,7 @@ int btrfs_update_block_group(struct btrfs_trans_handle *trans,
 		old_val += num_bytes;
 		cache->used = old_val;
 		cache->reserved -= num_bytes;
+		cache->reclaim_mark = 0;
 		space_info->bytes_reserved -= num_bytes;
 		space_info->bytes_used += num_bytes;
 		space_info->disk_used += num_bytes * factor;
diff --git a/fs/btrfs/block-group.h b/fs/btrfs/block-group.h
index c4a1f01cc1c2..24b576b7a88c 100644
--- a/fs/btrfs/block-group.h
+++ b/fs/btrfs/block-group.h
@@ -250,6 +250,7 @@ struct btrfs_block_group {
 	struct work_struct zone_finish_work;
 	struct extent_buffer *last_eb;
 	enum btrfs_block_group_size_class size_class;
+	u64 reclaim_mark;
 };
 
 static inline u64 btrfs_block_group_end(struct btrfs_block_group *block_group)
diff --git a/fs/btrfs/space-info.c b/fs/btrfs/space-info.c
index 86a87501af08..fc4e307669ef 100644
--- a/fs/btrfs/space-info.c
+++ b/fs/btrfs/space-info.c
@@ -1944,3 +1944,54 @@ int btrfs_calc_reclaim_threshold(struct btrfs_space_info *space_info)
 		return calc_dynamic_reclaim_threshold(space_info);
 	return READ_ONCE(space_info->bg_reclaim_threshold);
 }
+
+static int do_reclaim_sweep(struct btrfs_fs_info *fs_info,
+			    struct btrfs_space_info *space_info, int raid)
+{
+	struct btrfs_block_group *bg;
+	int thresh_pct;
+
+	spin_lock(&space_info->lock);
+	thresh_pct = btrfs_calc_reclaim_threshold(space_info);
+	spin_unlock(&space_info->lock);
+
+	down_read(&space_info->groups_sem);
+	list_for_each_entry(bg, &space_info->block_groups[raid], list) {
+		u64 thresh;
+		bool reclaim = false;
+
+		btrfs_get_block_group(bg);
+		spin_lock(&bg->lock);
+		thresh = mult_perc(bg->length, thresh_pct);
+		if (bg->used < thresh && bg->reclaim_mark)
+			reclaim = true;
+		bg->reclaim_mark++;
+		spin_unlock(&bg->lock);
+		if (reclaim)
+			btrfs_mark_bg_to_reclaim(bg);
+		btrfs_put_block_group(bg);
+	}
+	up_read(&space_info->groups_sem);
+	return 0;
+}
+
+int btrfs_reclaim_sweep(struct btrfs_fs_info *fs_info)
+{
+	int ret;
+	int raid;
+	struct btrfs_space_info *space_info;
+
+	list_for_each_entry(space_info, &fs_info->space_info, list) {
+		if (space_info->flags & BTRFS_BLOCK_GROUP_SYSTEM)
+			continue;
+		if (!READ_ONCE(space_info->periodic_reclaim))
+			continue;
+		for (raid = 0; raid < BTRFS_NR_RAID_TYPES; raid++) {
+			ret = do_reclaim_sweep(fs_info, space_info, raid);
+			if (ret)
+				return ret;
+		}
+	}
+
+	return ret;
+}
diff --git a/fs/btrfs/space-info.h b/fs/btrfs/space-info.h
index 2f4c00525a08..2917bc4247db 100644
--- a/fs/btrfs/space-info.h
+++ b/fs/btrfs/space-info.h
@@ -169,6 +169,12 @@ struct btrfs_space_info {
 	 * fixed bg_reclaim_threshold.
 	 */
 	bool dynamic_reclaim;
+
+	/*
+	 * Periodically check all block groups against the reclaim
+	 * threshold in the cleaner thread.
+	 */
+	bool periodic_reclaim;
 };
 
 struct reserve_ticket {
@@ -252,5 +258,6 @@ void btrfs_init_async_reclaim_work(struct btrfs_fs_info *fs_info);
 u64 btrfs_account_ro_block_groups_free_space(struct btrfs_space_info *sinfo);
 
 int btrfs_calc_reclaim_threshold(struct btrfs_space_info *space_info);
+int btrfs_reclaim_sweep(struct btrfs_fs_info *fs_info);
 
 #endif /* BTRFS_SPACE_INFO_H */
diff --git a/fs/btrfs/sysfs.c b/fs/btrfs/sysfs.c
index 0683a23e5254..98bd8efaa2dc 100644
--- a/fs/btrfs/sysfs.c
+++ b/fs/btrfs/sysfs.c
@@ -971,6 +971,39 @@ BTRFS_ATTR_RW(space_info, dynamic_reclaim,
 	      btrfs_sinfo_dynamic_reclaim_show,
 	      btrfs_sinfo_dynamic_reclaim_store);
 
+static ssize_t btrfs_sinfo_periodic_reclaim_show(struct kobject *kobj,
+						struct kobj_attribute *a,
+						char *buf)
+{
+	struct btrfs_space_info *space_info = to_space_info(kobj);
+
+	return sysfs_emit(buf, "%d\n", READ_ONCE(space_info->periodic_reclaim));
+}
+
+static ssize_t btrfs_sinfo_periodic_reclaim_store(struct kobject *kobj,
+						 struct kobj_attribute *a,
+						 const char *buf, size_t len)
+{
+	struct btrfs_space_info *space_info = to_space_info(kobj);
+	int periodic_reclaim;
+	int ret;
+
+	ret = kstrtoint(buf, 10, &periodic_reclaim);
+	if (ret)
+		return ret;
+
+	if (periodic_reclaim < 0)
+		return -EINVAL;
+
+	WRITE_ONCE(space_info->periodic_reclaim, periodic_reclaim != 0);
+
+	return len;
+}
+
+BTRFS_ATTR_RW(space_info, periodic_reclaim,
+	      btrfs_sinfo_periodic_reclaim_show,
+	      btrfs_sinfo_periodic_reclaim_store);
+
 /*
  * Allocation information about block group types.
  *
@@ -992,6 +1025,7 @@ static struct attribute *space_info_attrs[] = {
 	BTRFS_ATTR_PTR(space_info, chunk_size),
 	BTRFS_ATTR_PTR(space_info, size_classes),
 	BTRFS_ATTR_PTR(space_info, reclaim_count),
+	BTRFS_ATTR_PTR(space_info, periodic_reclaim),
 #ifdef CONFIG_BTRFS_DEBUG
 	BTRFS_ATTR_PTR(space_info, force_chunk_alloc),
 #endif
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 11+ messages in thread

* [PATCH 5/6] btrfs: urgent periodic reclaim pass
  2024-02-02 23:12 [RFC PATCH 0/6] btrfs: dynamic and periodic block_group reclaim Boris Burkov
                   ` (3 preceding siblings ...)
  2024-02-02 23:12 ` [PATCH 4/6] btrfs: periodic block_group reclaim Boris Burkov
@ 2024-02-02 23:12 ` Boris Burkov
  2024-02-02 23:12 ` [PATCH 6/6] btrfs: prevent pathological periodic reclaim loops Boris Burkov
  2024-02-06 14:55 ` [RFC PATCH 0/6] btrfs: dynamic and periodic block_group reclaim David Sterba
  6 siblings, 0 replies; 11+ messages in thread
From: Boris Burkov @ 2024-02-02 23:12 UTC (permalink / raw)
  To: linux-btrfs, kernel-team

Periodic reclaim attempts to avoid block_groups seeing active use with a
sweep mark that gets cleared on allocation and set on a sweep. In urgent
conditions where we have very little unallocated space, we want to be
able to override this mechanism.

Introduce a second pass that only happens if we fail to find a reclaim
candidate and reclaim is urgent. In that case, do a second pass where
all block groups are eligible.

Signed-off-by: Boris Burkov <boris@bur.io>
---
 fs/btrfs/space-info.c | 35 ++++++++++++++++++++++++++++++++++-
 1 file changed, 34 insertions(+), 1 deletion(-)

diff --git a/fs/btrfs/space-info.c b/fs/btrfs/space-info.c
index fc4e307669ef..7ec775979637 100644
--- a/fs/btrfs/space-info.c
+++ b/fs/btrfs/space-info.c
@@ -1945,17 +1945,35 @@ int btrfs_calc_reclaim_threshold(struct btrfs_space_info *space_info)
 	return READ_ONCE(space_info->bg_reclaim_threshold);
 }
 
+/*
+ * Under "urgent" reclaim, we will reclaim even fresh block groups that have
+ * recently seen successful allocations, as we are desperate to reclaim
+ * whatever we can to avoid ENOSPC in a transaction leading to a readonly fs.
+ */
+static bool is_reclaim_urgent(struct btrfs_space_info *space_info)
+{
+	struct btrfs_fs_info *fs_info = space_info->fs_info;
+	u64 unalloc = atomic64_read(&fs_info->free_chunk_space);
+	u64 chunk_size = min(READ_ONCE(space_info->chunk_size), SZ_1G);
+
+	return unalloc < chunk_size;
+}
+
 static int do_reclaim_sweep(struct btrfs_fs_info *fs_info,
 			    struct btrfs_space_info *space_info, int raid)
 {
 	struct btrfs_block_group *bg;
 	int thresh_pct;
+	bool try_again = true;
+	bool urgent;
 
 	spin_lock(&space_info->lock);
+	urgent = is_reclaim_urgent(space_info);
 	thresh_pct = btrfs_calc_reclaim_threshold(space_info);
 	spin_unlock(&space_info->lock);
 
 	down_read(&space_info->groups_sem);
+again:
 	list_for_each_entry(bg, &space_info->block_groups[raid], list) {
 		u64 thresh;
 		bool reclaim = false;
@@ -1963,14 +1981,29 @@ static int do_reclaim_sweep(struct btrfs_fs_info *fs_info,
 		btrfs_get_block_group(bg);
 		spin_lock(&bg->lock);
 		thresh = mult_perc(bg->length, thresh_pct);
-		if (bg->used < thresh && bg->reclaim_mark)
+		if (bg->used < thresh && bg->reclaim_mark) {
+			try_again = false;
 			reclaim = true;
+		}
 		bg->reclaim_mark++;
 		spin_unlock(&bg->lock);
 		if (reclaim)
 			btrfs_mark_bg_to_reclaim(bg);
 		btrfs_put_block_group(bg);
 	}
+
+	/*
+	 * In situations where we are very motivated to reclaim (low unalloc)
+	 * use two passes to make the reclaim mark check best effort.
+	 *
+	 * If we have any staler groups, we don't touch the fresher ones, but if we
+	 * really need a block group, do take a fresh one.
+	 */
+	if (try_again && urgent) {
+		try_again = false;
+		goto again;
+	}
+
 	up_read(&space_info->groups_sem);
 	return 0;
 }
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 11+ messages in thread

* [PATCH 6/6] btrfs: prevent pathological periodic reclaim loops
  2024-02-02 23:12 [RFC PATCH 0/6] btrfs: dynamic and periodic block_group reclaim Boris Burkov
                   ` (4 preceding siblings ...)
  2024-02-02 23:12 ` [PATCH 5/6] btrfs: urgent periodic reclaim pass Boris Burkov
@ 2024-02-02 23:12 ` Boris Burkov
  2024-02-06 14:55 ` [RFC PATCH 0/6] btrfs: dynamic and periodic block_group reclaim David Sterba
  6 siblings, 0 replies; 11+ messages in thread
From: Boris Burkov @ 2024-02-02 23:12 UTC (permalink / raw)
  To: linux-btrfs, kernel-team

Periodic reclaim runs the risk of getting stuck in a state where it
keeps reclaiming the same block group over and over. This can happen if
1. reclaiming that block_group fails
2. reclaiming that block_group fails to move any extents into existing
   block_groups and just allocates a fresh chunk and moves everything.

Currently, 1. is a very tight loop inside the reclaim worker. That is
critical for edge triggered reclaim or else we risk forgetting about a
reclaimable group. On the other hand, with level triggered reclaim we
can break out of that loop and get it later.

With that fixed, 2. applies to both failures and "successes" with no
progress. If we have done a periodic reclaim on a space_info and nothing
has changed in that space_info, there is not much point to trying again,
so don't, until some space gets free.

Signed-off-by: Boris Burkov <boris@bur.io>
---
 fs/btrfs/block-group.c | 3 ++-
 fs/btrfs/space-info.c  | 6 ++++++
 fs/btrfs/space-info.h  | 6 ++++++
 3 files changed, 14 insertions(+), 1 deletion(-)

diff --git a/fs/btrfs/block-group.c b/fs/btrfs/block-group.c
index 1a752a8a1bea..41b9320d3d3b 100644
--- a/fs/btrfs/block-group.c
+++ b/fs/btrfs/block-group.c
@@ -1871,7 +1871,7 @@ void btrfs_reclaim_bgs_work(struct work_struct *work)
 		}
 
 next:
-		if (ret)
+		if (ret && !READ_ONCE(space_info->periodic_reclaim))
 			btrfs_mark_bg_to_reclaim(bg);
 		btrfs_put_block_group(bg);
 
@@ -3580,6 +3580,7 @@ int btrfs_update_block_group(struct btrfs_trans_handle *trans,
 		space_info->bytes_used -= num_bytes;
 		space_info->disk_used -= num_bytes * factor;
 
+		space_info->periodic_reclaim_ready = true;
 		reclaim = should_reclaim_block_group(cache, num_bytes);
 
 		spin_unlock(&cache->lock);
diff --git a/fs/btrfs/space-info.c b/fs/btrfs/space-info.c
index 7ec775979637..bef4d29c07dd 100644
--- a/fs/btrfs/space-info.c
+++ b/fs/btrfs/space-info.c
@@ -1968,6 +1968,12 @@ static int do_reclaim_sweep(struct btrfs_fs_info *fs_info,
 	bool urgent;
 
 	spin_lock(&space_info->lock);
+	if (space_info->periodic_reclaim_ready) {
+		space_info->periodic_reclaim_ready = false;
+	} else {
+		spin_unlock(&space_info->lock);
+		return 0;
+	}
 	urgent = is_reclaim_urgent(space_info);
 	thresh_pct = btrfs_calc_reclaim_threshold(space_info);
 	spin_unlock(&space_info->lock);
diff --git a/fs/btrfs/space-info.h b/fs/btrfs/space-info.h
index 2917bc4247db..e6e3f82c2409 100644
--- a/fs/btrfs/space-info.h
+++ b/fs/btrfs/space-info.h
@@ -175,6 +175,12 @@ struct btrfs_space_info {
 	 * threshold in the cleaner thread.
 	 */
 	bool periodic_reclaim;
+
+	/*
+	 * Periodic reclaim should be a no-op if a space_info hasn't
+	 * freed any space since the last time we tried.
+	 */
+	bool periodic_reclaim_ready;
 };
 
 struct reserve_ticket {
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 11+ messages in thread

* Re: [PATCH 4/6] btrfs: periodic block_group reclaim
  2024-02-02 23:12 ` [PATCH 4/6] btrfs: periodic block_group reclaim Boris Burkov
@ 2024-02-04 18:19   ` kernel test robot
  0 siblings, 0 replies; 11+ messages in thread
From: kernel test robot @ 2024-02-04 18:19 UTC (permalink / raw)
  To: Boris Burkov, linux-btrfs, kernel-team; +Cc: oe-kbuild-all

Hi Boris,

kernel test robot noticed the following build warnings:

[auto build test WARNING on kdave/for-next]
[also build test WARNING on linus/master v6.8-rc2 next-20240202]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch#_base_tree_information]

url:    https://github.com/intel-lab-lkp/linux/commits/Boris-Burkov/btrfs-report-reclaim-count-in-sysfs/20240203-071516
base:   https://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux.git for-next
patch link:    https://lore.kernel.org/r/1173e535ec7b46bda33ed2dc4219027502763902.1706914865.git.boris%40bur.io
patch subject: [PATCH 4/6] btrfs: periodic block_group reclaim
config: i386-randconfig-141-20240204 (https://download.01.org/0day-ci/archive/20240205/202402050244.pa6hJds3-lkp@intel.com/config)
compiler: gcc-12 (Debian 12.2.0-14) 12.2.0

If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202402050244.pa6hJds3-lkp@intel.com/

smatch warnings:
fs/btrfs/space-info.c:1996 btrfs_reclaim_sweep() error: uninitialized symbol 'ret'.

vim +/ret +1996 fs/btrfs/space-info.c

  1977	
  1978	int btrfs_reclaim_sweep(struct btrfs_fs_info *fs_info)
  1979	{
  1980		int ret;
  1981		int raid;
  1982		struct btrfs_space_info *space_info;
  1983	
  1984		list_for_each_entry(space_info, &fs_info->space_info, list) {
  1985			if (space_info->flags & BTRFS_BLOCK_GROUP_SYSTEM)
  1986				continue;
  1987			if (!READ_ONCE(space_info->periodic_reclaim))
  1988				continue;
  1989			for (raid = 0; raid < BTRFS_NR_RAID_TYPES; raid++) {
  1990				ret = do_reclaim_sweep(fs_info, space_info, raid);
  1991				if (ret)
  1992					return ret;
  1993			}
  1994		}
  1995	
> 1996		return ret;

-- 
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [RFC PATCH 0/6] btrfs: dynamic and periodic block_group reclaim
  2024-02-02 23:12 [RFC PATCH 0/6] btrfs: dynamic and periodic block_group reclaim Boris Burkov
                   ` (5 preceding siblings ...)
  2024-02-02 23:12 ` [PATCH 6/6] btrfs: prevent pathological periodic reclaim loops Boris Burkov
@ 2024-02-06 14:55 ` David Sterba
  2024-02-06 22:07   ` Boris Burkov
  6 siblings, 1 reply; 11+ messages in thread
From: David Sterba @ 2024-02-06 14:55 UTC (permalink / raw)
  To: Boris Burkov; +Cc: linux-btrfs, kernel-team

On Fri, Feb 02, 2024 at 03:12:42PM -0800, Boris Burkov wrote:
> Btrfs's block_group allocator suffers from a well known problem, that
> it is capable of eagerly allocating too much space to either data or
> metadata (most often data, absent bugs) and then later be unable to
> allocate more space for the other, when needed. When data starves
> metadata, this can extra painfully result in read only filesystems that
> need careful manual balancing to fix.
> 
> This can be worked around by:
> - enabling automatic reclaim
> - periodically running balance
> 
> Neither of these enjoy widespread use, as far as I know, though the
> former is used at scale at Meta with good results.

https://github.com/kdave/btrfsmaintenance is to my knowledge widely used
and installed on distros.  (Also my most starred project on github.)

The idea is to make the balance separate from kernel, allowing users and
administrators to easily tweak the parameters and timing. We haven't
added automatic reclaim to kernel as it tends to start at the worst
time. The jobs from btrfsmaintenance are scheduled according to the
calendar events (systemd.timer).

Also the jobs don't have to be ran at all, the package not installed.

The problem with balancing amount of data and metadata chunks is known
and there are only heuristics, we can't solve that without knowing the
exact usage pattern.

> This patch set expands on automatic reclaim, adding the ability to set a
> dynamic reclaim threshold that appropriately scales with the global file
> system allocation conditions as well as periodic reclaim which runs that
> reclaim sweep in the cleaner thread. Together, I believe they constitute
> a robust and general automatic reclaim system that should avoid
> unfortunate read only filesystems in all but extreme conditions, where
> space is running quite low anyway and failure is more reasonable.
> 
> I ran it on three workloads (described in detail on the dynamic reclaim
> patch) but they are:
> 1. bounce allocations around X% full.
> 2. fill up all the way and introduce full fragmentation.
> 3. write in a fragmented way until the filesystem is just about full.
> script can be found here:
> https://github.com/boryas/scripts/tree/main/fio/reclaim

A common workload on distros is regular system update (rolling distro)
with snapshots (snapper) and cleanup. This can create a lot of under
used block groups, both data and metadata. Reclaiming that preriodically
was one of the ground ideas for the btrfsmaintenance project.

The reclaim is needed to make the space more compact as the randomly
removed unused extents create holes for new data so this is a good
example for either scripted or automatic reclaim.

However you can also find use case where this would harm performance or
just waste IO as the data are short lived and shuffling around unused
block groups does not help much.

The exact parameters of auto reclaim also depend on the storage type, an
NVMe would be probably fine with any amount of data, HDD not so much.

I don't know from your description above what's the estimated frequency
of the reclaim? I understand that the urgent reclaim would start as
needed, but otherwise the frequency of reclaim of say 30% used block
groups can stay fine for a few days, as there are usually more new data
than deletions.

Also with more block groups around it's more likely to find good
candidates for the size classes and then do the placement.

> The important results can be seen here (full results explorable at
> bur.io/dyn-rec/)
> 
> bounce at 30%, much higher relocations with a fixed threshold:
> https://bur.io/dyn-rec/bounce-30/relocs.png
> 
> hard 30% fragmentation, dynamic actually reclaims, relocs not crazy:
> https://bur.io/dyn-rec/strict_frag-30/unalloc_bytes.png
> https://bur.io/dyn-rec/strict_frag-30/relocs.png
> 
> fill it all the way up, not crazy churn, but saving a buffer:
> https://bur.io/dyn-rec/last_gig/unalloc_bytes.png
> https://bur.io/dyn-rec/last_gig/relocs.png
> https://bur.io/dyn-rec/last_gig/thresh.png
> 
> Boris Burkov (6):
>   btrfs: report reclaim count in sysfs
>   btrfs: store fs_info on space_info
>   btrfs: dynamic block_group reclaim threshold
>   btrfs: periodic block_group reclaim
>   btrfs: urgent periodic reclaim pass
>   btrfs: prevent pathological periodic reclaim loops

So one thing is to have the mechanism for the reclaim, I think that's
the easy part, the tuning will be interesting.

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [RFC PATCH 0/6] btrfs: dynamic and periodic block_group reclaim
  2024-02-06 14:55 ` [RFC PATCH 0/6] btrfs: dynamic and periodic block_group reclaim David Sterba
@ 2024-02-06 22:07   ` Boris Burkov
  2024-02-19 19:38     ` David Sterba
  0 siblings, 1 reply; 11+ messages in thread
From: Boris Burkov @ 2024-02-06 22:07 UTC (permalink / raw)
  To: David Sterba; +Cc: linux-btrfs, kernel-team

On Tue, Feb 06, 2024 at 03:55:24PM +0100, David Sterba wrote:
> On Fri, Feb 02, 2024 at 03:12:42PM -0800, Boris Burkov wrote:
> > Btrfs's block_group allocator suffers from a well known problem, that
> > it is capable of eagerly allocating too much space to either data or
> > metadata (most often data, absent bugs) and then later be unable to
> > allocate more space for the other, when needed. When data starves
> > metadata, this can extra painfully result in read only filesystems that
> > need careful manual balancing to fix.
> > 
> > This can be worked around by:
> > - enabling automatic reclaim
> > - periodically running balance
> > 
> > Neither of these enjoy widespread use, as far as I know, though the
> > former is used at scale at Meta with good results.
> 
> https://github.com/kdave/btrfsmaintenance is to my knowledge widely used
> and installed on distros.  (Also my most starred project on github.)

Oh, cool, I'm glad that is out there and being used. I'm sorry for my ignorance.

> 
> The idea is to make the balance separate from kernel, allowing users and
> administrators to easily tweak the parameters and timing. We haven't
> added automatic reclaim to kernel as it tends to start at the worst
> time. The jobs from btrfsmaintenance are scheduled according to the
> calendar events (systemd.timer).

Makes sense.

> 
> Also the jobs don't have to be ran at all, the package not installed.
> 
> The problem with balancing amount of data and metadata chunks is known
> and there are only heuristics, we can't solve that without knowing the
> exact usage pattern.

Agreed.

> 
> > This patch set expands on automatic reclaim, adding the ability to set a
> > dynamic reclaim threshold that appropriately scales with the global file
> > system allocation conditions as well as periodic reclaim which runs that
> > reclaim sweep in the cleaner thread. Together, I believe they constitute
> > a robust and general automatic reclaim system that should avoid
> > unfortunate read only filesystems in all but extreme conditions, where
> > space is running quite low anyway and failure is more reasonable.
> > 
> > I ran it on three workloads (described in detail on the dynamic reclaim
> > patch) but they are:
> > 1. bounce allocations around X% full.
> > 2. fill up all the way and introduce full fragmentation.
> > 3. write in a fragmented way until the filesystem is just about full.
> > script can be found here:
> > https://github.com/boryas/scripts/tree/main/fio/reclaim
> 
> A common workload on distros is regular system update (rolling distro)
> with snapshots (snapper) and cleanup. This can create a lot of under
> used block groups, both data and metadata. Reclaiming that preriodically
> was one of the ground ideas for the btrfsmaintenance project.

I believe this is pretty similar to my workload 2 in spirit, except I
haven't done much with snapshots. I would love to run this workload so
I'll try to set it up with a VM. If you have a script for it already, or
even tips for setting it up, I would be quite grateful :)

I think that the "lots of random deletes leave empty block groups"
workload is the most interesting one in general for reclaim, and I
think it's cool that it happens in the real world :)

> 
> The reclaim is needed to make the space more compact as the randomly
> removed unused extents create holes for new data so this is a good
> example for either scripted or automatic reclaim.
> 
> However you can also find use case where this would harm performance or
> just waste IO as the data are short lived and shuffling around unused
> block groups does not help much.

+1, definitely trying to avoid this.

> 
> The exact parameters of auto reclaim also depend on the storage type, an
> NVMe would be probably fine with any amount of data, HDD not so much.

Good point, have only tested on NVMe. Definitely needs to be tunable to
not abuse HDDs.

> 
> I don't know from your description above what's the estimated frequency
> of the reclaim? I understand that the urgent reclaim would start as
> needed, but otherwise the frequency of reclaim of say 30% used block
> groups can stay fine for a few days, as there are usually more new data
> than deletions.
> 
> Also with more block groups around it's more likely to find good
> candidates for the size classes and then do the placement.

I think talking about my workload 2 here is helpful. Roughly, it writes
out 100G in a ~110G disk, then deletes 70G in perfectly fragmenting
stripes, so if we were way too aggressive, or used the current
autoreclaim with an unlucky threshold, we would reclaim all 100
block_groups. Dynamic reclaim's threshold spikes up to max, relocates 7
block groups, which is enough to negative feedback loop it back to a low
threshold and not doing more reclaim.

see https://bur.io/dyn-rec/strict_frag-30/thresh.png for the threshold
curve and https://bur.io/dyn-rec/strict_frag-30/relocs.png for the
reclaim counts. (I didn't hack it up perfectly evilly to make the 30%
threshold config relocate 100 block groups in that graph, FWIW)

I will try to more systematically plot threshold curves to get a better
sense for how to cause the most reclaims possible for a worst case
estimate.

In case you were asking more about the period it runs at:
As written right now, it runs with every cleaner thread run, but skips
block_groups that got an allocation since the last cleaner thread run. I
think you make an excellent point that the rate is much better to be more
like "daily" or "weekly" rather than "minutely". That gives more time to
reach a quiescent state, fill in gaps with small writes, etc. At the
minimum, I think the periodic reclaim ought to have a configurable period
with a relatively long default (this should help with HDD too?)

> 
> > The important results can be seen here (full results explorable at
> > bur.io/dyn-rec/)
> > 
> > bounce at 30%, much higher relocations with a fixed threshold:
> > https://bur.io/dyn-rec/bounce-30/relocs.png
> > 
> > hard 30% fragmentation, dynamic actually reclaims, relocs not crazy:
> > https://bur.io/dyn-rec/strict_frag-30/unalloc_bytes.png
> > https://bur.io/dyn-rec/strict_frag-30/relocs.png
> > 
> > fill it all the way up, not crazy churn, but saving a buffer:
> > https://bur.io/dyn-rec/last_gig/unalloc_bytes.png
> > https://bur.io/dyn-rec/last_gig/relocs.png
> > https://bur.io/dyn-rec/last_gig/thresh.png
> > 
> > Boris Burkov (6):
> >   btrfs: report reclaim count in sysfs
> >   btrfs: store fs_info on space_info
> >   btrfs: dynamic block_group reclaim threshold
> >   btrfs: periodic block_group reclaim
> >   btrfs: urgent periodic reclaim pass
> >   btrfs: prevent pathological periodic reclaim loops
> 
> So one thing is to have the mechanism for the reclaim, I think that's
> the easy part, the tuning will be interesting.

My 2c based on what I learned from this effort, and from your feedback:

Our two goals should be:
1. Avoid unnecessary reclaim, it wastes user resources and can hurt
   their system's performance.
2. Prevent unallocated=1MiB before it's too late.

I think anything with a fixed threshold is unlikely to fully achieve
either goal, as unlucky workloads will either operate below the
threshold and reclaim too much or above it and never reclaim.

I believe the dynamic threshold with a negative feedback loop is the
right sort of idea for achieving both goals. Ultimately, it is a
continuous function that encodes "reclaim at all costs when it's really
bad, don't reclaim much otherwise". I think it could also work to get
rid of the extra distraction from modelling it with a continuous
function and trying to encode the two goals more discretely/directly.

i.e.,
Very long period, low threshold periodic maintenance (basically exactly
btrfsmaintenance, doesn't need to be in kernel) and the kernel having "urgent"
conditions where it reclaims more aggressively in a limited way, just to get us
back to a few gigs of unalloc.

I also saw that btrfsmaintenance defaults to dusage=5 then dusage=10
which is lower (but similar to!) the quiescent state thresholds I have
seen in my tests (around 15-20). I may try to tune it to land around 10%
for most healthy fses, as that seems to be the safest number we know.

By the way, I think the dynamic threshold could be implemented fully in
userspace by using the limit flag of balance and recalculating the threshold
between each reclaim. Would you be more interested in experimenting with
that in btrfsmaintenance? I do think that in the long run, some kind of
"urgent unalloc protection" does belong in the kernel by default, assuming
we can really nail it down perfectly.

Thanks for your feedback,
Boris

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [RFC PATCH 0/6] btrfs: dynamic and periodic block_group reclaim
  2024-02-06 22:07   ` Boris Burkov
@ 2024-02-19 19:38     ` David Sterba
  0 siblings, 0 replies; 11+ messages in thread
From: David Sterba @ 2024-02-19 19:38 UTC (permalink / raw)
  To: Boris Burkov; +Cc: David Sterba, linux-btrfs, kernel-team

On Tue, Feb 06, 2024 at 02:07:52PM -0800, Boris Burkov wrote:
> On Tue, Feb 06, 2024 at 03:55:24PM +0100, David Sterba wrote:
> > On Fri, Feb 02, 2024 at 03:12:42PM -0800, Boris Burkov wrote:
> > A common workload on distros is regular system update (rolling distro)
> > with snapshots (snapper) and cleanup. This can create a lot of under
> > used block groups, both data and metadata. Reclaiming that preriodically
> > was one of the ground ideas for the btrfsmaintenance project.
> 
> I believe this is pretty similar to my workload 2 in spirit, except I
> haven't done much with snapshots. I would love to run this workload so
> I'll try to set it up with a VM. If you have a script for it already, or
> even tips for setting it up, I would be quite grateful :)
> 
> I think that the "lots of random deletes leave empty block groups"
> workload is the most interesting one in general for reclaim, and I
> think it's cool that it happens in the real world :)

As a simulation of that I'm using git based workload that randomly
checks out commits and does a snapshot. A once working script is
herehttps://github.com/kdave/testunion/blob/master/test-snapgit/startme
(I maybe have some updates but I'd have to find the most recent version).

The used git repo should provide large files too so it's closer to what
eg. rpm does.

> > The exact parameters of auto reclaim also depend on the storage type, an
> > NVMe would be probably fine with any amount of data, HDD not so much.
> 
> Good point, have only tested on NVMe. Definitely needs to be tunable to
> not abuse HDDs.

I think we'll need an automatic classification of devices, now it's
third type that I know could use it (raid mirror balancing, checksum
offload and this one).

There's more to reply to, I'll continue on another day.

^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2024-02-19 19:38 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-02-02 23:12 [RFC PATCH 0/6] btrfs: dynamic and periodic block_group reclaim Boris Burkov
2024-02-02 23:12 ` [PATCH 1/6] btrfs: report reclaim count in sysfs Boris Burkov
2024-02-02 23:12 ` [PATCH 2/6] btrfs: store fs_info on space_info Boris Burkov
2024-02-02 23:12 ` [PATCH 3/6] btrfs: dynamic block_group reclaim threshold Boris Burkov
2024-02-02 23:12 ` [PATCH 4/6] btrfs: periodic block_group reclaim Boris Burkov
2024-02-04 18:19   ` kernel test robot
2024-02-02 23:12 ` [PATCH 5/6] btrfs: urgent periodic reclaim pass Boris Burkov
2024-02-02 23:12 ` [PATCH 6/6] btrfs: prevent pathological periodic reclaim loops Boris Burkov
2024-02-06 14:55 ` [RFC PATCH 0/6] btrfs: dynamic and periodic block_group reclaim David Sterba
2024-02-06 22:07   ` Boris Burkov
2024-02-19 19:38     ` David Sterba

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox