linux-btrfs.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 0/3] raid1 balancing methods
@ 2024-09-27  9:55 Anand Jain
  2024-09-27  9:55 ` [PATCH 1/3] btrfs: introduce RAID1 round-robin read balancing Anand Jain
                   ` (3 more replies)
  0 siblings, 4 replies; 10+ messages in thread
From: Anand Jain @ 2024-09-27  9:55 UTC (permalink / raw)
  To: linux-btrfs; +Cc: dsterba, waxhead

The RAID1-balancing methods helps distribute read I/O across devices, and
this patch introduces three balancing methods: rotation, latency, and
devid. These methods are enabled under the `CONFIG_BTRFS_DEBUG` config
option and are on top of the previously added
`/sys/fs/btrfs/<UUID>/read_policy` interface to configure the desired
RAID1 read balancing method.

I've tested these patches using fio and filesystem defragmentation
workloads on a two-device RAID1 setup (with both data and metadata
mirrored across identical devices). I tracked device read counts by
extracting stats from `/sys/devices/<..>/stat` for each device. Below is
a summary of the results, with each result the average of three
iterations.

A typical generic random rw workload:

$ fio --filename=/btrfs/foo --size=10Gi --direct=1 --rw=randrw --bs=4k \
  --ioengine=libaio --iodepth=256 --runtime=120 --numjobs=4 --time_based \
  --group_reporting --name=iops-test-job --eta-newline=1

|         |            |            | Read I/O count  |
|         | Read       | Write      | devid1 | devid2 |
|---------|------------|------------|--------|--------|
| pid     | 29.4MiB/s  | 29.5MiB/s  | 456548 | 447975 |
| rotation| 29.3MiB/s  | 29.3MiB/s  | 450105 | 450055 |
| latency | 21.9MiB/s  | 21.9MiB/s  | 672387 | 0      |
| devid:1 | 22.0MiB/s  | 22.0MiB/s  | 674788 | 0      |

Defragmentation with compression workload:

$ xfs_io -f -d -c 'pwrite -S 0xab 0 1G' /btrfs/foo
$ sync
$ echo 3 > /proc/sys/vm/drop_caches
$ btrfs filesystem defrag -f -c /btrfs/foo

|         | Time  | Read I/O Count  |
|         | Real  | devid1 | devid2 |
|---------|-------|--------|--------|
| pid     | 21.61s| 3810   | 0      |
| rotation| 11.55s| 1905   | 1905   |
| latency | 20.99s| 0      | 3810   |
| devid:2 | 21.41s| 0      | 3810   |

. The PID-based balancing method works well for the generic random rw fio
  workload.
. The rotation method is ideal when you want to keep both devices active,
  and it boosts performance in sequential defragmentation scenarios.
. The latency-based method work well when we have mixed device types or
  when one device experiences intermittent I/O failures the latency
  increases and it automatically picks the other device for further Read
  IOs.
. The devid method is a more hands-on approach, useful for diagnosing and
  testing RAID1 mirror synchronizations.

Anand Jain (3):
  btrfs: introduce RAID1 round-robin read balancing
  btrfs: use the path with the lowest latency for RAID1 reads
  btrfs: add RAID1 preferred read device feature

 fs/btrfs/sysfs.c   |  94 ++++++++++++++++++++++++++++++-------
 fs/btrfs/volumes.c | 113 +++++++++++++++++++++++++++++++++++++++++++++
 fs/btrfs/volumes.h |  14 ++++++
 3 files changed, 205 insertions(+), 16 deletions(-)

-- 
2.46.0


^ permalink raw reply	[flat|nested] 10+ messages in thread

* [PATCH 1/3] btrfs: introduce RAID1 round-robin read balancing
  2024-09-27  9:55 [PATCH 0/3] raid1 balancing methods Anand Jain
@ 2024-09-27  9:55 ` Anand Jain
  2024-09-27 10:10   ` Qu Wenruo
  2024-09-27  9:55 ` [PATCH 2/3] btrfs: use the path with the lowest latency for RAID1 reads Anand Jain
                   ` (2 subsequent siblings)
  3 siblings, 1 reply; 10+ messages in thread
From: Anand Jain @ 2024-09-27  9:55 UTC (permalink / raw)
  To: linux-btrfs; +Cc: dsterba, waxhead

This feature balances I/O across the striped devices when reading from
RAID1 blocks.

Signed-off-by: Anand Jain <anand.jain@oracle.com>
---
 fs/btrfs/sysfs.c   |  4 ++++
 fs/btrfs/volumes.c | 53 ++++++++++++++++++++++++++++++++++++++++++++++
 fs/btrfs/volumes.h |  7 ++++++
 3 files changed, 64 insertions(+)

diff --git a/fs/btrfs/sysfs.c b/fs/btrfs/sysfs.c
index 03926ad467c9..18fb35a887c6 100644
--- a/fs/btrfs/sysfs.c
+++ b/fs/btrfs/sysfs.c
@@ -1305,7 +1305,11 @@ static ssize_t btrfs_temp_fsid_show(struct kobject *kobj,
 }
 BTRFS_ATTR(, temp_fsid, btrfs_temp_fsid_show);
 
+#ifdef CONFIG_BTRFS_DEBUG
+static const char * const btrfs_read_policy_name[] = { "pid", "rotation" };
+#else
 static const char * const btrfs_read_policy_name[] = { "pid" };
+#endif
 
 static ssize_t btrfs_read_policy_show(struct kobject *kobj,
 				      struct kobj_attribute *a, char *buf)
diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index 995b0647f538..c130a27386a7 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -5859,6 +5859,54 @@ int btrfs_is_parity_mirror(struct btrfs_fs_info *fs_info, u64 logical, u64 len)
 	return ret;
 }
 
+#ifdef CONFIG_BTRFS_DEBUG
+struct stripe_mirror {
+	u64 devid;
+	int map;
+};
+
+static int btrfs_cmp_devid(const void *a, const void *b)
+{
+	struct stripe_mirror *s1 = (struct stripe_mirror *)a;
+	struct stripe_mirror *s2 = (struct stripe_mirror *)b;
+
+	if (s1->devid < s2->devid)
+		return -1;
+	if (s1->devid > s2->devid)
+		return 1;
+	return 0;
+}
+
+static int btrfs_read_rotation(struct btrfs_chunk_map *map, int first,
+			       int num_stripe)
+{
+	struct stripe_mirror stripes[4] = {0}; //4: for testing, works for now.
+	struct btrfs_fs_devices *fs_devices;
+	u64 devid;
+	int index, j, cnt;
+	int next_stripe;
+
+	index = 0;
+	for (j = first; j < first + num_stripe; j++) {
+		devid = map->stripes[j].dev->devid;
+
+		stripes[index].devid = devid;
+		stripes[index].map = j;
+
+		index++;
+	}
+
+	sort(stripes, num_stripe, sizeof(struct stripe_mirror),
+	     btrfs_cmp_devid, NULL);
+
+	fs_devices = map->stripes[first].dev->fs_devices;
+	cnt = atomic_inc_return(&fs_devices->total_reads);
+	next_stripe = stripes[cnt % num_stripe].map;
+
+	return next_stripe;
+}
+#endif
+
 static int find_live_mirror(struct btrfs_fs_info *fs_info,
 			    struct btrfs_chunk_map *map, int first,
 			    int dev_replace_is_ongoing)
@@ -5888,6 +5936,11 @@ static int find_live_mirror(struct btrfs_fs_info *fs_info,
 	case BTRFS_READ_POLICY_PID:
 		preferred_mirror = first + (current->pid % num_stripes);
 		break;
+#ifdef CONFIG_BTRFS_DEBUG
+	case BTRFS_READ_POLICY_ROTATION:
+		preferred_mirror = btrfs_read_rotation(map, first, num_stripes);
+		break;
+#endif
 	}
 
 	if (dev_replace_is_ongoing &&
diff --git a/fs/btrfs/volumes.h b/fs/btrfs/volumes.h
index 4481575dd70f..81701217dbb9 100644
--- a/fs/btrfs/volumes.h
+++ b/fs/btrfs/volumes.h
@@ -303,6 +303,10 @@ enum btrfs_chunk_allocation_policy {
 enum btrfs_read_policy {
 	/* Use process PID to choose the stripe */
 	BTRFS_READ_POLICY_PID,
+#ifdef CONFIG_BTRFS_DEBUG
+	/* Balancing raid1 reads across all striped devices */
+	BTRFS_READ_POLICY_ROTATION,
+#endif
 	BTRFS_NR_READ_POLICY,
 };
 
@@ -431,6 +435,9 @@ struct btrfs_fs_devices {
 	enum btrfs_read_policy read_policy;
 
 #ifdef CONFIG_BTRFS_DEBUG
+	/* read counter for the filesystem */ 
+	atomic_t total_reads;
+
 	/* Checksum mode - offload it or do it synchronously. */
 	enum btrfs_offload_csum_mode offload_csum_mode;
 #endif
-- 
2.46.0


^ permalink raw reply related	[flat|nested] 10+ messages in thread

* [PATCH 2/3] btrfs: use the path with the lowest latency for RAID1 reads
  2024-09-27  9:55 [PATCH 0/3] raid1 balancing methods Anand Jain
  2024-09-27  9:55 ` [PATCH 1/3] btrfs: introduce RAID1 round-robin read balancing Anand Jain
@ 2024-09-27  9:55 ` Anand Jain
  2024-09-27 10:25   ` Qu Wenruo
  2024-09-27  9:55 ` [PATCH 3/3] btrfs: add RAID1 preferred read device feature Anand Jain
  2024-10-04 10:44 ` [PATCH 0/3] raid1 balancing methods Yuwei Han
  3 siblings, 1 reply; 10+ messages in thread
From: Anand Jain @ 2024-09-27  9:55 UTC (permalink / raw)
  To: linux-btrfs; +Cc: dsterba, waxhead

This feature aims to direct the read I/O to the device with the lowest
known latency for reading RAID1 blocks.

echo "latency" > /sys/fs/btrfs/<UUID>/read_policy

Signed-off-by: Anand Jain <anand.jain@oracle.com>
---
 fs/btrfs/sysfs.c   |  2 +-
 fs/btrfs/volumes.c | 40 ++++++++++++++++++++++++++++++++++++++++
 fs/btrfs/volumes.h |  2 ++
 3 files changed, 43 insertions(+), 1 deletion(-)

diff --git a/fs/btrfs/sysfs.c b/fs/btrfs/sysfs.c
index 18fb35a887c6..15abf931726c 100644
--- a/fs/btrfs/sysfs.c
+++ b/fs/btrfs/sysfs.c
@@ -1306,7 +1306,7 @@ static ssize_t btrfs_temp_fsid_show(struct kobject *kobj,
 BTRFS_ATTR(, temp_fsid, btrfs_temp_fsid_show);
 
 #ifdef CONFIG_BTRFS_DEBUG
-static const char * const btrfs_read_policy_name[] = { "pid", "rotation" };
+static const char * const btrfs_read_policy_name[] = { "pid", "rotation", "latency" };
 #else
 static const char * const btrfs_read_policy_name[] = { "pid" };
 #endif
diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index c130a27386a7..20bc62d85b3b 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -12,6 +12,9 @@
 #include <linux/uuid.h>
 #include <linux/list_sort.h>
 #include <linux/namei.h>
+#ifdef CONFIG_BTRFS_DEBUG
+#include <linux/part_stat.h>
+#endif
 #include "misc.h"
 #include "ctree.h"
 #include "disk-io.h"
@@ -5860,6 +5863,39 @@ int btrfs_is_parity_mirror(struct btrfs_fs_info *fs_info, u64 logical, u64 len)
 }
 
 #ifdef CONFIG_BTRFS_DEBUG
+static int btrfs_best_stripe(struct btrfs_fs_info *fs_info,
+			     struct btrfs_chunk_map *map, int first,
+			     int num_stripe)
+{
+	u64 est_wait = 0;
+	int best_stripe = 0;
+	int index;
+
+	for (index = first; index < first + num_stripe; index++) {
+		u64 read_wait;
+		u64 avg_wait = 0;
+		unsigned long read_ios;
+		struct btrfs_device *device = map->stripes[index].dev;
+
+		read_wait = part_stat_read(device->bdev, nsecs[READ]);
+		read_ios = part_stat_read(device->bdev, ios[READ]);
+
+		if (read_wait && read_ios && read_wait >= read_ios)
+			avg_wait = div_u64(read_wait, read_ios);
+		else
+			btrfs_debug_rl(fs_info,
+			"devid: %llu avg_wait ZERO read_wait %llu read_ios %lu",
+				       device->devid, read_wait, read_ios);
+
+		if (est_wait == 0 || est_wait > avg_wait) {
+			est_wait = avg_wait;
+			best_stripe = index;
+		}
+	}
+
+	return best_stripe;
+}
+
 struct stripe_mirror {
 	u64 devid;
 	int map;
@@ -5940,6 +5976,10 @@ static int find_live_mirror(struct btrfs_fs_info *fs_info,
 	case BTRFS_READ_POLICY_ROTATION:
 		preferred_mirror = btrfs_read_rotation(map, first, num_stripes);
 		break;
+	case BTRFS_READ_POLICY_LATENCY:
+		preferred_mirror = btrfs_best_stripe(fs_info, map, first,
+								num_stripes);
+		break;
 #endif
 	}
 
diff --git a/fs/btrfs/volumes.h b/fs/btrfs/volumes.h
index 81701217dbb9..09920ef76a9b 100644
--- a/fs/btrfs/volumes.h
+++ b/fs/btrfs/volumes.h
@@ -306,6 +306,8 @@ enum btrfs_read_policy {
 #ifdef CONFIG_BTRFS_DEBUG
 	/* Balancing raid1 reads across all striped devices */
 	BTRFS_READ_POLICY_ROTATION,
+	/* Use the lowest-latency device dynamically */
+	BTRFS_READ_POLICY_LATENCY,
 #endif
 	BTRFS_NR_READ_POLICY,
 };
-- 
2.46.0


^ permalink raw reply related	[flat|nested] 10+ messages in thread

* [PATCH 3/3] btrfs: add RAID1 preferred read device feature
  2024-09-27  9:55 [PATCH 0/3] raid1 balancing methods Anand Jain
  2024-09-27  9:55 ` [PATCH 1/3] btrfs: introduce RAID1 round-robin read balancing Anand Jain
  2024-09-27  9:55 ` [PATCH 2/3] btrfs: use the path with the lowest latency for RAID1 reads Anand Jain
@ 2024-09-27  9:55 ` Anand Jain
  2024-10-04 10:44 ` [PATCH 0/3] raid1 balancing methods Yuwei Han
  3 siblings, 0 replies; 10+ messages in thread
From: Anand Jain @ 2024-09-27  9:55 UTC (permalink / raw)
  To: linux-btrfs; +Cc: dsterba, waxhead

When there's stale data on a mirrored device, this feature lets you choose
which device to read from. Mainly used for testing.

echo "devid:2" > /sys/fs/btrfs/<UUID>/read_policy

Signed-off-by: Anand Jain <anand.jain@oracle.com>
---
 fs/btrfs/sysfs.c   | 92 +++++++++++++++++++++++++++++++++++++---------
 fs/btrfs/volumes.c | 20 ++++++++++
 fs/btrfs/volumes.h |  5 +++
 3 files changed, 100 insertions(+), 17 deletions(-)

diff --git a/fs/btrfs/sysfs.c b/fs/btrfs/sysfs.c
index 15abf931726c..e32999ea761d 100644
--- a/fs/btrfs/sysfs.c
+++ b/fs/btrfs/sysfs.c
@@ -1306,7 +1306,7 @@ static ssize_t btrfs_temp_fsid_show(struct kobject *kobj,
 BTRFS_ATTR(, temp_fsid, btrfs_temp_fsid_show);
 
 #ifdef CONFIG_BTRFS_DEBUG
-static const char * const btrfs_read_policy_name[] = { "pid", "rotation", "latency" };
+static const char * const btrfs_read_policy_name[] = { "pid", "rotation", "latency", "devid" };
 #else
 static const char * const btrfs_read_policy_name[] = { "pid" };
 #endif
@@ -1320,14 +1320,22 @@ static ssize_t btrfs_read_policy_show(struct kobject *kobj,
 	int i;
 
 	for (i = 0; i < BTRFS_NR_READ_POLICY; i++) {
-		if (policy == i)
-			ret += sysfs_emit_at(buf, ret, "%s[%s]",
-					 (ret == 0 ? "" : " "),
-					 btrfs_read_policy_name[i]);
-		else
-			ret += sysfs_emit_at(buf, ret, "%s%s",
-					 (ret == 0 ? "" : " "),
-					 btrfs_read_policy_name[i]);
+		if (ret != 0)
+			ret += sysfs_emit_at(buf, ret, " ");
+
+		if (i == policy)
+			ret += sysfs_emit_at(buf, ret, "[");
+
+		ret += sysfs_emit_at(buf, ret, "%s", btrfs_read_policy_name[i]);
+
+#ifdef CONFIG_BTRFS_DEBUG
+		if (i == BTRFS_READ_POLICY_DEVID)
+			ret += sysfs_emit_at(buf, ret, ":%llu",
+							fs_devices->read_devid);
+#endif
+
+		if (i == policy)
+			ret += sysfs_emit_at(buf, ret, "]");
 	}
 
 	ret += sysfs_emit_at(buf, ret, "\n");
@@ -1340,21 +1348,71 @@ static ssize_t btrfs_read_policy_store(struct kobject *kobj,
 				       const char *buf, size_t len)
 {
 	struct btrfs_fs_devices *fs_devices = to_fs_devs(kobj);
+	char *value;
+#ifdef CONFIG_BTRFS_DEBUG
+	u64 devid = 0;
+#endif
+	int index = -1;
 	int i;
+	bool changed = false;
+
+	value = strchr(buf, ':');
+	if (value) {
+		*value = '\0';
+		value = value + 1;
+	}
 
 	for (i = 0; i < BTRFS_NR_READ_POLICY; i++) {
 		if (sysfs_streq(buf, btrfs_read_policy_name[i])) {
-			if (i != READ_ONCE(fs_devices->read_policy)) {
-				WRITE_ONCE(fs_devices->read_policy, i);
-				btrfs_info(fs_devices->fs_info,
-					   "read policy set to '%s'",
-					   btrfs_read_policy_name[i]);
-			}
-			return len;
+			index = i;
+			break;
+		}
+	}
+
+	if (index == -1)
+		return -EINVAL;
+
+#ifdef CONFIG_BTRFS_DEBUG
+	/* Extract values from input in devid:value format */
+	if (index == BTRFS_READ_POLICY_DEVID) {
+		BTRFS_DEV_LOOKUP_ARGS(args);
+
+		if (value == NULL || kstrtou64(value, 10, &devid))
+			return -EINVAL;
+
+		args.devid = devid;
+		if (btrfs_find_device(fs_devices, &args) == NULL)
+			return -EINVAL;
+
+		if (READ_ONCE(fs_devices->read_devid) != devid) {
+			WRITE_ONCE(fs_devices->read_devid, devid);
+			changed = true;
 		}
 	}
+#endif
+
+	if (index != READ_ONCE(fs_devices->read_policy)) {
+		WRITE_ONCE(fs_devices->read_policy, index);
+		changed = true;
+	}
+
+	if (changed) {
+#ifdef CONFIG_BTRFS_DEBUG
+		if (devid)
+			btrfs_info(fs_devices->fs_info,
+				   "read policy set to '%s:%llu'",
+				   btrfs_read_policy_name[index], devid);
+		else
+			btrfs_info(fs_devices->fs_info,
+				   "read policy set to '%s'",
+				   btrfs_read_policy_name[index]);
+#else
+		btrfs_info(fs_devices->fs_info, "read policy set to '%s'",
+			   btrfs_read_policy_name[index]);
+#endif
+	}
 
-	return -EINVAL;
+	return len;
 }
 BTRFS_ATTR_RW(, read_policy, btrfs_read_policy_show, btrfs_read_policy_store);
 
diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index 20bc62d85b3b..c49ca48e7b2e 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -5863,6 +5863,23 @@ int btrfs_is_parity_mirror(struct btrfs_fs_info *fs_info, u64 logical, u64 len)
 }
 
 #ifdef CONFIG_BTRFS_DEBUG
+static int btrfs_read_preferred(struct btrfs_chunk_map *map, int first,
+				int num_stripe)
+{
+	int last = first + num_stripe;
+	int stripe_index;
+
+	for (stripe_index = first; stripe_index < last; stripe_index++) {
+		struct btrfs_device *device = map->stripes[stripe_index].dev;
+
+		if (device->devid == READ_ONCE(device->fs_devices->read_devid))
+			return stripe_index;
+	}
+
+	/* If no read-preferred device, use first stripe */
+	return first;
+}
+
 static int btrfs_best_stripe(struct btrfs_fs_info *fs_info,
 			     struct btrfs_chunk_map *map, int first,
 			     int num_stripe)
@@ -5980,6 +5997,9 @@ static int find_live_mirror(struct btrfs_fs_info *fs_info,
 		preferred_mirror = btrfs_best_stripe(fs_info, map, first,
 								num_stripes);
 		break;
+	case BTRFS_READ_POLICY_DEVID:
+		preferred_mirror = btrfs_read_preferred(map, first, num_stripes);
+		break;
 #endif
 	}
 
diff --git a/fs/btrfs/volumes.h b/fs/btrfs/volumes.h
index 09920ef76a9b..9850edaafe8c 100644
--- a/fs/btrfs/volumes.h
+++ b/fs/btrfs/volumes.h
@@ -308,6 +308,8 @@ enum btrfs_read_policy {
 	BTRFS_READ_POLICY_ROTATION,
 	/* Use the lowest-latency device dynamically */
 	BTRFS_READ_POLICY_LATENCY,
+	/* Read from the specific device */
+	BTRFS_READ_POLICY_DEVID,
 #endif
 	BTRFS_NR_READ_POLICY,
 };
@@ -440,6 +442,9 @@ struct btrfs_fs_devices {
 	/* read counter for the filesystem */ 
 	atomic_t total_reads;
 
+	/* Device to be used for reading in case of RAID1 */
+	u64 read_devid;
+
 	/* Checksum mode - offload it or do it synchronously. */
 	enum btrfs_offload_csum_mode offload_csum_mode;
 #endif
-- 
2.46.0


^ permalink raw reply related	[flat|nested] 10+ messages in thread

* Re: [PATCH 1/3] btrfs: introduce RAID1 round-robin read balancing
  2024-09-27  9:55 ` [PATCH 1/3] btrfs: introduce RAID1 round-robin read balancing Anand Jain
@ 2024-09-27 10:10   ` Qu Wenruo
  2024-10-11  1:21     ` Anand Jain
  0 siblings, 1 reply; 10+ messages in thread
From: Qu Wenruo @ 2024-09-27 10:10 UTC (permalink / raw)
  To: Anand Jain, linux-btrfs; +Cc: dsterba, waxhead



在 2024/9/27 19:25, Anand Jain 写道:
> This feature balances I/O across the striped devices when reading from
> RAID1 blocks.
>
> Signed-off-by: Anand Jain <anand.jain@oracle.com>
> ---
>   fs/btrfs/sysfs.c   |  4 ++++
>   fs/btrfs/volumes.c | 53 ++++++++++++++++++++++++++++++++++++++++++++++
>   fs/btrfs/volumes.h |  7 ++++++
>   3 files changed, 64 insertions(+)
>
> diff --git a/fs/btrfs/sysfs.c b/fs/btrfs/sysfs.c
> index 03926ad467c9..18fb35a887c6 100644
> --- a/fs/btrfs/sysfs.c
> +++ b/fs/btrfs/sysfs.c
> @@ -1305,7 +1305,11 @@ static ssize_t btrfs_temp_fsid_show(struct kobject *kobj,
>   }
>   BTRFS_ATTR(, temp_fsid, btrfs_temp_fsid_show);
>
> +#ifdef CONFIG_BTRFS_DEBUG
> +static const char * const btrfs_read_policy_name[] = { "pid", "rotation" };
> +#else
>   static const char * const btrfs_read_policy_name[] = { "pid" };
> +#endif
>
>   static ssize_t btrfs_read_policy_show(struct kobject *kobj,
>   				      struct kobj_attribute *a, char *buf)
> diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
> index 995b0647f538..c130a27386a7 100644
> --- a/fs/btrfs/volumes.c
> +++ b/fs/btrfs/volumes.c
> @@ -5859,6 +5859,54 @@ int btrfs_is_parity_mirror(struct btrfs_fs_info *fs_info, u64 logical, u64 len)
>   	return ret;
>   }
>
> +#ifdef CONFIG_BTRFS_DEBUG

It would be much better to utilize CONFIG_BTRFS_EXPERIMENTAL.
CONFIG_BTRFS_DEBUG is now for pure debug purposes.

Thanks,
Qu

> +struct stripe_mirror {
> +	u64 devid;
> +	int map;
> +};
> +
> +static int btrfs_cmp_devid(const void *a, const void *b)
> +{
> +	struct stripe_mirror *s1 = (struct stripe_mirror *)a;
> +	struct stripe_mirror *s2 = (struct stripe_mirror *)b;
> +
> +	if (s1->devid < s2->devid)
> +		return -1;
> +	if (s1->devid > s2->devid)
> +		return 1;
> +	return 0;
> +}
> +
> +static int btrfs_read_rotation(struct btrfs_chunk_map *map, int first,
> +			       int num_stripe)
> +{
> +	struct stripe_mirror stripes[4] = {0}; //4: for testing, works for now.
> +	struct btrfs_fs_devices *fs_devices;
> +	u64 devid;
> +	int index, j, cnt;
> +	int next_stripe;
> +
> +	index = 0;
> +	for (j = first; j < first + num_stripe; j++) {
> +		devid = map->stripes[j].dev->devid;
> +
> +		stripes[index].devid = devid;
> +		stripes[index].map = j;
> +
> +		index++;
> +	}
> +
> +	sort(stripes, num_stripe, sizeof(struct stripe_mirror),
> +	     btrfs_cmp_devid, NULL);
> +
> +	fs_devices = map->stripes[first].dev->fs_devices;
> +	cnt = atomic_inc_return(&fs_devices->total_reads);
> +	next_stripe = stripes[cnt % num_stripe].map;
> +
> +	return next_stripe;
> +}
> +#endif
> +
>   static int find_live_mirror(struct btrfs_fs_info *fs_info,
>   			    struct btrfs_chunk_map *map, int first,
>   			    int dev_replace_is_ongoing)
> @@ -5888,6 +5936,11 @@ static int find_live_mirror(struct btrfs_fs_info *fs_info,
>   	case BTRFS_READ_POLICY_PID:
>   		preferred_mirror = first + (current->pid % num_stripes);
>   		break;
> +#ifdef CONFIG_BTRFS_DEBUG
> +	case BTRFS_READ_POLICY_ROTATION:
> +		preferred_mirror = btrfs_read_rotation(map, first, num_stripes);
> +		break;
> +#endif
>   	}
>
>   	if (dev_replace_is_ongoing &&
> diff --git a/fs/btrfs/volumes.h b/fs/btrfs/volumes.h
> index 4481575dd70f..81701217dbb9 100644
> --- a/fs/btrfs/volumes.h
> +++ b/fs/btrfs/volumes.h
> @@ -303,6 +303,10 @@ enum btrfs_chunk_allocation_policy {
>   enum btrfs_read_policy {
>   	/* Use process PID to choose the stripe */
>   	BTRFS_READ_POLICY_PID,
> +#ifdef CONFIG_BTRFS_DEBUG
> +	/* Balancing raid1 reads across all striped devices */
> +	BTRFS_READ_POLICY_ROTATION,
> +#endif
>   	BTRFS_NR_READ_POLICY,
>   };
>
> @@ -431,6 +435,9 @@ struct btrfs_fs_devices {
>   	enum btrfs_read_policy read_policy;
>
>   #ifdef CONFIG_BTRFS_DEBUG
> +	/* read counter for the filesystem */
> +	atomic_t total_reads;
> +
>   	/* Checksum mode - offload it or do it synchronously. */
>   	enum btrfs_offload_csum_mode offload_csum_mode;
>   #endif


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH 2/3] btrfs: use the path with the lowest latency for RAID1 reads
  2024-09-27  9:55 ` [PATCH 2/3] btrfs: use the path with the lowest latency for RAID1 reads Anand Jain
@ 2024-09-27 10:25   ` Qu Wenruo
  2024-10-11  1:21     ` Anand Jain
  0 siblings, 1 reply; 10+ messages in thread
From: Qu Wenruo @ 2024-09-27 10:25 UTC (permalink / raw)
  To: Anand Jain, linux-btrfs; +Cc: dsterba, waxhead



在 2024/9/27 19:25, Anand Jain 写道:
> This feature aims to direct the read I/O to the device with the lowest
> known latency for reading RAID1 blocks.
>
> echo "latency" > /sys/fs/btrfs/<UUID>/read_policy
>
> Signed-off-by: Anand Jain <anand.jain@oracle.com>
> ---
>   fs/btrfs/sysfs.c   |  2 +-
>   fs/btrfs/volumes.c | 40 ++++++++++++++++++++++++++++++++++++++++
>   fs/btrfs/volumes.h |  2 ++
>   3 files changed, 43 insertions(+), 1 deletion(-)
>
> diff --git a/fs/btrfs/sysfs.c b/fs/btrfs/sysfs.c
> index 18fb35a887c6..15abf931726c 100644
> --- a/fs/btrfs/sysfs.c
> +++ b/fs/btrfs/sysfs.c
> @@ -1306,7 +1306,7 @@ static ssize_t btrfs_temp_fsid_show(struct kobject *kobj,
>   BTRFS_ATTR(, temp_fsid, btrfs_temp_fsid_show);
>
>   #ifdef CONFIG_BTRFS_DEBUG
> -static const char * const btrfs_read_policy_name[] = { "pid", "rotation" };
> +static const char * const btrfs_read_policy_name[] = { "pid", "rotation", "latency" };
>   #else
>   static const char * const btrfs_read_policy_name[] = { "pid" };
>   #endif
> diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
> index c130a27386a7..20bc62d85b3b 100644
> --- a/fs/btrfs/volumes.c
> +++ b/fs/btrfs/volumes.c
> @@ -12,6 +12,9 @@
>   #include <linux/uuid.h>
>   #include <linux/list_sort.h>
>   #include <linux/namei.h>
> +#ifdef CONFIG_BTRFS_DEBUG
> +#include <linux/part_stat.h>
> +#endif
>   #include "misc.h"
>   #include "ctree.h"
>   #include "disk-io.h"
> @@ -5860,6 +5863,39 @@ int btrfs_is_parity_mirror(struct btrfs_fs_info *fs_info, u64 logical, u64 len)
>   }
>
>   #ifdef CONFIG_BTRFS_DEBUG
> +static int btrfs_best_stripe(struct btrfs_fs_info *fs_info,
> +			     struct btrfs_chunk_map *map, int first,
> +			     int num_stripe)
> +{
> +	u64 est_wait = 0;

Is this a typo of best_wait? Or do you mean estimated wait?

> +	int best_stripe = 0;
> +	int index;
> +
> +	for (index = first; index < first + num_stripe; index++) {
> +		u64 read_wait;
> +		u64 avg_wait = 0;
> +		unsigned long read_ios;
> +		struct btrfs_device *device = map->stripes[index].dev;
> +
> +		read_wait = part_stat_read(device->bdev, nsecs[READ]);
> +		read_ios = part_stat_read(device->bdev, ios[READ]);
> +
> +		if (read_wait && read_ios && read_wait >= read_ios)
> +			avg_wait = div_u64(read_wait, read_ios);
> +		else
> +			btrfs_debug_rl(fs_info,
> +			"devid: %llu avg_wait ZERO read_wait %llu read_ios %lu",
> +				       device->devid, read_wait, read_ios);

I do not think we need this debug messages.

The device can have no read so far.

> +
> +		if (est_wait == 0 || est_wait > avg_wait) {

You can give @est_wait a default value of U64_MAX, so that you do not
need to check for 0, just est_wait > avg_wait is enough.

Thanks,
Qu

> +			est_wait = avg_wait;
> +			best_stripe = index;
> +		}
> +	}
> +
> +	return best_stripe;
> +}
> +
>   struct stripe_mirror {
>   	u64 devid;
>   	int map;
> @@ -5940,6 +5976,10 @@ static int find_live_mirror(struct btrfs_fs_info *fs_info,
>   	case BTRFS_READ_POLICY_ROTATION:
>   		preferred_mirror = btrfs_read_rotation(map, first, num_stripes);
>   		break;
> +	case BTRFS_READ_POLICY_LATENCY:
> +		preferred_mirror = btrfs_best_stripe(fs_info, map, first,
> +								num_stripes);
> +		break;
>   #endif
>   	}
>
> diff --git a/fs/btrfs/volumes.h b/fs/btrfs/volumes.h
> index 81701217dbb9..09920ef76a9b 100644
> --- a/fs/btrfs/volumes.h
> +++ b/fs/btrfs/volumes.h
> @@ -306,6 +306,8 @@ enum btrfs_read_policy {
>   #ifdef CONFIG_BTRFS_DEBUG
>   	/* Balancing raid1 reads across all striped devices */
>   	BTRFS_READ_POLICY_ROTATION,
> +	/* Use the lowest-latency device dynamically */
> +	BTRFS_READ_POLICY_LATENCY,
>   #endif
>   	BTRFS_NR_READ_POLICY,
>   };


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH 0/3] raid1 balancing methods
  2024-09-27  9:55 [PATCH 0/3] raid1 balancing methods Anand Jain
                   ` (2 preceding siblings ...)
  2024-09-27  9:55 ` [PATCH 3/3] btrfs: add RAID1 preferred read device feature Anand Jain
@ 2024-10-04 10:44 ` Yuwei Han
  2024-10-11  1:20   ` Anand Jain
  3 siblings, 1 reply; 10+ messages in thread
From: Yuwei Han @ 2024-10-04 10:44 UTC (permalink / raw)
  To: Anand Jain, linux-btrfs; +Cc: dsterba, waxhead


在 2024/9/27 17:55, Anand Jain 写道:
> The RAID1-balancing methods helps distribute read I/O across devices, and
> this patch introduces three balancing methods: rotation, latency, and
> devid. These methods are enabled under the `CONFIG_BTRFS_DEBUG` config
> option and are on top of the previously added
> `/sys/fs/btrfs/<UUID>/read_policy` interface to configure the desired
> RAID1 read balancing method.
I am currently testing this on 6.12-rc1 with policy rotation, seems good 
for now.
Would be better if policy can be set in mount options.

HAN Yuwei
> 
> I've tested these patches using fio and filesystem defragmentation
> workloads on a two-device RAID1 setup (with both data and metadata
> mirrored across identical devices). I tracked device read counts by
> extracting stats from `/sys/devices/<..>/stat` for each device. Below is
> a summary of the results, with each result the average of three
> iterations.
> 
> A typical generic random rw workload:
> 
> $ fio --filename=/btrfs/foo --size=10Gi --direct=1 --rw=randrw --bs=4k \
>    --ioengine=libaio --iodepth=256 --runtime=120 --numjobs=4 --time_based \
>    --group_reporting --name=iops-test-job --eta-newline=1
> 
> |         |            |            | Read I/O count  |
> |         | Read       | Write      | devid1 | devid2 |
> |---------|------------|------------|--------|--------|
> | pid     | 29.4MiB/s  | 29.5MiB/s  | 456548 | 447975 |
> | rotation| 29.3MiB/s  | 29.3MiB/s  | 450105 | 450055 |
> | latency | 21.9MiB/s  | 21.9MiB/s  | 672387 | 0      |
> | devid:1 | 22.0MiB/s  | 22.0MiB/s  | 674788 | 0      |
> 
> Defragmentation with compression workload:
> 
> $ xfs_io -f -d -c 'pwrite -S 0xab 0 1G' /btrfs/foo
> $ sync
> $ echo 3 > /proc/sys/vm/drop_caches
> $ btrfs filesystem defrag -f -c /btrfs/foo
> 
> |         | Time  | Read I/O Count  |
> |         | Real  | devid1 | devid2 |
> |---------|-------|--------|--------|
> | pid     | 21.61s| 3810   | 0      |
> | rotation| 11.55s| 1905   | 1905   |
> | latency | 20.99s| 0      | 3810   |
> | devid:2 | 21.41s| 0      | 3810   |
> 
> . The PID-based balancing method works well for the generic random rw fio
>    workload.
> . The rotation method is ideal when you want to keep both devices active,
>    and it boosts performance in sequential defragmentation scenarios.
> . The latency-based method work well when we have mixed device types or
>    when one device experiences intermittent I/O failures the latency
>    increases and it automatically picks the other device for further Read
>    IOs.
> . The devid method is a more hands-on approach, useful for diagnosing and
>    testing RAID1 mirror synchronizations.
> 
> Anand Jain (3):
>    btrfs: introduce RAID1 round-robin read balancing
>    btrfs: use the path with the lowest latency for RAID1 reads
>    btrfs: add RAID1 preferred read device feature
> 
>   fs/btrfs/sysfs.c   |  94 ++++++++++++++++++++++++++++++-------
>   fs/btrfs/volumes.c | 113 +++++++++++++++++++++++++++++++++++++++++++++
>   fs/btrfs/volumes.h |  14 ++++++
>   3 files changed, 205 insertions(+), 16 deletions(-)
> 


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH 0/3] raid1 balancing methods
  2024-10-04 10:44 ` [PATCH 0/3] raid1 balancing methods Yuwei Han
@ 2024-10-11  1:20   ` Anand Jain
  0 siblings, 0 replies; 10+ messages in thread
From: Anand Jain @ 2024-10-11  1:20 UTC (permalink / raw)
  To: Yuwei Han, linux-btrfs; +Cc: dsterba, waxhead

On 4/10/24 4:14 pm, Yuwei Han wrote:
> 
> 在 2024/9/27 17:55, Anand Jain 写道:
>> The RAID1-balancing methods helps distribute read I/O across devices, and
>> this patch introduces three balancing methods: rotation, latency, and
>> devid. These methods are enabled under the `CONFIG_BTRFS_DEBUG` config
>> option and are on top of the previously added
>> `/sys/fs/btrfs/<UUID>/read_policy` interface to configure the desired
>> RAID1 read balancing method.
> I am currently testing this on 6.12-rc1 with policy rotation, seems good 
> for now.

Thanks for testing and reviewing.

> Would be better if policy can be set in mount options.

I think it is a good idea. However, we should also consolidate our
sysfs knobs, mount options, and btrfs properties all together
where applicable and make it easy to use.

V2 is in the ML.

Thanks,
-Anand

> 
> HAN Yuwei
>>
>> I've tested these patches using fio and filesystem defragmentation
>> workloads on a two-device RAID1 setup (with both data and metadata
>> mirrored across identical devices). I tracked device read counts by
>> extracting stats from `/sys/devices/<..>/stat` for each device. Below is
>> a summary of the results, with each result the average of three
>> iterations.
>>
>> A typical generic random rw workload:
>>
>> $ fio --filename=/btrfs/foo --size=10Gi --direct=1 --rw=randrw --bs=4k \
>>    --ioengine=libaio --iodepth=256 --runtime=120 --numjobs=4 -- 
>> time_based \
>>    --group_reporting --name=iops-test-job --eta-newline=1
>>
>> |         |            |            | Read I/O count  |
>> |         | Read       | Write      | devid1 | devid2 |
>> |---------|------------|------------|--------|--------|
>> | pid     | 29.4MiB/s  | 29.5MiB/s  | 456548 | 447975 |
>> | rotation| 29.3MiB/s  | 29.3MiB/s  | 450105 | 450055 |
>> | latency | 21.9MiB/s  | 21.9MiB/s  | 672387 | 0      |
>> | devid:1 | 22.0MiB/s  | 22.0MiB/s  | 674788 | 0      |
>>
>> Defragmentation with compression workload:
>>
>> $ xfs_io -f -d -c 'pwrite -S 0xab 0 1G' /btrfs/foo
>> $ sync
>> $ echo 3 > /proc/sys/vm/drop_caches
>> $ btrfs filesystem defrag -f -c /btrfs/foo
>>
>> |         | Time  | Read I/O Count  |
>> |         | Real  | devid1 | devid2 |
>> |---------|-------|--------|--------|
>> | pid     | 21.61s| 3810   | 0      |
>> | rotation| 11.55s| 1905   | 1905   |
>> | latency | 20.99s| 0      | 3810   |
>> | devid:2 | 21.41s| 0      | 3810   |
>>
>> . The PID-based balancing method works well for the generic random rw fio
>>    workload.
>> . The rotation method is ideal when you want to keep both devices active,
>>    and it boosts performance in sequential defragmentation scenarios.
>> . The latency-based method work well when we have mixed device types or
>>    when one device experiences intermittent I/O failures the latency
>>    increases and it automatically picks the other device for further Read
>>    IOs.
>> . The devid method is a more hands-on approach, useful for diagnosing and
>>    testing RAID1 mirror synchronizations.
>>
>> Anand Jain (3):
>>    btrfs: introduce RAID1 round-robin read balancing
>>    btrfs: use the path with the lowest latency for RAID1 reads
>>    btrfs: add RAID1 preferred read device feature
>>
>>   fs/btrfs/sysfs.c   |  94 ++++++++++++++++++++++++++++++-------
>>   fs/btrfs/volumes.c | 113 +++++++++++++++++++++++++++++++++++++++++++++
>>   fs/btrfs/volumes.h |  14 ++++++
>>   3 files changed, 205 insertions(+), 16 deletions(-)
>>
> 


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH 2/3] btrfs: use the path with the lowest latency for RAID1 reads
  2024-09-27 10:25   ` Qu Wenruo
@ 2024-10-11  1:21     ` Anand Jain
  0 siblings, 0 replies; 10+ messages in thread
From: Anand Jain @ 2024-10-11  1:21 UTC (permalink / raw)
  To: Qu Wenruo, linux-btrfs; +Cc: dsterba, waxhead



>> diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
>> index c130a27386a7..20bc62d85b3b 100644
>> --- a/fs/btrfs/volumes.c
>> +++ b/fs/btrfs/volumes.c
>> @@ -12,6 +12,9 @@
>>   #include <linux/uuid.h>
>>   #include <linux/list_sort.h>
>>   #include <linux/namei.h>
>> +#ifdef CONFIG_BTRFS_DEBUG
>> +#include <linux/part_stat.h>
>> +#endif
>>   #include "misc.h"
>>   #include "ctree.h"
>>   #include "disk-io.h"
>> @@ -5860,6 +5863,39 @@ int btrfs_is_parity_mirror(struct btrfs_fs_info 
>> *fs_info, u64 logical, u64 len)
>>   }
>>
>>   #ifdef CONFIG_BTRFS_DEBUG
>> +static int btrfs_best_stripe(struct btrfs_fs_info *fs_info,
>> +                 struct btrfs_chunk_map *map, int first,
>> +                 int num_stripe)
>> +{
>> +    u64 est_wait = 0;
> 
> Is this a typo of best_wait? Or do you mean estimated wait?
> 

It is best_wait. Fixed in v2.


>> +    int best_stripe = 0;
>> +    int index;
>> +
>> +    for (index = first; index < first + num_stripe; index++) {
>> +        u64 read_wait;
>> +        u64 avg_wait = 0;
>> +        unsigned long read_ios;
>> +        struct btrfs_device *device = map->stripes[index].dev;
>> +
>> +        read_wait = part_stat_read(device->bdev, nsecs[READ]);
>> +        read_ios = part_stat_read(device->bdev, ios[READ]);
>> +
>> +        if (read_wait && read_ios && read_wait >= read_ios)
>> +            avg_wait = div_u64(read_wait, read_ios);
>> +        else
>> +            btrfs_debug_rl(fs_info,
>> +            "devid: %llu avg_wait ZERO read_wait %llu read_ios %lu",
>> +                       device->devid, read_wait, read_ios);
> 
> I do not think we need this debug messages.
> 
> The device can have no read so far.
> 

Um. Yeah, we can remove it.

>> +
>> +        if (est_wait == 0 || est_wait > avg_wait) {
> 
> You can give @est_wait a default value of U64_MAX, so that you do not
> need to check for 0, just est_wait > avg_wait is enough.
> 

Fixed in v2.

Thanks for the review.

- Anand


> Thanks,
> Qu

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH 1/3] btrfs: introduce RAID1 round-robin read balancing
  2024-09-27 10:10   ` Qu Wenruo
@ 2024-10-11  1:21     ` Anand Jain
  0 siblings, 0 replies; 10+ messages in thread
From: Anand Jain @ 2024-10-11  1:21 UTC (permalink / raw)
  To: Qu Wenruo, linux-btrfs; +Cc: dsterba, waxhead


>> +#ifdef CONFIG_BTRFS_DEBUG
> 
> It would be much better to utilize CONFIG_BTRFS_EXPERIMENTAL.
> CONFIG_BTRFS_DEBUG is now for pure debug purposes.

Yes, I noticed the recent patch that changed that.
Fixed in v2.

Thanks, Anand


^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2024-10-11  1:21 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-09-27  9:55 [PATCH 0/3] raid1 balancing methods Anand Jain
2024-09-27  9:55 ` [PATCH 1/3] btrfs: introduce RAID1 round-robin read balancing Anand Jain
2024-09-27 10:10   ` Qu Wenruo
2024-10-11  1:21     ` Anand Jain
2024-09-27  9:55 ` [PATCH 2/3] btrfs: use the path with the lowest latency for RAID1 reads Anand Jain
2024-09-27 10:25   ` Qu Wenruo
2024-10-11  1:21     ` Anand Jain
2024-09-27  9:55 ` [PATCH 3/3] btrfs: add RAID1 preferred read device feature Anand Jain
2024-10-04 10:44 ` [PATCH 0/3] raid1 balancing methods Yuwei Han
2024-10-11  1:20   ` Anand Jain

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).