linux-btrfs.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v2 0/3] raid1 balancing methods
@ 2024-10-11  2:49 Anand Jain
  2024-10-11  2:49 ` [PATCH v2 1/3] btrfs: introduce RAID1 round-robin read balancing Anand Jain
                   ` (7 more replies)
  0 siblings, 8 replies; 15+ messages in thread
From: Anand Jain @ 2024-10-11  2:49 UTC (permalink / raw)
  To: linux-btrfs; +Cc: dsterba, wqu, hrx, waxhead

v2:
1. Move new features to CONFIG_BTRFS_EXPERIMENTAL instead of CONFIG_BTRFS_DEBUG.
2. Correct the typo from %est_wait to %best_wait.
3. Initialize %best_wait to U64_MAX and remove the check for 0.
4. Implement rotation with a minimum contiguous read threshold before
   switching to the next stripe. Configure this, using:

        echo rotation:[min_contiguous_read] > /sys/fs/btrfs/<uuid>/read_policy

   The default value is the sector size, and the min_contiguous_read
   value must be a multiple of the sector size.

5. Tested FIO random read/write and defrag compression workloads with
   min_contiguous_read set to sector size, 192k, and 256k.

   RAID1 balancing method rotation is better for multi-process workloads
   such as fio and also single-process workload such as defragmentation.

     $ fio --filename=/btrfs/foo --size=5Gi --direct=1 --rw=randrw --bs=4k \
        --ioengine=libaio --iodepth=256 --runtime=120 --numjobs=4 \
        --time_based --group_reporting --name=iops-test-job --eta-newline=1


|         |            |            | Read I/O count  |
|         | Read       | Write      | devid1 | devid2 |
|---------|------------|------------|--------|--------|
| pid     | 20.3MiB/s  | 20.5MiB/s  | 313895 | 313895 |
| rotation|            |            |        |        |
|     4096| 20.4MiB/s  | 20.5MiB/s  | 313895 | 313895 |
|   196608| 20.2MiB/s  | 20.2MiB/s  | 310152 | 310175 |
|   262144| 20.3MiB/s  | 20.4MiB/s  | 312180 | 312191 |
|  latency| 18.4MiB/s  | 18.4MiB/s  | 272980 | 291683 |
| devid:1 | 14.8MiB/s  | 14.9MiB/s  | 456376 | 0      |

   rotation RAID1 balancing technique performs more than 2x better for
   single-process defrag.

      $ time -p btrfs filesystem defrag -r -f -c /btrfs


|         | Time  | Read I/O Count  |
|         | Real  | devid1 | devid2 |
|---------|-------|--------|--------|
| pid     | 18.00s| 3800   | 0      |
| rotation|       |        |        |
|     4096|  8.95s| 1900   | 1901   |
|   196608|  8.50s| 1881   | 1919   |
|   262144|  8.80s| 1881   | 1919   |
| latency | 17.18s| 3800   | 0      |
| devid:2 | 17.48s| 0      | 3800   |

Rotation keeps all devices active, and for now, the Rotation RAID1
balancing method is preferable as default. More workload testing is
needed while the code is EXPERIMENTAL.
While Latency is better during the failing/unstable block layer transport.
As of now these two techniques, are needed to be further independently
tested with different worloads, and in the long term we should be merge
these technique to a unified heuristic.

Rotation keeps all devices active, and for now, the Rotation RAID1
balancing method should be the default. More workload testing is needed
while the code is EXPERIMENTAL.

Latency is smarter with unstable block layer transport.

Both techniques need independent testing across workloads, with the goal of
eventually merging them into a unified approach? for the long term.

Devid is a hands-on approach, provides manual or user-space script control.

These RAID1 balancing methods are tunable via the sysfs knob.
The mount -o option and btrfs properties are under consideration.

Thx.

--------- original v1 ------------

The RAID1-balancing methods helps distribute read I/O across devices, and
this patch introduces three balancing methods: rotation, latency, and
devid. These methods are enabled under the `CONFIG_BTRFS_DEBUG` config
option and are on top of the previously added
`/sys/fs/btrfs/<UUID>/read_policy` interface to configure the desired
RAID1 read balancing method.

I've tested these patches using fio and filesystem defragmentation
workloads on a two-device RAID1 setup (with both data and metadata
mirrored across identical devices). I tracked device read counts by
extracting stats from `/sys/devices/<..>/stat` for each device. Below is
a summary of the results, with each result the average of three
iterations.

A typical generic random rw workload:

$ fio --filename=/btrfs/foo --size=10Gi --direct=1 --rw=randrw --bs=4k \
  --ioengine=libaio --iodepth=256 --runtime=120 --numjobs=4 --time_based \
  --group_reporting --name=iops-test-job --eta-newline=1

|         |            |            | Read I/O count  |
|         | Read       | Write      | devid1 | devid2 |
|---------|------------|------------|--------|--------|
| pid     | 29.4MiB/s  | 29.5MiB/s  | 456548 | 447975 |
| rotation| 29.3MiB/s  | 29.3MiB/s  | 450105 | 450055 |
| latency | 21.9MiB/s  | 21.9MiB/s  | 672387 | 0      |
| devid:1 | 22.0MiB/s  | 22.0MiB/s  | 674788 | 0      |

Defragmentation with compression workload:

$ xfs_io -f -d -c 'pwrite -S 0xab 0 1G' /btrfs/foo
$ sync
$ echo 3 > /proc/sys/vm/drop_caches
$ btrfs filesystem defrag -f -c /btrfs/foo

|         | Time  | Read I/O Count  |
|         | Real  | devid1 | devid2 |
|---------|-------|--------|--------|
| pid     | 21.61s| 3810   | 0      |
| rotation| 11.55s| 1905   | 1905   |
| latency | 20.99s| 0      | 3810   |
| devid:2 | 21.41s| 0      | 3810   |

. The PID-based balancing method works well for the generic random rw fio
  workload.
. The rotation method is ideal when you want to keep both devices active,
  and it boosts performance in sequential defragmentation scenarios.
. The latency-based method work well when we have mixed device types or
  when one device experiences intermittent I/O failures the latency
  increases and it automatically picks the other device for further Read
  IOs.
. The devid method is a more hands-on approach, useful for diagnosing and
  testing RAID1 mirror synchronizations.

Anand Jain (3):
  btrfs: introduce RAID1 round-robin read balancing
  btrfs: use the path with the lowest latency for RAID1 reads
  btrfs: add RAID1 preferred read device

 fs/btrfs/disk-io.c |   4 ++
 fs/btrfs/sysfs.c   | 116 +++++++++++++++++++++++++++++++++++++++------
 fs/btrfs/volumes.c | 109 ++++++++++++++++++++++++++++++++++++++++++
 fs/btrfs/volumes.h |  16 +++++++
 4 files changed, 230 insertions(+), 15 deletions(-)

-- 
2.46.1


^ permalink raw reply	[flat|nested] 15+ messages in thread

* [PATCH v2 1/3] btrfs: introduce RAID1 round-robin read balancing
  2024-10-11  2:49 [PATCH v2 0/3] raid1 balancing methods Anand Jain
@ 2024-10-11  2:49 ` Anand Jain
  2024-10-11  2:49 ` [PATCH v2 2/3] btrfs: use the path with the lowest latency for RAID1 reads Anand Jain
                   ` (6 subsequent siblings)
  7 siblings, 0 replies; 15+ messages in thread
From: Anand Jain @ 2024-10-11  2:49 UTC (permalink / raw)
  To: linux-btrfs; +Cc: dsterba, wqu, hrx, waxhead

This feature balances I/O across the striped devices when reading from
RAID1 blocks.

   echo rotation:[min_contiguous_read] > /sys/fs/btrfs/<uuid>/read_policy

Default value of min_contiguous_read is equal to the sectorsize.

Signed-off-by: Anand Jain <anand.jain@oracle.com>
---
 fs/btrfs/disk-io.c |  3 ++
 fs/btrfs/sysfs.c   | 88 ++++++++++++++++++++++++++++++++++++++--------
 fs/btrfs/volumes.c | 53 ++++++++++++++++++++++++++++
 fs/btrfs/volumes.h |  9 +++++
 4 files changed, 138 insertions(+), 15 deletions(-)

diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index 4ad5db619b00..5b157f407e0a 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -3320,6 +3320,9 @@ int __cold open_ctree(struct super_block *sb, struct btrfs_fs_devices *fs_device
 
 	fs_info->nodesize = nodesize;
 	fs_info->sectorsize = sectorsize;
+#ifdef CONFIG_BTRFS_EXPERIMENTAL
+	fs_info->fs_devices->min_contiguous_read = sectorsize;
+#endif
 	fs_info->sectorsize_bits = ilog2(sectorsize);
 	fs_info->sectors_per_page = (PAGE_SIZE >> fs_info->sectorsize_bits);
 	fs_info->csums_per_leaf = BTRFS_MAX_ITEM_SIZE(fs_info) / fs_info->csum_size;
diff --git a/fs/btrfs/sysfs.c b/fs/btrfs/sysfs.c
index b843308e2bc6..bacb2871109b 100644
--- a/fs/btrfs/sysfs.c
+++ b/fs/btrfs/sysfs.c
@@ -1305,7 +1305,11 @@ static ssize_t btrfs_temp_fsid_show(struct kobject *kobj,
 }
 BTRFS_ATTR(, temp_fsid, btrfs_temp_fsid_show);
 
+#ifdef CONFIG_BTRFS_EXPERIMENTAL
+static const char * const btrfs_read_policy_name[] = { "pid", "rotation" };
+#else
 static const char * const btrfs_read_policy_name[] = { "pid" };
+#endif
 
 static ssize_t btrfs_read_policy_show(struct kobject *kobj,
 				      struct kobj_attribute *a, char *buf)
@@ -1316,14 +1320,22 @@ static ssize_t btrfs_read_policy_show(struct kobject *kobj,
 	int i;
 
 	for (i = 0; i < BTRFS_NR_READ_POLICY; i++) {
-		if (policy == i)
-			ret += sysfs_emit_at(buf, ret, "%s[%s]",
-					 (ret == 0 ? "" : " "),
-					 btrfs_read_policy_name[i]);
-		else
-			ret += sysfs_emit_at(buf, ret, "%s%s",
-					 (ret == 0 ? "" : " "),
-					 btrfs_read_policy_name[i]);
+		if (ret != 0)
+			ret += sysfs_emit_at(buf, ret, " ");
+
+		if (i == policy)
+			ret += sysfs_emit_at(buf, ret, "[");
+
+		ret += sysfs_emit_at(buf, ret, "%s", btrfs_read_policy_name[i]);
+
+#ifdef CONFIG_BTRFS_EXPERIMENTAL
+		if (i == BTRFS_READ_POLICY_ROTATION)
+			ret += sysfs_emit_at(buf, ret, ":%d",
+					     fs_devices->min_contiguous_read);
+#endif
+
+		if (i == policy)
+			ret += sysfs_emit_at(buf, ret, "]");
 	}
 
 	ret += sysfs_emit_at(buf, ret, "\n");
@@ -1336,21 +1348,67 @@ static ssize_t btrfs_read_policy_store(struct kobject *kobj,
 				       const char *buf, size_t len)
 {
 	struct btrfs_fs_devices *fs_devices = to_fs_devs(kobj);
+	int index = -1;
 	int i;
 
+#ifdef CONFIG_BTRFS_EXPERIMENTAL
+	char *value = strchr(buf, ':');
+
+	/* Separate value from input in policy:value format. */
+	if (value) {
+		*value = '\0';
+		value++;
+	}
+#endif
+
 	for (i = 0; i < BTRFS_NR_READ_POLICY; i++) {
 		if (sysfs_streq(buf, btrfs_read_policy_name[i])) {
-			if (i != READ_ONCE(fs_devices->read_policy)) {
-				WRITE_ONCE(fs_devices->read_policy, i);
-				btrfs_info(fs_devices->fs_info,
-					   "read policy set to '%s'",
-					   btrfs_read_policy_name[i]);
+			index = i;
+			break;
+		}
+	}
+
+	if (index == -1)
+		return -EINVAL;
+
+#ifdef CONFIG_BTRFS_EXPERIMENTAL
+	if (index == BTRFS_READ_POLICY_ROTATION) {
+		int value_rota = fs_devices->fs_info->sectorsize;
+
+		if (value) {
+			if (kstrtoint(value, 10, &value_rota))
+				return -EINVAL;
+
+			if (value_rota % fs_devices->fs_info->sectorsize != 0) {
+				btrfs_err(fs_devices->fs_info,
+"read_policy: min_contiguous_read %d should be multiples of the sectorsize %u",
+					  value_rota,
+					  fs_devices->fs_info->sectorsize);
+				return -EINVAL;
 			}
-			return len;
 		}
+
+		if (index != READ_ONCE(fs_devices->read_policy) ||
+		    value_rota != READ_ONCE(fs_devices->min_contiguous_read)) {
+			WRITE_ONCE(fs_devices->read_policy, index);
+			WRITE_ONCE(fs_devices->min_contiguous_read, value_rota);
+			atomic_set(&fs_devices->total_reads, 0);
+
+			btrfs_info(fs_devices->fs_info, "read policy set to '%s:%d'",
+				   btrfs_read_policy_name[index], value_rota);
+
+		}
+
+		return len;
+	}
+#endif
+	if (index != READ_ONCE(fs_devices->read_policy)) {
+		WRITE_ONCE(fs_devices->read_policy, index);
+		btrfs_info(fs_devices->fs_info, "read policy set to '%s'",
+			   btrfs_read_policy_name[index]);
 	}
 
-	return -EINVAL;
+	return len;
 }
 BTRFS_ATTR_RW(, read_policy, btrfs_read_policy_show, btrfs_read_policy_store);
 
diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index dc9f54849f39..ec5dbe69ba2c 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -5962,6 +5962,54 @@ unsigned long btrfs_full_stripe_len(struct btrfs_fs_info *fs_info,
 	return len;
 }
 
+#ifdef CONFIG_BTRFS_EXPERIMENTAL
+struct stripe_mirror {
+	u64 devid;
+	int num;
+};
+
+static int btrfs_cmp_devid(const void *a, const void *b)
+{
+	struct stripe_mirror *s1 = (struct stripe_mirror *)a;
+	struct stripe_mirror *s2 = (struct stripe_mirror *)b;
+
+	if (s1->devid < s2->devid)
+		return -1;
+	if (s1->devid > s2->devid)
+		return 1;
+	return 0;
+}
+
+static int btrfs_read_rotation(struct btrfs_chunk_map *map, int first,
+			       int num_stripe)
+{
+	struct stripe_mirror stripes[4] = {0}; //4: max possible mirrors
+	struct btrfs_fs_devices *fs_devices = map->stripes[first].dev->fs_devices;
+	int j;
+	int slot;
+	int index;
+	int ret_stripe;
+	int total_reads;
+	int reads_per_dev = fs_devices->min_contiguous_read/
+						fs_devices->fs_info->sectorsize;
+
+	index = 0;
+	for (j = first; j < first + num_stripe; j++) {
+		stripes[index].devid = map->stripes[j].dev->devid;
+		stripes[index].num = j;
+		index++;
+	}
+	sort(stripes, num_stripe, sizeof(struct stripe_mirror),
+	     btrfs_cmp_devid, NULL);
+
+	total_reads = atomic_inc_return(&fs_devices->total_reads);
+	slot = total_reads/reads_per_dev;
+	ret_stripe = stripes[slot % num_stripe].num;
+
+	return ret_stripe;
+}
+#endif
+
 static int find_live_mirror(struct btrfs_fs_info *fs_info,
 			    struct btrfs_chunk_map *map, int first,
 			    int dev_replace_is_ongoing)
@@ -5991,6 +6039,11 @@ static int find_live_mirror(struct btrfs_fs_info *fs_info,
 	case BTRFS_READ_POLICY_PID:
 		preferred_mirror = first + (current->pid % num_stripes);
 		break;
+#ifdef CONFIG_BTRFS_EXPERIMENTAL
+	case BTRFS_READ_POLICY_ROTATION:
+		preferred_mirror = btrfs_read_rotation(map, first, num_stripes);
+		break;
+#endif
 	}
 
 	if (dev_replace_is_ongoing &&
diff --git a/fs/btrfs/volumes.h b/fs/btrfs/volumes.h
index 3a416b1bc24c..0db754a4b13d 100644
--- a/fs/btrfs/volumes.h
+++ b/fs/btrfs/volumes.h
@@ -303,6 +303,10 @@ enum btrfs_chunk_allocation_policy {
 enum btrfs_read_policy {
 	/* Use process PID to choose the stripe */
 	BTRFS_READ_POLICY_PID,
+#ifdef CONFIG_BTRFS_EXPERIMENTAL
+	/* Balancing raid1 reads across all striped devices */
+	BTRFS_READ_POLICY_ROTATION,
+#endif
 	BTRFS_NR_READ_POLICY,
 };
 
@@ -431,6 +435,11 @@ struct btrfs_fs_devices {
 	enum btrfs_read_policy read_policy;
 
 #ifdef CONFIG_BTRFS_EXPERIMENTAL
+	/* IO stat, read counter. */
+	atomic_t total_reads;
+	/* Min contiguous reads before switching to next device. */
+	int min_contiguous_read;
+
 	/* Checksum mode - offload it or do it synchronously. */
 	enum btrfs_offload_csum_mode offload_csum_mode;
 #endif
-- 
2.46.1


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [PATCH v2 2/3] btrfs: use the path with the lowest latency for RAID1 reads
  2024-10-11  2:49 [PATCH v2 0/3] raid1 balancing methods Anand Jain
  2024-10-11  2:49 ` [PATCH v2 1/3] btrfs: introduce RAID1 round-robin read balancing Anand Jain
@ 2024-10-11  2:49 ` Anand Jain
  2024-10-11  2:49 ` [PATCH v2 3/3] btrfs: add RAID1 preferred read device Anand Jain
                   ` (5 subsequent siblings)
  7 siblings, 0 replies; 15+ messages in thread
From: Anand Jain @ 2024-10-11  2:49 UTC (permalink / raw)
  To: linux-btrfs; +Cc: dsterba, wqu, hrx, waxhead

This feature aims to direct the read I/O to the device with the lowest
known latency for reading RAID1 blocks.

echo "latency" > /sys/fs/btrfs/<UUID>/read_policy

Signed-off-by: Anand Jain <anand.jain@oracle.com>
---
 fs/btrfs/sysfs.c   |  2 +-
 fs/btrfs/volumes.c | 36 ++++++++++++++++++++++++++++++++++++
 fs/btrfs/volumes.h |  2 ++
 3 files changed, 39 insertions(+), 1 deletion(-)

diff --git a/fs/btrfs/sysfs.c b/fs/btrfs/sysfs.c
index bacb2871109b..9f506d46a94c 100644
--- a/fs/btrfs/sysfs.c
+++ b/fs/btrfs/sysfs.c
@@ -1306,7 +1306,7 @@ static ssize_t btrfs_temp_fsid_show(struct kobject *kobj,
 BTRFS_ATTR(, temp_fsid, btrfs_temp_fsid_show);
 
 #ifdef CONFIG_BTRFS_EXPERIMENTAL
-static const char * const btrfs_read_policy_name[] = { "pid", "rotation" };
+static const char * const btrfs_read_policy_name[] = { "pid", "rotation", "latency" };
 #else
 static const char * const btrfs_read_policy_name[] = { "pid" };
 #endif
diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index ec5dbe69ba2c..8912ee1d8b54 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -12,6 +12,9 @@
 #include <linux/uuid.h>
 #include <linux/list_sort.h>
 #include <linux/namei.h>
+#ifdef CONFIG_BTRFS_EXPERIMENTAL
+#include <linux/part_stat.h>
+#endif
 #include "misc.h"
 #include "ctree.h"
 #include "disk-io.h"
@@ -5963,6 +5966,35 @@ unsigned long btrfs_full_stripe_len(struct btrfs_fs_info *fs_info,
 }
 
 #ifdef CONFIG_BTRFS_EXPERIMENTAL
+static int btrfs_best_stripe(struct btrfs_fs_info *fs_info,
+			     struct btrfs_chunk_map *map, int first,
+			     int num_stripe)
+{
+	u64 best_wait = U64_MAX;
+	int best_stripe = 0;
+	int index;
+
+	for (index = first; index < first + num_stripe; index++) {
+		u64 read_wait;
+		u64 avg_wait = 0;
+		unsigned long read_ios;
+		struct btrfs_device *device = map->stripes[index].dev;
+
+		read_wait = part_stat_read(device->bdev, nsecs[READ]);
+		read_ios = part_stat_read(device->bdev, ios[READ]);
+
+		if (read_wait && read_ios && read_wait >= read_ios)
+			avg_wait = div_u64(read_wait, read_ios);
+
+		if (best_wait > avg_wait) {
+			best_wait = avg_wait;
+			best_stripe = index;
+		}
+	}
+
+	return best_stripe;
+}
+
 struct stripe_mirror {
 	u64 devid;
 	int num;
@@ -6043,6 +6075,10 @@ static int find_live_mirror(struct btrfs_fs_info *fs_info,
 	case BTRFS_READ_POLICY_ROTATION:
 		preferred_mirror = btrfs_read_rotation(map, first, num_stripes);
 		break;
+	case BTRFS_READ_POLICY_LATENCY:
+		preferred_mirror = btrfs_best_stripe(fs_info, map, first,
+								num_stripes);
+		break;
 #endif
 	}
 
diff --git a/fs/btrfs/volumes.h b/fs/btrfs/volumes.h
index 0db754a4b13d..f9c744b87b61 100644
--- a/fs/btrfs/volumes.h
+++ b/fs/btrfs/volumes.h
@@ -306,6 +306,8 @@ enum btrfs_read_policy {
 #ifdef CONFIG_BTRFS_EXPERIMENTAL
 	/* Balancing raid1 reads across all striped devices */
 	BTRFS_READ_POLICY_ROTATION,
+	/* Use the lowest-latency device dynamically */
+	BTRFS_READ_POLICY_LATENCY,
 #endif
 	BTRFS_NR_READ_POLICY,
 };
-- 
2.46.1


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [PATCH v2 3/3] btrfs: add RAID1 preferred read device
  2024-10-11  2:49 [PATCH v2 0/3] raid1 balancing methods Anand Jain
  2024-10-11  2:49 ` [PATCH v2 1/3] btrfs: introduce RAID1 round-robin read balancing Anand Jain
  2024-10-11  2:49 ` [PATCH v2 2/3] btrfs: use the path with the lowest latency for RAID1 reads Anand Jain
@ 2024-10-11  2:49 ` Anand Jain
  2024-10-11  3:35 ` [PATCH v2 0/3] raid1 balancing methods Anand Jain
                   ` (4 subsequent siblings)
  7 siblings, 0 replies; 15+ messages in thread
From: Anand Jain @ 2024-10-11  2:49 UTC (permalink / raw)
  To: linux-btrfs; +Cc: dsterba, wqu, hrx, waxhead

When there's stale data on a mirrored device, this feature lets you choose
which device to read from. Mainly used for testing.

echo "devid:2" > /sys/fs/btrfs/<UUID>/read_policy

Signed-off-by: Anand Jain <anand.jain@oracle.com>
---
 fs/btrfs/disk-io.c |  1 +
 fs/btrfs/sysfs.c   | 32 ++++++++++++++++++++++++++++++--
 fs/btrfs/volumes.c | 20 ++++++++++++++++++++
 fs/btrfs/volumes.h |  5 +++++
 4 files changed, 56 insertions(+), 2 deletions(-)

diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index 5b157f407e0a..0e7b29282136 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -3322,6 +3322,7 @@ int __cold open_ctree(struct super_block *sb, struct btrfs_fs_devices *fs_device
 	fs_info->sectorsize = sectorsize;
 #ifdef CONFIG_BTRFS_EXPERIMENTAL
 	fs_info->fs_devices->min_contiguous_read = sectorsize;
+	fs_info->fs_devices->read_devid = fs_info->fs_devices->latest_dev->devid;
 #endif
 	fs_info->sectorsize_bits = ilog2(sectorsize);
 	fs_info->sectors_per_page = (PAGE_SIZE >> fs_info->sectorsize_bits);
diff --git a/fs/btrfs/sysfs.c b/fs/btrfs/sysfs.c
index 9f506d46a94c..aa4c9cbaa61f 100644
--- a/fs/btrfs/sysfs.c
+++ b/fs/btrfs/sysfs.c
@@ -1306,7 +1306,7 @@ static ssize_t btrfs_temp_fsid_show(struct kobject *kobj,
 BTRFS_ATTR(, temp_fsid, btrfs_temp_fsid_show);
 
 #ifdef CONFIG_BTRFS_EXPERIMENTAL
-static const char * const btrfs_read_policy_name[] = { "pid", "rotation", "latency" };
+static const char * const btrfs_read_policy_name[] = { "pid", "rotation", "latency", "devid" };
 #else
 static const char * const btrfs_read_policy_name[] = { "pid" };
 #endif
@@ -1332,8 +1332,11 @@ static ssize_t btrfs_read_policy_show(struct kobject *kobj,
 		if (i == BTRFS_READ_POLICY_ROTATION)
 			ret += sysfs_emit_at(buf, ret, ":%d",
 					     fs_devices->min_contiguous_read);
-#endif
 
+		if (i == BTRFS_READ_POLICY_DEVID)
+			ret += sysfs_emit_at(buf, ret, ":%llu",
+							fs_devices->read_devid);
+#endif
 		if (i == policy)
 			ret += sysfs_emit_at(buf, ret, "]");
 	}
@@ -1401,7 +1404,32 @@ static ssize_t btrfs_read_policy_store(struct kobject *kobj,
 
 		return len;
 	}
+
+	if (index == BTRFS_READ_POLICY_DEVID) {
+		u64 value_devid;
+		BTRFS_DEV_LOOKUP_ARGS(args);
+
+		if (value == NULL || kstrtou64(value, 10, &value_devid))
+			return -EINVAL;
+
+		args.devid = value_devid;
+		if (btrfs_find_device(fs_devices, &args) == NULL)
+			return -EINVAL;
+
+		if (index != READ_ONCE(fs_devices->read_policy) ||
+		    (value_devid != READ_ONCE(fs_devices->read_devid))) {
+			WRITE_ONCE(fs_devices->read_policy, index);
+			WRITE_ONCE(fs_devices->read_devid, value_devid);
+
+			btrfs_info(fs_devices->fs_info, "read policy set to '%s:%llu'",
+				   btrfs_read_policy_name[index], value_devid);
+
+		}
+
+		return len;
+	}
 #endif
+
 	if (index != READ_ONCE(fs_devices->read_policy)) {
 		WRITE_ONCE(fs_devices->read_policy, index);
 		btrfs_info(fs_devices->fs_info, "read policy set to '%s'",
diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index 8912ee1d8b54..87a072fa9be4 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -5966,6 +5966,23 @@ unsigned long btrfs_full_stripe_len(struct btrfs_fs_info *fs_info,
 }
 
 #ifdef CONFIG_BTRFS_EXPERIMENTAL
+static int btrfs_read_preferred(struct btrfs_chunk_map *map, int first,
+				int num_stripe)
+{
+	int last = first + num_stripe;
+	int stripe_index;
+
+	for (stripe_index = first; stripe_index < last; stripe_index++) {
+		struct btrfs_device *device = map->stripes[stripe_index].dev;
+
+		if (device->devid == READ_ONCE(device->fs_devices->read_devid))
+			return stripe_index;
+	}
+
+	/* If no read-preferred device, use first stripe */
+	return first;
+}
+
 static int btrfs_best_stripe(struct btrfs_fs_info *fs_info,
 			     struct btrfs_chunk_map *map, int first,
 			     int num_stripe)
@@ -6079,6 +6096,9 @@ static int find_live_mirror(struct btrfs_fs_info *fs_info,
 		preferred_mirror = btrfs_best_stripe(fs_info, map, first,
 								num_stripes);
 		break;
+	case BTRFS_READ_POLICY_DEVID:
+		preferred_mirror = btrfs_read_preferred(map, first, num_stripes);
+		break;
 #endif
 	}
 
diff --git a/fs/btrfs/volumes.h b/fs/btrfs/volumes.h
index f9c744b87b61..b5ade9d41fe7 100644
--- a/fs/btrfs/volumes.h
+++ b/fs/btrfs/volumes.h
@@ -308,6 +308,8 @@ enum btrfs_read_policy {
 	BTRFS_READ_POLICY_ROTATION,
 	/* Use the lowest-latency device dynamically */
 	BTRFS_READ_POLICY_LATENCY,
+	/* Read from the specific device */
+	BTRFS_READ_POLICY_DEVID,
 #endif
 	BTRFS_NR_READ_POLICY,
 };
@@ -442,6 +444,9 @@ struct btrfs_fs_devices {
 	/* Min contiguous reads before switching to next device. */
 	int min_contiguous_read;
 
+	/* Device to be used for reading in case of RAID1. */
+	u64 read_devid;
+
 	/* Checksum mode - offload it or do it synchronously. */
 	enum btrfs_offload_csum_mode offload_csum_mode;
 #endif
-- 
2.46.1


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* Re: [PATCH v2 0/3] raid1 balancing methods
  2024-10-11  2:49 [PATCH v2 0/3] raid1 balancing methods Anand Jain
                   ` (2 preceding siblings ...)
  2024-10-11  2:49 ` [PATCH v2 3/3] btrfs: add RAID1 preferred read device Anand Jain
@ 2024-10-11  3:35 ` Anand Jain
  2024-10-11  4:59 ` Qu Wenruo
                   ` (3 subsequent siblings)
  7 siblings, 0 replies; 15+ messages in thread
From: Anand Jain @ 2024-10-11  3:35 UTC (permalink / raw)
  To: linux-btrfs; +Cc: dsterba, wqu, hrx, waxhead



On 11/10/24 8:19 am, Anand Jain wrote:
> v2:
> 1. Move new features to CONFIG_BTRFS_EXPERIMENTAL instead of CONFIG_BTRFS_DEBUG.
> 2. Correct the typo from %est_wait to %best_wait.
> 3. Initialize %best_wait to U64_MAX and remove the check for 0.
> 4. Implement rotation with a minimum contiguous read threshold before
>     switching to the next stripe. Configure this, using:
> 
>          echo rotation:[min_contiguous_read] > /sys/fs/btrfs/<uuid>/read_policy
> 
>     The default value is the sector size, and the min_contiguous_read
>     value must be a multiple of the sector size.
> 
> 5. Tested FIO random read/write and defrag compression workloads with
>     min_contiguous_read set to sector size, 192k, and 256k.
> 
>     RAID1 balancing method rotation is better for multi-process workloads
>     such as fio and also single-process workload such as defragmentation.
> 
>       $ fio --filename=/btrfs/foo --size=5Gi --direct=1 --rw=randrw --bs=4k \
>          --ioengine=libaio --iodepth=256 --runtime=120 --numjobs=4 \
>          --time_based --group_reporting --name=iops-test-job --eta-newline=1
> 
> 
> |         |            |            | Read I/O count  |
> |         | Read       | Write      | devid1 | devid2 |
> |---------|------------|------------|--------|--------|
> | pid     | 20.3MiB/s  | 20.5MiB/s  | 313895 | 313895 |
> | rotation|            |            |        |        |
> |     4096| 20.4MiB/s  | 20.5MiB/s  | 313895 | 313895 |
> |   196608| 20.2MiB/s  | 20.2MiB/s  | 310152 | 310175 |
> |   262144| 20.3MiB/s  | 20.4MiB/s  | 312180 | 312191 |
> |  latency| 18.4MiB/s  | 18.4MiB/s  | 272980 | 291683 |
> | devid:1 | 14.8MiB/s  | 14.9MiB/s  | 456376 | 0      |
> 
>     rotation RAID1 balancing technique performs more than 2x better for
>     single-process defrag.
> 
>        $ time -p btrfs filesystem defrag -r -f -c /btrfs
> 
> 
> |         | Time  | Read I/O Count  |
> |         | Real  | devid1 | devid2 |
> |---------|-------|--------|--------|
> | pid     | 18.00s| 3800   | 0      |
> | rotation|       |        |        |
> |     4096|  8.95s| 1900   | 1901   |
> |   196608|  8.50s| 1881   | 1919   |
> |   262144|  8.80s| 1881   | 1919   |
> | latency | 17.18s| 3800   | 0      |
> | devid:2 | 17.48s| 0      | 3800   |
> 


Copy and paste error. Please ignore the below paragraph. Thx.
---vvv--- ignore ---vvv----
> Rotation keeps all devices active, and for now, the Rotation RAID1
> balancing method is preferable as default. More workload testing is
> needed while the code is EXPERIMENTAL.
> While Latency is better during the failing/unstable block layer transport.
> As of now these two techniques, are needed to be further independently
> tested with different worloads, and in the long term we should be merge
> these technique to a unified heuristic.
---^^^------------^^^------

> Rotation keeps all devices active, and for now, the Rotation RAID1
> balancing method should be the default. More workload testing is needed
> while the code is EXPERIMENTAL.
> 
> Latency is smarter with unstable block layer transport.
> 
> Both techniques need independent testing across workloads, with the goal of
> eventually merging them into a unified approach? for the long term.
> 
> Devid is a hands-on approach, provides manual or user-space script control.
> 
> These RAID1 balancing methods are tunable via the sysfs knob.
> The mount -o option and btrfs properties are under consideration.
> 
> Thx.
> 
> --------- original v1 ------------
> 
> The RAID1-balancing methods helps distribute read I/O across devices, and
> this patch introduces three balancing methods: rotation, latency, and
> devid. These methods are enabled under the `CONFIG_BTRFS_DEBUG` config
> option and are on top of the previously added
> `/sys/fs/btrfs/<UUID>/read_policy` interface to configure the desired
> RAID1 read balancing method.
> 
> I've tested these patches using fio and filesystem defragmentation
> workloads on a two-device RAID1 setup (with both data and metadata
> mirrored across identical devices). I tracked device read counts by
> extracting stats from `/sys/devices/<..>/stat` for each device. Below is
> a summary of the results, with each result the average of three
> iterations.
> 
> A typical generic random rw workload:
> 
> $ fio --filename=/btrfs/foo --size=10Gi --direct=1 --rw=randrw --bs=4k \
>    --ioengine=libaio --iodepth=256 --runtime=120 --numjobs=4 --time_based \
>    --group_reporting --name=iops-test-job --eta-newline=1
> 
> |         |            |            | Read I/O count  |
> |         | Read       | Write      | devid1 | devid2 |
> |---------|------------|------------|--------|--------|
> | pid     | 29.4MiB/s  | 29.5MiB/s  | 456548 | 447975 |
> | rotation| 29.3MiB/s  | 29.3MiB/s  | 450105 | 450055 |
> | latency | 21.9MiB/s  | 21.9MiB/s  | 672387 | 0      |
> | devid:1 | 22.0MiB/s  | 22.0MiB/s  | 674788 | 0      |
> 
> Defragmentation with compression workload:
> 
> $ xfs_io -f -d -c 'pwrite -S 0xab 0 1G' /btrfs/foo
> $ sync
> $ echo 3 > /proc/sys/vm/drop_caches
> $ btrfs filesystem defrag -f -c /btrfs/foo
> 
> |         | Time  | Read I/O Count  |
> |         | Real  | devid1 | devid2 |
> |---------|-------|--------|--------|
> | pid     | 21.61s| 3810   | 0      |
> | rotation| 11.55s| 1905   | 1905   |
> | latency | 20.99s| 0      | 3810   |
> | devid:2 | 21.41s| 0      | 3810   |
> 
> . The PID-based balancing method works well for the generic random rw fio
>    workload.
> . The rotation method is ideal when you want to keep both devices active,
>    and it boosts performance in sequential defragmentation scenarios.
> . The latency-based method work well when we have mixed device types or
>    when one device experiences intermittent I/O failures the latency
>    increases and it automatically picks the other device for further Read
>    IOs.
> . The devid method is a more hands-on approach, useful for diagnosing and
>    testing RAID1 mirror synchronizations.
> 
> Anand Jain (3):
>    btrfs: introduce RAID1 round-robin read balancing
>    btrfs: use the path with the lowest latency for RAID1 reads
>    btrfs: add RAID1 preferred read device
> 
>   fs/btrfs/disk-io.c |   4 ++
>   fs/btrfs/sysfs.c   | 116 +++++++++++++++++++++++++++++++++++++++------
>   fs/btrfs/volumes.c | 109 ++++++++++++++++++++++++++++++++++++++++++
>   fs/btrfs/volumes.h |  16 +++++++
>   4 files changed, 230 insertions(+), 15 deletions(-)
> 


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v2 0/3] raid1 balancing methods
  2024-10-11  2:49 [PATCH v2 0/3] raid1 balancing methods Anand Jain
                   ` (3 preceding siblings ...)
  2024-10-11  3:35 ` [PATCH v2 0/3] raid1 balancing methods Anand Jain
@ 2024-10-11  4:59 ` Qu Wenruo
  2024-10-11  6:04   ` Anand Jain
  2024-10-21 14:05 ` David Sterba
                   ` (2 subsequent siblings)
  7 siblings, 1 reply; 15+ messages in thread
From: Qu Wenruo @ 2024-10-11  4:59 UTC (permalink / raw)
  To: Anand Jain, linux-btrfs; +Cc: dsterba, wqu, hrx, waxhead



在 2024/10/11 13:19, Anand Jain 写道:
> v2:
> 1. Move new features to CONFIG_BTRFS_EXPERIMENTAL instead of CONFIG_BTRFS_DEBUG.
> 2. Correct the typo from %est_wait to %best_wait.
> 3. Initialize %best_wait to U64_MAX and remove the check for 0.
> 4. Implement rotation with a minimum contiguous read threshold before
>     switching to the next stripe. Configure this, using:
>
>          echo rotation:[min_contiguous_read] > /sys/fs/btrfs/<uuid>/read_policy
>
>     The default value is the sector size, and the min_contiguous_read
>     value must be a multiple of the sector size.

Overall, I'm fine with the latency and preferred device policies.

Meanwhile I'd prefer the previous version of round-robin, without the
min_contiguous_read.

That looks a little overkilled, and I think we should keep the policy as
simple as possible for now.

Mind to share why the min_contiguous_read is introduced in this update?

In the future, we should go the same method as sched_ext, by pushing the
complex policies to eBPF programs.


Another future improvement is the interface, I'm fine with the sysfs
knob for an experimental feature.

But from my last drop_subtree_threshold experience, sysfs is not going
to be a user-friendly interface. It really relies on some user space
daemon to set.

I'd prefer something more persistent, like some XATTR but inside root
tree, and go with prop interfaces.
But that can all be done in the future.

Thanks,
Qu
>
> 5. Tested FIO random read/write and defrag compression workloads with
>     min_contiguous_read set to sector size, 192k, and 256k.
>
>     RAID1 balancing method rotation is better for multi-process workloads
>     such as fio and also single-process workload such as defragmentation.
>
>       $ fio --filename=/btrfs/foo --size=5Gi --direct=1 --rw=randrw --bs=4k \
>          --ioengine=libaio --iodepth=256 --runtime=120 --numjobs=4 \
>          --time_based --group_reporting --name=iops-test-job --eta-newline=1
>
>
> |         |            |            | Read I/O count  |
> |         | Read       | Write      | devid1 | devid2 |
> |---------|------------|------------|--------|--------|
> | pid     | 20.3MiB/s  | 20.5MiB/s  | 313895 | 313895 |
> | rotation|            |            |        |        |
> |     4096| 20.4MiB/s  | 20.5MiB/s  | 313895 | 313895 |
> |   196608| 20.2MiB/s  | 20.2MiB/s  | 310152 | 310175 |
> |   262144| 20.3MiB/s  | 20.4MiB/s  | 312180 | 312191 |
> |  latency| 18.4MiB/s  | 18.4MiB/s  | 272980 | 291683 |
> | devid:1 | 14.8MiB/s  | 14.9MiB/s  | 456376 | 0      |
>
>     rotation RAID1 balancing technique performs more than 2x better for
>     single-process defrag.
>
>        $ time -p btrfs filesystem defrag -r -f -c /btrfs
>
>
> |         | Time  | Read I/O Count  |
> |         | Real  | devid1 | devid2 |
> |---------|-------|--------|--------|
> | pid     | 18.00s| 3800   | 0      |
> | rotation|       |        |        |
> |     4096|  8.95s| 1900   | 1901   |
> |   196608|  8.50s| 1881   | 1919   |
> |   262144|  8.80s| 1881   | 1919   |
> | latency | 17.18s| 3800   | 0      |
> | devid:2 | 17.48s| 0      | 3800   |
>
> Rotation keeps all devices active, and for now, the Rotation RAID1
> balancing method is preferable as default. More workload testing is
> needed while the code is EXPERIMENTAL.
> While Latency is better during the failing/unstable block layer transport.
> As of now these two techniques, are needed to be further independently
> tested with different worloads, and in the long term we should be merge
> these technique to a unified heuristic.
>
> Rotation keeps all devices active, and for now, the Rotation RAID1
> balancing method should be the default. More workload testing is needed
> while the code is EXPERIMENTAL.
>
> Latency is smarter with unstable block layer transport.
>
> Both techniques need independent testing across workloads, with the goal of
> eventually merging them into a unified approach? for the long term.
>
> Devid is a hands-on approach, provides manual or user-space script control.
>
> These RAID1 balancing methods are tunable via the sysfs knob.
> The mount -o option and btrfs properties are under consideration.
>
> Thx.
>
> --------- original v1 ------------
>
> The RAID1-balancing methods helps distribute read I/O across devices, and
> this patch introduces three balancing methods: rotation, latency, and
> devid. These methods are enabled under the `CONFIG_BTRFS_DEBUG` config
> option and are on top of the previously added
> `/sys/fs/btrfs/<UUID>/read_policy` interface to configure the desired
> RAID1 read balancing method.
>
> I've tested these patches using fio and filesystem defragmentation
> workloads on a two-device RAID1 setup (with both data and metadata
> mirrored across identical devices). I tracked device read counts by
> extracting stats from `/sys/devices/<..>/stat` for each device. Below is
> a summary of the results, with each result the average of three
> iterations.
>
> A typical generic random rw workload:
>
> $ fio --filename=/btrfs/foo --size=10Gi --direct=1 --rw=randrw --bs=4k \
>    --ioengine=libaio --iodepth=256 --runtime=120 --numjobs=4 --time_based \
>    --group_reporting --name=iops-test-job --eta-newline=1
>
> |         |            |            | Read I/O count  |
> |         | Read       | Write      | devid1 | devid2 |
> |---------|------------|------------|--------|--------|
> | pid     | 29.4MiB/s  | 29.5MiB/s  | 456548 | 447975 |
> | rotation| 29.3MiB/s  | 29.3MiB/s  | 450105 | 450055 |
> | latency | 21.9MiB/s  | 21.9MiB/s  | 672387 | 0      |
> | devid:1 | 22.0MiB/s  | 22.0MiB/s  | 674788 | 0      |
>
> Defragmentation with compression workload:
>
> $ xfs_io -f -d -c 'pwrite -S 0xab 0 1G' /btrfs/foo
> $ sync
> $ echo 3 > /proc/sys/vm/drop_caches
> $ btrfs filesystem defrag -f -c /btrfs/foo
>
> |         | Time  | Read I/O Count  |
> |         | Real  | devid1 | devid2 |
> |---------|-------|--------|--------|
> | pid     | 21.61s| 3810   | 0      |
> | rotation| 11.55s| 1905   | 1905   |
> | latency | 20.99s| 0      | 3810   |
> | devid:2 | 21.41s| 0      | 3810   |
>
> . The PID-based balancing method works well for the generic random rw fio
>    workload.
> . The rotation method is ideal when you want to keep both devices active,
>    and it boosts performance in sequential defragmentation scenarios.
> . The latency-based method work well when we have mixed device types or
>    when one device experiences intermittent I/O failures the latency
>    increases and it automatically picks the other device for further Read
>    IOs.
> . The devid method is a more hands-on approach, useful for diagnosing and
>    testing RAID1 mirror synchronizations.
>
> Anand Jain (3):
>    btrfs: introduce RAID1 round-robin read balancing
>    btrfs: use the path with the lowest latency for RAID1 reads
>    btrfs: add RAID1 preferred read device
>
>   fs/btrfs/disk-io.c |   4 ++
>   fs/btrfs/sysfs.c   | 116 +++++++++++++++++++++++++++++++++++++++------
>   fs/btrfs/volumes.c | 109 ++++++++++++++++++++++++++++++++++++++++++
>   fs/btrfs/volumes.h |  16 +++++++
>   4 files changed, 230 insertions(+), 15 deletions(-)
>


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v2 0/3] raid1 balancing methods
  2024-10-11  4:59 ` Qu Wenruo
@ 2024-10-11  6:04   ` Anand Jain
  0 siblings, 0 replies; 15+ messages in thread
From: Anand Jain @ 2024-10-11  6:04 UTC (permalink / raw)
  To: Qu Wenruo, linux-btrfs; +Cc: dsterba, wqu, hrx, waxhead



On 11/10/24 10:29 am, Qu Wenruo wrote:
> 
> 
> 在 2024/10/11 13:19, Anand Jain 写道:
>> v2:
>> 1. Move new features to CONFIG_BTRFS_EXPERIMENTAL instead of 
>> CONFIG_BTRFS_DEBUG.
>> 2. Correct the typo from %est_wait to %best_wait.
>> 3. Initialize %best_wait to U64_MAX and remove the check for 0.
>> 4. Implement rotation with a minimum contiguous read threshold before
>>     switching to the next stripe. Configure this, using:
>>
>>          echo rotation:[min_contiguous_read] > /sys/fs/btrfs/<uuid>/ 
>> read_policy
>>
>>     The default value is the sector size, and the min_contiguous_read
>>     value must be a multiple of the sector size.
> 
> Overall, I'm fine with the latency and preferred device policies.
> 
> Meanwhile I'd prefer the previous version of round-robin, without the
> min_contiguous_read.
> 
> That looks a little overkilled, and I think we should keep the policy as
> simple as possible for now.
> 
> Mind to share why the min_contiguous_read is introduced in this update?
> 

The reason for adding min_contiguous_read:
The block layer optimizes with bio merging to improve HDD performance. 
David mentioned on Slack that 192k to 256k co-reads can
performance better, though I haven't seen this in my setup but it
may work in others.

> In the future, we should go the same method as sched_ext, by pushing the
> complex policies to eBPF programs.

External scripts for RAID1 balancing are achievable with BPF, though
it require writable BPF, which is disabled in some cases. That said,
still we should prioritize adding support and provide choice to the
use-case to decide.

> Another future improvement is the interface, I'm fine with the sysfs
> knob for an experimental feature.

Yes, we need to review tunables - mount options, sysfs, and
btrfs properties to have a clear guidelines/consolidation.

> But from my last drop_subtree_threshold experience, sysfs is not going
> to be a user-friendly interface. It really relies on some user space
> daemon to set.
> 

Agreed. However, for Btrfs, sysfs has been the most comprehensive so far.

> I'd prefer something more persistent, like some XATTR but inside root
> tree, and go with prop interfaces.
> But that can all be done in the future.
>

Absolutely. I included that in earlier experiments, but it was removed
due to review comments. Now isn't the right time to reintroduce it; we
can update the on-disk formats and xattrs once the in-memory graduates
address specific use cases.

Thanks, Anand

> Thanks,
> Qu
>>
>> 5. Tested FIO random read/write and defrag compression workloads with
>>     min_contiguous_read set to sector size, 192k, and 256k.
>>
>>     RAID1 balancing method rotation is better for multi-process workloads
>>     such as fio and also single-process workload such as defragmentation.
>>
>>       $ fio --filename=/btrfs/foo --size=5Gi --direct=1 --rw=randrw -- 
>> bs=4k \
>>          --ioengine=libaio --iodepth=256 --runtime=120 --numjobs=4 \
>>          --time_based --group_reporting --name=iops-test-job --eta- 
>> newline=1
>>
>>
>> |         |            |            | Read I/O count  |
>> |         | Read       | Write      | devid1 | devid2 |
>> |---------|------------|------------|--------|--------|
>> | pid     | 20.3MiB/s  | 20.5MiB/s  | 313895 | 313895 |
>> | rotation|            |            |        |        |
>> |     4096| 20.4MiB/s  | 20.5MiB/s  | 313895 | 313895 |
>> |   196608| 20.2MiB/s  | 20.2MiB/s  | 310152 | 310175 |
>> |   262144| 20.3MiB/s  | 20.4MiB/s  | 312180 | 312191 |
>> |  latency| 18.4MiB/s  | 18.4MiB/s  | 272980 | 291683 |
>> | devid:1 | 14.8MiB/s  | 14.9MiB/s  | 456376 | 0      |
>>
>>     rotation RAID1 balancing technique performs more than 2x better for
>>     single-process defrag.
>>
>>        $ time -p btrfs filesystem defrag -r -f -c /btrfs
>>
>>
>> |         | Time  | Read I/O Count  |
>> |         | Real  | devid1 | devid2 |
>> |---------|-------|--------|--------|
>> | pid     | 18.00s| 3800   | 0      |
>> | rotation|       |        |        |
>> |     4096|  8.95s| 1900   | 1901   |
>> |   196608|  8.50s| 1881   | 1919   |
>> |   262144|  8.80s| 1881   | 1919   |
>> | latency | 17.18s| 3800   | 0      |
>> | devid:2 | 17.48s| 0      | 3800   |
>>
>> Rotation keeps all devices active, and for now, the Rotation RAID1
>> balancing method is preferable as default. More workload testing is
>> needed while the code is EXPERIMENTAL.
>> While Latency is better during the failing/unstable block layer 
>> transport.
>> As of now these two techniques, are needed to be further independently
>> tested with different worloads, and in the long term we should be merge
>> these technique to a unified heuristic.
>>
>> Rotation keeps all devices active, and for now, the Rotation RAID1
>> balancing method should be the default. More workload testing is needed
>> while the code is EXPERIMENTAL.
>>
>> Latency is smarter with unstable block layer transport.
>>
>> Both techniques need independent testing across workloads, with the 
>> goal of
>> eventually merging them into a unified approach? for the long term.
>>
>> Devid is a hands-on approach, provides manual or user-space script 
>> control.
>>
>> These RAID1 balancing methods are tunable via the sysfs knob.
>> The mount -o option and btrfs properties are under consideration.
>>
>> Thx.
>>
>> --------- original v1 ------------
>>
>> The RAID1-balancing methods helps distribute read I/O across devices, and
>> this patch introduces three balancing methods: rotation, latency, and
>> devid. These methods are enabled under the `CONFIG_BTRFS_DEBUG` config
>> option and are on top of the previously added
>> `/sys/fs/btrfs/<UUID>/read_policy` interface to configure the desired
>> RAID1 read balancing method.
>>
>> I've tested these patches using fio and filesystem defragmentation
>> workloads on a two-device RAID1 setup (with both data and metadata
>> mirrored across identical devices). I tracked device read counts by
>> extracting stats from `/sys/devices/<..>/stat` for each device. Below is
>> a summary of the results, with each result the average of three
>> iterations.
>>
>> A typical generic random rw workload:
>>
>> $ fio --filename=/btrfs/foo --size=10Gi --direct=1 --rw=randrw --bs=4k \
>>    --ioengine=libaio --iodepth=256 --runtime=120 --numjobs=4 -- 
>> time_based \
>>    --group_reporting --name=iops-test-job --eta-newline=1
>>
>> |         |            |            | Read I/O count  |
>> |         | Read       | Write      | devid1 | devid2 |
>> |---------|------------|------------|--------|--------|
>> | pid     | 29.4MiB/s  | 29.5MiB/s  | 456548 | 447975 |
>> | rotation| 29.3MiB/s  | 29.3MiB/s  | 450105 | 450055 |
>> | latency | 21.9MiB/s  | 21.9MiB/s  | 672387 | 0      |
>> | devid:1 | 22.0MiB/s  | 22.0MiB/s  | 674788 | 0      |
>>
>> Defragmentation with compression workload:
>>
>> $ xfs_io -f -d -c 'pwrite -S 0xab 0 1G' /btrfs/foo
>> $ sync
>> $ echo 3 > /proc/sys/vm/drop_caches
>> $ btrfs filesystem defrag -f -c /btrfs/foo
>>
>> |         | Time  | Read I/O Count  |
>> |         | Real  | devid1 | devid2 |
>> |---------|-------|--------|--------|
>> | pid     | 21.61s| 3810   | 0      |
>> | rotation| 11.55s| 1905   | 1905   |
>> | latency | 20.99s| 0      | 3810   |
>> | devid:2 | 21.41s| 0      | 3810   |
>>
>> . The PID-based balancing method works well for the generic random rw fio
>>    workload.
>> . The rotation method is ideal when you want to keep both devices active,
>>    and it boosts performance in sequential defragmentation scenarios.
>> . The latency-based method work well when we have mixed device types or
>>    when one device experiences intermittent I/O failures the latency
>>    increases and it automatically picks the other device for further Read
>>    IOs.
>> . The devid method is a more hands-on approach, useful for diagnosing and
>>    testing RAID1 mirror synchronizations.
>>
>> Anand Jain (3):
>>    btrfs: introduce RAID1 round-robin read balancing
>>    btrfs: use the path with the lowest latency for RAID1 reads
>>    btrfs: add RAID1 preferred read device
>>
>>   fs/btrfs/disk-io.c |   4 ++
>>   fs/btrfs/sysfs.c   | 116 +++++++++++++++++++++++++++++++++++++++------
>>   fs/btrfs/volumes.c | 109 ++++++++++++++++++++++++++++++++++++++++++
>>   fs/btrfs/volumes.h |  16 +++++++
>>   4 files changed, 230 insertions(+), 15 deletions(-)
>>
> 


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v2 0/3] raid1 balancing methods
  2024-10-11  2:49 [PATCH v2 0/3] raid1 balancing methods Anand Jain
                   ` (4 preceding siblings ...)
  2024-10-11  4:59 ` Qu Wenruo
@ 2024-10-21 14:05 ` David Sterba
  2024-10-21 15:36   ` Anand Jain
  2024-10-21 14:32 ` waxhead
  2024-10-24  4:39 ` Qu Wenruo
  7 siblings, 1 reply; 15+ messages in thread
From: David Sterba @ 2024-10-21 14:05 UTC (permalink / raw)
  To: Anand Jain; +Cc: linux-btrfs, dsterba, wqu, hrx, waxhead

On Fri, Oct 11, 2024 at 10:49:15AM +0800, Anand Jain wrote:
> v2:
> 1. Move new features to CONFIG_BTRFS_EXPERIMENTAL instead of CONFIG_BTRFS_DEBUG.
> 2. Correct the typo from %est_wait to %best_wait.
> 3. Initialize %best_wait to U64_MAX and remove the check for 0.
> 4. Implement rotation with a minimum contiguous read threshold before
>    switching to the next stripe. Configure this, using:
> 
>         echo rotation:[min_contiguous_read] > /sys/fs/btrfs/<uuid>/read_policy
> 
>    The default value is the sector size, and the min_contiguous_read
>    value must be a multiple of the sector size.

I think it's safe to start with the round-round robin policy, but the
syntax is strange, why the [ ] are mandatory? Also please call it
round-robin, or 'rr' for short.

The default of sector size is IMHO a wrong value, switching devices so
often will drop the performance just because of the io request overhead.
From what I rememer values around 200k were reasonable, so either 192k
or 256k should be the default. We may also drop the configurable value
at all and provide a few hard coded sizes like rr-256k, rr-512k, rr-1m,
if not only to drop parsing of user strings.

> 5. Tested FIO random read/write and defrag compression workloads with
>    min_contiguous_read set to sector size, 192k, and 256k.
> 
>    RAID1 balancing method rotation is better for multi-process workloads
>    such as fio and also single-process workload such as defragmentation.
> 
>      $ fio --filename=/btrfs/foo --size=5Gi --direct=1 --rw=randrw --bs=4k \
>         --ioengine=libaio --iodepth=256 --runtime=120 --numjobs=4 \
>         --time_based --group_reporting --name=iops-test-job --eta-newline=1
> 
> 
> |         |            |            | Read I/O count  |
> |         | Read       | Write      | devid1 | devid2 |
> |---------|------------|------------|--------|--------|
> | pid     | 20.3MiB/s  | 20.5MiB/s  | 313895 | 313895 |
> | rotation|            |            |        |        |
> |     4096| 20.4MiB/s  | 20.5MiB/s  | 313895 | 313895 |
> |   196608| 20.2MiB/s  | 20.2MiB/s  | 310152 | 310175 |
> |   262144| 20.3MiB/s  | 20.4MiB/s  | 312180 | 312191 |
> |  latency| 18.4MiB/s  | 18.4MiB/s  | 272980 | 291683 |
> | devid:1 | 14.8MiB/s  | 14.9MiB/s  | 456376 | 0      |
> 
>    rotation RAID1 balancing technique performs more than 2x better for
>    single-process defrag.
> 
>       $ time -p btrfs filesystem defrag -r -f -c /btrfs
> 
> 
> |         | Time  | Read I/O Count  |
> |         | Real  | devid1 | devid2 |
> |---------|-------|--------|--------|
> | pid     | 18.00s| 3800   | 0      |
> | rotation|       |        |        |
> |     4096|  8.95s| 1900   | 1901   |
> |   196608|  8.50s| 1881   | 1919   |
> |   262144|  8.80s| 1881   | 1919   |
> | latency | 17.18s| 3800   | 0      |
> | devid:2 | 17.48s| 0      | 3800   |
> 
> Rotation keeps all devices active, and for now, the Rotation RAID1
> balancing method is preferable as default. More workload testing is
> needed while the code is EXPERIMENTAL.

Yeah round-robin will be a good defalt, we only need to verify the chunk
size and then do the switch in the next release.

> While Latency is better during the failing/unstable block layer transport.
> As of now these two techniques, are needed to be further independently
> tested with different worloads, and in the long term we should be merge
> these technique to a unified heuristic.

This sounds like he latency is good for a specific case and maybe a
fallback if the device becomes faulty, but once the layer below becomes
unstable we may need to skip reading from the device. This is also a
different mode of operation than balancing reads.

> Rotation keeps all devices active, and for now, the Rotation RAID1
> balancing method should be the default. More workload testing is needed
> while the code is EXPERIMENTAL.
> 
> Latency is smarter with unstable block layer transport.
> 
> Both techniques need independent testing across workloads, with the goal of
> eventually merging them into a unified approach? for the long term.
> 
> Devid is a hands-on approach, provides manual or user-space script control.
> 
> These RAID1 balancing methods are tunable via the sysfs knob.
> The mount -o option and btrfs properties are under consideration.

To move forward with the feature I think the round robin and preferred
device id can be merged. I'm not sure about the latency but if it's
under experimental we can take it as is and tune later.

I'll check my notes from the last time Michal attempted to implement the
policies if we haven't missed something.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v2 0/3] raid1 balancing methods
  2024-10-11  2:49 [PATCH v2 0/3] raid1 balancing methods Anand Jain
                   ` (5 preceding siblings ...)
  2024-10-21 14:05 ` David Sterba
@ 2024-10-21 14:32 ` waxhead
  2024-10-21 15:44   ` Anand Jain
  2024-10-22  7:07   ` Johannes Thumshirn
  2024-10-24  4:39 ` Qu Wenruo
  7 siblings, 2 replies; 15+ messages in thread
From: waxhead @ 2024-10-21 14:32 UTC (permalink / raw)
  To: Anand Jain, linux-btrfs; +Cc: dsterba, wqu, hrx

Anand Jain wrote:
> v2:
> 1. Move new features to CONFIG_BTRFS_EXPERIMENTAL instead of CONFIG_BTRFS_DEBUG.
> 2. Correct the typo from %est_wait to %best_wait.
> 3. Initialize %best_wait to U64_MAX and remove the check for 0.
> 4. Implement rotation with a minimum contiguous read threshold before
>     switching to the next stripe. Configure this, using:
> 
>          echo rotation:[min_contiguous_read] > /sys/fs/btrfs/<uuid>/read_policy
> 
>     The default value is the sector size, and the min_contiguous_read
>     value must be a multiple of the sector size.
> 
> 5. Tested FIO random read/write and defrag compression workloads with
>     min_contiguous_read set to sector size, 192k, and 256k.
> 
>     RAID1 balancing method rotation is better for multi-process workloads
>     such as fio and also single-process workload such as defragmentation.

With this functionality added, would it not also make sense to add a 
RAID0/10 profile that limits the stripe width, so a stripe does not 
spawn more than n disk (for example n=4).

On systems with for example 24 disks in RAID10, a read may activate 12 
disks at the same time which could easily saturate the bus.

Therefore if a storage profile that limits the number of devices a 
stripe occupy existed, it seems like there might be posibillities for 
RAID0/10 as well.

Note that as of writing this I believe that RAID0/10/5/6 make the stripe 
as wide as the number of storage devices available for the filesystem. 
If I am wrong about this please ignore my jabbering and move on.






^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v2 0/3] raid1 balancing methods
  2024-10-21 14:05 ` David Sterba
@ 2024-10-21 15:36   ` Anand Jain
  2024-10-21 18:42     ` David Sterba
  0 siblings, 1 reply; 15+ messages in thread
From: Anand Jain @ 2024-10-21 15:36 UTC (permalink / raw)
  To: dsterba; +Cc: linux-btrfs, dsterba, wqu, hrx, waxhead


Thanks for commenting.

On 21/10/24 22:05, David Sterba wrote:
> On Fri, Oct 11, 2024 at 10:49:15AM +0800, Anand Jain wrote:
>> v2:
>> 1. Move new features to CONFIG_BTRFS_EXPERIMENTAL instead of CONFIG_BTRFS_DEBUG.
>> 2. Correct the typo from %est_wait to %best_wait.
>> 3. Initialize %best_wait to U64_MAX and remove the check for 0.
>> 4. Implement rotation with a minimum contiguous read threshold before
>>     switching to the next stripe. Configure this, using:
>>
>>          echo rotation:[min_contiguous_read] > /sys/fs/btrfs/<uuid>/read_policy
>>
>>     The default value is the sector size, and the min_contiguous_read
>>     value must be a multiple of the sector size.
> 
> I think it's safe to start with the round-round robin policy, but the
> syntax is strange, why the [ ] are mandatory? Also please call it
> round-robin, or 'rr' for short.
> 

I'm fine with round-robin. The [ ] part is not mandatory; if the
min_contiguous_read value is not specified, it will default to a
predefined value.

> The default of sector size is IMHO a wrong value, switching devices so
> often will drop the performance just because of the io request overhead.

>  From what I rememer values around 200k were reasonable, so either 192k
> or 256k should be the default. We may also drop the configurable value
> at all and provide a few hard coded sizes like rr-256k, rr-512k, rr-1m,
> if not only to drop parsing of user strings.

I'm okay with a default value of 256k. For the experimental feature,
we can keep it configurable, allowing the opportunity to experiment
with other values as well

> 
>> 5. Tested FIO random read/write and defrag compression workloads with
>>     min_contiguous_read set to sector size, 192k, and 256k.
>>
>>     RAID1 balancing method rotation is better for multi-process workloads
>>     such as fio and also single-process workload such as defragmentation.
>>
>>       $ fio --filename=/btrfs/foo --size=5Gi --direct=1 --rw=randrw --bs=4k \
>>          --ioengine=libaio --iodepth=256 --runtime=120 --numjobs=4 \
>>          --time_based --group_reporting --name=iops-test-job --eta-newline=1
>>
>>
>> |         |            |            | Read I/O count  |
>> |         | Read       | Write      | devid1 | devid2 |
>> |---------|------------|------------|--------|--------|
>> | pid     | 20.3MiB/s  | 20.5MiB/s  | 313895 | 313895 |
>> | rotation|            |            |        |        |
>> |     4096| 20.4MiB/s  | 20.5MiB/s  | 313895 | 313895 |
>> |   196608| 20.2MiB/s  | 20.2MiB/s  | 310152 | 310175 |
>> |   262144| 20.3MiB/s  | 20.4MiB/s  | 312180 | 312191 |
>> |  latency| 18.4MiB/s  | 18.4MiB/s  | 272980 | 291683 |
>> | devid:1 | 14.8MiB/s  | 14.9MiB/s  | 456376 | 0      |
>>
>>     rotation RAID1 balancing technique performs more than 2x better for
>>     single-process defrag.
>>
>>        $ time -p btrfs filesystem defrag -r -f -c /btrfs
>>
>>
>> |         | Time  | Read I/O Count  |
>> |         | Real  | devid1 | devid2 |
>> |---------|-------|--------|--------|
>> | pid     | 18.00s| 3800   | 0      |
>> | rotation|       |        |        |
>> |     4096|  8.95s| 1900   | 1901   |
>> |   196608|  8.50s| 1881   | 1919   |
>> |   262144|  8.80s| 1881   | 1919   |
>> | latency | 17.18s| 3800   | 0      |
>> | devid:2 | 17.48s| 0      | 3800   |
>>
>> Rotation keeps all devices active, and for now, the Rotation RAID1
>> balancing method is preferable as default. More workload testing is
>> needed while the code is EXPERIMENTAL.
> 
> Yeah round-robin will be a good defalt, we only need to verify the chunk
> size and then do the switch in the next release.
> 

Yes..

>> While Latency is better during the failing/unstable block layer transport.
>> As of now these two techniques, are needed to be further independently
>> tested with different worloads, and in the long term we should be merge
>> these technique to a unified heuristic.
> 
> This sounds like he latency is good for a specific case and maybe a
> fallback if the device becomes faulty, but once the layer below becomes
> unstable we may need to skip reading from the device. This is also a
> different mode of operation than balancing reads.
> 

If the latency on the faulty path is so high that it shouldn't pick that
path at all, so it works. However, the round-robin balancing is unaware
of dynamic faults on the device path. IMO, a round-robin method that is
latency aware (with ~20% variance) would be better.

>> Rotation keeps all devices active, and for now, the Rotation RAID1
>> balancing method should be the default. More workload testing is needed
>> while the code is EXPERIMENTAL.
>>
>> Latency is smarter with unstable block layer transport.
>>
>> Both techniques need independent testing across workloads, with the goal of
>> eventually merging them into a unified approach? for the long term.
>>
>> Devid is a hands-on approach, provides manual or user-space script control.
>>
>> These RAID1 balancing methods are tunable via the sysfs knob.
>> The mount -o option and btrfs properties are under consideration.
> 
> To move forward with the feature I think the round robin and preferred
> device id can be merged. I'm not sure about the latency but if it's
> under experimental we can take it as is and tune later.

I hope the experimental feature also means we can change the name of the
balancing method at any time. Once we have tested a fair combination of
block device types, we'll definitely need a method that can
automatically tune based on the device type, which will require adding
or dropping balancing methods accordingly.

> I'll check my notes from the last time Michal attempted to implement the
> policies if we haven't missed something.

Thanks, Anand


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v2 0/3] raid1 balancing methods
  2024-10-21 14:32 ` waxhead
@ 2024-10-21 15:44   ` Anand Jain
  2024-10-22  7:07   ` Johannes Thumshirn
  1 sibling, 0 replies; 15+ messages in thread
From: Anand Jain @ 2024-10-21 15:44 UTC (permalink / raw)
  To: waxhead, linux-btrfs; +Cc: dsterba, wqu, hrx



On 21/10/24 22:32, waxhead wrote:
> Anand Jain wrote:
>> v2:
>> 1. Move new features to CONFIG_BTRFS_EXPERIMENTAL instead of 
>> CONFIG_BTRFS_DEBUG.
>> 2. Correct the typo from %est_wait to %best_wait.
>> 3. Initialize %best_wait to U64_MAX and remove the check for 0.
>> 4. Implement rotation with a minimum contiguous read threshold before
>>     switching to the next stripe. Configure this, using:
>>
>>          echo rotation:[min_contiguous_read] > /sys/fs/btrfs/<uuid>/ 
>> read_policy
>>
>>     The default value is the sector size, and the min_contiguous_read
>>     value must be a multiple of the sector size.
>>
>> 5. Tested FIO random read/write and defrag compression workloads with
>>     min_contiguous_read set to sector size, 192k, and 256k.
>>
>>     RAID1 balancing method rotation is better for multi-process workloads
>>     such as fio and also single-process workload such as defragmentation.
> 
> With this functionality added, would it not also make sense to add a 
> RAID0/10 profile that limits the stripe width, so a stripe does not 
> spawn more than n disk (for example n=4).
> 
 > On systems with for example 24 disks in RAID10, a read may activate 
12 > disks at the same time which could easily saturate the bus.
> 
> Therefore if a storage profile that limits the number of devices a 
> stripe occupy existed, it seems like there might be posibillities for 
> RAID0/10 as well.
> 
> Note that as of writing this I believe that RAID0/10/5/6 make the stripe 
> as wide as the number of storage devices available for the filesystem. 
> If I am wrong about this please ignore my jabbering and move on.

That's correct. I previously attempted to come up with a fix using
the device grouping method. If there's a convincing and more generic
way to specify how the devices should be grouped, we could consider
that.

Thanks, Anand

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v2 0/3] raid1 balancing methods
  2024-10-21 15:36   ` Anand Jain
@ 2024-10-21 18:42     ` David Sterba
  2024-10-22  0:31       ` Anand Jain
  0 siblings, 1 reply; 15+ messages in thread
From: David Sterba @ 2024-10-21 18:42 UTC (permalink / raw)
  To: Anand Jain; +Cc: linux-btrfs, dsterba, wqu, hrx, waxhead

On Mon, Oct 21, 2024 at 11:36:10PM +0800, Anand Jain wrote:
> > I think it's safe to start with the round-round robin policy, but the
> > syntax is strange, why the [ ] are mandatory? Also please call it
> > round-robin, or 'rr' for short.
> 
> I'm fine with round-robin. The [ ] part is not mandatory; if the
> min_contiguous_read value is not specified, it will default to a
> predefined value.
> 
> > The default of sector size is IMHO a wrong value, switching devices so
> > often will drop the performance just because of the io request overhead.
> 
> >  From what I rememer values around 200k were reasonable, so either 192k
> > or 256k should be the default. We may also drop the configurable value
> > at all and provide a few hard coded sizes like rr-256k, rr-512k, rr-1m,
> > if not only to drop parsing of user strings.
> 
> I'm okay with a default value of 256k. For the experimental feature,
> we can keep it configurable, allowing the opportunity to experiment
> with other values as well

Yeah, for experimenting it makes sense to make it flexible, no need to
patch and reboot the kernel. For final we should settle on some
reasonable values.

> >> 5. Tested FIO random read/write and defrag compression workloads with
> >>     min_contiguous_read set to sector size, 192k, and 256k.
> >>
> >>     RAID1 balancing method rotation is better for multi-process workloads
> >>     such as fio and also single-process workload such as defragmentation.
> >>
> >>       $ fio --filename=/btrfs/foo --size=5Gi --direct=1 --rw=randrw --bs=4k \
> >>          --ioengine=libaio --iodepth=256 --runtime=120 --numjobs=4 \
> >>          --time_based --group_reporting --name=iops-test-job --eta-newline=1
> >>
> >>
> >> |         |            |            | Read I/O count  |
> >> |         | Read       | Write      | devid1 | devid2 |
> >> |---------|------------|------------|--------|--------|
> >> | pid     | 20.3MiB/s  | 20.5MiB/s  | 313895 | 313895 |
> >> | rotation|            |            |        |        |
> >> |     4096| 20.4MiB/s  | 20.5MiB/s  | 313895 | 313895 |
> >> |   196608| 20.2MiB/s  | 20.2MiB/s  | 310152 | 310175 |
> >> |   262144| 20.3MiB/s  | 20.4MiB/s  | 312180 | 312191 |
> >> |  latency| 18.4MiB/s  | 18.4MiB/s  | 272980 | 291683 |
> >> | devid:1 | 14.8MiB/s  | 14.9MiB/s  | 456376 | 0      |
> >>
> >>     rotation RAID1 balancing technique performs more than 2x better for
> >>     single-process defrag.
> >>
> >>        $ time -p btrfs filesystem defrag -r -f -c /btrfs
> >>
> >>
> >> |         | Time  | Read I/O Count  |
> >> |         | Real  | devid1 | devid2 |
> >> |---------|-------|--------|--------|
> >> | pid     | 18.00s| 3800   | 0      |
> >> | rotation|       |        |        |
> >> |     4096|  8.95s| 1900   | 1901   |
> >> |   196608|  8.50s| 1881   | 1919   |
> >> |   262144|  8.80s| 1881   | 1919   |
> >> | latency | 17.18s| 3800   | 0      |
> >> | devid:2 | 17.48s| 0      | 3800   |
> >>
> >> Rotation keeps all devices active, and for now, the Rotation RAID1
> >> balancing method is preferable as default. More workload testing is
> >> needed while the code is EXPERIMENTAL.
> > 
> > Yeah round-robin will be a good defalt, we only need to verify the chunk
> > size and then do the switch in the next release.
> > 
> 
> Yes..
> 
> >> While Latency is better during the failing/unstable block layer transport.
> >> As of now these two techniques, are needed to be further independently
> >> tested with different worloads, and in the long term we should be merge
> >> these technique to a unified heuristic.
> > 
> > This sounds like he latency is good for a specific case and maybe a
> > fallback if the device becomes faulty, but once the layer below becomes
> > unstable we may need to skip reading from the device. This is also a
> > different mode of operation than balancing reads.
> > 
> 
> If the latency on the faulty path is so high that it shouldn't pick that
> path at all, so it works. However, the round-robin balancing is unaware
> of dynamic faults on the device path. IMO, a round-robin method that is
> latency aware (with ~20% variance) would be better.

We should not mix the faulty device handling mode to the read balancing,
at least for now. A back off algorithm that checks number of failed io
requests should precede the balancing.

> >> Rotation keeps all devices active, and for now, the Rotation RAID1
> >> balancing method should be the default. More workload testing is needed
> >> while the code is EXPERIMENTAL.
> >>
> >> Latency is smarter with unstable block layer transport.
> >>
> >> Both techniques need independent testing across workloads, with the goal of
> >> eventually merging them into a unified approach? for the long term.
> >>
> >> Devid is a hands-on approach, provides manual or user-space script control.
> >>
> >> These RAID1 balancing methods are tunable via the sysfs knob.
> >> The mount -o option and btrfs properties are under consideration.
> > 
> > To move forward with the feature I think the round robin and preferred
> > device id can be merged. I'm not sure about the latency but if it's
> > under experimental we can take it as is and tune later.
> 
> I hope the experimental feature also means we can change the name of the
> balancing method at any time. Once we have tested a fair combination of
> block device types, we'll definitely need a method that can
> automatically tune based on the device type, which will require adding
> or dropping balancing methods accordingly.

Yes we can change the names. The automatic tuning would need some
feedback that measures the load and tries to improve the throughput,
this is where we got stuck last time. So for now let's do some
starightforward policy that on average works better than the current pid
policy. I hope that tha round-robin-256k can be a good default, but of
course we need more data for that.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v2 0/3] raid1 balancing methods
  2024-10-21 18:42     ` David Sterba
@ 2024-10-22  0:31       ` Anand Jain
  0 siblings, 0 replies; 15+ messages in thread
From: Anand Jain @ 2024-10-22  0:31 UTC (permalink / raw)
  To: dsterba; +Cc: linux-btrfs, dsterba, wqu, hrx, waxhead



On 22/10/24 02:42, David Sterba wrote:
> On Mon, Oct 21, 2024 at 11:36:10PM +0800, Anand Jain wrote:
>>> I think it's safe to start with the round-round robin policy, but the
>>> syntax is strange, why the [ ] are mandatory? Also please call it
>>> round-robin, or 'rr' for short.
>>
>> I'm fine with round-robin. The [ ] part is not mandatory; if the
>> min_contiguous_read value is not specified, it will default to a
>> predefined value.
>>
>>> The default of sector size is IMHO a wrong value, switching devices so
>>> often will drop the performance just because of the io request overhead.
>>
>>>   From what I rememer values around 200k were reasonable, so either 192k
>>> or 256k should be the default. We may also drop the configurable value
>>> at all and provide a few hard coded sizes like rr-256k, rr-512k, rr-1m,
>>> if not only to drop parsing of user strings.
>>
>> I'm okay with a default value of 256k. For the experimental feature,
>> we can keep it configurable, allowing the opportunity to experiment
>> with other values as well
> 
> Yeah, for experimenting it makes sense to make it flexible, no need to
> patch and reboot the kernel. For final we should settle on some
> reasonable values.
> 
>>>> 5. Tested FIO random read/write and defrag compression workloads with
>>>>      min_contiguous_read set to sector size, 192k, and 256k.
>>>>
>>>>      RAID1 balancing method rotation is better for multi-process workloads
>>>>      such as fio and also single-process workload such as defragmentation.
>>>>
>>>>        $ fio --filename=/btrfs/foo --size=5Gi --direct=1 --rw=randrw --bs=4k \
>>>>           --ioengine=libaio --iodepth=256 --runtime=120 --numjobs=4 \
>>>>           --time_based --group_reporting --name=iops-test-job --eta-newline=1
>>>>
>>>>
>>>> |         |            |            | Read I/O count  |
>>>> |         | Read       | Write      | devid1 | devid2 |
>>>> |---------|------------|------------|--------|--------|
>>>> | pid     | 20.3MiB/s  | 20.5MiB/s  | 313895 | 313895 |
>>>> | rotation|            |            |        |        |
>>>> |     4096| 20.4MiB/s  | 20.5MiB/s  | 313895 | 313895 |
>>>> |   196608| 20.2MiB/s  | 20.2MiB/s  | 310152 | 310175 |
>>>> |   262144| 20.3MiB/s  | 20.4MiB/s  | 312180 | 312191 |
>>>> |  latency| 18.4MiB/s  | 18.4MiB/s  | 272980 | 291683 |
>>>> | devid:1 | 14.8MiB/s  | 14.9MiB/s  | 456376 | 0      |
>>>>
>>>>      rotation RAID1 balancing technique performs more than 2x better for
>>>>      single-process defrag.
>>>>
>>>>         $ time -p btrfs filesystem defrag -r -f -c /btrfs
>>>>
>>>>
>>>> |         | Time  | Read I/O Count  |
>>>> |         | Real  | devid1 | devid2 |
>>>> |---------|-------|--------|--------|
>>>> | pid     | 18.00s| 3800   | 0      |
>>>> | rotation|       |        |        |
>>>> |     4096|  8.95s| 1900   | 1901   |
>>>> |   196608|  8.50s| 1881   | 1919   |
>>>> |   262144|  8.80s| 1881   | 1919   |
>>>> | latency | 17.18s| 3800   | 0      |
>>>> | devid:2 | 17.48s| 0      | 3800   |
>>>>
>>>> Rotation keeps all devices active, and for now, the Rotation RAID1
>>>> balancing method is preferable as default. More workload testing is
>>>> needed while the code is EXPERIMENTAL.
>>>
>>> Yeah round-robin will be a good defalt, we only need to verify the chunk
>>> size and then do the switch in the next release.
>>>
>>
>> Yes..
>>
>>>> While Latency is better during the failing/unstable block layer transport.
>>>> As of now these two techniques, are needed to be further independently
>>>> tested with different worloads, and in the long term we should be merge
>>>> these technique to a unified heuristic.
>>>
>>> This sounds like he latency is good for a specific case and maybe a
>>> fallback if the device becomes faulty, but once the layer below becomes
>>> unstable we may need to skip reading from the device. This is also a
>>> different mode of operation than balancing reads.
>>>
>>
>> If the latency on the faulty path is so high that it shouldn't pick that
>> path at all, so it works. However, the round-robin balancing is unaware
>> of dynamic faults on the device path. IMO, a round-robin method that is
>> latency aware (with ~20% variance) would be better.
> 
> We should not mix the faulty device handling mode to the read balancing,
> at least for now. A back off algorithm that checks number of failed io
> requests should precede the balancing.
> 
>>>> Rotation keeps all devices active, and for now, the Rotation RAID1
>>>> balancing method should be the default. More workload testing is needed
>>>> while the code is EXPERIMENTAL.
>>>>
>>>> Latency is smarter with unstable block layer transport.
>>>>
>>>> Both techniques need independent testing across workloads, with the goal of
>>>> eventually merging them into a unified approach? for the long term.
>>>>
>>>> Devid is a hands-on approach, provides manual or user-space script control.
>>>>
>>>> These RAID1 balancing methods are tunable via the sysfs knob.
>>>> The mount -o option and btrfs properties are under consideration.
>>>
>>> To move forward with the feature I think the round robin and preferred
>>> device id can be merged. I'm not sure about the latency but if it's
>>> under experimental we can take it as is and tune later.
>>
>> I hope the experimental feature also means we can change the name of the
>> balancing method at any time. Once we have tested a fair combination of
>> block device types, we'll definitely need a method that can
>> automatically tune based on the device type, which will require adding
>> or dropping balancing methods accordingly.
> 
> Yes we can change the names. The automatic tuning would need some
> feedback that measures the load and tries to improve the throughput,
> this is where we got stuck last time. So for now let's do some
> starightforward policy that on average works better than the current pid
> policy. I hope that tha round-robin-256k can be a good default, but of
> course we need more data for that.

Sending v3 with rotation renamed to round-robin. Code review
appreciated; I'll wait a day.

Thanks, Anand

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v2 0/3] raid1 balancing methods
  2024-10-21 14:32 ` waxhead
  2024-10-21 15:44   ` Anand Jain
@ 2024-10-22  7:07   ` Johannes Thumshirn
  1 sibling, 0 replies; 15+ messages in thread
From: Johannes Thumshirn @ 2024-10-22  7:07 UTC (permalink / raw)
  To: waxhead@dirtcellar.net, Anand Jain, linux-btrfs@vger.kernel.org
  Cc: dsterba@suse.com, WenRuo Qu, hrx@bupt.moe

On 21.10.24 16:32, waxhead wrote:
> Note that as of writing this I believe that RAID0/10/5/6 make the stripe
> as wide as the number of storage devices available for the filesystem.
> If I am wrong about this please ignore my jabbering and move on.

Nope, you're correct and this is a huge problem for bigger (in numbers 
of drives) arrays.

But it's also on my list of things I want to change in how we handle 
RAID with the RAID stripe-tree. This way we can do declustered RAID and 
ease on rebuild times. Also we can drastically enhance write parallelism 
to an array by directing different write streams to different sets of 
stripes. Which btw atm isn't even done for RAID1, as we're picking a 
block-group at a time until it's full which then gets written, instead 
of creating new block groups on new drive sets for different write 
streams (i.e. different files, etc..)

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v2 0/3] raid1 balancing methods
  2024-10-11  2:49 [PATCH v2 0/3] raid1 balancing methods Anand Jain
                   ` (6 preceding siblings ...)
  2024-10-21 14:32 ` waxhead
@ 2024-10-24  4:39 ` Qu Wenruo
  7 siblings, 0 replies; 15+ messages in thread
From: Qu Wenruo @ 2024-10-24  4:39 UTC (permalink / raw)
  To: Anand Jain, linux-btrfs; +Cc: dsterba, wqu, hrx, waxhead



在 2024/10/11 13:19, Anand Jain 写道:
> v2:
> 1. Move new features to CONFIG_BTRFS_EXPERIMENTAL instead of CONFIG_BTRFS_DEBUG.
> 2. Correct the typo from %est_wait to %best_wait.
> 3. Initialize %best_wait to U64_MAX and remove the check for 0.
> 4. Implement rotation with a minimum contiguous read threshold before
>     switching to the next stripe. Configure this, using:
>
>          echo rotation:[min_contiguous_read] > /sys/fs/btrfs/<uuid>/read_policy
>
>     The default value is the sector size, and the min_contiguous_read
>     value must be a multiple of the sector size.
>
> 5. Tested FIO random read/write and defrag compression workloads with
>     min_contiguous_read set to sector size, 192k, and 256k.
>
>     RAID1 balancing method rotation is better for multi-process workloads
>     such as fio and also single-process workload such as defragmentation.
>
>       $ fio --filename=/btrfs/foo --size=5Gi --direct=1 --rw=randrw --bs=4k \
>          --ioengine=libaio --iodepth=256 --runtime=120 --numjobs=4 \
>          --time_based --group_reporting --name=iops-test-job --eta-newline=1

Reviewed-by: Qu Wenruo <wqu@suse.com>

Although not 100% happy with the min_contiguous_read setting, since it's
an optional one and still experimental, I'm fine with series so far.


Just want to express my concern about going mount option.

I know sysfs is not a good way to setup a lot of features, but mount
option is way too committed to me, even under experimental features.

But I also understand without mount option it can be pretty hard to
setup the read policy for fstests runs.

So I'd prefer to have some on-disk solution (XATTR or temporary items)
to save the read policy.
It's less committed compared to mount option (aka, much easier to revert
the change with breaking any compatibility), and can help for future
features.

Thanks,
Qu
>
>
> |         |            |            | Read I/O count  |
> |         | Read       | Write      | devid1 | devid2 |
> |---------|------------|------------|--------|--------|
> | pid     | 20.3MiB/s  | 20.5MiB/s  | 313895 | 313895 |
> | rotation|            |            |        |        |
> |     4096| 20.4MiB/s  | 20.5MiB/s  | 313895 | 313895 |
> |   196608| 20.2MiB/s  | 20.2MiB/s  | 310152 | 310175 |
> |   262144| 20.3MiB/s  | 20.4MiB/s  | 312180 | 312191 |
> |  latency| 18.4MiB/s  | 18.4MiB/s  | 272980 | 291683 |
> | devid:1 | 14.8MiB/s  | 14.9MiB/s  | 456376 | 0      |
>
>     rotation RAID1 balancing technique performs more than 2x better for
>     single-process defrag.
>
>        $ time -p btrfs filesystem defrag -r -f -c /btrfs
>
>
> |         | Time  | Read I/O Count  |
> |         | Real  | devid1 | devid2 |
> |---------|-------|--------|--------|
> | pid     | 18.00s| 3800   | 0      |
> | rotation|       |        |        |
> |     4096|  8.95s| 1900   | 1901   |
> |   196608|  8.50s| 1881   | 1919   |
> |   262144|  8.80s| 1881   | 1919   |
> | latency | 17.18s| 3800   | 0      |
> | devid:2 | 17.48s| 0      | 3800   |
>
> Rotation keeps all devices active, and for now, the Rotation RAID1
> balancing method is preferable as default. More workload testing is
> needed while the code is EXPERIMENTAL.
> While Latency is better during the failing/unstable block layer transport.
> As of now these two techniques, are needed to be further independently
> tested with different worloads, and in the long term we should be merge
> these technique to a unified heuristic.
>
> Rotation keeps all devices active, and for now, the Rotation RAID1
> balancing method should be the default. More workload testing is needed
> while the code is EXPERIMENTAL.
>
> Latency is smarter with unstable block layer transport.
>
> Both techniques need independent testing across workloads, with the goal of
> eventually merging them into a unified approach? for the long term.
>
> Devid is a hands-on approach, provides manual or user-space script control.
>
> These RAID1 balancing methods are tunable via the sysfs knob.
> The mount -o option and btrfs properties are under consideration.
>
> Thx.
>
> --------- original v1 ------------
>
> The RAID1-balancing methods helps distribute read I/O across devices, and
> this patch introduces three balancing methods: rotation, latency, and
> devid. These methods are enabled under the `CONFIG_BTRFS_DEBUG` config
> option and are on top of the previously added
> `/sys/fs/btrfs/<UUID>/read_policy` interface to configure the desired
> RAID1 read balancing method.
>
> I've tested these patches using fio and filesystem defragmentation
> workloads on a two-device RAID1 setup (with both data and metadata
> mirrored across identical devices). I tracked device read counts by
> extracting stats from `/sys/devices/<..>/stat` for each device. Below is
> a summary of the results, with each result the average of three
> iterations.
>
> A typical generic random rw workload:
>
> $ fio --filename=/btrfs/foo --size=10Gi --direct=1 --rw=randrw --bs=4k \
>    --ioengine=libaio --iodepth=256 --runtime=120 --numjobs=4 --time_based \
>    --group_reporting --name=iops-test-job --eta-newline=1
>
> |         |            |            | Read I/O count  |
> |         | Read       | Write      | devid1 | devid2 |
> |---------|------------|------------|--------|--------|
> | pid     | 29.4MiB/s  | 29.5MiB/s  | 456548 | 447975 |
> | rotation| 29.3MiB/s  | 29.3MiB/s  | 450105 | 450055 |
> | latency | 21.9MiB/s  | 21.9MiB/s  | 672387 | 0      |
> | devid:1 | 22.0MiB/s  | 22.0MiB/s  | 674788 | 0      |
>
> Defragmentation with compression workload:
>
> $ xfs_io -f -d -c 'pwrite -S 0xab 0 1G' /btrfs/foo
> $ sync
> $ echo 3 > /proc/sys/vm/drop_caches
> $ btrfs filesystem defrag -f -c /btrfs/foo
>
> |         | Time  | Read I/O Count  |
> |         | Real  | devid1 | devid2 |
> |---------|-------|--------|--------|
> | pid     | 21.61s| 3810   | 0      |
> | rotation| 11.55s| 1905   | 1905   |
> | latency | 20.99s| 0      | 3810   |
> | devid:2 | 21.41s| 0      | 3810   |
>
> . The PID-based balancing method works well for the generic random rw fio
>    workload.
> . The rotation method is ideal when you want to keep both devices active,
>    and it boosts performance in sequential defragmentation scenarios.
> . The latency-based method work well when we have mixed device types or
>    when one device experiences intermittent I/O failures the latency
>    increases and it automatically picks the other device for further Read
>    IOs.
> . The devid method is a more hands-on approach, useful for diagnosing and
>    testing RAID1 mirror synchronizations.
>
> Anand Jain (3):
>    btrfs: introduce RAID1 round-robin read balancing
>    btrfs: use the path with the lowest latency for RAID1 reads
>    btrfs: add RAID1 preferred read device
>
>   fs/btrfs/disk-io.c |   4 ++
>   fs/btrfs/sysfs.c   | 116 +++++++++++++++++++++++++++++++++++++++------
>   fs/btrfs/volumes.c | 109 ++++++++++++++++++++++++++++++++++++++++++
>   fs/btrfs/volumes.h |  16 +++++++
>   4 files changed, 230 insertions(+), 15 deletions(-)
>


^ permalink raw reply	[flat|nested] 15+ messages in thread

end of thread, other threads:[~2024-10-24  4:39 UTC | newest]

Thread overview: 15+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-10-11  2:49 [PATCH v2 0/3] raid1 balancing methods Anand Jain
2024-10-11  2:49 ` [PATCH v2 1/3] btrfs: introduce RAID1 round-robin read balancing Anand Jain
2024-10-11  2:49 ` [PATCH v2 2/3] btrfs: use the path with the lowest latency for RAID1 reads Anand Jain
2024-10-11  2:49 ` [PATCH v2 3/3] btrfs: add RAID1 preferred read device Anand Jain
2024-10-11  3:35 ` [PATCH v2 0/3] raid1 balancing methods Anand Jain
2024-10-11  4:59 ` Qu Wenruo
2024-10-11  6:04   ` Anand Jain
2024-10-21 14:05 ` David Sterba
2024-10-21 15:36   ` Anand Jain
2024-10-21 18:42     ` David Sterba
2024-10-22  0:31       ` Anand Jain
2024-10-21 14:32 ` waxhead
2024-10-21 15:44   ` Anand Jain
2024-10-22  7:07   ` Johannes Thumshirn
2024-10-24  4:39 ` Qu Wenruo

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).