[PATCH RFC 00/10] btrfs: new performance-based chunk allocation using device roles

public inbox for linux-btrfs@vger.kernel.org
 help / color / mirror / Atom feed

* [PATCH RFC 00/10] btrfs: new performance-based chunk allocation using device roles
@ 2025-05-12 18:07 Anand Jain
  2025-05-12 18:07 ` [PATCH 01/10] btrfs: fix thresh scope in should_alloc_chunk() Anand Jain
                   ` (16 more replies)
  0 siblings, 17 replies; 44+ messages in thread
From: Anand Jain @ 2025-05-12 18:07 UTC (permalink / raw)
  To: linux-btrfs

In host hardware, devices can have different speeds. Generally, faster
devices come with lesser capacity while slower devices come with larger
capacity. A typical configuration would expect that:

 - A filesystem's read/write performance is evenly distributed on average
 across the entire filesystem. This is not achievable with the current
 allocation method because chunks are allocated based only on device free
 space.

 - Typically, faster devices are assigned to metadata chunk allocations
 while slower devices are assigned to data chunk allocations.

Introducing Device Roles:

 Here I define 5 device roles in a specific order for metadata and in the
 reverse order for data: metadata_only, metadata, none, data, data_only.
 One or more devices may have the same role.

 The metadata and data roles indicate preference but not exclusivity for
 that role, whereas data_only and metadata_only are exclusive roles.

Introducing Role-then-Space allocation method:

 Metadata allocation can happen on devices with the roles metadata_only,
 metadata, none, and data in that order. If multiple devices share a role,
 they are arranged based on device free space.

 Similarly, data allocation can happen on devices with the roles data_only,
 data, none, and metadata in that order. If multiple devices share a role,
 they are arranged based on device free space.

Finding device speed automatically:

 Measuring device read/write latency for the allocaiton is not good idea,
 as the historical readings and may be misleading, as they could include
 iostat data from periods with issues that have since been fixed. Testing
 to determine relative latency and arranging in ascending order for metadata
 and descending for data is possible, but is better handled by an external
 tool that can still set device roles.

On-Disk Format changes:

 The following items are defined but are unused on-disk format:

	btrfs_dev_item::
	 __le64 type; // unused
	 __le64 start_offset; // unused
	 __le32 dev_group; // unused
	 __u8 seek_speed; // unused
	 __u8 bandwidth; // unused

 The device roles is using the dev_item::type 8-bit field to store each
 device's role.

Anand Jain (10):
  btrfs: fix thresh scope in should_alloc_chunk()
  btrfs: refactor should_alloc_chunk() arg type
  btrfs: introduce btrfs_split_sysfs_arg() for argument parsing
  btrfs: introduce device allocation method
  btrfs: sysfs: show device allocation method
  btrfs: skip device sorting when only one device is present
  btrfs: refactor chunk allocation device handling to use list_head
  btrfs: introduce explicit device roles for block groups
  btrfs: introduce ROLE_THEN_SPACE device allocation method
  btrfs: pass device roles through device add ioctl

 fs/btrfs/block-group.c |  11 +-
 fs/btrfs/ioctl.c       |  12 +-
 fs/btrfs/sysfs.c       | 130 ++++++++++++++++++++--
 fs/btrfs/volumes.c     | 242 +++++++++++++++++++++++++++++++++--------
 fs/btrfs/volumes.h     |  35 +++++-
 5 files changed, 366 insertions(+), 64 deletions(-)

-- 
2.49.0

^ permalink raw reply	[flat|nested] 44+ messages in thread

* [PATCH 01/10] btrfs: fix thresh scope in should_alloc_chunk()
  2025-05-12 18:07 [PATCH RFC 00/10] btrfs: new performance-based chunk allocation using device roles Anand Jain
@ 2025-05-12 18:07 ` Anand Jain
  2025-05-12 18:07 ` [PATCH 02/10] btrfs: refactor should_alloc_chunk() arg type Anand Jain
                   ` (15 subsequent siblings)
  16 siblings, 0 replies; 44+ messages in thread
From: Anand Jain @ 2025-05-12 18:07 UTC (permalink / raw)
  To: linux-btrfs

Moved thresh variable declaration from function scope to local block scope
where it's used. Minor comment formatting improvement. No functional changes

Signed-off-by: Anand Jain <anand.jain@oracle.com>
---
 fs/btrfs/block-group.c | 8 ++++----
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/fs/btrfs/block-group.c b/fs/btrfs/block-group.c
index 20f238dd8d96..f8317410724a 100644
--- a/fs/btrfs/block-group.c
+++ b/fs/btrfs/block-group.c
@@ -3876,17 +3876,17 @@ static bool should_alloc_chunk(const struct btrfs_fs_info *fs_info,
 			       const struct btrfs_space_info *sinfo, int force)
 {
 	u64 bytes_used = btrfs_space_info_used(sinfo, false);
-	u64 thresh;
 
 	if (force == CHUNK_ALLOC_FORCE)
 		return true;
 
 	/*
-	 * in limited mode, we want to have some free space up to
-	 * about 1% of the FS size.
+	 * In limited mode, we want to have some free space up to about 1% of
+	 * the FS size.
 	 */
 	if (force == CHUNK_ALLOC_LIMITED) {
-		thresh = btrfs_super_total_bytes(fs_info->super_copy);
+		u64 thresh = btrfs_super_total_bytes(fs_info->super_copy);
+
 		thresh = max_t(u64, SZ_64M, mult_perc(thresh, 1));
 
 		if (sinfo->total_bytes - bytes_used < thresh)
-- 
2.49.0


^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [PATCH 02/10] btrfs: refactor should_alloc_chunk() arg type
  2025-05-12 18:07 [PATCH RFC 00/10] btrfs: new performance-based chunk allocation using device roles Anand Jain
  2025-05-12 18:07 ` [PATCH 01/10] btrfs: fix thresh scope in should_alloc_chunk() Anand Jain
@ 2025-05-12 18:07 ` Anand Jain
  2025-05-12 18:07 ` [PATCH 03/10] btrfs: introduce btrfs_split_sysfs_arg() for argument parsing Anand Jain
                   ` (14 subsequent siblings)
  16 siblings, 0 replies; 44+ messages in thread
From: Anand Jain @ 2025-05-12 18:07 UTC (permalink / raw)
  To: linux-btrfs

The %force arg in the should_alloc_chunk() function is type
enum btrfs_chunk_alloc_enum.

Signed-off-by: Anand Jain <anand.jain@oracle.com>
---
 fs/btrfs/block-group.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/fs/btrfs/block-group.c b/fs/btrfs/block-group.c
index f8317410724a..c92345619f96 100644
--- a/fs/btrfs/block-group.c
+++ b/fs/btrfs/block-group.c
@@ -3873,7 +3873,8 @@ static void force_metadata_allocation(struct btrfs_fs_info *info)
 }
 
 static bool should_alloc_chunk(const struct btrfs_fs_info *fs_info,
-			       const struct btrfs_space_info *sinfo, int force)
+			       const struct btrfs_space_info *sinfo,
+			       enum btrfs_chunk_alloc_enum force)
 {
 	u64 bytes_used = btrfs_space_info_used(sinfo, false);
 
-- 
2.49.0


^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [PATCH 03/10] btrfs: introduce btrfs_split_sysfs_arg() for argument parsing
  2025-05-12 18:07 [PATCH RFC 00/10] btrfs: new performance-based chunk allocation using device roles Anand Jain
  2025-05-12 18:07 ` [PATCH 01/10] btrfs: fix thresh scope in should_alloc_chunk() Anand Jain
  2025-05-12 18:07 ` [PATCH 02/10] btrfs: refactor should_alloc_chunk() arg type Anand Jain
@ 2025-05-12 18:07 ` Anand Jain
  2025-05-12 18:07 ` [PATCH 04/10] btrfs: introduce device allocation method Anand Jain
                   ` (13 subsequent siblings)
  16 siblings, 0 replies; 44+ messages in thread
From: Anand Jain @ 2025-05-12 18:07 UTC (permalink / raw)
  To: linux-btrfs

Refactoring `btrfs_read_policy_to_enum()` to add new sysfs knobs, such as
device roles and block-group mapping. This commit introduces the
`btrfs_split_sysfs_arg()` helper function to parse string arguments in the
"string:value" format.

Signed-off-by: Anand Jain <anand.jain@oracle.com>
---
 fs/btrfs/sysfs.c | 34 ++++++++++++++++++++++++----------
 1 file changed, 24 insertions(+), 10 deletions(-)

diff --git a/fs/btrfs/sysfs.c b/fs/btrfs/sysfs.c
index 5d93d9dd2c12..6ba118d45a92 100644
--- a/fs/btrfs/sysfs.c
+++ b/fs/btrfs/sysfs.c
@@ -1317,7 +1317,8 @@ static const char *btrfs_read_policy_name[] = {
 #ifdef CONFIG_BTRFS_EXPERIMENTAL
 
 /* Global module configuration parameters. */
-static char *read_policy;
+/* NULL read_policy will set the raid1 balancing to the default */
+static char *read_policy = NULL;
 char *btrfs_get_mod_read_policy(void)
 {
 	return read_policy;
@@ -1329,16 +1330,10 @@ MODULE_PARM_DESC(read_policy,
 "Global read policy: pid (default), round-robin[:<min_contig_read>], devid[:<devid>]");
 #endif
 
-int btrfs_read_policy_to_enum(const char *str, s64 *value_ret)
+static int btrfs_split_sysfs_arg(char *param, s64 *value_ret)
 {
-	char param[32];
 	char __maybe_unused *value_str;
 
-	if (!str || strlen(str) == 0)
-		return 0;
-
-	strscpy(param, str);
-
 #ifdef CONFIG_BTRFS_EXPERIMENTAL
 	/* Separate value from input in policy:value format. */
 	value_str = strchr(param, ':');
@@ -1358,6 +1353,23 @@ int btrfs_read_policy_to_enum(const char *str, s64 *value_ret)
 	}
 #endif
 
+	return 0;
+}
+
+int btrfs_read_policy_to_enum(const char *str, s64 *value_ret)
+{
+	int ret;
+	char param[32] = { 0 };
+
+	/* If the policy is empty, point to the default at index 0. */
+	if (!str || strlen(str) == 0)
+		return 0;
+
+	strncpy(param, str, sizeof(param) - 1);
+	ret = btrfs_split_sysfs_arg(param, value_ret);
+	if (ret < 0)
+		return ret;
+
 	return sysfs_match_string(btrfs_read_policy_name, param);
 }
 
@@ -1366,8 +1378,10 @@ int __init btrfs_read_policy_init(void)
 {
 	s64 value;
 
-	if (btrfs_read_policy_to_enum(read_policy, &value) == -EINVAL) {
-		btrfs_err(NULL, "invalid read policy or value %s", read_policy);
+	/* Verify whether the read_policy from modprobe is correct. */
+	if (btrfs_read_policy_to_enum(read_policy, &value) < 0) {
+		btrfs_err(NULL, "invalid read policy or value %s",
+			  read_policy);
 		return -EINVAL;
 	}
 
-- 
2.49.0


^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [PATCH 04/10] btrfs: introduce device allocation method
  2025-05-12 18:07 [PATCH RFC 00/10] btrfs: new performance-based chunk allocation using device roles Anand Jain
                   ` (2 preceding siblings ...)
  2025-05-12 18:07 ` [PATCH 03/10] btrfs: introduce btrfs_split_sysfs_arg() for argument parsing Anand Jain
@ 2025-05-12 18:07 ` Anand Jain
  2025-05-12 18:07 ` [PATCH 05/10] btrfs: sysfs: show " Anand Jain
                   ` (12 subsequent siblings)
  16 siblings, 0 replies; 44+ messages in thread
From: Anand Jain @ 2025-05-12 18:07 UTC (permalink / raw)
  To: linux-btrfs

This commit introduces the `enum btrfs_device_allocation_method` in
preparation for supporting different chunk allocation methods. Currently
set to use the existing free-space-based device sorting by default.

Signed-off-by: Anand Jain <anand.jain@oracle.com>
---
 fs/btrfs/volumes.c | 13 ++++++++++---
 fs/btrfs/volumes.h |  9 +++++++++
 2 files changed, 19 insertions(+), 3 deletions(-)

diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index 89835071cfea..9592c30217a2 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -1256,6 +1256,7 @@ static int open_fs_devices(struct btrfs_fs_devices *fs_devices,
 	fs_devices->opened = 1;
 	fs_devices->latest_dev = latest_dev;
 	fs_devices->total_rw_bytes = 0;
+	fs_devices->device_alloc_method = BTRFS_DEV_ALLOC_BY_SPACE;
 	fs_devices->chunk_alloc_policy = BTRFS_CHUNK_ALLOC_REGULAR;
 #ifdef CONFIG_BTRFS_EXPERIMENTAL
 	fs_devices->rr_min_contig_read = BTRFS_DEFAULT_RR_MIN_CONTIG_READ;
@@ -5287,10 +5288,16 @@ static int gather_device_info(struct btrfs_fs_devices *fs_devices,
 	ctl->ndevs = ndevs;
 
 	/*
-	 * now sort the devices by hole size / available space
+	 * Now sort the devices by hole size / available space.
 	 */
-	sort(devices_info, ndevs, sizeof(struct btrfs_device_info),
-	     btrfs_cmp_device_info, NULL);
+	switch (fs_devices->device_alloc_method) {
+	default:
+		fallthrough;
+	case BTRFS_DEV_ALLOC_BY_SPACE:
+		sort(devices_info, ndevs, sizeof(struct btrfs_device_info),
+		     btrfs_cmp_device_info, NULL);
+		break;
+	}
 
 	return 0;
 }
diff --git a/fs/btrfs/volumes.h b/fs/btrfs/volumes.h
index 137cc232f58e..0cc799629ccf 100644
--- a/fs/btrfs/volumes.h
+++ b/fs/btrfs/volumes.h
@@ -295,6 +295,12 @@ BTRFS_DEVICE_GETSET_FUNCS(total_bytes);
 BTRFS_DEVICE_GETSET_FUNCS(disk_total_bytes);
 BTRFS_DEVICE_GETSET_FUNCS(bytes_used);
 
+/* Btrfs on disk chunk allocation methods. */
+enum btrfs_device_allocation_method {
+	BTRFS_DEV_ALLOC_BY_SPACE,
+	BTRFS_DEV_ALLOC_NR,
+};
+
 enum btrfs_chunk_allocation_policy {
 	BTRFS_CHUNK_ALLOC_REGULAR,
 	BTRFS_CHUNK_ALLOC_ZONED,
@@ -440,6 +446,9 @@ struct btrfs_fs_devices {
 	struct kobject *devinfo_kobj;
 	struct completion kobj_unregister;
 
+	/* Method for selecting devices during chunk allocation */
+	enum btrfs_device_allocation_method device_alloc_method;
+
 	enum btrfs_chunk_allocation_policy chunk_alloc_policy;
 
 	/* Policy used to read the mirrored stripes. */
-- 
2.49.0


^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [PATCH 05/10] btrfs: sysfs: show device allocation method
  2025-05-12 18:07 [PATCH RFC 00/10] btrfs: new performance-based chunk allocation using device roles Anand Jain
                   ` (3 preceding siblings ...)
  2025-05-12 18:07 ` [PATCH 04/10] btrfs: introduce device allocation method Anand Jain
@ 2025-05-12 18:07 ` Anand Jain
  2025-05-12 18:07 ` [PATCH 06/10] btrfs: skip device sorting when only one device is present Anand Jain
                   ` (11 subsequent siblings)
  16 siblings, 0 replies; 44+ messages in thread
From: Anand Jain @ 2025-05-12 18:07 UTC (permalink / raw)
  To: linux-btrfs

As a preparation to add more device allocation methods, bring the
exisiting device space based allocation method under the sysfs. So that
at any time we should be able to know how the kernel is allocating the
chunks using the command such as:

	$ cat /sys/fs/btrfs/<FSID>/device_allocation
	[space]

Signed-off-by: Anand Jain <anand.jain@oracle.com>
---
 fs/btrfs/sysfs.c | 95 ++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 95 insertions(+)

diff --git a/fs/btrfs/sysfs.c b/fs/btrfs/sysfs.c
index 6ba118d45a92..d07c22e05088 100644
--- a/fs/btrfs/sysfs.c
+++ b/fs/btrfs/sysfs.c
@@ -435,6 +435,15 @@ static ssize_t temp_fsid_supported_show(struct kobject *kobj,
 }
 BTRFS_ATTR(static_feature, temp_fsid, temp_fsid_supported_show);
 
+#ifdef CONFIG_BTRFS_EXPERIMENTAL
+static ssize_t device_alloc_supported_show(struct kobject *kobj,
+					   struct kobj_attribute *a, char *buf)
+{
+	return sysfs_emit(buf, "0\n");
+}
+BTRFS_ATTR(static_feature, device_allocation, device_alloc_supported_show);
+#endif
+
 /*
  * Features which only depend on kernel version.
  *
@@ -449,6 +458,9 @@ static struct attribute *btrfs_supported_static_feature_attrs[] = {
 	BTRFS_ATTR_PTR(static_feature, supported_rescue_options),
 	BTRFS_ATTR_PTR(static_feature, supported_sectorsizes),
 	BTRFS_ATTR_PTR(static_feature, temp_fsid),
+#ifdef CONFIG_BTRFS_EXPERIMENTAL
+	BTRFS_ATTR_PTR(static_feature, device_allocation),
+#endif
 	NULL
 };
 
@@ -1506,6 +1518,88 @@ static ssize_t btrfs_read_policy_store(struct kobject *kobj,
 }
 BTRFS_ATTR_RW(, read_policy, btrfs_read_policy_show, btrfs_read_policy_store);
 
+#ifdef CONFIG_BTRFS_EXPERIMENTAL
+/*
+ * We only need sys/fs/btrfs/UUID/device_allocation for testing.
+ * Promote this to be under CONFIG_BTRFS_DEBUG when appropriate.
+ */
+static const char *btrfs_dev_alloc_name[] = {
+	"space",
+};
+
+static int btrfs_dev_alloc_name_to_enum(const char *str, s64 *value_ret)
+{
+	int ret;
+	char param[32] = { 0 };
+
+	/* If the policy is empty, point to the default at index 0. */
+	if (!str || strlen(str) == 0)
+		return 0;
+
+	strncpy(param, str, sizeof(param) - 1);
+
+	ret = btrfs_split_sysfs_arg(param, value_ret);
+	if (ret < 0)
+		return ret;
+
+	return sysfs_match_string(btrfs_dev_alloc_name, param);
+}
+
+static ssize_t btrfs_device_alloc_show(struct kobject *kobj,
+				       struct kobj_attribute *a, char *buf)
+{
+	struct btrfs_fs_devices *fs_devices = to_fs_devs(kobj);
+	enum btrfs_device_allocation_method dev_alloc;
+	ssize_t ret = 0;
+	int i;
+
+	dev_alloc = READ_ONCE(fs_devices->device_alloc_method);
+
+	for (i = 0; i < BTRFS_DEV_ALLOC_NR; i++) {
+		if (ret != 0)
+			ret += sysfs_emit_at(buf, ret, " ");
+
+		if (i == dev_alloc)
+			ret += sysfs_emit_at(buf, ret, "[");
+
+		ret += sysfs_emit_at(buf, ret, "%s", btrfs_dev_alloc_name[i]);
+
+		if (i == dev_alloc)
+			ret += sysfs_emit_at(buf, ret, "]");
+	}
+
+	ret += sysfs_emit_at(buf, ret, "\n");
+
+	return ret;
+
+}
+
+static ssize_t btrfs_device_alloc_store(struct kobject *kobj,
+					struct kobj_attribute *a,
+					const char *buf, size_t len)
+{
+	struct btrfs_fs_devices *fs_devices = to_fs_devs(kobj);
+	int index;
+	s64 value = -1;
+
+	index = btrfs_dev_alloc_name_to_enum(buf, &value);
+	if (index < 0)
+		return -EINVAL;
+
+	if (index != READ_ONCE(fs_devices->device_alloc_method)) {
+		WRITE_ONCE(fs_devices->device_alloc_method, index);
+		btrfs_info(fs_devices->fs_info,
+			   "device allocation method set to: '%s'",
+			   btrfs_dev_alloc_name[index]);
+	}
+
+	return len;
+
+}
+BTRFS_ATTR_RW(, device_allocation, btrfs_device_alloc_show,
+	      btrfs_device_alloc_store);
+#endif
+
 static ssize_t btrfs_bg_reclaim_threshold_show(struct kobject *kobj,
 					       struct kobj_attribute *a,
 					       char *buf)
@@ -1604,6 +1698,7 @@ static const struct attribute *btrfs_attrs[] = {
 	BTRFS_ATTR_PTR(, temp_fsid),
 #ifdef CONFIG_BTRFS_EXPERIMENTAL
 	BTRFS_ATTR_PTR(, offload_csum),
+	BTRFS_ATTR_PTR(, device_allocation),
 #endif
 	NULL,
 };
-- 
2.49.0


^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [PATCH 06/10] btrfs: skip device sorting when only one device is present
  2025-05-12 18:07 [PATCH RFC 00/10] btrfs: new performance-based chunk allocation using device roles Anand Jain
                   ` (4 preceding siblings ...)
  2025-05-12 18:07 ` [PATCH 05/10] btrfs: sysfs: show " Anand Jain
@ 2025-05-12 18:07 ` Anand Jain
  2025-05-12 18:07 ` [PATCH 07/10] btrfs: refactor chunk allocation device handling to use list_head Anand Jain
                   ` (10 subsequent siblings)
  16 siblings, 0 replies; 44+ messages in thread
From: Anand Jain @ 2025-05-12 18:07 UTC (permalink / raw)
  To: linux-btrfs

No need to sort devices if there is only a single device.
Return early to avoid unnecessary work.

Signed-off-by: Anand Jain <anand.jain@oracle.com>
---
 fs/btrfs/volumes.c | 4 ++++
 1 file changed, 4 insertions(+)

diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index 9592c30217a2..704ef78999e0 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -5287,6 +5287,10 @@ static int gather_device_info(struct btrfs_fs_devices *fs_devices,
 	}
 	ctl->ndevs = ndevs;
 
+	/* No sorting is required if there is only one device */
+	if (ctl->ndevs == 1)
+		return 0;
+
 	/*
 	 * Now sort the devices by hole size / available space.
 	 */
-- 
2.49.0


^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [PATCH 07/10] btrfs: refactor chunk allocation device handling to use list_head
  2025-05-12 18:07 [PATCH RFC 00/10] btrfs: new performance-based chunk allocation using device roles Anand Jain
                   ` (5 preceding siblings ...)
  2025-05-12 18:07 ` [PATCH 06/10] btrfs: skip device sorting when only one device is present Anand Jain
@ 2025-05-12 18:07 ` Anand Jain
  2025-05-12 18:07 ` [PATCH 08/10] btrfs: introduce explicit device roles for block groups Anand Jain
                   ` (9 subsequent siblings)
  16 siblings, 0 replies; 44+ messages in thread
From: Anand Jain @ 2025-05-12 18:07 UTC (permalink / raw)
  To: linux-btrfs

Refactor the chunk allocation path to use list_head for managing
btrfs_device_info, replacing the previous kcalloc()d array and sort()
approach. Provides better code consistency and prepares for adding more
sort choices in following patches.

Device info structs are now allocated individually via kzalloc(), added to
a list, and sorted using list_sort(). Associated functions are updated to
take struct list_head * as arg. Cleanup iterates the list for kfree().

Finally, calculates smallest_space during gather_device_info() and stores
it in alloc_chunk_ctl to simplify decide_stripe_size_*() logic, avoiding
lookups in the sorted list.

Signed-off-by: Anand Jain <anand.jain@oracle.com>
---
 fs/btrfs/volumes.c | 101 ++++++++++++++++++++++++++++-----------------
 fs/btrfs/volumes.h |   1 +
 2 files changed, 65 insertions(+), 37 deletions(-)

diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index 704ef78999e0..2ae6ead3fb43 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -5071,10 +5071,14 @@ static int btrfs_add_system_chunk(struct btrfs_fs_info *fs_info,
 /*
  * sort the devices in descending order by max_avail, total_avail
  */
-static int btrfs_cmp_device_info(const void *a, const void *b)
+static int btrfs_cmp_device_space(void *priv, const struct list_head *a,
+				  const struct list_head *b)
 {
-	const struct btrfs_device_info *di_a = a;
-	const struct btrfs_device_info *di_b = b;
+	const struct btrfs_device_info *di_a;
+	const struct btrfs_device_info *di_b;
+
+	di_a = list_entry(a, struct btrfs_device_info, list);
+	di_b = list_entry(b, struct btrfs_device_info, list);
 
 	if (di_a->max_avail > di_b->max_avail)
 		return -1;
@@ -5131,6 +5135,8 @@ struct alloc_chunk_ctl {
 	u64 dev_extent_min;
 	u64 stripe_size;
 	u64 chunk_size;
+	/* Smallest free space of all available devices */
+	u64 smallest_avail;
 	int ndevs;
 	/* Space_info the block group is going to belong. */
 	struct btrfs_space_info *space_info;
@@ -5205,6 +5211,7 @@ static void init_alloc_chunk_ctl(struct btrfs_fs_devices *fs_devices,
 	ctl->ncopies = btrfs_raid_array[index].ncopies;
 	ctl->nparity = btrfs_raid_array[index].nparity;
 	ctl->ndevs = 0;
+	ctl->smallest_avail = 0;
 
 	switch (fs_devices->chunk_alloc_policy) {
 	default:
@@ -5221,9 +5228,10 @@ static void init_alloc_chunk_ctl(struct btrfs_fs_devices *fs_devices,
 
 static int gather_device_info(struct btrfs_fs_devices *fs_devices,
 			      struct alloc_chunk_ctl *ctl,
-			      struct btrfs_device_info *devices_info)
+			      struct list_head *devices_info)
 {
 	struct btrfs_fs_info *info = fs_devices->fs_info;
+	struct btrfs_device_info *device_info;
 	struct btrfs_device *device;
 	u64 total_avail;
 	u64 dev_extent_want = ctl->max_stripe_size * ctl->dev_stripes;
@@ -5233,7 +5241,7 @@ static int gather_device_info(struct btrfs_fs_devices *fs_devices,
 	u64 dev_offset;
 
 	/*
-	 * in the first pass through the devices list, we gather information
+	 * In the first pass through the devices list, we gather information
 	 * about the available holes on each device.
 	 */
 	list_for_each_entry(device, &fs_devices->alloc_list, dev_alloc_list) {
@@ -5274,16 +5282,26 @@ static int gather_device_info(struct btrfs_fs_devices *fs_devices,
 			continue;
 		}
 
+		if (ctl->smallest_avail == 0 || ctl->smallest_avail > max_avail)
+			ctl->smallest_avail = max_avail;
+
 		if (ndevs == fs_devices->rw_devices) {
 			WARN(1, "%s: found more than %llu devices\n",
 			     __func__, fs_devices->rw_devices);
 			break;
 		}
-		devices_info[ndevs].dev_offset = dev_offset;
-		devices_info[ndevs].max_avail = max_avail;
-		devices_info[ndevs].total_avail = total_avail;
-		devices_info[ndevs].dev = device;
+
+		device_info = kzalloc(sizeof(*device_info), GFP_KERNEL);
+		if (!device_info)
+			return -ENOMEM;
+
+		list_add_tail(&device_info->list, devices_info);
 		++ndevs;
+
+		device_info->dev_offset = dev_offset;
+		device_info->max_avail = max_avail;
+		device_info->total_avail = total_avail;
+		device_info->dev = device;
 	}
 	ctl->ndevs = ndevs;
 
@@ -5298,16 +5316,14 @@ static int gather_device_info(struct btrfs_fs_devices *fs_devices,
 	default:
 		fallthrough;
 	case BTRFS_DEV_ALLOC_BY_SPACE:
-		sort(devices_info, ndevs, sizeof(struct btrfs_device_info),
-		     btrfs_cmp_device_info, NULL);
+		list_sort(NULL, devices_info, btrfs_cmp_device_space);
 		break;
 	}
 
 	return 0;
 }
 
-static int decide_stripe_size_regular(struct alloc_chunk_ctl *ctl,
-				      struct btrfs_device_info *devices_info)
+static int decide_stripe_size_regular(struct alloc_chunk_ctl *ctl)
 {
 	/* Number of stripes that count for block group size */
 	int data_stripes;
@@ -5319,8 +5335,7 @@ static int decide_stripe_size_regular(struct alloc_chunk_ctl *ctl,
 	 * The DUP profile stores more than one stripe per device, the
 	 * max_avail is the total size so we have to adjust.
 	 */
-	ctl->stripe_size = div_u64(devices_info[ctl->ndevs - 1].max_avail,
-				   ctl->dev_stripes);
+	ctl->stripe_size = div_u64(ctl->smallest_avail, ctl->dev_stripes);
 	ctl->num_stripes = ctl->ndevs * ctl->dev_stripes;
 
 	/* This will have to be fixed for RAID1 and RAID10 over more drives */
@@ -5354,19 +5369,23 @@ static int decide_stripe_size_regular(struct alloc_chunk_ctl *ctl,
 }
 
 static int decide_stripe_size_zoned(struct alloc_chunk_ctl *ctl,
-				    struct btrfs_device_info *devices_info)
+				    struct list_head *devices_info)
 {
-	u64 zone_size = devices_info[0].dev->zone_info->zone_size;
+	struct btrfs_device_info *device_info;
+	u64 zone_size;
 	/* Number of stripes that count for block group size */
 	int data_stripes;
 
+	device_info = list_first_entry(devices_info,
+				       struct btrfs_device_info, list);
+	zone_size = device_info->dev->zone_info->zone_size;
 	/*
 	 * It should hold because:
 	 *    dev_extent_min == dev_extent_want == zone_size * dev_stripes
 	 */
-	ASSERT(devices_info[ctl->ndevs - 1].max_avail == ctl->dev_extent_min,
+	ASSERT(ctl->smallest_avail == ctl->dev_extent_min,
 	       "ndevs=%d max_avail=%llu dev_extent_min=%llu", ctl->ndevs,
-	       devices_info[ctl->ndevs - 1].max_avail, ctl->dev_extent_min);
+	       ctl->smallest_avail, ctl->dev_extent_min);
 
 	ctl->stripe_size = zone_size;
 	ctl->num_stripes = ctl->ndevs * ctl->dev_stripes;
@@ -5391,7 +5410,7 @@ static int decide_stripe_size_zoned(struct alloc_chunk_ctl *ctl,
 
 static int decide_stripe_size(struct btrfs_fs_devices *fs_devices,
 			      struct alloc_chunk_ctl *ctl,
-			      struct btrfs_device_info *devices_info)
+			      struct list_head *devices_info)
 {
 	struct btrfs_fs_info *info = fs_devices->fs_info;
 
@@ -5418,7 +5437,7 @@ static int decide_stripe_size(struct btrfs_fs_devices *fs_devices,
 		btrfs_warn_unknown_chunk_allocation(fs_devices->chunk_alloc_policy);
 		fallthrough;
 	case BTRFS_CHUNK_ALLOC_REGULAR:
-		return decide_stripe_size_regular(ctl, devices_info);
+		return decide_stripe_size_regular(ctl);
 	case BTRFS_CHUNK_ALLOC_ZONED:
 		return decide_stripe_size_zoned(ctl, devices_info);
 	}
@@ -5511,15 +5530,17 @@ struct btrfs_chunk_map *btrfs_alloc_chunk_map(int num_stripes, gfp_t gfp)
 }
 
 static struct btrfs_block_group *create_chunk(struct btrfs_trans_handle *trans,
-			struct alloc_chunk_ctl *ctl,
-			struct btrfs_device_info *devices_info)
+					      struct alloc_chunk_ctl *ctl,
+					      struct list_head *devices_info)
 {
 	struct btrfs_fs_info *info = trans->fs_info;
 	struct btrfs_chunk_map *map;
 	struct btrfs_block_group *block_group;
+	struct btrfs_device_info *device_info;
 	u64 start = ctl->start;
 	u64 type = ctl->type;
 	int ret;
+	int dev_cnt = 0;
 
 	map = btrfs_alloc_chunk_map(ctl->num_stripes, GFP_NOFS);
 	if (!map)
@@ -5534,13 +5555,17 @@ static struct btrfs_block_group *create_chunk(struct btrfs_trans_handle *trans,
 	map->sub_stripes = ctl->sub_stripes;
 	map->num_stripes = ctl->num_stripes;
 
-	for (int i = 0; i < ctl->ndevs; i++) {
+	list_for_each_entry(device_info, devices_info, list) {
+		if (dev_cnt >= ctl->ndevs)
+			break;
 		for (int j = 0; j < ctl->dev_stripes; j++) {
-			int s = i * ctl->dev_stripes + j;
-			map->stripes[s].dev = devices_info[i].dev;
-			map->stripes[s].physical = devices_info[i].dev_offset +
+			int s = dev_cnt * ctl->dev_stripes + j;
+
+			map->stripes[s].dev = device_info->dev;
+			map->stripes[s].physical = device_info->dev_offset +
 						   j * ctl->stripe_size;
 		}
+		dev_cnt++;
 	}
 
 	trace_btrfs_chunk_alloc(info, map, start, ctl->chunk_size);
@@ -5583,7 +5608,7 @@ struct btrfs_block_group *btrfs_create_chunk(struct btrfs_trans_handle *trans,
 {
 	struct btrfs_fs_info *info = trans->fs_info;
 	struct btrfs_fs_devices *fs_devices = info->fs_devices;
-	struct btrfs_device_info *devices_info = NULL;
+	LIST_HEAD(devices_info);
 	struct alloc_chunk_ctl ctl;
 	struct btrfs_block_group *block_group;
 	int ret;
@@ -5612,27 +5637,29 @@ struct btrfs_block_group *btrfs_create_chunk(struct btrfs_trans_handle *trans,
 	ctl.space_info = space_info;
 	init_alloc_chunk_ctl(fs_devices, &ctl);
 
-	devices_info = kcalloc(fs_devices->rw_devices, sizeof(*devices_info),
-			       GFP_NOFS);
-	if (!devices_info)
-		return ERR_PTR(-ENOMEM);
-
-	ret = gather_device_info(fs_devices, &ctl, devices_info);
+	ret = gather_device_info(fs_devices, &ctl, &devices_info);
 	if (ret < 0) {
 		block_group = ERR_PTR(ret);
 		goto out;
 	}
 
-	ret = decide_stripe_size(fs_devices, &ctl, devices_info);
+	ret = decide_stripe_size(fs_devices, &ctl, &devices_info);
 	if (ret < 0) {
 		block_group = ERR_PTR(ret);
 		goto out;
 	}
 
-	block_group = create_chunk(trans, &ctl, devices_info);
+	block_group = create_chunk(trans, &ctl, &devices_info);
 
 out:
-	kfree(devices_info);
+	while (!list_empty(&devices_info)) {
+		struct btrfs_device_info *device_info;
+
+		device_info = list_first_entry(&devices_info,
+					       struct btrfs_device_info, list);
+		list_del(&device_info->list);
+		kfree(device_info);
+	}
 	return block_group;
 }
 
diff --git a/fs/btrfs/volumes.h b/fs/btrfs/volumes.h
index 0cc799629ccf..dea6265a2dc8 100644
--- a/fs/btrfs/volumes.h
+++ b/fs/btrfs/volumes.h
@@ -594,6 +594,7 @@ struct btrfs_io_context {
 };
 
 struct btrfs_device_info {
+	struct list_head list;
 	struct btrfs_device *dev;
 	u64 dev_offset;
 	u64 max_avail;
-- 
2.49.0


^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [PATCH 08/10] btrfs: introduce explicit device roles for block groups
  2025-05-12 18:07 [PATCH RFC 00/10] btrfs: new performance-based chunk allocation using device roles Anand Jain
                   ` (6 preceding siblings ...)
  2025-05-12 18:07 ` [PATCH 07/10] btrfs: refactor chunk allocation device handling to use list_head Anand Jain
@ 2025-05-12 18:07 ` Anand Jain
  2025-05-12 18:07 ` [PATCH 09/10] btrfs: introduce ROLE_THEN_SPACE device allocation method Anand Jain
                   ` (8 subsequent siblings)
  16 siblings, 0 replies; 44+ messages in thread
From: Anand Jain @ 2025-05-12 18:07 UTC (permalink / raw)
  To: linux-btrfs

This commit introduces device roles to manage the intended usage of
devices for metadata and data block groups. Four distinct roles are
defined in the `enum btrfs_device_roles`:

  enum btrfs_device_roles {
	BTRFS_DEVICE_ROLE_METADATA_ONLY = 20,
	BTRFS_DEVICE_ROLE_METADATA      = 40,
	BTRFS_DEVICE_ROLE_NONE          = 80,
	BTRFS_DEVICE_ROLE_DATA          = 100,
	BTRFS_DEVICE_ROLE_DATA_ONLY     = 120,
  };

Devices marked with the `_ONLY` suffix are dedicated exclusively to the
specified block group type (metadata or data). In contrast, devices
without the `_ONLY` suffix indicate a preference for the designated
block group. While these devices will primarily be used for their
preferred type, they can also accommodate other block group types if a
critical out-of-space condition arises, allowing for allocation to
succeed.

The `parse_device_role()` function helps conversion of string ("metadata",
"data", "metadata-only", "data-only", and their abbreviations "m", "d",
"monly", "donly") into the corresponding `enum btrfs_device_roles` value.
This will be used to configure device roles.

Signed-off-by: Anand Jain <anand.jain@oracle.com>
---
 fs/btrfs/volumes.c | 21 +++++++++++++++++++++
 fs/btrfs/volumes.h | 20 ++++++++++++++++++++
 2 files changed, 41 insertions(+)

diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index 2ae6ead3fb43..a44749103410 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -2687,6 +2687,27 @@ static int btrfs_finish_sprout(struct btrfs_trans_handle *trans)
 	return ret;
 }
 
+int parse_device_role(char *str, enum btrfs_device_roles *role)
+{
+	if (strncmp(str, "m", strlen(str)) == 0 ||
+	    strncmp(str, "metadata", strlen(str)) == 0) {
+		*role = BTRFS_DEVICE_ROLE_METADATA;
+	} else if (strncmp(str, "d", strlen(str)) == 0 ||
+	    strncmp(str, "data", strlen(str)) == 0) {
+		*role = BTRFS_DEVICE_ROLE_DATA;
+	} else if (strncmp(str, "monly", strlen(str)) == 0 ||
+	    strncmp(str, "metadata-only", strlen(str)) == 0) {
+		*role = BTRFS_DEVICE_ROLE_METADATA_ONLY;
+	} else if (strncmp(str, "donly", strlen(str)) == 0 ||
+	    strncmp(str, "data-only", strlen(str)) == 0) {
+		*role = BTRFS_DEVICE_ROLE_DATA_ONLY;
+	} else {
+		return -EINVAL;
+	}
+
+	return 0;
+}
+
 int btrfs_init_new_device(struct btrfs_fs_info *fs_info, const char *device_path)
 {
 	struct btrfs_root *root = fs_info->dev_root;
diff --git a/fs/btrfs/volumes.h b/fs/btrfs/volumes.h
index dea6265a2dc8..7dbdfe502481 100644
--- a/fs/btrfs/volumes.h
+++ b/fs/btrfs/volumes.h
@@ -82,6 +82,25 @@ enum btrfs_raid_types {
 	BTRFS_NR_RAID_TYPES
 };
 
+#define BTRFS_DEVICE_ROLE_MASK	0xff
+/*
+ * device_role value and how it will be used.
+ * 	      0: Unused
+ *	   1-20: Metadata only
+ *	  21-40: Metadata preferred
+ *	  41-80: Anything|None
+ *	 81-100: Data preferred
+ *	101-128: Data only
+ * Declare some predefined easy to use device_bg_type values
+ */
+enum btrfs_device_roles {
+	BTRFS_DEVICE_ROLE_METADATA_ONLY = 20,
+	BTRFS_DEVICE_ROLE_METADATA      = 40,
+	BTRFS_DEVICE_ROLE_NONE          = 80,
+	BTRFS_DEVICE_ROLE_DATA          = 100,
+	BTRFS_DEVICE_ROLE_DATA_ONLY     = 120,
+};
+
 /*
  * Use sequence counter to get consistent device stat data on
  * 32-bit processors.
@@ -756,6 +775,7 @@ int btrfs_grow_device(struct btrfs_trans_handle *trans,
 struct btrfs_device *btrfs_find_device(const struct btrfs_fs_devices *fs_devices,
 				       const struct btrfs_dev_lookup_args *args);
 int btrfs_shrink_device(struct btrfs_device *device, u64 new_size);
+int parse_device_role(char *str, enum btrfs_device_roles *role);
 int btrfs_init_new_device(struct btrfs_fs_info *fs_info, const char *path);
 int btrfs_balance(struct btrfs_fs_info *fs_info,
 		  struct btrfs_balance_control *bctl,
-- 
2.49.0


^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [PATCH 09/10] btrfs: introduce ROLE_THEN_SPACE device allocation method
  2025-05-12 18:07 [PATCH RFC 00/10] btrfs: new performance-based chunk allocation using device roles Anand Jain
                   ` (7 preceding siblings ...)
  2025-05-12 18:07 ` [PATCH 08/10] btrfs: introduce explicit device roles for block groups Anand Jain
@ 2025-05-12 18:07 ` Anand Jain
  2025-05-12 18:07 ` [PATCH 10/10] btrfs: pass device roles through device add ioctl Anand Jain
                   ` (7 subsequent siblings)
  16 siblings, 0 replies; 44+ messages in thread
From: Anand Jain @ 2025-05-12 18:07 UTC (permalink / raw)
  To: linux-btrfs

Introduce a new device allocation method, BTRFS_DEV_ALLOC_BY_ROLE_THEN_SPACE,
allowing chunk allocation to prioritize devices based on role suitability
before considering free space.

This patch adds:
- Filtering in gather_device_info() to skip devices whose role
  (e.g., DATA_ONLY, METADATA_ONLY) is incompatible with the current
  allocation type (DATA, METADATA, SYSTEM).
- A new comparator, btrfs_cmp_device_role(), to sort devices based on
  role suitability for the given allocation type.
- Logic in gather_device_info() for the new policy: first uses
  list_sort() with btrfs_cmp_device_role(), then sorts again within
  each role group using list_sort() with btrfs_cmp_device_space().

This allows for more granular control over chunk placement based on
defined device roles.

To check if 'role-then-space' is active for testing, updates the
previously added sysfs interface for role-then-space:

    $ cat /sys/fs/btrfs/UUID/device_allocation
	[space] role-then-space

Compatibility:
 - In older on-disk formats, dev_item::type:4 is zero. This implies
   BTRFS_DEVICE_ROLE_NONE in the newer kernel, which is acceptable.
   Although BTRFS_DEVICE_ROLE_NONE is not equal to 0, the kernel always
   writes it as 0, thus maintaining backward compatibility. Therefore,
   older on-disk formats will seamlessly work on newer kernels and
   vice versa if the device roles are not activated.

   However, if the on-disk format is newer and device roles are active
   (i.e., dev_item::type:4 != 0) and you want to mount it on an older
   it also works because the kernel or btrfs check never enforced that
   dev_item::type must be zero. However, when mounted on older kernel
   the chunks maybe allocated will be only based on the space avaialbe.

Compatibility:
 - In older on-disk formats, `dev_item::type:4` is zero. This implies
   `BTRFS_DEVICE_ROLE_NONE` in newer kernels, which is acceptable.
   Although `BTRFS_DEVICE_ROLE_NONE` is not equal to 0, the kernel always
   writes it as 0, thus maintaining backward compatibility (if needed).
   Therefore, older on-disk formats will seamlessly work on newer kernels
   and vice versa if device roles are not activated.

 - However, if the on-disk format is newer and device roles are active
   (i.e., `dev_item::type:4 != 0`), mounting on an older kernel also works
   because the kernel or btrfs checks never enforced that `dev_item::type`
   must be zero. Nevertheless, when mounted on an older kernel, chunk
   allocation will likely be based solely on available space.

Signed-off-by: Anand Jain <anand.jain@oracle.com>
---
 fs/btrfs/sysfs.c   |  1 +
 fs/btrfs/volumes.c | 90 +++++++++++++++++++++++++++++++++++++++++-----
 fs/btrfs/volumes.h |  2 ++
 3 files changed, 85 insertions(+), 8 deletions(-)

diff --git a/fs/btrfs/sysfs.c b/fs/btrfs/sysfs.c
index d07c22e05088..db008a421450 100644
--- a/fs/btrfs/sysfs.c
+++ b/fs/btrfs/sysfs.c
@@ -1525,6 +1525,7 @@ BTRFS_ATTR_RW(, read_policy, btrfs_read_policy_show, btrfs_read_policy_store);
  */
 static const char *btrfs_dev_alloc_name[] = {
 	"space",
+	"role-then-space"
 };
 
 static int btrfs_dev_alloc_name_to_enum(const char *str, s64 *value_ret)
diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index a44749103410..fb86d5684454 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -5097,21 +5097,61 @@ static int btrfs_cmp_device_space(void *priv, const struct list_head *a,
 {
 	const struct btrfs_device_info *di_a;
 	const struct btrfs_device_info *di_b;
+	enum btrfs_device_roles *role = priv;
 
 	di_a = list_entry(a, struct btrfs_device_info, list);
 	di_b = list_entry(b, struct btrfs_device_info, list);
 
-	if (di_a->max_avail > di_b->max_avail)
-		return -1;
-	if (di_a->max_avail < di_b->max_avail)
-		return 1;
-	if (di_a->total_avail > di_b->total_avail)
-		return -1;
-	if (di_a->total_avail < di_b->total_avail)
-		return 1;
+	if (!role || ((di_a->role == *role) && (di_b->role == *role))) {
+		if (di_a->max_avail > di_b->max_avail)
+			return -1;
+		if (di_a->max_avail < di_b->max_avail)
+			return 1;
+		if (di_a->total_avail > di_b->total_avail)
+			return -1;
+		if (di_a->total_avail < di_b->total_avail)
+			return 1;
+	}
+
 	return 0;
 }
 
+static int btrfs_cmp_role(enum btrfs_device_roles a, enum btrfs_device_roles b,
+			  bool assend)
+{
+	if (a == 0)
+		a = BTRFS_DEVICE_ROLE_NONE;
+
+	if (b == 0)
+		b = BTRFS_DEVICE_ROLE_NONE;
+
+	if (assend)
+		return a > b ? -1 : a < b ? 1 : 0;
+	else
+		return a < b ? -1 : a > b ? 1 : 0;
+}
+
+/* Sort the devices by their role to suit the allocation type. */
+static int btrfs_cmp_device_role(void *priv, const struct list_head *a,
+				 const struct list_head *b)
+{
+	const struct btrfs_device_info *di_a;
+	const struct btrfs_device_info *di_b;
+	u64 *type = (u64 *)priv;
+	enum btrfs_device_roles role_a;
+	enum btrfs_device_roles role_b;
+	bool assend;
+
+	di_a = list_entry(a, struct btrfs_device_info, list);
+	role_a = di_a->role;
+	di_b = list_entry(b, struct btrfs_device_info, list);
+	role_b = di_b->role;
+
+	assend = ((*type & BTRFS_BLOCK_GROUP_TYPE_MASK) == BTRFS_BLOCK_GROUP_DATA);
+
+	return btrfs_cmp_role(role_a, role_b, assend);
+}
+
 static void check_raid56_incompat_flag(struct btrfs_fs_info *info, u64 type)
 {
 	if (!(type & BTRFS_BLOCK_GROUP_RAID56_MASK))
@@ -5247,6 +5287,14 @@ static void init_alloc_chunk_ctl(struct btrfs_fs_devices *fs_devices,
 	}
 }
 
+static const enum btrfs_device_roles dev_roles[] = {
+	BTRFS_DEVICE_ROLE_METADATA_ONLY,
+	BTRFS_DEVICE_ROLE_METADATA,
+	BTRFS_DEVICE_ROLE_NONE,
+	BTRFS_DEVICE_ROLE_DATA,
+	BTRFS_DEVICE_ROLE_DATA_ONLY,
+};
+
 static int gather_device_info(struct btrfs_fs_devices *fs_devices,
 			      struct alloc_chunk_ctl *ctl,
 			      struct list_head *devices_info)
@@ -5256,6 +5304,7 @@ static int gather_device_info(struct btrfs_fs_devices *fs_devices,
 	struct btrfs_device *device;
 	u64 total_avail;
 	u64 dev_extent_want = ctl->max_stripe_size * ctl->dev_stripes;
+	u64 alloc_type = ctl->type & BTRFS_BLOCK_GROUP_TYPE_MASK;
 	int ret;
 	int ndevs = 0;
 	u64 max_avail;
@@ -5266,6 +5315,11 @@ static int gather_device_info(struct btrfs_fs_devices *fs_devices,
 	 * about the available holes on each device.
 	 */
 	list_for_each_entry(device, &fs_devices->alloc_list, dev_alloc_list) {
+		unsigned int dev_role = device->type & BTRFS_DEVICE_ROLE_MASK;
+
+		if (!dev_role)
+			dev_role = BTRFS_DEVICE_ROLE_NONE;
+
 		if (!test_bit(BTRFS_DEV_STATE_WRITEABLE, &device->dev_state)) {
 			WARN(1, KERN_ERR
 			       "BTRFS: read-only device in alloc_list\n");
@@ -5277,6 +5331,14 @@ static int gather_device_info(struct btrfs_fs_devices *fs_devices,
 		    test_bit(BTRFS_DEV_STATE_REPLACE_TGT, &device->dev_state))
 			continue;
 
+		if (alloc_type == BTRFS_BLOCK_GROUP_DATA) {
+			if (dev_role == BTRFS_DEVICE_ROLE_METADATA_ONLY)
+				continue;
+		} else {
+			if (dev_role == BTRFS_DEVICE_ROLE_DATA_ONLY)
+				continue;
+		}
+
 		if (device->total_bytes > device->bytes_used)
 			total_avail = device->total_bytes - device->bytes_used;
 		else
@@ -5323,6 +5385,7 @@ static int gather_device_info(struct btrfs_fs_devices *fs_devices,
 		device_info->max_avail = max_avail;
 		device_info->total_avail = total_avail;
 		device_info->dev = device;
+		device_info->role = dev_role;
 	}
 	ctl->ndevs = ndevs;
 
@@ -5339,6 +5402,17 @@ static int gather_device_info(struct btrfs_fs_devices *fs_devices,
 	case BTRFS_DEV_ALLOC_BY_SPACE:
 		list_sort(NULL, devices_info, btrfs_cmp_device_space);
 		break;
+	case BTRFS_DEV_ALLOC_BY_ROLE_THEN_SPACE:
+		/* First, Sort by device roles for the given allocation type */
+		list_sort((void *)&ctl->type, devices_info, btrfs_cmp_device_role);
+
+		/* Next, for each device role, sort the devices by free space */
+		for (int i = 0; i < ARRAY_SIZE(dev_roles); i++) {
+			enum btrfs_device_roles role = dev_roles[i];
+
+			list_sort((void *)&role, devices_info, btrfs_cmp_device_space);
+		}
+		break;
 	}
 
 	return 0;
diff --git a/fs/btrfs/volumes.h b/fs/btrfs/volumes.h
index 7dbdfe502481..73e26bcb19f5 100644
--- a/fs/btrfs/volumes.h
+++ b/fs/btrfs/volumes.h
@@ -317,6 +317,7 @@ BTRFS_DEVICE_GETSET_FUNCS(bytes_used);
 /* Btrfs on disk chunk allocation methods. */
 enum btrfs_device_allocation_method {
 	BTRFS_DEV_ALLOC_BY_SPACE,
+	BTRFS_DEV_ALLOC_BY_ROLE_THEN_SPACE,
 	BTRFS_DEV_ALLOC_NR,
 };
 
@@ -618,6 +619,7 @@ struct btrfs_device_info {
 	u64 dev_offset;
 	u64 max_avail;
 	u64 total_avail;
+	enum btrfs_device_roles role;
 };
 
 struct btrfs_raid_attr {
-- 
2.49.0


^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [PATCH 10/10] btrfs: pass device roles through device add ioctl
  2025-05-12 18:07 [PATCH RFC 00/10] btrfs: new performance-based chunk allocation using device roles Anand Jain
                   ` (8 preceding siblings ...)
  2025-05-12 18:07 ` [PATCH 09/10] btrfs: introduce ROLE_THEN_SPACE device allocation method Anand Jain
@ 2025-05-12 18:07 ` Anand Jain
  2025-05-12 18:09 ` [PATCH RFC 00/14] btrfs-progs: add support for device role-based chunk allocation Anand Jain
                   ` (6 subsequent siblings)
  16 siblings, 0 replies; 44+ messages in thread
From: Anand Jain @ 2025-05-12 18:07 UTC (permalink / raw)
  To: linux-btrfs

Extend the `btrfs device add` ioctl to allow specifying device roles.
Users can now use the syntax

	`btrfs device add /dev/sdX:<role> /btrfs`

where `<role>` is one of "m", "metadata", "d", "data", "monly",
"metadata-only", "donly", or "data-only".

Stores the device role in the ondisk struct

	struct btrfs_dev_item::type:4

Signed-off-by: Anand Jain <anand.jain@oracle.com>
---
 fs/btrfs/ioctl.c   | 12 +++++++++++-
 fs/btrfs/volumes.c | 17 ++++++++++++++++-
 fs/btrfs/volumes.h |  3 ++-
 3 files changed, 29 insertions(+), 3 deletions(-)

diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c
index a498fe524c90..d61847e64733 100644
--- a/fs/btrfs/ioctl.c
+++ b/fs/btrfs/ioctl.c
@@ -2588,6 +2588,8 @@ static long btrfs_ioctl_add_dev(struct btrfs_fs_info *fs_info, void __user *arg)
 	struct btrfs_ioctl_vol_args *vol_args;
 	bool restore_op = false;
 	int ret;
+	char *colon;
+	enum btrfs_device_roles role = BTRFS_DEVICE_ROLE_NONE;
 
 	if (!capable(CAP_SYS_ADMIN))
 		return -EPERM;
@@ -2627,8 +2629,16 @@ static long btrfs_ioctl_add_dev(struct btrfs_fs_info *fs_info, void __user *arg)
 	if (ret < 0)
 		goto out_free;
 
-	ret = btrfs_init_new_device(fs_info, vol_args->name);
+	colon = strstr(vol_args->name, ":");
+	if (colon) {
+		vol_args->name[colon - vol_args->name] = '\0';
+		colon++;
+		ret = parse_device_role(colon, &role);
+		if (ret)
+			goto out_free;
+	}
 
+	ret = btrfs_init_new_device(fs_info, vol_args->name, role);
 	if (!ret)
 		btrfs_info(fs_info, "disk added %s", vol_args->name);
 
diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index fb86d5684454..871fff21ed85 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -2708,7 +2708,9 @@ int parse_device_role(char *str, enum btrfs_device_roles *role)
 	return 0;
 }
 
-int btrfs_init_new_device(struct btrfs_fs_info *fs_info, const char *device_path)
+int btrfs_init_new_device(struct btrfs_fs_info *fs_info,
+			  const char *device_path,
+			  enum btrfs_device_roles role)
 {
 	struct btrfs_root *root = fs_info->dev_root;
 	struct btrfs_trans_handle *trans;
@@ -2784,6 +2786,7 @@ int btrfs_init_new_device(struct btrfs_fs_info *fs_info, const char *device_path
 	device->io_width = fs_info->sectorsize;
 	device->io_align = fs_info->sectorsize;
 	device->sector_size = fs_info->sectorsize;
+	device->type = BTRFS_DEVICE_ROLE_MASK & role;
 	device->total_bytes =
 		round_down(bdev_nr_bytes(device->bdev), fs_info->sectorsize);
 	device->disk_total_bytes = device->total_bytes;
@@ -2841,6 +2844,18 @@ int btrfs_init_new_device(struct btrfs_fs_info *fs_info, const char *device_path
 	 */
 	btrfs_clear_space_info_full(fs_info);
 
+	/*
+	 * Assigning a role to a device via btrfs device add automatically
+	 * activates the role-then-space allocation method if it wasn't already
+	 * active. Avoid assigning device roles if you do not intend to use the
+	 * role-then-space strategy.
+	 */
+	if (((device->type & BTRFS_DEVICE_ROLE_MASK) != BTRFS_DEVICE_ROLE_NONE ||
+	    (device->type & BTRFS_DEVICE_ROLE_MASK) != 0) &&
+	    fs_devices->device_alloc_method == BTRFS_DEV_ALLOC_BY_SPACE)
+		fs_devices->device_alloc_method =
+				BTRFS_DEV_ALLOC_BY_ROLE_THEN_SPACE;
+
 	mutex_unlock(&fs_info->chunk_mutex);
 
 	/* Add sysfs device entry */
diff --git a/fs/btrfs/volumes.h b/fs/btrfs/volumes.h
index 73e26bcb19f5..5dc2db4ed00c 100644
--- a/fs/btrfs/volumes.h
+++ b/fs/btrfs/volumes.h
@@ -778,7 +778,8 @@ struct btrfs_device *btrfs_find_device(const struct btrfs_fs_devices *fs_devices
 				       const struct btrfs_dev_lookup_args *args);
 int btrfs_shrink_device(struct btrfs_device *device, u64 new_size);
 int parse_device_role(char *str, enum btrfs_device_roles *role);
-int btrfs_init_new_device(struct btrfs_fs_info *fs_info, const char *path);
+int btrfs_init_new_device(struct btrfs_fs_info *fs_info, const char *path,
+			  enum btrfs_device_roles role);
 int btrfs_balance(struct btrfs_fs_info *fs_info,
 		  struct btrfs_balance_control *bctl,
 		  struct btrfs_ioctl_balance_args *bargs);
-- 
2.49.0


^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [PATCH RFC 00/14] btrfs-progs: add support for device role-based chunk allocation
  2025-05-12 18:07 [PATCH RFC 00/10] btrfs: new performance-based chunk allocation using device roles Anand Jain
                   ` (9 preceding siblings ...)
  2025-05-12 18:07 ` [PATCH 10/10] btrfs: pass device roles through device add ioctl Anand Jain
@ 2025-05-12 18:09 ` Anand Jain
  2025-05-12 18:09   ` [PATCH 01/14] btrfs-progs: minor spelling correction in the list-chunk help text Anand Jain
                     ` (14 more replies)
  2025-05-12 18:11 ` [PATCH RFC 0/2] fstests: btrfs: add functional verification for device roles Anand Jain
                   ` (5 subsequent siblings)
  16 siblings, 15 replies; 44+ messages in thread
From: Anand Jain @ 2025-05-12 18:09 UTC (permalink / raw)
  To: linux-btrfs

Adds cleanup, fixes, and device role support to enable more efficient
kernel chunk allocation based on device perforamnce.

Anand Jain (14):
  btrfs-progs: minor spelling correction in the list-chunk help text
  btrfs-progs: refactor devid comparison function
  btrfs-progs: rename local dev_list to devices in btrfs_alloc_chunk
  btrfs-progs: mkfs: prepare to merge duplicate if-else blocks
  btrfs-progs: mkfs: eliminate duplicate code in if-else
  btrfs-progs: mkfs: refactor test_num_disk_vs_raid - split data and
    metadata
  btrfs-progs: mkfs: device argument handling with a list
  btrfs-progs: import device role handling from the kernel
  btrfs-progs: mkfs: introduce device roles in device paths
  btrfs-progs: sort devices by role before using them
  btrfs-progs: helper for the device role within dev_item::type
  btrfs-progs: mkfs: persist device roles to dev_item::type
  btrfs-progs: update device add ioctl with device type
  btrfs-progs: disable exclusive metadata/data device roles

 cmds/device.c           |  57 +++++--
 cmds/filesystem.c       |  15 --
 cmds/inspect.c          |   2 +-
 common/device-scan.c    |   4 +-
 common/device-scan.h    |   2 +-
 common/device-utils.c   |  46 ++++++
 common/device-utils.h   |   3 +
 common/utils.c          |  30 ++--
 kernel-shared/volumes.c |  40 ++++-
 kernel-shared/volumes.h |  26 ++++
 mkfs/common.c           |   2 +-
 mkfs/common.h           |   6 +-
 mkfs/main.c             | 324 +++++++++++++++++++++++++++++++---------
 13 files changed, 430 insertions(+), 127 deletions(-)

-- 
2.49.0


^ permalink raw reply	[flat|nested] 44+ messages in thread

* [PATCH 01/14] btrfs-progs: minor spelling correction in the list-chunk help text
  2025-05-12 18:09 ` [PATCH RFC 00/14] btrfs-progs: add support for device role-based chunk allocation Anand Jain
@ 2025-05-12 18:09   ` Anand Jain
  2025-05-12 18:09   ` [PATCH 02/14] btrfs-progs: refactor devid comparison function Anand Jain
                     ` (13 subsequent siblings)
  14 siblings, 0 replies; 44+ messages in thread
From: Anand Jain @ 2025-05-12 18:09 UTC (permalink / raw)
  To: linux-btrfs

Signed-off-by: Anand Jain <anand.jain@oracle.com>
---
 cmds/inspect.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/cmds/inspect.c b/cmds/inspect.c
index d689e085703a..04c466c8afe5 100644
--- a/cmds/inspect.c
+++ b/cmds/inspect.c
@@ -692,7 +692,7 @@ static const char * const cmd_inspect_list_chunks_usage[] = {
 	"btrfs inspect-internal list-chunks [options] <path>",
 	"Enumerate chunks on all devices",
 	"Enumerate chunks on all devices. Chunks are the physical storage tied to a device,",
-	"striped profiles they appear multiple times for a ginve logical offset, on other",
+	"striped profiles they appear multiple times for a given logical offset, on other",
 	"profiles the correspondence is 1:1 or 1:N.",
 	"",
 	HELPINFO_UNITS_LONG,
-- 
2.49.0


^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [PATCH 02/14] btrfs-progs: refactor devid comparison function
  2025-05-12 18:09 ` [PATCH RFC 00/14] btrfs-progs: add support for device role-based chunk allocation Anand Jain
  2025-05-12 18:09   ` [PATCH 01/14] btrfs-progs: minor spelling correction in the list-chunk help text Anand Jain
@ 2025-05-12 18:09   ` Anand Jain
  2025-05-12 18:09   ` [PATCH 03/14] btrfs-progs: rename local dev_list to devices in btrfs_alloc_chunk Anand Jain
                     ` (12 subsequent siblings)
  14 siblings, 0 replies; 44+ messages in thread
From: Anand Jain @ 2025-05-12 18:09 UTC (permalink / raw)
  To: linux-btrfs

There were two similar but slightly different implementations of devid
comparison functions: cmp_device_id in filesystem.c and _cmp_device_by_id
in mkfs/main.c. Merge them as cmp_device_id in common/device-utils.c.

Signed-off-by: Anand Jain <anand.jain@oracle.com>
---
 cmds/filesystem.c     | 15 ---------------
 common/device-utils.c | 12 ++++++++++++
 common/device-utils.h |  2 ++
 mkfs/main.c           |  9 +--------
 4 files changed, 15 insertions(+), 23 deletions(-)

diff --git a/cmds/filesystem.c b/cmds/filesystem.c
index 64373532e5e0..54a186f023c2 100644
--- a/cmds/filesystem.c
+++ b/cmds/filesystem.c
@@ -259,21 +259,6 @@ static int uuid_search(struct btrfs_fs_devices *fs_devices, const char *search)
 	return 0;
 }
 
-/*
- * Sort devices by devid, ascending
- */
-static int cmp_device_id(void *priv, struct list_head *a,
-		struct list_head *b)
-{
-	const struct btrfs_device *da = list_entry(a, struct btrfs_device,
-			dev_list);
-	const struct btrfs_device *db = list_entry(b, struct btrfs_device,
-			dev_list);
-
-	return da->devid < db->devid ? -1 :
-		da->devid > db->devid ? 1 : 0;
-}
-
 static void splice_device_list(struct list_head *seed_devices,
 			       struct list_head *all_devices)
 {
diff --git a/common/device-utils.c b/common/device-utils.c
index c39e6d6166ad..783d79555446 100644
--- a/common/device-utils.c
+++ b/common/device-utils.c
@@ -641,3 +641,15 @@ ssize_t btrfs_direct_pwrite(int fd, const void *buf, size_t count, off_t offset)
 	free(bounce_buf);
 	return ret;
 }
+
+/* Sort devices by devid, ascending */
+int cmp_device_id(void *priv, struct list_head *a, struct list_head *b)
+{
+	const struct btrfs_device *da = list_entry(a, struct btrfs_device,
+			dev_list);
+	const struct btrfs_device *db = list_entry(b, struct btrfs_device,
+			dev_list);
+
+	return da->devid < db->devid ? -1 :
+		da->devid > db->devid ? 1 : 0;
+}
diff --git a/common/device-utils.h b/common/device-utils.h
index 8e96154ab0a9..cef9405f3a9a 100644
--- a/common/device-utils.h
+++ b/common/device-utils.h
@@ -58,6 +58,8 @@ int btrfs_prepare_device(int fd, const char *file, u64 *block_count_ret,
 ssize_t btrfs_direct_pread(int fd, void *buf, size_t count, off_t offset);
 ssize_t btrfs_direct_pwrite(int fd, const void *buf, size_t count, off_t offset);
 
+int cmp_device_id(void *priv, struct list_head *a, struct list_head *b);
+
 #ifdef BTRFS_ZONED
 static inline ssize_t btrfs_pwrite(int fd, const void *buf, size_t count,
 				   off_t offset, bool direct)
diff --git a/mkfs/main.c b/mkfs/main.c
index 4c2ce98c784c..b23d0a6f092d 100644
--- a/mkfs/main.c
+++ b/mkfs/main.c
@@ -508,13 +508,6 @@ static int zero_output_file(int out_fd, u64 size)
 	return ret;
 }
 
-static int _cmp_device_by_id(void *priv, struct list_head *a,
-			     struct list_head *b)
-{
-	return list_entry(a, struct btrfs_device, dev_list)->devid -
-	       list_entry(b, struct btrfs_device, dev_list)->devid;
-}
-
 static void list_all_devices(struct btrfs_root *root, bool is_zoned)
 {
 	struct btrfs_fs_devices *fs_devices;
@@ -528,7 +521,7 @@ static void list_all_devices(struct btrfs_root *root, bool is_zoned)
 	list_for_each_entry(device, &fs_devices->devices, dev_list)
 		number_of_devices++;
 
-	list_sort(NULL, &fs_devices->devices, _cmp_device_by_id);
+	list_sort(NULL, &fs_devices->devices, cmp_device_id);
 
 	printf("Number of devices:  %d\n", number_of_devices);
 	printf("Devices:\n");
-- 
2.49.0


^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [PATCH 03/14] btrfs-progs: rename local dev_list to devices in btrfs_alloc_chunk
  2025-05-12 18:09 ` [PATCH RFC 00/14] btrfs-progs: add support for device role-based chunk allocation Anand Jain
  2025-05-12 18:09   ` [PATCH 01/14] btrfs-progs: minor spelling correction in the list-chunk help text Anand Jain
  2025-05-12 18:09   ` [PATCH 02/14] btrfs-progs: refactor devid comparison function Anand Jain
@ 2025-05-12 18:09   ` Anand Jain
  2025-05-12 18:09   ` [PATCH 04/14] btrfs-progs: mkfs: prepare to merge duplicate if-else blocks Anand Jain
                     ` (11 subsequent siblings)
  14 siblings, 0 replies; 44+ messages in thread
From: Anand Jain @ 2025-05-12 18:09 UTC (permalink / raw)
  To: linux-btrfs

Rename local dev_list to devices in btrfs_alloc_chunk, avoids confusion
with btrfs_device::dev_list. Local variable dev_list currently points to
btrfs_fs_devices::devices.

Signed-off-by: Anand Jain <anand.jain@oracle.com>
---
 kernel-shared/volumes.c | 14 +++++++-------
 1 file changed, 7 insertions(+), 7 deletions(-)

diff --git a/kernel-shared/volumes.c b/kernel-shared/volumes.c
index 783505480765..be01bdb4d3f6 100644
--- a/kernel-shared/volumes.c
+++ b/kernel-shared/volumes.c
@@ -1688,7 +1688,7 @@ int btrfs_alloc_chunk(struct btrfs_trans_handle *trans,
 {
 	struct btrfs_device *device = NULL;
 	struct list_head private_devs;
-	struct list_head *dev_list = &info->fs_devices->devices;
+	struct list_head *devs = &info->fs_devices->devices;
 	struct list_head *cur;
 	u64 min_free;
 	u64 avail = 0;
@@ -1698,7 +1698,7 @@ int btrfs_alloc_chunk(struct btrfs_trans_handle *trans,
 	int ret;
 	int index;
 
-	if (list_empty(dev_list))
+	if (list_empty(devs))
 		return -ENOSPC;
 
 	ctl.type = type;
@@ -1715,7 +1715,7 @@ again:
 		return ret;
 
 	INIT_LIST_HEAD(&private_devs);
-	cur = dev_list->next;
+	cur = devs->next;
 	index = 0;
 
 	if (type & BTRFS_BLOCK_GROUP_DUP)
@@ -1737,11 +1737,11 @@ again:
 				index++;
 		} else if (avail > max_avail)
 			max_avail = avail;
-		if (cur == dev_list)
+		if (cur == devs)
 			break;
 	}
 	if (index < ctl.num_stripes) {
-		list_splice(&private_devs, dev_list);
+		list_splice(&private_devs, devs);
 		if (index >= ctl.min_stripes) {
 			ctl.num_stripes = index;
 			if (type & (BTRFS_BLOCK_GROUP_RAID10)) {
@@ -1773,13 +1773,13 @@ again:
 	while (!list_empty(&private_devs)) {
 		device = list_entry(private_devs.next, struct btrfs_device,
 				    dev_list);
-		list_move(&device->dev_list, dev_list);
+		list_move(&device->dev_list, devs);
 	}
 	/*
 	 * All private devs moved back to @dev_list, now dev_list should not be
 	 * empty.
 	 */
-	ASSERT(!list_empty(dev_list));
+	ASSERT(!list_empty(devs));
 	*start = ctl.start;
 	*num_bytes = ctl.num_bytes;
 
-- 
2.49.0


^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [PATCH 04/14] btrfs-progs: mkfs: prepare to merge duplicate if-else blocks
  2025-05-12 18:09 ` [PATCH RFC 00/14] btrfs-progs: add support for device role-based chunk allocation Anand Jain
                     ` (2 preceding siblings ...)
  2025-05-12 18:09   ` [PATCH 03/14] btrfs-progs: rename local dev_list to devices in btrfs_alloc_chunk Anand Jain
@ 2025-05-12 18:09   ` Anand Jain
  2025-05-12 18:09   ` [PATCH 05/14] btrfs-progs: mkfs: eliminate duplicate code in if-else Anand Jain
                     ` (10 subsequent siblings)
  14 siblings, 0 replies; 44+ messages in thread
From: Anand Jain @ 2025-05-12 18:09 UTC (permalink / raw)
  To: linux-btrfs

In create_metadata_block_groups(), move the line

   allocation->metadata += chunk_size;

to after the function's error return check. This aligns it with the
if-block and enables the merging of the if-else blocks.

Signed-off-by: Anand Jain <anand.jain@oracle.com>
---
 mkfs/main.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/mkfs/main.c b/mkfs/main.c
index b23d0a6f092d..48aa57f23d5f 100644
--- a/mkfs/main.c
+++ b/mkfs/main.c
@@ -157,9 +157,9 @@ static int create_metadata_block_groups(struct btrfs_root *root, bool mixed,
 		ret = btrfs_make_block_group(trans, fs_info, 0,
 					     BTRFS_BLOCK_GROUP_METADATA,
 					     chunk_start, chunk_size);
-		allocation->metadata += chunk_size;
 		if (ret)
 			return ret;
+		allocation->metadata += chunk_size;
 	}
 
 	root->fs_info->system_allocs = 0;
-- 
2.49.0


^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [PATCH 05/14] btrfs-progs: mkfs: eliminate duplicate code in if-else
  2025-05-12 18:09 ` [PATCH RFC 00/14] btrfs-progs: add support for device role-based chunk allocation Anand Jain
                     ` (3 preceding siblings ...)
  2025-05-12 18:09   ` [PATCH 04/14] btrfs-progs: mkfs: prepare to merge duplicate if-else blocks Anand Jain
@ 2025-05-12 18:09   ` Anand Jain
  2025-05-12 18:09   ` [PATCH 06/14] btrfs-progs: mkfs: refactor test_num_disk_vs_raid - split data and metadata Anand Jain
                     ` (9 subsequent siblings)
  14 siblings, 0 replies; 44+ messages in thread
From: Anand Jain @ 2025-05-12 18:09 UTC (permalink / raw)
  To: linux-btrfs

The separate if and else blocks unnecessarily handled different block
group types for the mixed case. The local flag variable already accounts
for these distinctions. Merging these blocks and relying on the flag
simplifies the code.

Signed-off-by: Anand Jain <anand.jain@oracle.com>
---
 mkfs/main.c | 49 ++++++++++++++++---------------------------------
 1 file changed, 16 insertions(+), 33 deletions(-)

diff --git a/mkfs/main.c b/mkfs/main.c
index 48aa57f23d5f..f80b18c7ad23 100644
--- a/mkfs/main.c
+++ b/mkfs/main.c
@@ -126,41 +126,24 @@ static int create_metadata_block_groups(struct btrfs_root *root, bool mixed,
 	if (ret)
 		return ret;
 
-	if (mixed) {
-		ret = btrfs_alloc_chunk(trans, fs_info,
-					&chunk_start, &chunk_size,
-					BTRFS_BLOCK_GROUP_METADATA |
-					BTRFS_BLOCK_GROUP_DATA);
-		if (ret == -ENOSPC) {
-			error("no space to allocate data/metadata chunk");
-			goto err;
-		}
-		if (ret)
-			return ret;
-		ret = btrfs_make_block_group(trans, fs_info, 0,
-					     BTRFS_BLOCK_GROUP_METADATA |
-					     BTRFS_BLOCK_GROUP_DATA,
-					     chunk_start, chunk_size);
-		if (ret)
-			return ret;
+	ret = btrfs_alloc_chunk(trans, fs_info, &chunk_start, &chunk_size,
+				flags);
+	if (ret == -ENOSPC) {
+		error("no space to allocate data/metadata chunk");
+		goto err;
+	}
+	if (ret)
+		return ret;
+
+	ret = btrfs_make_block_group(trans, fs_info, 0, flags, chunk_start,
+				     chunk_size);
+	if (ret)
+		return ret;
+
+	if (mixed)
 		allocation->mixed += chunk_size;
-	} else {
-		ret = btrfs_alloc_chunk(trans, fs_info,
-					&chunk_start, &chunk_size,
-					BTRFS_BLOCK_GROUP_METADATA);
-		if (ret == -ENOSPC) {
-			error("no space to allocate metadata chunk");
-			goto err;
-		}
-		if (ret)
-			return ret;
-		ret = btrfs_make_block_group(trans, fs_info, 0,
-					     BTRFS_BLOCK_GROUP_METADATA,
-					     chunk_start, chunk_size);
-		if (ret)
-			return ret;
+	else
 		allocation->metadata += chunk_size;
-	}
 
 	root->fs_info->system_allocs = 0;
 	ret = btrfs_commit_transaction(trans, root);
-- 
2.49.0


^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [PATCH 06/14] btrfs-progs: mkfs: refactor test_num_disk_vs_raid - split data and metadata
  2025-05-12 18:09 ` [PATCH RFC 00/14] btrfs-progs: add support for device role-based chunk allocation Anand Jain
                     ` (4 preceding siblings ...)
  2025-05-12 18:09   ` [PATCH 05/14] btrfs-progs: mkfs: eliminate duplicate code in if-else Anand Jain
@ 2025-05-12 18:09   ` Anand Jain
  2025-05-12 18:09   ` [PATCH 07/14] btrfs-progs: mkfs: device argument handling with a list Anand Jain
                     ` (8 subsequent siblings)
  14 siblings, 0 replies; 44+ messages in thread
From: Anand Jain @ 2025-05-12 18:09 UTC (permalink / raw)
  To: linux-btrfs

This patch reuses test_num_disk_vs_raid() for varying data and metadata device
counts, calling it twice instead of adding separate arguments. This is
in preparation to support device roles.

Signed-off-by: Anand Jain <anand.jain@oracle.com>
---
 common/utils.c | 30 ++++++++++++------------------
 mkfs/common.h  |  3 +--
 mkfs/main.c    |  7 +++++--
 3 files changed, 18 insertions(+), 22 deletions(-)

diff --git a/common/utils.c b/common/utils.c
index 9515abd47af8..ebf419224162 100644
--- a/common/utils.c
+++ b/common/utils.c
@@ -366,39 +366,33 @@ int get_fsid(const char *path, u8 *fsid, int silent)
 	return ret;
 }
 
-int test_num_disk_vs_raid(u64 metadata_profile, u64 data_profile,
-	u64 dev_cnt, int mixed, int ssd)
+int test_num_disk_vs_raid(u64 bg_profile, u64 dev_cnt, int mixed, int ssd)
 {
 	u64 allowed;
-	u64 profile = metadata_profile | data_profile;
 
 	allowed = btrfs_bg_flags_for_device_num(dev_cnt);
 
-	if (dev_cnt > 1 && profile & BTRFS_BLOCK_GROUP_DUP) {
+	if (dev_cnt > 1 && (bg_profile & BTRFS_BLOCK_GROUP_DUP)) {
 		warning("DUP is not recommended on filesystem with multiple devices");
 	}
-	if (metadata_profile & ~allowed) {
-		error("unable to create FS with metadata profile %s "
-			"(have %llu devices but %d devices are required)",
-			btrfs_group_profile_str(metadata_profile), dev_cnt,
-			btrfs_bg_type_to_devs_min(metadata_profile));
-		return 1;
-	}
-	if (data_profile & ~allowed) {
-		error("ERROR: unable to create FS with data profile %s "
+
+	if (bg_profile & ~allowed) {
+		error("unable to create FS with %s profile %s "
 			"(have %llu devices but %d devices are required)",
-			btrfs_group_profile_str(data_profile), dev_cnt,
-			btrfs_bg_type_to_devs_min(data_profile));
+			btrfs_group_type_str(bg_profile),
+			btrfs_group_profile_str(bg_profile), dev_cnt,
+			btrfs_bg_type_to_devs_min(bg_profile));
 		return 1;
 	}
 
-	if (dev_cnt == 3 && profile & BTRFS_BLOCK_GROUP_RAID6) {
+	if (dev_cnt == 3 && (bg_profile & BTRFS_BLOCK_GROUP_RAID6)) {
 		warning("RAID6 is not recommended on filesystem with 3 devices only");
 	}
-	if (dev_cnt == 2 && profile & BTRFS_BLOCK_GROUP_RAID5) {
+	if (dev_cnt == 2 && (bg_profile & BTRFS_BLOCK_GROUP_RAID5)) {
 		warning("RAID5 is not recommended on filesystem with 2 devices only");
 	}
-	warning_on(!mixed && (data_profile & BTRFS_BLOCK_GROUP_DUP) && ssd,
+	warning_on(!mixed && (bg_profile & BTRFS_BLOCK_GROUP_DUP) && ssd &&
+		   (bg_profile & BTRFS_BLOCK_GROUP_DATA),
 		   "DUP may not actually lead to 2 copies on the device, see manual page");
 
 	return 0;
diff --git a/mkfs/common.h b/mkfs/common.h
index c600c16622fa..de0e413774a4 100644
--- a/mkfs/common.h
+++ b/mkfs/common.h
@@ -107,8 +107,7 @@ u64 btrfs_min_dev_size(u32 nodesize, bool mixed, u64 zone_size, u64 meta_profile
 		       u64 data_profile);
 int test_minimum_size(const char *file, u64 min_dev_size);
 int is_vol_small(const char *file);
-int test_num_disk_vs_raid(u64 metadata_profile, u64 data_profile,
-	u64 dev_cnt, int mixed, int ssd);
+int test_num_disk_vs_raid(u64 bg_profile, u64 dev_cnt, int mixed, int ssd);
 bool test_status_for_mkfs(const char *file, bool force_overwrite);
 bool test_dev_for_mkfs(const char *file, int force_overwrite);
 
diff --git a/mkfs/main.c b/mkfs/main.c
index f80b18c7ad23..0823d378779d 100644
--- a/mkfs/main.c
+++ b/mkfs/main.c
@@ -1742,8 +1742,11 @@ int BOX_MAIN(mkfs)(int argc, char **argv)
 			goto error;
 		}
 	}
-	ret = test_num_disk_vs_raid(metadata_profile, data_profile,
-			device_count, mixed, ssd);
+	ret = test_num_disk_vs_raid(metadata_profile, device_count, mixed, ssd);
+	if (ret)
+		goto error;
+
+	ret = test_num_disk_vs_raid(data_profile, device_count, mixed, ssd);
 	if (ret)
 		goto error;
 
-- 
2.49.0


^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [PATCH 07/14] btrfs-progs: mkfs: device argument handling with a list
  2025-05-12 18:09 ` [PATCH RFC 00/14] btrfs-progs: add support for device role-based chunk allocation Anand Jain
                     ` (5 preceding siblings ...)
  2025-05-12 18:09   ` [PATCH 06/14] btrfs-progs: mkfs: refactor test_num_disk_vs_raid - split data and metadata Anand Jain
@ 2025-05-12 18:09   ` Anand Jain
  2025-05-12 18:09   ` [PATCH 08/14] btrfs-progs: import device role handling from the kernel Anand Jain
                     ` (7 subsequent siblings)
  14 siblings, 0 replies; 44+ messages in thread
From: Anand Jain @ 2025-05-12 18:09 UTC (permalink / raw)
  To: linux-btrfs

Storing device arguments in a list provides the advantage of sorting
devices according to their designated roles. Which is a necessary
change to implement distinct device roles.

Signed-off-by: Anand Jain <anand.jain@oracle.com>
---
 mkfs/main.c | 82 +++++++++++++++++++++++++++++++++++++++--------------
 1 file changed, 61 insertions(+), 21 deletions(-)

diff --git a/mkfs/main.c b/mkfs/main.c
index 0823d378779d..0dbc09339f24 100644
--- a/mkfs/main.c
+++ b/mkfs/main.c
@@ -999,6 +999,34 @@ static int setup_raid_stripe_tree_root(struct btrfs_fs_info *fs_info)
 	return 0;
 }
 
+struct device_arg {
+	struct list_head list;
+	char path[PATH_MAX];
+};
+
+static struct device_arg *parse_device_arg(const char *path,
+					    struct list_head *devices)
+{
+	struct device_arg *device;
+
+	device = calloc(1, sizeof(struct device_arg));
+	if (!device) {
+		error_msg(ERROR_MSG_MEMORY, NULL);
+		return ERR_PTR(-ENOMEM);
+	}
+
+	if (arg_copy_path(device->path, path, sizeof(device->path))) {
+		error("Device path '%s' length '%ld' is too long",
+		      path, strlen(path));
+		free(device);
+		return ERR_PTR(-EINVAL);
+	}
+
+	list_add_tail(&device->list, devices);
+
+	return device;
+}
+
 /* Thread callback for device preparation */
 static void *prepare_one_device(void *ctx)
 {
@@ -1156,7 +1184,6 @@ int BOX_MAIN(mkfs)(int argc, char **argv)
 	u64 min_dev_size;
 	u64 shrink_size;
 	int device_count = 0;
-	int saved_optind;
 	pthread_t *t_prepare = NULL;
 	struct prepare_device_progress *prepare_ctx = NULL;
 	struct mkfs_allocation allocation = { 0 };
@@ -1186,6 +1213,8 @@ int BOX_MAIN(mkfs)(int argc, char **argv)
 	enum btrfs_compression_type compression = BTRFS_COMPRESS_NONE;
 	unsigned int compression_level = 0;
 	LIST_HEAD(subvols);
+	struct device_arg *arg_device;
+	LIST_HEAD(arg_devices);
 
 	cpu_detect_flags();
 	hash_init_accel();
@@ -1392,7 +1421,6 @@ int BOX_MAIN(mkfs)(int argc, char **argv)
 		nodesize = max_t(u32, sectorsize, BTRFS_MKFS_DEFAULT_NODE_SIZE);
 
 	stripesize = sectorsize;
-	saved_optind = optind;
 	device_count = argc - optind;
 	if (device_count == 0)
 		usage(&mkfs_cmd, 1);
@@ -1515,6 +1543,13 @@ int BOX_MAIN(mkfs)(int argc, char **argv)
 	for (i = 0; i < device_count; i++) {
 		file = argv[optind++];
 
+		arg_device = parse_device_arg(file, &arg_devices);
+		if (IS_ERR(arg_device)) {
+			ret = 1;
+			goto error;
+		}
+		file = arg_device->path;
+
 		if (source_dir && path_exists(file) == 0)
 			ret = 0;
 		else if (path_is_block_device(file) == 1)
@@ -1526,10 +1561,8 @@ int BOX_MAIN(mkfs)(int argc, char **argv)
 			goto error;
 	}
 
-	optind = saved_optind;
-	device_count = argc - optind;
-
-	file = argv[optind++];
+	arg_device = list_first_entry(&arg_devices, struct device_arg, list);
+	file = arg_device->path;
 	ssd = device_get_rotational(file);
 	if (opt_zoned) {
 		if (!zone_size(file)) {
@@ -1725,10 +1758,9 @@ int BOX_MAIN(mkfs)(int argc, char **argv)
 		goto error;
 	}
 
-	for (i = saved_optind; i < saved_optind + device_count; i++) {
-		char *path;
+	list_for_each_entry(arg_device, &arg_devices, list) {
+		char *path = arg_device->path;
 
-		path = argv[i];
 		ret = test_minimum_size(path, min_dev_size);
 		if (ret < 0) {
 			error("failed to check size for %s: %m", path);
@@ -1793,17 +1825,18 @@ int BOX_MAIN(mkfs)(int argc, char **argv)
 	}
 
 	opt_oflags = O_RDWR;
-	for (i = 0; i < device_count; i++) {
+	list_for_each_entry(arg_device, &arg_devices, list) {
 		if (opt_zoned &&
-		    zoned_model(argv[optind + i - 1]) == ZONED_HOST_MANAGED) {
+		    zoned_model(arg_device->path) == ZONED_HOST_MANAGED) {
 			opt_oflags |= O_DIRECT;
 			break;
 		}
 	}
 
 	/* Start threads */
-	for (i = 0; i < device_count; i++) {
-		prepare_ctx[i].file = argv[optind + i - 1];
+	i = 0;
+	list_for_each_entry(arg_device, &arg_devices, list) {
+		prepare_ctx[i].file = arg_device->path;
 		prepare_ctx[i].byte_count = byte_count;
 		prepare_ctx[i].dev_byte_count = byte_count;
 		ret = pthread_create(&t_prepare[i], NULL, prepare_one_device,
@@ -1814,6 +1847,7 @@ int BOX_MAIN(mkfs)(int argc, char **argv)
 					prepare_ctx[i].file);
 			goto error;
 		}
+		i++;
 	}
 
 	/* Wait for threads */
@@ -1973,11 +2007,11 @@ int BOX_MAIN(mkfs)(int argc, char **argv)
 			goto error;
 		}
 		if (bconf.verbose >= LOG_INFO) {
-			struct btrfs_device *device;
+			struct btrfs_device *tmp;
 
-			device = container_of(fs_info->fs_devices->devices.next,
-					struct btrfs_device, dev_list);
-			printf("adding device %s id %llu\n", file, device->devid);
+			tmp = container_of(fs_info->fs_devices->devices.next,
+					   struct btrfs_device, dev_list);
+			printf("adding device %s id %llu\n", file, tmp->devid);
 		}
 	}
 
@@ -2174,10 +2208,8 @@ out:
 	close_ret = close_ctree(root);
 
 	if (!close_ret) {
-		optind = saved_optind;
-		device_count = argc - optind;
-		while (device_count-- > 0) {
-			file = argv[optind++];
+		list_for_each_entry(arg_device, &arg_devices, list) {
+			file = arg_device->path;
 			if (path_is_block_device(file) == 1)
 				btrfs_register_one_device(file);
 		}
@@ -2209,6 +2241,14 @@ error:
 		free(head);
 	}
 
+	while (!list_empty(&arg_devices)) {
+		struct device_arg *head;
+
+		head = list_entry(arg_devices.next, struct device_arg, list);
+		list_del(&head->list);
+		free(head);
+	}
+
 	return !!ret;
 
 success:
-- 
2.49.0


^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [PATCH 08/14] btrfs-progs: import device role handling from the kernel
  2025-05-12 18:09 ` [PATCH RFC 00/14] btrfs-progs: add support for device role-based chunk allocation Anand Jain
                     ` (6 preceding siblings ...)
  2025-05-12 18:09   ` [PATCH 07/14] btrfs-progs: mkfs: device argument handling with a list Anand Jain
@ 2025-05-12 18:09   ` Anand Jain
  2025-05-12 18:09   ` [PATCH 09/14] btrfs-progs: mkfs: introduce device roles in device paths Anand Jain
                     ` (6 subsequent siblings)
  14 siblings, 0 replies; 44+ messages in thread
From: Anand Jain @ 2025-05-12 18:09 UTC (permalink / raw)
  To: linux-btrfs

Imports the kernel's defines and code to device roles and related
utilities to ensure consistency between btrfs-progs and the kernel.

Signed-off-by: Anand Jain <anand.jain@oracle.com>
---
 kernel-shared/volumes.c | 21 +++++++++++++++++++++
 kernel-shared/volumes.h | 24 ++++++++++++++++++++++++
 2 files changed, 45 insertions(+)

diff --git a/kernel-shared/volumes.c b/kernel-shared/volumes.c
index be01bdb4d3f6..b5b1a53a4a90 100644
--- a/kernel-shared/volumes.c
+++ b/kernel-shared/volumes.c
@@ -1258,6 +1258,27 @@ out:
 	return ret;
 }
 
+int parse_device_role(char *str, enum btrfs_device_roles *role)
+{
+	if (strncmp(str, "m", strlen(str)) == 0 ||
+	    strncmp(str, "metadata", strlen(str)) == 0) {
+		*role = BTRFS_DEVICE_ROLE_METADATA;
+	} else if (strncmp(str, "d", strlen(str)) == 0 ||
+	    strncmp(str, "data", strlen(str)) == 0) {
+		*role = BTRFS_DEVICE_ROLE_DATA;
+	} else if (strncmp(str, "monly", strlen(str)) == 0 ||
+	    strncmp(str, "metadata-only", strlen(str)) == 0) {
+		*role = BTRFS_DEVICE_ROLE_METADATA_ONLY;
+	} else if (strncmp(str, "donly", strlen(str)) == 0 ||
+	    strncmp(str, "data-only", strlen(str)) == 0) {
+		*role = BTRFS_DEVICE_ROLE_DATA_ONLY;
+	} else {
+		return -EINVAL;
+	}
+
+	return 0;
+}
+
 int btrfs_add_system_chunk(struct btrfs_fs_info *fs_info, struct btrfs_key *key,
 			   struct btrfs_chunk *chunk, int item_size)
 {
diff --git a/kernel-shared/volumes.h b/kernel-shared/volumes.h
index 74fccd147d82..2bb299eead8c 100644
--- a/kernel-shared/volumes.h
+++ b/kernel-shared/volumes.h
@@ -33,6 +33,30 @@ struct extent_buffer;
 #define BTRFS_STRIPE_LEN	SZ_64K
 #define BTRFS_STRIPE_LEN_SHIFT	(16)
 
+#define BTRFS_DEVICE_ROLE_MASK	0xff
+/*
+ * device_role value and how it will be used.
+ * 	      0: Unused
+ *	   1-20: Metadata only
+ *	  21-40: Metadata preferred
+ *	  41-80: Anything|None
+ *	 81-100: Data preferred
+ *	101-128: Data only
+ * Declare some predefined easy to use device_bg_type values
+ */
+enum btrfs_device_roles {
+	BTRFS_DEVICE_ROLE_METADATA_ONLY = 20,
+	BTRFS_DEVICE_ROLE_METADATA      = 40,
+	BTRFS_DEVICE_ROLE_NONE          = 80,
+	BTRFS_DEVICE_ROLE_DATA          = 100,
+	BTRFS_DEVICE_ROLE_DATA_ONLY     = 120,
+};
+
+/* Device role value range (0 to 128) */
+#define BTRFS_DEVICE_ROLE_MAX 128
+
+int parse_device_role(char *str, enum btrfs_device_roles *role);
+
 struct btrfs_device {
 	struct list_head dev_list;
 	struct btrfs_root *dev_root;
-- 
2.49.0


^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [PATCH 09/14] btrfs-progs: mkfs: introduce device roles in device paths
  2025-05-12 18:09 ` [PATCH RFC 00/14] btrfs-progs: add support for device role-based chunk allocation Anand Jain
                     ` (7 preceding siblings ...)
  2025-05-12 18:09   ` [PATCH 08/14] btrfs-progs: import device role handling from the kernel Anand Jain
@ 2025-05-12 18:09   ` Anand Jain
  2025-05-12 18:09   ` [PATCH 10/14] btrfs-progs: sort devices by role before using them Anand Jain
                     ` (5 subsequent siblings)
  14 siblings, 0 replies; 44+ messages in thread
From: Anand Jain @ 2025-05-12 18:09 UTC (permalink / raw)
  To: linux-btrfs

Users can append :m, :d, :monly, :donly, or :none to device paths in
mkfs.btrfs to specify device roles for metadata and data.

   :m (:metadata): Preferred for metadata, can use other bg type if space
	is low.
   :d (:data): Preferred for data, can use other bg type if space is low.
   :monly (metadata-only): Exclusive for metadata.
   :donly (data-only): Exclusive for data.
   :No preference; used if preferred devices are full.

Examples:

   mkfs.btrfs /dev/sda:m /dev/sdb:d

	/dev/sda prefers metadata, /dev/sdb prefers data.

   mkfs.btrfs /dev/nvme0n1:monly /dev/sdb:donly

	/dev/nvme0n1 only metadata, /dev/sdb only data.

   mkfs.btrfs /dev/sdc:m /dev/sdd:d /dev/sde

	/dev/sdc prefers metadata, /dev/sdd prefers data,
	/dev/sde has no preference.

Signed-off-by: Anand Jain <anand.jain@oracle.com>
---
 mkfs/main.c | 58 +++++++++++++++++++++++++++++++++++++++++++++++------
 1 file changed, 52 insertions(+), 6 deletions(-)

diff --git a/mkfs/main.c b/mkfs/main.c
index 0dbc09339f24..0bf6938f9026 100644
--- a/mkfs/main.c
+++ b/mkfs/main.c
@@ -403,9 +403,15 @@ static int create_raid_groups(struct btrfs_trans_handle *trans,
 }
 
 static const char * const mkfs_usage[] = {
-	"mkfs.btrfs [options] <dev> [<dev...>]",
+	"mkfs.btrfs [options] <dev[:profile]> [<dev[:profile]...>]",
 	"Create a BTRFS filesystem on a device or multiple devices",
 	"",
+	"Device-specific roles or profiles:",
+	OPTLINE(":m|:metadata", "Preferred for metadata block-group allocations"),
+	OPTLINE(":d|:data", "Preferred for data block-group allocations"),
+	OPTLINE(":monly|:metadata-only", "Must be used for metadata block-group allocations only"),
+	OPTLINE(":donly|:data-only", "Must be used for data block-group allocations only"),
+	"",
 	"Allocation profiles:",
 	OPTLINE("-d|--data PROFILE", "data profile, raid0, raid1, raid1c3, raid1c4, raid5, raid6, raid10, dup or single"),
 	OPTLINE("-m|--metadata PROFILE", "metadata profile, values like for data profile"),
@@ -1002,12 +1008,14 @@ static int setup_raid_stripe_tree_root(struct btrfs_fs_info *fs_info)
 struct device_arg {
 	struct list_head list;
 	char path[PATH_MAX];
+	enum btrfs_device_roles role;
 };
 
 static struct device_arg *parse_device_arg(const char *path,
 					    struct list_head *devices)
 {
 	struct device_arg *device;
+	char *colon;
 
 	device = calloc(1, sizeof(struct device_arg));
 	if (!device) {
@@ -1015,6 +1023,7 @@ static struct device_arg *parse_device_arg(const char *path,
 		return ERR_PTR(-ENOMEM);
 	}
 
+	/* Copy path and type (separated by ':'), then replace ':' with null. */
 	if (arg_copy_path(device->path, path, sizeof(device->path))) {
 		error("Device path '%s' length '%ld' is too long",
 		      path, strlen(path));
@@ -1022,6 +1031,17 @@ static struct device_arg *parse_device_arg(const char *path,
 		return ERR_PTR(-EINVAL);
 	}
 
+	colon = strstr(path, ":");
+	if (colon) {
+		device->path[colon - path] = '\0';
+		if (parse_device_role(colon + 1, &device->role)) {
+			error("Invalid device profile");
+			return ERR_PTR(-EINVAL);
+		}
+	} else {
+		device->role = 0;
+	}
+
 	list_add_tail(&device->list, devices);
 
 	return device;
@@ -1184,6 +1204,8 @@ int BOX_MAIN(mkfs)(int argc, char **argv)
 	u64 min_dev_size;
 	u64 shrink_size;
 	int device_count = 0;
+	int metadata_device_count = 0;
+	int data_device_count = 0;
 	pthread_t *t_prepare = NULL;
 	struct prepare_device_progress *prepare_ctx = NULL;
 	struct mkfs_allocation allocation = { 0 };
@@ -1561,6 +1583,28 @@ int BOX_MAIN(mkfs)(int argc, char **argv)
 			goto error;
 	}
 
+	list_for_each_entry(arg_device, &arg_devices, list) {
+		enum btrfs_device_roles role = arg_device->role;
+
+		if (role == BTRFS_DEVICE_ROLE_NONE ||
+		    role == 0 ||
+		    role == BTRFS_DEVICE_ROLE_METADATA ||
+		    role == BTRFS_DEVICE_ROLE_DATA) {
+			metadata_device_count++;
+			data_device_count++;
+		} else if (role == BTRFS_DEVICE_ROLE_METADATA_ONLY) {
+			metadata_device_count++;
+		} else if (role == BTRFS_DEVICE_ROLE_DATA_ONLY) {
+			data_device_count++;
+		}
+
+		if (mixed && role != BTRFS_DEVICE_ROLE_NONE && role != 0) {
+			error("Mixed mode can't put metadata and data to separate devices");
+			ret = 1;
+			goto error;
+		}
+	}
+
 	arg_device = list_first_entry(&arg_devices, struct device_arg, list);
 	file = arg_device->path;
 	ssd = device_get_rotational(file);
@@ -1584,14 +1628,14 @@ int BOX_MAIN(mkfs)(int argc, char **argv)
 		u64 tmp;
 
 		if (!metadata_profile_set) {
-			if (device_count > 1)
+			if (metadata_device_count > 1)
 				tmp = BTRFS_MKFS_DEFAULT_META_MULTI_DEVICE;
 			else
 				tmp = BTRFS_MKFS_DEFAULT_META_ONE_DEVICE;
 			metadata_profile = tmp;
 		}
 		if (!data_profile_set) {
-			if (device_count > 1)
+			if (data_device_count > 1)
 				tmp = BTRFS_MKFS_DEFAULT_DATA_MULTI_DEVICE;
 			else
 				tmp = BTRFS_MKFS_DEFAULT_DATA_ONE_DEVICE;
@@ -1774,15 +1818,17 @@ int BOX_MAIN(mkfs)(int argc, char **argv)
 			goto error;
 		}
 	}
-	ret = test_num_disk_vs_raid(metadata_profile, device_count, mixed, ssd);
+	ret = test_num_disk_vs_raid(metadata_profile, metadata_device_count,
+				    mixed, ssd);
 	if (ret)
 		goto error;
 
-	ret = test_num_disk_vs_raid(data_profile, device_count, mixed, ssd);
+	ret = test_num_disk_vs_raid(data_profile, data_device_count, mixed,
+				    ssd);
 	if (ret)
 		goto error;
 
-	if (opt_zoned && device_count) {
+	if (opt_zoned && data_device_count) {
 		switch (data_profile & BTRFS_BLOCK_GROUP_PROFILE_MASK) {
 		case BTRFS_BLOCK_GROUP_DUP:
 		case BTRFS_BLOCK_GROUP_RAID1:
-- 
2.49.0


^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [PATCH 10/14] btrfs-progs: sort devices by role before using them
  2025-05-12 18:09 ` [PATCH RFC 00/14] btrfs-progs: add support for device role-based chunk allocation Anand Jain
                     ` (8 preceding siblings ...)
  2025-05-12 18:09   ` [PATCH 09/14] btrfs-progs: mkfs: introduce device roles in device paths Anand Jain
@ 2025-05-12 18:09   ` Anand Jain
  2025-05-12 18:09   ` [PATCH 11/14] btrfs-progs: helper for the device role within dev_item::type Anand Jain
                     ` (4 subsequent siblings)
  14 siblings, 0 replies; 44+ messages in thread
From: Anand Jain @ 2025-05-12 18:09 UTC (permalink / raw)
  To: linux-btrfs

We're sorting the devices based on whether they're for metadata or data
before we start allocation on them. The way we've set up the roles means
that sorting them one way works best for metadata, and sorting them the
other way works best for data.

Signed-off-by: Anand Jain <anand.jain@oracle.com>
---
 common/device-utils.c   | 34 ++++++++++++++++++++
 common/device-utils.h   |  1 +
 kernel-shared/volumes.c |  5 +++
 kernel-shared/volumes.h |  2 ++
 mkfs/main.c             | 69 +++++++++++++++++++++++++++++++++++++++++
 5 files changed, 111 insertions(+)

diff --git a/common/device-utils.c b/common/device-utils.c
index 783d79555446..84883e551899 100644
--- a/common/device-utils.c
+++ b/common/device-utils.c
@@ -35,6 +35,7 @@
 #include "kernel-shared/disk-io.h"
 #include "kernel-shared/ctree.h"
 #include "kernel-shared/zoned.h"
+#include "kernel-shared/volumes.h"
 #include "kernel-shared/uapi/btrfs.h"
 #include "kernel-shared/uapi/btrfs_tree.h"
 #include "common/device-utils.h"
@@ -653,3 +654,36 @@ int cmp_device_id(void *priv, struct list_head *a, struct list_head *b)
 	return da->devid < db->devid ? -1 :
 		da->devid > db->devid ? 1 : 0;
 }
+
+int btrfs_cmp_role(enum btrfs_device_roles a, enum btrfs_device_roles b,
+		   bool assend)
+{
+	if (a == 0)
+		a = BTRFS_DEVICE_ROLE_NONE;
+
+	if (b == 0)
+		b = BTRFS_DEVICE_ROLE_NONE;
+
+	if (assend)
+		return a > b ? -1 : a < b ? 1 : 0;
+	else
+		return a < b ? -1 : a > b ? 1 : 0;
+}
+
+/*
+ * Sort or reverse sort device list for metadata or data.
+ */
+int cmp_device_role(void *type, struct list_head *a, struct list_head *b)
+{
+	const struct btrfs_device *da = list_entry(a, struct btrfs_device,
+						   dev_list);
+	const struct btrfs_device *db = list_entry(b, struct btrfs_device,
+						   dev_list);
+	u64 *profile = type;
+	enum btrfs_device_roles role_a = da->type;
+	enum btrfs_device_roles role_b = db->type;
+	bool assend = ((*profile & BTRFS_BLOCK_GROUP_TYPE_MASK) ==
+			BTRFS_BLOCK_GROUP_DATA);
+
+	return btrfs_cmp_role(role_a, role_b, assend);
+}
diff --git a/common/device-utils.h b/common/device-utils.h
index cef9405f3a9a..4c832047756d 100644
--- a/common/device-utils.h
+++ b/common/device-utils.h
@@ -59,6 +59,7 @@ ssize_t btrfs_direct_pread(int fd, void *buf, size_t count, off_t offset);
 ssize_t btrfs_direct_pwrite(int fd, const void *buf, size_t count, off_t offset);
 
 int cmp_device_id(void *priv, struct list_head *a, struct list_head *b);
+int cmp_device_role(void *type, struct list_head *a, struct list_head *b);
 
 #ifdef BTRFS_ZONED
 static inline ssize_t btrfs_pwrite(int fd, const void *buf, size_t count,
diff --git a/kernel-shared/volumes.c b/kernel-shared/volumes.c
index b5b1a53a4a90..e70f2bb9bc89 100644
--- a/kernel-shared/volumes.c
+++ b/kernel-shared/volumes.c
@@ -26,6 +26,7 @@
 #include <stddef.h>
 #include <string.h>
 #include "kernel-lib/raid56.h"
+#include "kernel-lib/list_sort.h"
 #include "kernel-shared/ctree.h"
 #include "kernel-shared/disk-io.h"
 #include "kernel-shared/transaction.h"
@@ -1726,6 +1727,10 @@ int btrfs_alloc_chunk(struct btrfs_trans_handle *trans,
 	/* start and num_bytes will be set by create_chunk() */
 	ctl.start = 0;
 	ctl.num_bytes = 0;
+
+	/* Sort devices according to device block-group type preference. */
+	list_sort(&type, devs, cmp_device_role);
+
 	init_alloc_chunk_ctl(info, &ctl);
 	if (ctl.num_stripes < ctl.min_stripes)
 		return -ENOSPC;
diff --git a/kernel-shared/volumes.h b/kernel-shared/volumes.h
index 2bb299eead8c..bea772a9681b 100644
--- a/kernel-shared/volumes.h
+++ b/kernel-shared/volumes.h
@@ -351,5 +351,7 @@ int btrfs_bg_type_to_nparity(u64 flags);
 int btrfs_bg_type_to_sub_stripes(u64 flags);
 u64 btrfs_bg_flags_for_device_num(int number);
 bool btrfs_bg_type_is_stripey(u64 flags);
+int btrfs_cmp_role(enum btrfs_device_roles a, enum btrfs_device_roles b,
+		   bool assend);
 
 #endif
diff --git a/mkfs/main.c b/mkfs/main.c
index 0bf6938f9026..2101e63c80e6 100644
--- a/mkfs/main.c
+++ b/mkfs/main.c
@@ -1047,6 +1047,62 @@ static struct device_arg *parse_device_arg(const char *path,
 	return device;
 }
 
+static int btrfs_device_update_role(struct btrfs_fs_info *fs_info,
+				    struct list_head *devices)
+{
+	struct device_arg *arg_device;
+	struct btrfs_device *device;
+
+	list_for_each_entry(device, &fs_info->fs_devices->devices,
+			    dev_list) {
+		bool found = false;
+
+		list_for_each_entry(arg_device, devices, list) {
+			if (strncmp(arg_device->path, device->name,
+				    strlen(device->name)) == 0) {
+				device->bg_type = arg_device->bg_type;
+				found = true;
+				break;
+			}
+		}
+		/*
+		 * This may fail if the device scan detects the mapper path
+		 * while the argument specifies its DM path. Use MAJ:MIN?
+		 * However, get an example first.
+		 */
+		if (!found) {
+			error("Device not found in the arg '%s'", device->name);
+			return -EINVAL;
+		}
+	}
+	return 0;
+}
+
+static int cmp_device_arg_role(void *type, struct list_head *a,
+			       struct list_head *b)
+{
+	const struct device_arg *da = list_entry(a, struct device_arg, list);
+	const struct device_arg *db = list_entry(b, struct device_arg, list);
+	u64 *profile = type;
+	enum btrfs_device_roles role_a = da->role;
+	enum btrfs_device_roles role_b = db->role;
+	bool assend;
+
+	assend = ((*profile & BTRFS_BLOCK_GROUP_TYPE_MASK) ==
+		  BTRFS_BLOCK_GROUP_DATA);
+
+	if (role_a == 0)
+		role_a = BTRFS_DEVICE_ROLE_NONE;
+
+	if (role_b == 0)
+		role_b = BTRFS_DEVICE_ROLE_NONE;
+
+	if (assend)
+		return role_a > role_b ? -1 : role_a < role_b ? 1 : 0;
+	else
+		return role_a < role_b ? -1 : role_a > role_b ? 1 : 0;
+}
+
 /* Thread callback for device preparation */
 static void *prepare_one_device(void *ctx)
 {
@@ -1237,6 +1293,7 @@ int BOX_MAIN(mkfs)(int argc, char **argv)
 	LIST_HEAD(subvols);
 	struct device_arg *arg_device;
 	LIST_HEAD(arg_devices);
+	u64 bg_metadata;
 
 	cpu_detect_flags();
 	hash_init_accel();
@@ -1605,6 +1662,13 @@ int BOX_MAIN(mkfs)(int argc, char **argv)
 		}
 	}
 
+	/*
+	 * Make sure devices marked as 'metadata preferred' end up at the top,
+	 * so that, it will be our bootstrap device.
+	 */
+	bg_metadata = BTRFS_BLOCK_GROUP_METADATA;
+	list_sort(&bg_metadata, &arg_devices, cmp_device_arg_role);
+
 	arg_device = list_first_entry(&arg_devices, struct device_arg, list);
 	file = arg_device->path;
 	ssd = device_get_rotational(file);
@@ -2064,6 +2128,11 @@ int BOX_MAIN(mkfs)(int argc, char **argv)
 	if (opt_zoned)
 		btrfs_get_dev_zone_info_all_devices(fs_info);
 
+	if (btrfs_device_update_role(fs_info, &arg_devices)) {
+		ret = 1;
+		goto error;
+	}
+
 raid_groups:
 	ret = create_raid_groups(trans, root, data_profile,
 			 metadata_profile, mixed, &allocation);
-- 
2.49.0


^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [PATCH 11/14] btrfs-progs: helper for the device role within dev_item::type
  2025-05-12 18:09 ` [PATCH RFC 00/14] btrfs-progs: add support for device role-based chunk allocation Anand Jain
                     ` (9 preceding siblings ...)
  2025-05-12 18:09   ` [PATCH 10/14] btrfs-progs: sort devices by role before using them Anand Jain
@ 2025-05-12 18:09   ` Anand Jain
  2025-05-12 18:09   ` [PATCH 12/14] btrfs-progs: mkfs: persist device roles to dev_item::type Anand Jain
                     ` (3 subsequent siblings)
  14 siblings, 0 replies; 44+ messages in thread
From: Anand Jain @ 2025-05-12 18:09 UTC (permalink / raw)
  To: linux-btrfs

The device role information, is encoded within the lower 4 bits of the
dev_item::type field, this commit adds helper functions for reading and
writing these bits.

Signed-off-by: Anand Jain <anand.jain@oracle.com>
---
 mkfs/main.c | 38 ++++++++++++++++++++++++++++++++++++++
 1 file changed, 38 insertions(+)

diff --git a/mkfs/main.c b/mkfs/main.c
index 2101e63c80e6..8248d8c4a287 100644
--- a/mkfs/main.c
+++ b/mkfs/main.c
@@ -1047,6 +1047,44 @@ static struct device_arg *parse_device_arg(const char *path,
 	return device;
 }
 
+static inline u8 device_role(struct btrfs_device *device)
+{
+	u64 type = le64_to_cpu(device->type);
+
+	/*
+	 * The on-disk value `0` for `dev_item::type:8` maps to
+	 * `BTRFS_DEVICE_ROLE_NONE` in memory, which is defined as `80`.
+	 */
+	if ((type & BTRFS_DEVICE_ROLE_MASK) == 0)
+		return BTRFS_DEVICE_ROLE_NONE;
+	else
+		return (u8)(type & BTRFS_DEVICE_ROLE_MASK);
+}
+
+static inline int set_device_role(struct btrfs_device *device, u8 value)
+{
+	u64 type;
+
+	if (value > BTRFS_DEVICE_ROLE_MAX)
+		return -EINVAL;
+
+	type = le64_to_cpu(device->type);
+
+	/*
+	 * If roles aren't being set, we keep the on-disk value as zero so that
+	 * ondisk format remains compatible with the older kernels.
+	 */
+	if (value == BTRFS_DEVICE_ROLE_NONE)
+		value = 0;
+
+	type = (type & ~BTRFS_DEVICE_ROLE_MASK) | \
+		(value & BTRFS_DEVICE_ROLE_MASK);
+
+	device->type = cpu_to_le64(type);
+
+	return 0;
+}
+
 static int btrfs_device_update_role(struct btrfs_fs_info *fs_info,
 				    struct list_head *devices)
 {
-- 
2.49.0


^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [PATCH 12/14] btrfs-progs: mkfs: persist device roles to dev_item::type
  2025-05-12 18:09 ` [PATCH RFC 00/14] btrfs-progs: add support for device role-based chunk allocation Anand Jain
                     ` (10 preceding siblings ...)
  2025-05-12 18:09   ` [PATCH 11/14] btrfs-progs: helper for the device role within dev_item::type Anand Jain
@ 2025-05-12 18:09   ` Anand Jain
  2025-05-12 18:09   ` [PATCH 13/14] btrfs-progs: update device add ioctl with device type Anand Jain
                     ` (2 subsequent siblings)
  14 siblings, 0 replies; 44+ messages in thread
From: Anand Jain @ 2025-05-12 18:09 UTC (permalink / raw)
  To: linux-btrfs

The dev_item::type field, currently unused, will now store the newly
introduced device roles. This commit propagates the specified device roles
during filesystem creation, writing them to the lower 4 bits of the
dev_item::type structure on disk. This ensures the roles are recorded.

Signed-off-by: Anand Jain <anand.jain@oracle.com>
---
 common/device-scan.c |  4 ++--
 common/device-scan.h |  2 +-
 mkfs/common.c        |  2 +-
 mkfs/common.h        |  3 +++
 mkfs/main.c          | 12 +++++++++---
 5 files changed, 16 insertions(+), 7 deletions(-)

diff --git a/common/device-scan.c b/common/device-scan.c
index 7d7d67fb5b71..b1f1475303d8 100644
--- a/common/device-scan.c
+++ b/common/device-scan.c
@@ -129,7 +129,7 @@ int test_uuid_unique(const char *uuid_str)
 int btrfs_add_to_fsid(struct btrfs_trans_handle *trans,
 		      struct btrfs_root *root, int fd, const char *path,
 		      u64 device_total_bytes, u32 io_width, u32 io_align,
-		      u32 sectorsize)
+		      u32 sectorsize, u64 type)
 {
 	struct btrfs_super_block *disk_super;
 	struct btrfs_fs_info *fs_info = root->fs_info;
@@ -160,7 +160,7 @@ int btrfs_add_to_fsid(struct btrfs_trans_handle *trans,
 	uuid_generate(device->uuid);
 	device->fs_info = fs_info;
 	device->devid = 0;
-	device->type = 0;
+	device->type = type;
 	device->io_width = io_width;
 	device->io_align = io_align;
 	device->sector_size = sectorsize;
diff --git a/common/device-scan.h b/common/device-scan.h
index b154e8c860b2..d8326d7353ba 100644
--- a/common/device-scan.h
+++ b/common/device-scan.h
@@ -52,7 +52,7 @@ int btrfs_register_all_devices(void);
 int btrfs_add_to_fsid(struct btrfs_trans_handle *trans,
 		      struct btrfs_root *root, int fd, const char *path,
 		      u64 device_total_bytes, u32 io_width, u32 io_align,
-		      u32 sectorsize);
+		      u32 sectorsize, u64 type);
 int btrfs_device_already_in_root(struct btrfs_root *root, int fd,
 				 int super_offset);
 int is_seen_fsid(u8 *fsid, struct seen_fsid *seen_fsid_hash[]);
diff --git a/mkfs/common.c b/mkfs/common.c
index bb5a2ad46f4f..05e4eb4e1c4e 100644
--- a/mkfs/common.c
+++ b/mkfs/common.c
@@ -604,7 +604,7 @@ int make_btrfs(int fd, struct btrfs_mkfs_config *cfg)
 	btrfs_set_device_io_align(buf, dev_item, cfg->sectorsize);
 	btrfs_set_device_io_width(buf, dev_item, cfg->sectorsize);
 	btrfs_set_device_sector_size(buf, dev_item, cfg->sectorsize);
-	btrfs_set_device_type(buf, dev_item, 0);
+	btrfs_set_device_type(buf, dev_item, cfg->dev_bg_type);
 
 	write_extent_buffer(buf, super.dev_item.uuid,
 			    (unsigned long)btrfs_device_uuid(dev_item),
diff --git a/mkfs/common.h b/mkfs/common.h
index de0e413774a4..ba1c78d9ea03 100644
--- a/mkfs/common.h
+++ b/mkfs/common.h
@@ -100,6 +100,9 @@ struct btrfs_mkfs_config {
 
 	/* Superblock offset after make_btrfs */
 	u64 super_bytenr;
+
+	/* dev_item::type value */
+	u64 dev_bg_type;
 };
 
 int make_btrfs(int fd, struct btrfs_mkfs_config *cfg);
diff --git a/mkfs/main.c b/mkfs/main.c
index 8248d8c4a287..e069d69e3304 100644
--- a/mkfs/main.c
+++ b/mkfs/main.c
@@ -82,6 +82,7 @@ struct prepare_device_progress {
 	u64 dev_byte_count;
 	u64 byte_count;
 	int ret;
+	struct device_arg *device_arg;
 };
 
 static int create_metadata_block_groups(struct btrfs_root *root, bool mixed,
@@ -1098,7 +1099,8 @@ static int btrfs_device_update_role(struct btrfs_fs_info *fs_info,
 		list_for_each_entry(arg_device, devices, list) {
 			if (strncmp(arg_device->path, device->name,
 				    strlen(device->name)) == 0) {
-				device->bg_type = arg_device->bg_type;
+				if (set_device_role(device, arg_device->role))
+					return -EINVAL;
 				found = true;
 				break;
 			}
@@ -1332,6 +1334,7 @@ int BOX_MAIN(mkfs)(int argc, char **argv)
 	struct device_arg *arg_device;
 	LIST_HEAD(arg_devices);
 	u64 bg_metadata;
+	const int mkfs_dev_index = 0;
 
 	cpu_detect_flags();
 	hash_init_accel();
@@ -1987,6 +1990,7 @@ int BOX_MAIN(mkfs)(int argc, char **argv)
 		prepare_ctx[i].file = arg_device->path;
 		prepare_ctx[i].byte_count = byte_count;
 		prepare_ctx[i].dev_byte_count = byte_count;
+		prepare_ctx[i].device_arg = arg_device;
 		ret = pthread_create(&t_prepare[i], NULL, prepare_one_device,
 				     &prepare_ctx[i]);
 		if (ret) {
@@ -2044,7 +2048,8 @@ int BOX_MAIN(mkfs)(int argc, char **argv)
 	else
 		mkfs_cfg.zone_size = 0;
 
-	ret = make_btrfs(prepare_ctx[0].fd, &mkfs_cfg);
+	mkfs_cfg.dev_bg_type = prepare_ctx[mkfs_dev_index].device_arg->role;
+	ret = make_btrfs(prepare_ctx[mkfs_dev_index].fd, &mkfs_cfg);
 	if (ret) {
 		errno = -ret;
 		error("error during mkfs: %m");
@@ -2147,7 +2152,8 @@ int BOX_MAIN(mkfs)(int argc, char **argv)
 
 		ret = btrfs_add_to_fsid(trans, root, prepare_ctx[i].fd,
 					prepare_ctx[i].file, dev_byte_count,
-					sectorsize, sectorsize, sectorsize);
+					sectorsize, sectorsize, sectorsize,
+					prepare_ctx[i].device_arg->role);
 		if (ret) {
 			errno = -ret;
 			error("unable to add %s to filesystem: %m",
-- 
2.49.0


^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [PATCH 13/14] btrfs-progs: update device add ioctl with device type
  2025-05-12 18:09 ` [PATCH RFC 00/14] btrfs-progs: add support for device role-based chunk allocation Anand Jain
                     ` (11 preceding siblings ...)
  2025-05-12 18:09   ` [PATCH 12/14] btrfs-progs: mkfs: persist device roles to dev_item::type Anand Jain
@ 2025-05-12 18:09   ` Anand Jain
  2025-05-12 18:09   ` [PATCH 14/14] btrfs-progs: disable exclusive metadata/data device roles Anand Jain
  2025-06-20 16:46   ` [PATCH RFC 00/14] btrfs-progs: add support for device role-based chunk allocation David Sterba
  14 siblings, 0 replies; 44+ messages in thread
From: Anand Jain @ 2025-05-12 18:09 UTC (permalink / raw)
  To: linux-btrfs

When a device is added to the mounted device provide an optonal suffix
to add device roles similar to mkfs.btrfs;

  btrfs device add /dev/sda[:m|:d|:monly|:dataonly]

btrfs-progs: update device add ioctl to support device roles

The btrfs device add command now accepts an optional suffix to specify
device roles, mirroring the functionality introduced in mkfs.btrfs. This
allows users to define the role of a device (metadata preferred, data
preferred, metadata only, or data only) at the time of adding it to a
mounted filesystem.

Example:

  btrfs device add /dev/sdb:d /mnt/btrfs

	This command adds /dev/sdb to the mounted filesystem at /mnt/btrfs
	and designates its role as preferred for data. The supported
	suffixes are :m, :d, :monly, and :donly.

Signed-off-by: Anand Jain <anand.jain@oracle.com>
---
 cmds/device.c | 57 +++++++++++++++++++++++++++++++++++++++++----------
 1 file changed, 46 insertions(+), 11 deletions(-)

diff --git a/cmds/device.c b/cmds/device.c
index 3e45a72094e6..af2741353e90 100644
--- a/cmds/device.c
+++ b/cmds/device.c
@@ -122,34 +122,62 @@ static int cmd_device_add(const struct cmd_struct *cmd,
 	}
 	zoned = (feature_flags.incompat_flags & BTRFS_FEATURE_INCOMPAT_ZONED);
 
-	for (i = optind; i < last_dev; i++){
+	for (i = optind; i < last_dev; i++) {
 		struct btrfs_ioctl_vol_args ioctl_args;
 		int	devfd, res;
 		u64 dev_block_count = 0;
+		char temp_argv[PATH_MAX];
+		char final_argv[PATH_MAX];
 		char *path;
+		char *colon;
+		enum btrfs_device_roles role;
 
-		if (!zoned && zoned_model(argv[i]) == ZONED_HOST_MANAGED) {
+		/*
+		 * Copy path and role (separated by ':'), then replace ':'
+		 * with null.
+		 */
+		if (arg_copy_path(temp_argv, argv[i], PATH_MAX)) {
+			error("Device path '%s' length '%ld' is too long",
+			      argv[i], strlen(argv[i]));
+			ret++;
+			continue;
+		}
+
+		colon = strstr(temp_argv, ":");
+		if (colon) {
+			*colon = '\0';
+			colon++;
+			if (parse_device_role(colon, &role)) {
+				error("Invalid device profile");
+				ret++;
+				continue;
+			}
+		} else {
+			role = 0;
+		}
+
+		if (!zoned && zoned_model(temp_argv) == ZONED_HOST_MANAGED) {
 			error(
 "zoned: cannot add host-managed zoned device to non-zoned filesystem '%s'",
-			      argv[i]);
+			      temp_argv);
 			ret++;
 			continue;
 		}
 
-		res = test_dev_for_mkfs(argv[i], force);
+		res = test_dev_for_mkfs(temp_argv, force);
 		if (res) {
 			ret++;
 			continue;
 		}
 
-		devfd = open(argv[i], O_RDWR);
+		devfd = open(temp_argv, O_RDWR);
 		if (devfd < 0) {
-			error("unable to open device '%s'", argv[i]);
+			error("unable to open device '%s'", temp_argv);
 			ret++;
 			continue;
 		}
 
-		res = btrfs_prepare_device(devfd, argv[i], &dev_block_count, 0,
+		res = btrfs_prepare_device(devfd, temp_argv, &dev_block_count, 0,
 				PREP_DEVICE_ZERO_END | PREP_DEVICE_VERBOSE |
 				(discard ? PREP_DEVICE_DISCARD : 0) |
 				(zoned ? PREP_DEVICE_ZONED : 0));
@@ -159,19 +187,26 @@ static int cmd_device_add(const struct cmd_struct *cmd,
 			goto error_out;
 		}
 
-		path = path_canonicalize(argv[i]);
+		path = path_canonicalize(temp_argv);
 		if (!path) {
 			error("could not canonicalize pathname '%s': %m",
-				argv[i]);
+				temp_argv);
 			ret++;
 			goto error_out;
 		}
 
+		strncpy_null(final_argv, path, sizeof(final_argv));
+		if (colon) {
+			strncpy_null(final_argv + strlen(final_argv), ":",
+				     PATH_MAX - strlen(final_argv));
+			strncpy_null(final_argv + strlen(final_argv), colon,
+				     PATH_MAX - strlen(final_argv));
+		}
 		memset(&ioctl_args, 0, sizeof(ioctl_args));
-		strncpy_null(ioctl_args.name, path, sizeof(ioctl_args.name));
+		strncpy_null(ioctl_args.name, final_argv, sizeof(ioctl_args.name));
 		res = ioctl(fdmnt, BTRFS_IOC_ADD_DEV, &ioctl_args);
 		if (res < 0) {
-			error("error adding device '%s': %m", path);
+			error("error adding device '%s': %m", final_argv);
 			ret++;
 		}
 		free(path);
-- 
2.49.0


^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [PATCH 14/14] btrfs-progs: disable exclusive metadata/data device roles
  2025-05-12 18:09 ` [PATCH RFC 00/14] btrfs-progs: add support for device role-based chunk allocation Anand Jain
                     ` (12 preceding siblings ...)
  2025-05-12 18:09   ` [PATCH 13/14] btrfs-progs: update device add ioctl with device type Anand Jain
@ 2025-05-12 18:09   ` Anand Jain
  2025-06-20 16:46   ` [PATCH RFC 00/14] btrfs-progs: add support for device role-based chunk allocation David Sterba
  14 siblings, 0 replies; 44+ messages in thread
From: Anand Jain @ 2025-05-12 18:09 UTC (permalink / raw)
  To: linux-btrfs

When exclusive device roles are set, the actual number of devices
available for allocation depends on the block group type (metadata or
data), rather than the total number of devices in the filesystem. The
total number of devices is widely used in the code; for example, `device
remove` and `mount` have to validate if they would satisfy the block group
profile based on the device roles.

These necessary changes will be implemented in both the kernel,
btrfs-progs, and fstests in a subsequent update, once the current
metadata-preferred and data-preferred device roles are stable. For now,
the metadata-only and data-only roles are marked as unsupported.

However, devices with metadata-preferred and data-preferred roles are
included in the total device count. Therefore, the above changes are not
required to support these roles, and the preferred roles are ready.

Signed-off-by: Anand Jain <anand.jain@oracle.com>
---
 mkfs/main.c | 8 ++++++++
 1 file changed, 8 insertions(+)

diff --git a/mkfs/main.c b/mkfs/main.c
index e069d69e3304..e1b5c378b5d0 100644
--- a/mkfs/main.c
+++ b/mkfs/main.c
@@ -1703,6 +1703,14 @@ int BOX_MAIN(mkfs)(int argc, char **argv)
 		}
 	}

+	/* Keep until we support metadata-only | data-only devices */
+	if (device_count != metadata_device_count ||
+	    device_count != data_device_count) {
+		error("Metadata_only and or Data_only is not yet supported");
+		ret = 1;
+		goto error;
+	}
+
 	/*
 	 * Make sure devices marked as 'metadata preferred' end up at the top,
 	 * so that, it will be our bootstrap device.
-- 
2.49.0

^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [PATCH RFC 0/2] fstests: btrfs: add functional verification for device roles
  2025-05-12 18:07 [PATCH RFC 00/10] btrfs: new performance-based chunk allocation using device roles Anand Jain
                   ` (10 preceding siblings ...)
  2025-05-12 18:09 ` [PATCH RFC 00/14] btrfs-progs: add support for device role-based chunk allocation Anand Jain
@ 2025-05-12 18:11 ` Anand Jain
  2025-05-12 18:11   ` [PATCH 1/2] fstests: common/btrfs: add _require_btrfs_feature_device_roles Anand Jain
  2025-05-12 18:11   ` [PATCH 2/2] fstests: btrfs/366: add test for device role-based chunk allocation Anand Jain
  2025-05-20  9:19 ` [PATCH RFC 00/10] btrfs: new performance-based chunk allocation using device roles Forza
                   ` (4 subsequent siblings)
  16 siblings, 2 replies; 44+ messages in thread
From: Anand Jain @ 2025-05-12 18:11 UTC (permalink / raw)
  To: linux-btrfs

Add testcase to verify Btrfs device roles are assigned
and that chunk allocations match the specified roles.

Anand Jain (2):
  fstests: common/btrfs: add _require_btrfs_feature_device_roles
  fstests: btrfs/366: add test for device role-based chunk allocation

 common/btrfs        |  12 ++
 tests/btrfs/336     | 259 ++++++++++++++++++++++++++++++++++++++++++++
 tests/btrfs/336.out | 153 ++++++++++++++++++++++++++
 3 files changed, 424 insertions(+)
 create mode 100755 tests/btrfs/336
 create mode 100644 tests/btrfs/336.out

-- 
2.49.0


^ permalink raw reply	[flat|nested] 44+ messages in thread

* [PATCH 1/2] fstests: common/btrfs: add _require_btrfs_feature_device_roles
  2025-05-12 18:11 ` [PATCH RFC 0/2] fstests: btrfs: add functional verification for device roles Anand Jain
@ 2025-05-12 18:11   ` Anand Jain
  2025-05-12 18:11   ` [PATCH 2/2] fstests: btrfs/366: add test for device role-based chunk allocation Anand Jain
  1 sibling, 0 replies; 44+ messages in thread
From: Anand Jain @ 2025-05-12 18:11 UTC (permalink / raw)
  To: linux-btrfs

In order to test the new btrfs device role feature, check if btrfs-progs
and the kernel support it.

Signed-off-by: Anand Jain <anand.jain@oracle.com>
---
 common/btrfs | 12 ++++++++++++
 1 file changed, 12 insertions(+)

diff --git a/common/btrfs b/common/btrfs
index 6a1095ff8934..ab75975e7711 100644
--- a/common/btrfs
+++ b/common/btrfs
@@ -1048,3 +1048,15 @@ _require_btrfs_iouring_encoded_read()
 		_notrun "btrfs io_uring encoded read failed with -EOPNOTSUPP"
 	fi
 }
+
+_require_btrfs_feature_device_roles()
+{
+	_require_btrfs_fs_sysfs
+	_require_btrfs_fs_feature device_allocation
+
+	$MKFS_BTRFS_PROG --help 2>&1 | \
+				grep -q "Device-specific roles or profiles"
+	if (($? != 0)); then
+		_notrun "Requires btrfs-progs device roles support"
+	fi
+}
-- 
2.49.0


^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [PATCH 2/2] fstests: btrfs/366: add test for device role-based chunk allocation
  2025-05-12 18:11 ` [PATCH RFC 0/2] fstests: btrfs: add functional verification for device roles Anand Jain
  2025-05-12 18:11   ` [PATCH 1/2] fstests: common/btrfs: add _require_btrfs_feature_device_roles Anand Jain
@ 2025-05-12 18:11   ` Anand Jain
  1 sibling, 0 replies; 44+ messages in thread
From: Anand Jain @ 2025-05-12 18:11 UTC (permalink / raw)
  To: linux-btrfs

Add a new test to verify the btrfs device-role feature.

Earlier, chunk allocation depended only on available device space.
With device roles, allocation is guided by assigned roles. This test
creates scratch devices of varying sizes, triggers relocations, and
checks if chunk placement follows the role-based policy.

Signed-off-by: Anand Jain <anand.jain@oracle.com>
---
 tests/btrfs/336     | 259 ++++++++++++++++++++++++++++++++++++++++++++
 tests/btrfs/336.out | 153 ++++++++++++++++++++++++++
 2 files changed, 412 insertions(+)
 create mode 100755 tests/btrfs/336
 create mode 100644 tests/btrfs/336.out

diff --git a/tests/btrfs/336 b/tests/btrfs/336
new file mode 100755
index 000000000000..703e0279ebe9
--- /dev/null
+++ b/tests/btrfs/336
@@ -0,0 +1,259 @@
+#! /bin/bash
+# SPDX-License-Identifier: GPL-2.0
+# Copyright (c) 2025 Oracle.  All Rights Reserved.
+#
+# FS QA Test 336
+#
+# Verify the device role is working.
+
+. ./common/preamble
+_begin_fstest auto quick volume
+
+. ./common/sysfs
+. ./common/filter.btrfs
+
+_require_test
+_require_loop
+_require_btrfs_command inspect-internal dump-tree
+_require_btrfs_command inspect-internal list-chunks
+_require_btrfs_feature_device_roles
+_require_fs_sysfs_attr_policy $TEST_DEV device_allocation role-then-space
+
+_cleanup()
+{
+	losetup -d ${DEV[@]} > /dev/null 2>&1
+	rm -f ${IMG[@]} > /dev/null 2>&1
+}
+
+declare -a TEST_VECTORS=(
+# $m_profile:$d_profile:$monly_nr:$m_nr:$none_nr:$d_nr:$donly_nr
+"single:single:0:4:4:4:0"
+"dup:single:0:1:1:1:0"
+"raid1:single:0:2:1:2:0"
+"raid10:raid10:0:4:1:4:0"
+# Unusual config but must pass.
+"raid1:raid1:0:1:0:1:0"
+# Must fail as of now.
+"single:single:1:0:0:0:1"
+)
+
+# Check if at least 4Gb space is available.
+_require_fs_space $TEST_DIR $((4*1000*1000))
+
+# Make sure TEST_VECTORS would need not more than MAX_NDEVS scratch devices of
+# different size.
+MAX_NDEVS=12
+for testcase in "${TEST_VECTORS[@]}"; do
+	IFS=':' read -ra args <<< $testcase
+	ndevs=$((args[2] + args[3] + args[4] + args[5] + args[6]))
+
+	if (( MAX_NDEVS < ndevs )); then
+		_fail "'$testcase' needs more than max '$MAX_NDEVS' devs"
+	fi
+done
+
+declare -a IMG="(  )"
+declare -a DEV="(  )"
+
+# As the disk allocaiton depend on the free space, create scratch devices with
+# different sizes
+# Make sure there are MAX_NDEVS elements here
+sizes=(256 768 512 1280 1024 1536 2048 1792 2304 3072 2816 3328)
+for ((i=0; i<MAX_NDEVS; i++)); do
+	size=${sizes[i]}
+	path=$TEST_DIR/$$_${i}_${size}.img
+	truncate -s ${size}M ${path} || _fail "truncate ${path}"
+
+	DEV[$i]=$(_create_loop_device ${path})
+	IMG[$i]=$path
+	echo $(stat --format=%n,%s,%i ${IMG[$i]}) ${DEV[$i]} >> $seqres.full
+done
+
+filter()
+{
+	awk '
+	{
+		for (i = 1; i <= NF; i++) {
+			is_excluded_value = 0
+			# Check the preceding field only if we are not on the
+			# first field
+			if (i > 1) {
+				if ($(i-1) == "num_stripes" || \
+				    $(i-1) == "sub_stripes" || \
+				    $(i-1) == "stripe" || $(i-1) == "devid") {
+					is_excluded_value = 1
+				}
+			}
+
+			# Check if the current field consists only of digits
+			is_numeric = ($i ~ /^[0-9]+$/)
+
+			# If it is numeric and its preceding keyword is not in
+			# the exclusion list, sanitize it
+			if (is_numeric && !is_excluded_value) {
+				$i = "X"  # Replace the number with "X"
+			}
+		}
+		print $0
+	}' "$@"
+}
+
+extract()
+{
+	awk '
+	/^[ \t]*item [0-9]+ key/ {
+		if (keep && block) { print block }
+		block = $0
+		keep = 0
+		next
+	}
+	!/^[ \t]*item [0-9]+ key/ && block {
+		block = block "\n" $0
+		if ($0 ~ /type (METADATA|DATA)/) {
+			keep = 1
+		}
+	}
+	END {
+		if (keep && block) { print block }
+	}' "$@"
+}
+
+dump_tree()
+{
+	local dev=$1
+
+	# make sure the ondisk has the mkfs
+	sync
+	$BTRFS_UTIL_PROG inspect-internal dump-tree -t 3 ${dev} | \
+		grep -A2 DEV_ITEMS | grep -E 'devid|type' | \
+		perl -pe 's/(?<!devid |type )\b\d+\b/X/g'
+	$BTRFS_UTIL_PROG inspect-internal dump-tree -t 3 ${dev} | \
+		extract | grep -v 'io_align' | grep -E 'DATA|stripe' | filter
+}
+
+dump_chunks()
+{
+	# Make sure relocation chunks are synced before dumping them.
+	$XFS_IO_PROG -c sync $SCRATCH_MNT
+
+$BTRFS_UTIL_PROG inspect-internal list-chunks --raw --sort lstart $SCRATCH_MNT >> \
+								${seqres}.full
+
+	# We don't care how many chunks there are, but we do ensure that all of
+	# are on the correct device.
+$BTRFS_UTIL_PROG inspect-internal list-chunks --raw --sort lstart $SCRATCH_MNT | \
+		$AWK_PROG '{print $1" "$3}' | grep -E 'Data' | sort -u
+
+$BTRFS_UTIL_PROG inspect-internal list-chunks --raw --sort lstart $SCRATCH_MNT | \
+		$AWK_PROG '{print $1" "$3}' | grep -E 'Metadata' | sort -u
+}
+
+verify()
+{
+	IFS=':' read -ra args <<< $1
+	local m_profile=${args[0]}
+	local d_profile=${args[1]}
+	local monly_nr=${args[2]}
+	local m_nr=${args[3]}
+	local none_nr=${args[4]}
+	local d_nr=${args[5]}
+	local donly_nr=${args[6]}
+
+	local assigned_devs_string=""
+	local ref_dev
+	local dev_idx=0 # Keeps track of indexing 'DEV' array
+	local i # Loop counter
+
+	# --- Loop to assign devices based on roles ---
+
+	# Assign devices for metadata only role (monly)
+	for ((i=0; i<monly_nr; i++)); do
+		assigned_devs_string+=" ${DEV[$dev_idx]}:monly"
+		((dev_idx++))
+	done
+
+	# Assign devices for metadata role (m)
+	for ((i=0; i<m_nr; i++)); do
+		assigned_devs_string+=" ${DEV[$dev_idx]}:m"
+		((dev_idx++))
+	done
+
+	# Assign devices for data role (d)
+	for ((i=0; i<d_nr; i++)); do
+		assigned_devs_string+=" ${DEV[$dev_idx]}:d"
+		((dev_idx++))
+	done
+
+	# Assign devices for data only role (donly)
+	for ((i=0; i<donly_nr; i++)); do
+		assigned_devs_string+=" ${DEV[$dev_idx]}:donly"
+		((dev_idx++))
+	done
+
+	# Assign devices with no specific role (none)
+	# Make sure role-none gets devs with larger size.
+	dev_idx=$MAX_NDEVS
+	for ((i=0; i<none_nr; i++)); do
+		assigned_devs_string+=" ${DEV[$dev_idx]}"
+		((dev_idx--))
+	done
+
+	# Remove potential leading space
+	assigned_devs_string="${assigned_devs_string# }"
+
+	# Print the results for verification/debugging
+	echo "mkfs opt: $m_profile $d_profile \"$assigned_devs_string\"" >> \
+								$seqres.full
+	ref_dev=$(echo $assigned_devs_string | sed 's/:.*//g')
+	echo $ref_dev >> ${seqres}.full
+
+	# --- End of assignment loop ---
+
+
+	echo -e "\nTest Vector: $1"
+
+	# Roles like metadata_only or data_only aren’t supported yet. Just make
+	# sure they fail cleanly.
+	echo $assigned_devs_string | grep -q only
+	if [[ $? == 0 ]]; then
+		_try_mkfs_dev "-q -m $m_profile -d $d_profile $assigned_devs_string"
+		return
+	else
+		_mkfs_dev "-q -m $m_profile -d $d_profile $assigned_devs_string"
+	fi
+
+	# Make sure the golden output verifies that the roles are updated in the
+	# on-disk structure.
+	dump_tree $ref_dev
+
+	# Keep data seperate use max_inline
+	_mount "-o max_inline=0" $ref_dev $SCRATCH_MNT
+	$XFS_IO_PROG -f -c "pwrite -i /dev/zero 0 1M" $SCRATCH_MNT/foo > \
+								/dev/null 2>&1
+
+	_set_fs_sysfs_attr ${ref_dev} device_allocation space
+	_get_fs_sysfs_attr ${ref_dev} device_allocation
+	_run_btrfs_balance_start $SCRATCH_MNT >> $seqres.full
+
+	# When testing with different options like ^free-space-tree,
+	# block-group-tree, etc the number of allocated chunks can vary and they
+	# might not be on the same device. Therefore, when we are not using
+	# role-then-space, do not dump chunk location so that the golden output
+	# remains compatible.
+	#dump_chunks
+
+	_set_fs_sysfs_attr ${ref_dev} device_allocation role-then-space
+	_get_fs_sysfs_attr ${ref_dev} device_allocation
+	_run_btrfs_balance_start $SCRATCH_MNT >> $seqres.full
+
+	dump_chunks
+
+	_scratch_unmount
+}
+
+for testcase in "${TEST_VECTORS[@]}"; do
+	verify $testcase
+done
+
+status=0
+exit
diff --git a/tests/btrfs/336.out b/tests/btrfs/336.out
new file mode 100644
index 000000000000..c4d519462538
--- /dev/null
+++ b/tests/btrfs/336.out
@@ -0,0 +1,153 @@
+QA output created by 336
+
+Test Vector: single:single:0:4:4:4:0
+		devid 1 total_bytes X bytes_used X
+		io_align X io_width X sector_size X type 40
+		devid 2 total_bytes X bytes_used X
+		io_align X io_width X sector_size X type 40
+		devid 3 total_bytes X bytes_used X
+		io_align X io_width X sector_size X type 40
+		devid 4 total_bytes X bytes_used X
+		io_align X io_width X sector_size X type 40
+		devid 5 total_bytes X bytes_used X
+		io_align X io_width X sector_size X type 0
+		devid 6 total_bytes X bytes_used X
+		io_align X io_width X sector_size X type 0
+		devid 7 total_bytes X bytes_used X
+		io_align X io_width X sector_size X type 0
+		devid 8 total_bytes X bytes_used X
+		io_align X io_width X sector_size X type 100
+		devid 9 total_bytes X bytes_used X
+		io_align X io_width X sector_size X type 100
+		devid 10 total_bytes X bytes_used X
+		io_align X io_width X sector_size X type 100
+		devid 11 total_bytes X bytes_used X
+		io_align X io_width X sector_size X type 100
+length X owner X stripe_len X type METADATA|single
+		num_stripes 1 sub_stripes 1
+stripe 0 devid 1 offset X
+length X owner X stripe_len X type DATA|single
+		num_stripes 1 sub_stripes 1
+stripe 0 devid 1 offset X
+[space] role-then-space
+space [role-then-space]
+10 Data/single
+4 Metadata/single
+
+Test Vector: dup:single:0:1:1:1:0
+		devid 1 total_bytes X bytes_used X
+		io_align X io_width X sector_size X type 40
+		devid 2 total_bytes X bytes_used X
+		io_align X io_width X sector_size X type 100
+length X owner X stripe_len X type DATA|single
+		num_stripes 1 sub_stripes 1
+stripe 0 devid 1 offset X
+length X owner X stripe_len X type METADATA|DUP
+		num_stripes 2 sub_stripes 1
+stripe 0 devid 1 offset X
+stripe 1 devid 1 offset X
+[space] role-then-space
+space [role-then-space]
+2 Data/single
+1 Metadata/DUP
+
+Test Vector: raid1:single:0:2:1:2:0
+		devid 1 total_bytes X bytes_used X
+		io_align X io_width X sector_size X type 40
+		devid 2 total_bytes X bytes_used X
+		io_align X io_width X sector_size X type 40
+		devid 3 total_bytes X bytes_used X
+		io_align X io_width X sector_size X type 100
+		devid 4 total_bytes X bytes_used X
+		io_align X io_width X sector_size X type 100
+length X owner X stripe_len X type DATA|single
+		num_stripes 1 sub_stripes 1
+stripe 0 devid 1 offset X
+length X owner X stripe_len X type METADATA|RAID1
+		num_stripes 2 sub_stripes 1
+stripe 0 devid 4 offset X
+stripe 1 devid 2 offset X
+[space] role-then-space
+space [role-then-space]
+4 Data/single
+1 Metadata/RAID1
+2 Metadata/RAID1
+
+Test Vector: raid10:raid10:0:4:1:4:0
+		devid 1 total_bytes X bytes_used X
+		io_align X io_width X sector_size X type 40
+		devid 2 total_bytes X bytes_used X
+		io_align X io_width X sector_size X type 40
+		devid 3 total_bytes X bytes_used X
+		io_align X io_width X sector_size X type 40
+		devid 4 total_bytes X bytes_used X
+		io_align X io_width X sector_size X type 40
+		devid 5 total_bytes X bytes_used X
+		io_align X io_width X sector_size X type 100
+		devid 6 total_bytes X bytes_used X
+		io_align X io_width X sector_size X type 100
+		devid 7 total_bytes X bytes_used X
+		io_align X io_width X sector_size X type 100
+		devid 8 total_bytes X bytes_used X
+		io_align X io_width X sector_size X type 100
+length X owner X stripe_len X type METADATA|RAID10
+		num_stripes 8 sub_stripes 2
+stripe 0 devid 5 offset X
+stripe 1 devid 6 offset X
+stripe 2 devid 7 offset X
+stripe 3 devid 8 offset X
+stripe 4 devid 1 offset X
+stripe 5 devid 2 offset X
+stripe 6 devid 3 offset X
+stripe 7 devid 4 offset X
+length X owner X stripe_len X type DATA|RAID10
+		num_stripes 8 sub_stripes 2
+stripe 0 devid 1 offset X
+stripe 1 devid 2 offset X
+stripe 2 devid 3 offset X
+stripe 3 devid 4 offset X
+stripe 4 devid 5 offset X
+stripe 5 devid 6 offset X
+stripe 6 devid 7 offset X
+stripe 7 devid 8 offset X
+[space] role-then-space
+space [role-then-space]
+1 Data/RAID10
+2 Data/RAID10
+3 Data/RAID10
+4 Data/RAID10
+5 Data/RAID10
+6 Data/RAID10
+7 Data/RAID10
+8 Data/RAID10
+1 Metadata/RAID10
+2 Metadata/RAID10
+3 Metadata/RAID10
+4 Metadata/RAID10
+5 Metadata/RAID10
+6 Metadata/RAID10
+7 Metadata/RAID10
+8 Metadata/RAID10
+
+Test Vector: raid1:raid1:0:1:0:1:0
+		devid 1 total_bytes X bytes_used X
+		io_align X io_width X sector_size X type 40
+		devid 2 total_bytes X bytes_used X
+		io_align X io_width X sector_size X type 100
+length X owner X stripe_len X type METADATA|RAID1
+		num_stripes 2 sub_stripes 1
+stripe 0 devid 2 offset X
+stripe 1 devid 1 offset X
+length X owner X stripe_len X type DATA|RAID1
+		num_stripes 2 sub_stripes 1
+stripe 0 devid 1 offset X
+stripe 1 devid 2 offset X
+[space] role-then-space
+space [role-then-space]
+1 Data/RAID1
+2 Data/RAID1
+1 Metadata/RAID1
+2 Metadata/RAID1
+
+Test Vector: single:single:1:0:0:0:1
+ERROR: Metadata_only and or Data_only is not yet supported
-- 
2.49.0


^ permalink raw reply related	[flat|nested] 44+ messages in thread

* Re: [PATCH RFC 00/10] btrfs: new performance-based chunk allocation using device roles
  2025-05-12 18:07 [PATCH RFC 00/10] btrfs: new performance-based chunk allocation using device roles Anand Jain
                   ` (11 preceding siblings ...)
  2025-05-12 18:11 ` [PATCH RFC 0/2] fstests: btrfs: add functional verification for device roles Anand Jain
@ 2025-05-20  9:19 ` Forza
  2025-05-21  8:37   ` Anand Jain
  2025-05-22  4:07 ` Zygo Blaxell
                   ` (3 subsequent siblings)
  16 siblings, 1 reply; 44+ messages in thread
From: Forza @ 2025-05-20  9:19 UTC (permalink / raw)
  To: Anand Jain, linux-btrfs

Hi,

On 2025-05-12 20:07, Anand Jain wrote:
> In host hardware, devices can have different speeds. Generally, faster
> devices come with lesser capacity while slower devices come with larger
> capacity. A typical configuration would expect that:
> 
>   - A filesystem's read/write performance is evenly distributed on average
>   across the entire filesystem. This is not achievable with the current
>   allocation method because chunks are allocated based only on device free
>   space.
> 
>   - Typically, faster devices are assigned to metadata chunk allocations
>   while slower devices are assigned to data chunk allocations.
> 
> Introducing Device Roles:
> 
>   Here I define 5 device roles in a specific order for metadata and in the
>   reverse order for data: metadata_only, metadata, none, data, data_only.
>   One or more devices may have the same role.
> 
>   The metadata and data roles indicate preference but not exclusivity for
>   that role, whereas data_only and metadata_only are exclusive roles.

This sounds like the old preferred_metadata (Allocator Hints) patch 
series from Goffredo Baroncelli[1] back in the 2020, now being 
maintained and improved by Kai Krakow[2] and others. Is this an 
updated/enhanced version of those patches?


> 
> Introducing Role-then-Space allocation method:
> 
>   Metadata allocation can happen on devices with the roles metadata_only,
>   metadata, none, and data in that order. If multiple devices share a role,
>   they are arranged based on device free space.
> 
>   Similarly, data allocation can happen on devices with the roles data_only,
>   data, none, and metadata in that order. If multiple devices share a role,
>   they are arranged based on device free space.

The Allocator Hints patch series show that this is a good method. We are 
several users that use those, also in production environments to good 
effect. Some argue that having more tiers would be beneficial, it could 
be combined with defrag or balance operation to place data on slow or 
fast storage.

> 
> Finding device speed automatically:
> 
>   Measuring device read/write latency for the allocaiton is not good idea,
>   as the historical readings and may be misleading, as they could include
>   iostat data from periods with issues that have since been fixed. Testing
>   to determine relative latency and arranging in ascending order for metadata
>   and descending for data is possible, but is better handled by an external
>   tool that can still set device roles.

Benchmarks using round-robin, latency and latency-round-robin and queue 
based scheduling show that latency based allocation can be particularly 
useful for some workloads and device types. It is difficult to 
generalise, but based on benchmarks we see that a good all-rounder is a 
queue based approach. See [3] for a complete set of raw data from these 
benchmarks.


|  # | Storage    | Jobs | Test                | Policy      |   IOPS  |
| -: | :--------- | ---: | :------------------ | :---------- | ------: |
|  1 | HDD RAID1  |    1 | RandRead 4 KiB QD1  | pid         |      81 |
|  2 | HDD RAID1  |    1 | RandRead 4 KiB QD1  | round-robin |      93 |
|  3 | HDD RAID1  |    1 | RandRead 4 KiB QD1  | latency     |      89 |
|  4 | HDD RAID1  |    1 | RandRead 4 KiB QD1  | latency-rr  |      87 |
|  5 | HDD RAID1  |    1 | RandRead 4 KiB QD1  | queue       |     102 |
|  6 | SSD RAID10 |    1 | RandRead 4 KiB QD32 | pid         |  68 800 |
|  7 | SSD RAID10 |    1 | RandRead 4 KiB QD32 | round-robin | 143 000 |
|  8 | SSD RAID10 |    1 | RandRead 4 KiB QD32 | latency     | 142 000 |
|  9 | SSD RAID10 |    1 | RandRead 4 KiB QD32 | latency-rr  | 137 000 |
| 10 | SSD RAID10 |    1 | RandRead 4 KiB QD32 | queue       | 143 000 |

(table wraps)

|  # | Policy      | BW (KiB/s) | Avg Lat (ms) | 99 % Lat | 99.9 % Lat |
| -: | :---------- | ---------: | -----------: | -------: | ---------: |
|  1 | pid         |        328 |        0.310 |   30.016 |    242.222 |
|  2 | round-robin |        374 |        0.091 |   26.084 |     60.031 |
|  3 | latency     |        358 |        0.041 |   26.608 |     32.900 |
|  4 | latency-rr  |        348 |        0.041 |   28.181 |     33.817 |
|  5 | queue       |        409 |        0.050 |   24.511 |     35.390 |
|  6 | pid         |    275 456 |        0.458 |    8.029 |     10.290 |
|  7 | round-robin |    572 416 |        0.217 |    0.338 |      0.627 |
|  8 | latency     |    569 344 |        0.219 |    0.306 |      0.400 |
|  9 | latency-rr  |    547 840 |        0.227 |    0.326 |      0.449 |
| 10 | queue       |    571 392 |        0.218 |    0.457 |      0.594 |

I think md uses a mix of queue based and sector-distance based approach 
depending on device type[4].

> 
> On-Disk Format changes:
> 
>   The following items are defined but are unused on-disk format:
> 
> 	btrfs_dev_item::
> 	 __le64 type; // unused
> 	 __le64 start_offset; // unused
> 	 __le32 dev_group; // unused
> 	 __u8 seek_speed; // unused
> 	 __u8 bandwidth; // unused
> 
>   The device roles is using the dev_item::type 8-bit field to store each
>   device's role.
>
> Anand Jain (10):
>   btrfs: fix thresh scope in should_alloc_chunk()
>   btrfs: refactor should_alloc_chunk() arg type
>   btrfs: introduce btrfs_split_sysfs_arg() for argument parsing
>   btrfs: introduce device allocation method
>   btrfs: sysfs: show device allocation method
>   btrfs: skip device sorting when only one device is present
>   btrfs: refactor chunk allocation device handling to use list_head
>   btrfs: introduce explicit device roles for block groups
>   btrfs: introduce ROLE_THEN_SPACE device allocation method
>   btrfs: pass device roles through device add ioctl



Have you considered how to deal with `df` and disk free calculation? Are 
device roles preserved during `btrfs device replace`?

Thank you!

[1] 
https://lore.kernel.org/linux-btrfs/20210116002533.GE31381@hungrycats.org/T/
[2] https://github.com/kakra/linux/pull/36
[3] https://gist.github.com/kakra/ce99896e5915f9b26d13c5637f56ff37
[4] 
https://github.com/torvalds/linux/blob/a5806cd506af5a7c19bcd596e4708b5c464bfd21/drivers/md/raid1.c#L832-L843





^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH RFC 00/10] btrfs: new performance-based chunk allocation using device roles
  2025-05-20  9:19 ` [PATCH RFC 00/10] btrfs: new performance-based chunk allocation using device roles Forza
@ 2025-05-21  8:37   ` Anand Jain
  0 siblings, 0 replies; 44+ messages in thread
From: Anand Jain @ 2025-05-21  8:37 UTC (permalink / raw)
  To: Forza, linux-btrfs



On 20/5/25 17:19, Forza wrote:
> Hi,
> 
> On 2025-05-12 20:07, Anand Jain wrote:
>> In host hardware, devices can have different speeds. Generally, faster
>> devices come with lesser capacity while slower devices come with larger
>> capacity. A typical configuration would expect that:
>>
>>   - A filesystem's read/write performance is evenly distributed on 
>> average
>>   across the entire filesystem. This is not achievable with the current
>>   allocation method because chunks are allocated based only on device 
>> free
>>   space.
>>
>>   - Typically, faster devices are assigned to metadata chunk allocations
>>   while slower devices are assigned to data chunk allocations.
>>
>> Introducing Device Roles:
>>
>>   Here I define 5 device roles in a specific order for metadata and in 
>> the
>>   reverse order for data: metadata_only, metadata, none, data, data_only.
>>   One or more devices may have the same role.
>>
>>   The metadata and data roles indicate preference but not exclusivity for
>>   that role, whereas data_only and metadata_only are exclusive roles.
> 
> This sounds like the old preferred_metadata (Allocator Hints) patch 
> series from Goffredo Baroncelli[1] back in the 2020, now being 
> maintained and improved by Kai Krakow[2] and others. Is this an updated/ 
> enhanced version of those patches?
> 

Thanks for the comments.

I haven't reviewed the implementation details of [1], so I can't make a
direct comparison. The goal here is to define a generic device priority
range from 1 to 255, which can be externally assigned and stored.

In one of the current modes under development, ROLE_THEN_SPACE, devices
are first grouped by three priority levels, then sorted by available
free space at the time of allocation.

I’m calling them generic device priorities because even when all devices
have similar performance—as is common in most general-purpose setups—we
can still use priorities to enable simple, linear allocation for the
single profile.

>>
>> Introducing Role-then-Space allocation method:
>>
>>   Metadata allocation can happen on devices with the roles metadata_only,
>>   metadata, none, and data in that order. If multiple devices share a 
>> role,
>>   they are arranged based on device free space.
>>
>>   Similarly, data allocation can happen on devices with the roles 
>> data_only,
>>   data, none, and metadata in that order. If multiple devices share a 
>> role,
>>   they are arranged based on device free space.
> 
> The Allocator Hints patch series show that this is a good method. We are 
> several users that use those, also in production environments to good 
> effect. Some argue that having more tiers would be beneficial, it could 
> be combined with defrag or balance operation to place data on slow or 
> fast storage.
> 
>>
>> Finding device speed automatically:
>>
>>   Measuring device read/write latency for the allocaiton is not good 
>> idea,
>>   as the historical readings and may be misleading, as they could include
>>   iostat data from periods with issues that have since been fixed. 
>> Testing
>>   to determine relative latency and arranging in ascending order for 
>> metadata
>>   and descending for data is possible, but is better handled by an 
>> external
>>   tool that can still set device roles.
> 
> Benchmarks using round-robin, latency and latency-round-robin and queue 
> based scheduling show that latency based allocation can be particularly 
> useful for some workloads and device types. It is difficult to 
> generalise, but based on benchmarks we see that a good all-rounder is a 
> queue based approach. See [3] for a complete set of raw data from these 
> benchmarks.
> 

I'm not commenting on the implementation details. My point is that
dynamic latency-based allocation was previously rejected because
temporary latency spikes can mislead the allocator and cause data to
land on the wrong device.

That said, for reads, there are indeed patches that support a latency-
based read_policy.

> 
> |  # | Storage    | Jobs | Test                | Policy      |   IOPS  |
> | -: | :--------- | ---: | :------------------ | :---------- | ------: |
> |  1 | HDD RAID1  |    1 | RandRead 4 KiB QD1  | pid         |      81 |
> |  2 | HDD RAID1  |    1 | RandRead 4 KiB QD1  | round-robin |      93 |
> |  3 | HDD RAID1  |    1 | RandRead 4 KiB QD1  | latency     |      89 |
> |  4 | HDD RAID1  |    1 | RandRead 4 KiB QD1  | latency-rr  |      87 |
> |  5 | HDD RAID1  |    1 | RandRead 4 KiB QD1  | queue       |     102 |
> |  6 | SSD RAID10 |    1 | RandRead 4 KiB QD32 | pid         |  68 800 |
> |  7 | SSD RAID10 |    1 | RandRead 4 KiB QD32 | round-robin | 143 000 |
> |  8 | SSD RAID10 |    1 | RandRead 4 KiB QD32 | latency     | 142 000 |
> |  9 | SSD RAID10 |    1 | RandRead 4 KiB QD32 | latency-rr  | 137 000 |
> | 10 | SSD RAID10 |    1 | RandRead 4 KiB QD32 | queue       | 143 000 |
> 
> (table wraps)
> 
> |  # | Policy      | BW (KiB/s) | Avg Lat (ms) | 99 % Lat | 99.9 % Lat |
> | -: | :---------- | ---------: | -----------: | -------: | ---------: |
> |  1 | pid         |        328 |        0.310 |   30.016 |    242.222 |
> |  2 | round-robin |        374 |        0.091 |   26.084 |     60.031 |
> |  3 | latency     |        358 |        0.041 |   26.608 |     32.900 |
> |  4 | latency-rr  |        348 |        0.041 |   28.181 |     33.817 |
> |  5 | queue       |        409 |        0.050 |   24.511 |     35.390 |
> |  6 | pid         |    275 456 |        0.458 |    8.029 |     10.290 |
> |  7 | round-robin |    572 416 |        0.217 |    0.338 |      0.627 |
> |  8 | latency     |    569 344 |        0.219 |    0.306 |      0.400 |
> |  9 | latency-rr  |    547 840 |        0.227 |    0.326 |      0.449 |
> | 10 | queue       |    571 392 |        0.218 |    0.457 |      0.594 |
> 
> I think md uses a mix of queue based and sector-distance based approach 
> depending on device type[4].
> 
>>
>> On-Disk Format changes:
>>
>>   The following items are defined but are unused on-disk format:
>>
>>     btrfs_dev_item::
>>      __le64 type; // unused
>>      __le64 start_offset; // unused
>>      __le32 dev_group; // unused
>>      __u8 seek_speed; // unused
>>      __u8 bandwidth; // unused
>>
>>   The device roles is using the dev_item::type 8-bit field to store each
>>   device's role.
>>
>> Anand Jain (10):
>>   btrfs: fix thresh scope in should_alloc_chunk()
>>   btrfs: refactor should_alloc_chunk() arg type
>>   btrfs: introduce btrfs_split_sysfs_arg() for argument parsing
>>   btrfs: introduce device allocation method
>>   btrfs: sysfs: show device allocation method
>>   btrfs: skip device sorting when only one device is present
>>   btrfs: refactor chunk allocation device handling to use list_head
>>   btrfs: introduce explicit device roles for block groups
>>   btrfs: introduce ROLE_THEN_SPACE device allocation method
>>   btrfs: pass device roles through device add ioctl
> 
> 
> 
> Have you considered how to deal with `df` and disk free calculation? Are 
> device roles preserved during `btrfs device replace`?
>

This is the foundational framework; the remaining features will be added
progressively.

Thanks!
Anand

> Thank you!
> 
> [1] https://lore.kernel.org/linux- 
> btrfs/20210116002533.GE31381@hungrycats.org/T/
> [2] https://github.com/kakra/linux/pull/36
> [3] https://gist.github.com/kakra/ce99896e5915f9b26d13c5637f56ff37
> [4] https://github.com/torvalds/linux/blob/ 
> a5806cd506af5a7c19bcd596e4708b5c464bfd21/drivers/md/raid1.c#L832-L843
> 
> 
> 
> 


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH RFC 00/10] btrfs: new performance-based chunk allocation using device roles
  2025-05-12 18:07 [PATCH RFC 00/10] btrfs: new performance-based chunk allocation using device roles Anand Jain
                   ` (12 preceding siblings ...)
  2025-05-20  9:19 ` [PATCH RFC 00/10] btrfs: new performance-based chunk allocation using device roles Forza
@ 2025-05-22  4:07 ` Zygo Blaxell
  2025-06-02  4:26   ` Anand Jain
  2025-05-22 18:19 ` waxhead
                   ` (2 subsequent siblings)
  16 siblings, 1 reply; 44+ messages in thread
From: Zygo Blaxell @ 2025-05-22  4:07 UTC (permalink / raw)
  To: Anand Jain; +Cc: linux-btrfs

On Tue, May 13, 2025 at 02:07:06AM +0800, Anand Jain wrote:
> In host hardware, devices can have different speeds. Generally, faster
> devices come with lesser capacity while slower devices come with larger
> capacity. A typical configuration would expect that:
> 
>  - A filesystem's read/write performance is evenly distributed on average
>  across the entire filesystem. This is not achievable with the current
>  allocation method because chunks are allocated based only on device free
>  space.
> 
>  - Typically, faster devices are assigned to metadata chunk allocations
>  while slower devices are assigned to data chunk allocations.
> 
> Introducing Device Roles:
> 
>  Here I define 5 device roles in a specific order for metadata and in the
>  reverse order for data: metadata_only, metadata, none, data, data_only.
>  One or more devices may have the same role.
>
>  The metadata and data roles indicate preference but not exclusivity for
>  that role, whereas data_only and metadata_only are exclusive roles.

Using role-based names like these presents three problems:

1. **Stripe incompatibility** -- These roles imply a hierarchy that breaks
in some multi-device arrays. e.g. with 5 devices of equal size and mixed
roles ("data_only" vs "data"), it's impossible to form a 5-device-wide
data chunk.

2. **Poor extensibility** -- The role system doesn't scale when
introducing additional allocation types. Any new category (e.g. PPL or
journal) would require duplicating preference permutations like "data,
then journal, then metadata" vs "journal, then data, then metadata",
resulting in combinatorial explosion.

3. **Misleading terminology** -- The name "none" is used in a misleading
way.  That name should be reserved for the case that prohibits all new
chunk allocations--a critical use case for array reshaping. A clearer
term would be "default," but the scheme would be even clearer if all
the legacy role names were discarded.

I suggest replacing roles with a pair of orthogonal properties per device
for each allocation type:

* Per-type tier level -- A simple u8 tier number that expresses allocation
preference. Allocators attempt to satisfy allocation using devices at
the lowest available tier, expanding the set to higher tiers as needed
until the minimum number of devices is reached.

* Per-type enable bit -- Indicates whether the device allows allocations
of that type at all. This can be stored explicitly, or encoded using a
reserved tier value (e.g. 0xFF = disabled).

Encoding this way makes "0" a reasonable default value for each field.

Then you get all of the required combinations, e.g.

* metadata 0, data 0 - what btrfs does now, equal preference

* metadata 2, data 1 - metadata preferred, data allowed

* metadata 1, data 2 - data preferred, metadata allowed

* metadata 0, data 255 - metadata only, no data

* metadata 255, data 0 - data only, no metadata

* metadata 255, data 255 - no new chunk allocations

This model offers cleaner semantics and more robust scaling:

* It eliminates unintended allocation spillover. A device either allows
data/metadata, or it doesn't.
* It expresses preference via explicit tiering rather than role overlap.
* It generalizes easily to future allocation types without rewriting
role logic.

"Allow nothing" is an important case for reshaping arrays.  If you are
upgrading 4 out of 12 disks in a striped raid filesystem, you don't
want to rewrite all the data in the filesystem 4 times.  Instead, set
the devices you want to remove to "allow nothing", run a balance with a
`devid` filter targeting each device to evacuate the data, and then run
device delete on the 4 empty drives.

> Introducing Role-then-Space allocation method:
> 
>  Metadata allocation can happen on devices with the roles metadata_only,
>  metadata, none, and data in that order. If multiple devices share a role,
>  they are arranged based on device free space.
> 
>  Similarly, data allocation can happen on devices with the roles data_only,
>  data, none, and metadata in that order. If multiple devices share a role,
>  they are arranged based on device free space.
> 
> Finding device speed automatically:
> 
>  Measuring device read/write latency for the allocaiton is not good idea,
>  as the historical readings and may be misleading, as they could include
>  iostat data from periods with issues that have since been fixed. Testing
>  to determine relative latency and arranging in ascending order for metadata
>  and descending for data is possible, but is better handled by an external
>  tool that can still set device roles.
> 
> On-Disk Format changes:
> 
>  The following items are defined but are unused on-disk format:
> 
> 	btrfs_dev_item::
> 	 __le64 type; // unused
> 	 __le64 start_offset; // unused
> 	 __le32 dev_group; // unused
> 	 __u8 seek_speed; // unused
> 	 __u8 bandwidth; // unused
> 
>  The device roles is using the dev_item::type 8-bit field to store each
>  device's role.

In the other implementations of this idea, allocation roles are stored in
`dev_item::type`, a single `u8` field, for simplicity; however, it would
be better to store these roles in the filesystem tree--e.g. using a
`BTRFS_PERSISTENT_ITEM_KEY` with a dedicated objectid for allocation
roles, and offset values corresponding to device IDs. This would enable
versioning of the schema and flexible extension (e.g., to add migration
policies, size-based allocation preferences, or other enhancements).

Since btrfs loads the trees before allocation can occur, tree-based
role data will be available in time for allocation, and we don't need
to store roles in the superblocks.

A longer version of this with use cases and some discussion is available
here:

	https://github.com/kakra/linux/pull/36#issuecomment-2784251968

	https://github.com/kakra/linux/pull/36#issuecomment-2784434490

> Anand Jain (10):
>   btrfs: fix thresh scope in should_alloc_chunk()
>   btrfs: refactor should_alloc_chunk() arg type
>   btrfs: introduce btrfs_split_sysfs_arg() for argument parsing
>   btrfs: introduce device allocation method
>   btrfs: sysfs: show device allocation method
>   btrfs: skip device sorting when only one device is present
>   btrfs: refactor chunk allocation device handling to use list_head
>   btrfs: introduce explicit device roles for block groups
>   btrfs: introduce ROLE_THEN_SPACE device allocation method
>   btrfs: pass device roles through device add ioctl
> 
>  fs/btrfs/block-group.c |  11 +-
>  fs/btrfs/ioctl.c       |  12 +-
>  fs/btrfs/sysfs.c       | 130 ++++++++++++++++++++--
>  fs/btrfs/volumes.c     | 242 +++++++++++++++++++++++++++++++++--------
>  fs/btrfs/volumes.h     |  35 +++++-
>  5 files changed, 366 insertions(+), 64 deletions(-)
> 
> -- 
> 2.49.0
> 
> 

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH RFC 00/10] btrfs: new performance-based chunk allocation using device roles
  2025-05-12 18:07 [PATCH RFC 00/10] btrfs: new performance-based chunk allocation using device roles Anand Jain
                   ` (13 preceding siblings ...)
  2025-05-22  4:07 ` Zygo Blaxell
@ 2025-05-22 18:19 ` waxhead
  2025-06-02  4:25   ` Anand Jain
  2025-05-22 20:39 ` Ferry Toth
  2025-05-30  0:15 ` Jani Partanen
  16 siblings, 1 reply; 44+ messages in thread
From: waxhead @ 2025-05-22 18:19 UTC (permalink / raw)
  To: Anand Jain, linux-btrfs

Anand Jain wrote:
> In host hardware, devices can have different speeds. Generally, faster
> devices come with lesser capacity while slower devices come with larger
> capacity. A typical configuration would expect that:
> 
>   - A filesystem's read/write performance is evenly distributed on average
>   across the entire filesystem. This is not achievable with the current
>   allocation method because chunks are allocated based only on device free
>   space.
> 
>   - Typically, faster devices are assigned to metadata chunk allocations
>   while slower devices are assigned to data chunk allocations.
> 
> Introducing Device Roles:
> 
>   Here I define 5 device roles in a specific order for metadata and in the
>   reverse order for data: metadata_only, metadata, none, data, data_only.
>   One or more devices may have the same role.
> 
>   The metadata and data roles indicate preference but not exclusivity for
>   that role, whereas data_only and metadata_only are exclusive roles.

As a BTRFS user I would like to comment a bit on this. I have earlier 
mentioned that I think that BTRFS should allow for device groups. E.g. 
assigning a storage device to one or more groups (or vice versa).

I really like what is being introduced here, but I would like to suggest 
to take this a step further. Instead of assigning a role to the storage 
device itself then maybe it would have been wiser to follow a scheme 
like this:

DeviceID -> Group(s) -> Group properties

In this case what is being introduced here could easily be dealt with as 
a simple group property like (meta)data_weight=0...128 for example.

Personally I think that would have been a much cleaner interface.

Setting a metadata/data roles as originally suggested here would be fine 
on a low number of devices, but on larger storage arrays with many 
devices it sounds (to me) like it would quickly become difficult to keep 
track of.

With the scheme I suggest you would simply list the properties of a 
group and see what DeviceID's that belong in that group... perhaps even 
in a nice table if you where lucky.

(And just for the record: other properties I can from the top of my head 
imagine that would be useful would be read/write weight that could 
(automatically) be set higher and higher if a device starts to throw 
errors, or group_exclusive=1|0 (to prevent other groups owning that 
DeviceID etc... etc...)

And this would of course require another step after mkfs, but personally 
I do not understand why setting these roles (or the scheme I suggest) 
would be very useful at mkfs time. It might as well be done at first 
mount before the filesystem gets put to use.

Great to see progress for BTRFS for things like this , but please do 
consider another scheme for setting the roles.

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH RFC 00/10] btrfs: new performance-based chunk allocation using device roles
  2025-05-12 18:07 [PATCH RFC 00/10] btrfs: new performance-based chunk allocation using device roles Anand Jain
                   ` (14 preceding siblings ...)
  2025-05-22 18:19 ` waxhead
@ 2025-05-22 20:39 ` Ferry Toth
  2025-06-02  4:24   ` Anand Jain
  2025-05-30  0:15 ` Jani Partanen
  16 siblings, 1 reply; 44+ messages in thread
From: Ferry Toth @ 2025-05-22 20:39 UTC (permalink / raw)
  To: Anand Jain, linux-btrfs

Hi,

Op 12-05-2025 om 20:07 schreef Anand Jain:
> In host hardware, devices can have different speeds. Generally, faster
> devices come with lesser capacity while slower devices come with larger
> capacity. A typical configuration would expect that:
> 
>   - A filesystem's read/write performance is evenly distributed on average
>   across the entire filesystem. This is not achievable with the current
>   allocation method because chunks are allocated based only on device free
>   space.
> 
>   - Typically, faster devices are assigned to metadata chunk allocations
>   while slower devices are assigned to data chunk allocations.

Finally a new effort in this direction.

> Introducing Device Roles:
> 
>   Here I define 5 device roles in a specific order for metadata and in the
>   reverse order for data: metadata_only, metadata, none, data, data_only.
>   One or more devices may have the same role.
> 
>   The metadata and data roles indicate preference but not exclusivity for
>   that role, whereas data_only and metadata_only are exclusive roles.
> 
> Introducing Role-then-Space allocation method:
> 
>   Metadata allocation can happen on devices with the roles metadata_only,
>   metadata, none, and data in that order. If multiple devices share a role,
>   they are arranged based on device free space.
> 
>   Similarly, data allocation can happen on devices with the roles data_only,
>   data, none, and metadata in that order. If multiple devices share a role,
>   they are arranged based on device free space.

I can see the use case for large pools of disks used in server 
environments where disks get assigned a role.

For desktop use I would like it a lot better with no roles, just a 
performance-based chunk allocation to select between a ssd and a hdd. 
And then used more like a hint to the allocator. Really nothing should 
go wrong if a data or meta-data gets allocated on the wrong / 
sub-optimal disk.

This could then bring back the old hot relocation idea, finally.

> Finding device speed automatically:
> 
>   Measuring device read/write latency for the allocaiton is not good idea,
>   as the historical readings and may be misleading, as they could include
>   iostat data from periods with issues that have since been fixed. Testing
>   to determine relative latency and arranging in ascending order for metadata
>   and descending for data is possible, but is better handled by an external
>   tool that can still set device roles.
> 
> On-Disk Format changes:
> 
>   The following items are defined but are unused on-disk format:
> 
> 	btrfs_dev_item::
> 	 __le64 type; // unused
> 	 __le64 start_offset; // unused
> 	 __le32 dev_group; // unused
> 	 __u8 seek_speed; // unused
> 	 __u8 bandwidth; // unused
> 
>   The device roles is using the dev_item::type 8-bit field to store each
>   device's role.

I think filling the fields with either measured or user entered data 
should be fine, as long as when the disk behavior changes you can 
re-measure or re-enter.

The difference between a ssd and a hdd will be so huge small changes 
will have no real effect.

> Anand Jain (10):
>    btrfs: fix thresh scope in should_alloc_chunk()
>    btrfs: refactor should_alloc_chunk() arg type
>    btrfs: introduce btrfs_split_sysfs_arg() for argument parsing
>    btrfs: introduce device allocation method
>    btrfs: sysfs: show device allocation method
>    btrfs: skip device sorting when only one device is present
>    btrfs: refactor chunk allocation device handling to use list_head
>    btrfs: introduce explicit device roles for block groups
>    btrfs: introduce ROLE_THEN_SPACE device allocation method
>    btrfs: pass device roles through device add ioctl
> 
>   fs/btrfs/block-group.c |  11 +-
>   fs/btrfs/ioctl.c       |  12 +-
>   fs/btrfs/sysfs.c       | 130 ++++++++++++++++++++--
>   fs/btrfs/volumes.c     | 242 +++++++++++++++++++++++++++++++++--------
>   fs/btrfs/volumes.h     |  35 +++++-
>   5 files changed, 366 insertions(+), 64 deletions(-)
> 


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH RFC 00/10] btrfs: new performance-based chunk allocation using device roles
  2025-05-12 18:07 [PATCH RFC 00/10] btrfs: new performance-based chunk allocation using device roles Anand Jain
                   ` (15 preceding siblings ...)
  2025-05-22 20:39 ` Ferry Toth
@ 2025-05-30  0:15 ` Jani Partanen
  2025-06-02  4:25   ` Anand Jain
  16 siblings, 1 reply; 44+ messages in thread
From: Jani Partanen @ 2025-05-30  0:15 UTC (permalink / raw)
  To: Anand Jain, linux-btrfs

On 12/05/2025 21.07, Anand Jain wrote:
> In host hardware, devices can have different speeds. Generally, faster
> devices come with lesser capacity while slower devices come with larger
> capacity. A typical configuration would expect that:
>
>   - A filesystem's read/write performance is evenly distributed on average
>   across the entire filesystem. This is not achievable with the current
>   allocation method because chunks are allocated based only on device free
>   space.
>
>   - Typically, faster devices are assigned to metadata chunk allocations
>   while slower devices are assigned to data chunk allocations.

Now if this could be expanded to allow tagging fast drives as 
write-cache, example I would add 256GB nvme drives or partitions as 
write-cache so even with HDD's as main data storage, I would get very 
fast writing.

Ofcourse cache would need some task to empty it after x time or it would 
have not much use. This is currently issue with lvm caching. There is 2 
type of caches, one is for read/write but when cache is full, it has not 
much help for writes anymore because it filled with read cache. Another 
is just write-cache.

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH RFC 00/10] btrfs: new performance-based chunk allocation using device roles
  2025-05-22 20:39 ` Ferry Toth
@ 2025-06-02  4:24   ` Anand Jain
  2025-06-04 21:29     ` Ferry Toth
  0 siblings, 1 reply; 44+ messages in thread
From: Anand Jain @ 2025-06-02  4:24 UTC (permalink / raw)
  To: Ferry Toth, linux-btrfs

On 23/5/25 04:39, Ferry Toth wrote:
> Hi,
> 
> Op 12-05-2025 om 20:07 schreef Anand Jain:
>> In host hardware, devices can have different speeds. Generally, faster
>> devices come with lesser capacity while slower devices come with larger
>> capacity. A typical configuration would expect that:
>>
>>   - A filesystem's read/write performance is evenly distributed on 
>> average
>>   across the entire filesystem. This is not achievable with the current
>>   allocation method because chunks are allocated based only on device 
>> free
>>   space.
>>
>>   - Typically, faster devices are assigned to metadata chunk allocations
>>   while slower devices are assigned to data chunk allocations.
> 
> Finally a new effort in this direction.
> 
>> Introducing Device Roles:
>>
>>   Here I define 5 device roles in a specific order for metadata and in 
>> the
>>   reverse order for data: metadata_only, metadata, none, data, data_only.
>>   One or more devices may have the same role.
>>
>>   The metadata and data roles indicate preference but not exclusivity for
>>   that role, whereas data_only and metadata_only are exclusive roles.
>>
>> Introducing Role-then-Space allocation method:
>>
>>   Metadata allocation can happen on devices with the roles metadata_only,
>>   metadata, none, and data in that order. If multiple devices share a 
>> role,
>>   they are arranged based on device free space.
>>
>>   Similarly, data allocation can happen on devices with the roles 
>> data_only,
>>   data, none, and metadata in that order. If multiple devices share a 
>> role,
>>   they are arranged based on device free space.
> 
> I can see the use case for large pools of disks used in server 
> environments where disks get assigned a role.
> 
> For desktop use I would like it a lot better with no roles, just a 
> performance-based chunk allocation to select between a ssd and a hdd. 
> And then used more like a hint to the allocator. Really nothing should 
> go wrong if a data or meta-data gets allocated on the wrong / sub- 
> optimal disk.
> 
> This could then bring back the old hot relocation idea, finally.
> 
>> Finding device speed automatically:
>>
>>   Measuring device read/write latency for the allocaiton is not good 
>> idea,
>>   as the historical readings and may be misleading, as they could include
>>   iostat data from periods with issues that have since been fixed. 
>> Testing
>>   to determine relative latency and arranging in ascending order for 
>> metadata
>>   and descending for data is possible, but is better handled by an 
>> external
>>   tool that can still set device roles.
>>
>> On-Disk Format changes:
>>
>>   The following items are defined but are unused on-disk format:
>>
>>     btrfs_dev_item::
>>      __le64 type; // unused
>>      __le64 start_offset; // unused
>>      __le32 dev_group; // unused
>>      __u8 seek_speed; // unused
>>      __u8 bandwidth; // unused
>>
>>   The device roles is using the dev_item::type 8-bit field to store each
>>   device's role.
> 
> I think filling the fields with either measured or user entered data 
> should be fine, as long as when the disk behavior changes you can re- 
> measure or re-enter.
> 
> The difference between a ssd and a hdd will be so huge small changes 
> will have no real effect.


Yeah, for desktop setups with SSDs and HDDs, the distinction is clear
and stable, so assigning data or metadata based on device type makes
sense. It’s straightforward to handle statically, and a
--set-roles-by-type mkfs option will make it automatic.

Even if the SSD temporarily slows down during a balance, we’d still
prefer to keep metadata on it, assuming the slowdown is short-lived.
SSD performance typically recovers, so there's no need to overreact
to transient dips.

For virtual devices, mkfs --set-roles-by-iostat should also work well.
And later if performance characteristics change permanently, a
balance-time option like --recalibrate-role-by-iostat could
re-evaluate based on I/O stats, confirm with the user, and relocate
chunks accordingly.

Also, I'm trying not to introduce too many options or configuration
paths, just enough to keep Btrfs simple to use.

Does that sound reasonable?

Thanks, Anand

>> Anand Jain (10):
>>    btrfs: fix thresh scope in should_alloc_chunk()
>>    btrfs: refactor should_alloc_chunk() arg type
>>    btrfs: introduce btrfs_split_sysfs_arg() for argument parsing
>>    btrfs: introduce device allocation method
>>    btrfs: sysfs: show device allocation method
>>    btrfs: skip device sorting when only one device is present
>>    btrfs: refactor chunk allocation device handling to use list_head
>>    btrfs: introduce explicit device roles for block groups
>>    btrfs: introduce ROLE_THEN_SPACE device allocation method
>>    btrfs: pass device roles through device add ioctl
>>
>>   fs/btrfs/block-group.c |  11 +-
>>   fs/btrfs/ioctl.c       |  12 +-
>>   fs/btrfs/sysfs.c       | 130 ++++++++++++++++++++--
>>   fs/btrfs/volumes.c     | 242 +++++++++++++++++++++++++++++++++--------
>>   fs/btrfs/volumes.h     |  35 +++++-
>>   5 files changed, 366 insertions(+), 64 deletions(-)
>>
> 


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH RFC 00/10] btrfs: new performance-based chunk allocation using device roles
  2025-05-22 18:19 ` waxhead
@ 2025-06-02  4:25   ` Anand Jain
  2025-06-06 14:21     ` waxhead
  0 siblings, 1 reply; 44+ messages in thread
From: Anand Jain @ 2025-06-02  4:25 UTC (permalink / raw)
  To: waxhead, linux-btrfs

On 23/5/25 02:19, waxhead wrote:
> Anand Jain wrote:
>> In host hardware, devices can have different speeds. Generally, faster
>> devices come with lesser capacity while slower devices come with larger
>> capacity. A typical configuration would expect that:
>>
>>   - A filesystem's read/write performance is evenly distributed on 
>> average
>>   across the entire filesystem. This is not achievable with the current
>>   allocation method because chunks are allocated based only on device 
>> free
>>   space.
>>
>>   - Typically, faster devices are assigned to metadata chunk allocations
>>   while slower devices are assigned to data chunk allocations.
>>
>> Introducing Device Roles:
>>
>>   Here I define 5 device roles in a specific order for metadata and in 
>> the
>>   reverse order for data: metadata_only, metadata, none, data, data_only.
>>   One or more devices may have the same role.
>>
>>   The metadata and data roles indicate preference but not exclusivity for
>>   that role, whereas data_only and metadata_only are exclusive roles.
> 
> As a BTRFS user I would like to comment a bit on this. I have earlier 
> mentioned that I think that BTRFS should allow for device groups. E.g. 
> assigning a storage device to one or more groups (or vice versa).
> 
> I really like what is being introduced here, but I would like to suggest 
> to take this a step further. Instead of assigning a role to the storage 
> device itself then maybe it would have been wiser to follow a scheme 
> like this:
> 
> DeviceID -> Group(s) -> Group properties
> 
> In this case what is being introduced here could easily be dealt with as 
> a simple group property like (meta)data_weight=0...128 for example.
> 
> Personally I think that would have been a much cleaner interface.
> 
> Setting a metadata/data roles as originally suggested here would be fine 
> on a low number of devices, but on larger storage arrays with many 
> devices it sounds (to me) like it would quickly become difficult to keep 
> track of.
> 
> With the scheme I suggest you would simply list the properties of a 
> group and see what DeviceID's that belong in that group... perhaps even 
> in a nice table if you where lucky.
> 
> (And just for the record: other properties I can from the top of my head 
> imagine that would be useful would be read/write weight that could 
> (automatically) be set higher and higher if a device starts to throw 
> errors, or group_exclusive=1|0 (to prevent other groups owning that 
> DeviceID etc... etc...)
> 
> And this would of course require another step after mkfs, but personally 
> I do not understand why setting these roles (or the scheme I suggest) 
> would be very useful at mkfs time. It might as well be done at first 
> mount before the filesystem gets put to use.
> 
> Great to see progress for BTRFS for things like this , but please do 
> consider another scheme for setting the roles.


Thanks for the feedback.

The question is: which approach handles large numbers of devices
better, Mode Groups or Direct Modes?

Let’s try to break it down.

Both approaches need to manage the following:

Five role types (preferences):
    metadata_only, metadata, none (any), data, data_only

Fault tolerance (FT) groups:
    2 to n device groups

Four allocation strategies:
    linear-devid, linear-priority, round-robin, free-space

Pros and Cons:
Direct Modes are simpler and work well for small setups. As things
scale, complexity grows, but scripts or tooling can manage that.

Mode Groups are better organized for large setups, but may be overkill
for small ones. They also require managing an extra btrfs key, which
adds some overhead.

Did I miss anything?

So far, I'm leaning toward Direct Modes. But if there's enough interest
in Mode Groups, we can explore that too. Alternatively, we could start
with Direct Modes and add Mode Groups later if needed.
Does that sound reasonable?

I’ve put up a draft work in progress version of the proposal here:

   https://asj.github.io/chunk-alloc-enhancement.html

Thanks, Anand

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH RFC 00/10] btrfs: new performance-based chunk allocation using device roles
  2025-05-30  0:15 ` Jani Partanen
@ 2025-06-02  4:25   ` Anand Jain
  0 siblings, 0 replies; 44+ messages in thread
From: Anand Jain @ 2025-06-02  4:25 UTC (permalink / raw)
  To: Jani Partanen, linux-btrfs

On 30/5/25 08:15, Jani Partanen wrote:
> On 12/05/2025 21.07, Anand Jain wrote:
>> In host hardware, devices can have different speeds. Generally, faster
>> devices come with lesser capacity while slower devices come with larger
>> capacity. A typical configuration would expect that:
>>
>>   - A filesystem's read/write performance is evenly distributed on 
>> average
>>   across the entire filesystem. This is not achievable with the current
>>   allocation method because chunks are allocated based only on device 
>> free
>>   space.
>>
>>   - Typically, faster devices are assigned to metadata chunk allocations
>>   while slower devices are assigned to data chunk allocations.
> 
> Now if this could be expanded to allow tagging fast drives as write- 
> cache, example I would add 256GB nvme drives or partitions as write- 
> cache so even with HDD's as main data storage, I would get very fast 
> writing.
> 
> Ofcourse cache would need some task to empty it after x time or it would 
> have not much use. This is currently issue with lvm caching. There is 2 
> type of caches, one is for read/write but when cache is full, it has not 
> much help for writes anymore because it filled with read cache. Another 
> is just write-cache.
>

Thanks for the feedback. A write-cache device is quite different from
data or metadata devices in the kernel. I’ll look into it once the chunk
allocation part is settled.

 From a UI point of view, write-cache could be treated as another role.
With the current mkfs.btrfs device option scheme, we could add it as an
additional role alongside metadata and data.

Thanks, Anand

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH RFC 00/10] btrfs: new performance-based chunk allocation using device roles
  2025-05-22  4:07 ` Zygo Blaxell
@ 2025-06-02  4:26   ` Anand Jain
  2025-06-21  1:11     ` Zygo Blaxell
  0 siblings, 1 reply; 44+ messages in thread
From: Anand Jain @ 2025-06-02  4:26 UTC (permalink / raw)
  To: Zygo Blaxell; +Cc: linux-btrfs


Thanks for the detailed proposal, more below..

On 22/5/25 12:07, Zygo Blaxell wrote:
> On Tue, May 13, 2025 at 02:07:06AM +0800, Anand Jain wrote:
>> In host hardware, devices can have different speeds. Generally, faster
>> devices come with lesser capacity while slower devices come with larger
>> capacity. A typical configuration would expect that:
>>
>>   - A filesystem's read/write performance is evenly distributed on average
>>   across the entire filesystem. This is not achievable with the current
>>   allocation method because chunks are allocated based only on device free
>>   space.
>>
>>   - Typically, faster devices are assigned to metadata chunk allocations
>>   while slower devices are assigned to data chunk allocations.
>>
>> Introducing Device Roles:
>>
>>   Here I define 5 device roles in a specific order for metadata and in the
>>   reverse order for data: metadata_only, metadata, none, data, data_only.
>>   One or more devices may have the same role.
>>
>>   The metadata and data roles indicate preference but not exclusivity for
>>   that role, whereas data_only and metadata_only are exclusive roles.
> 
> Using role-based names like these presents three problems:
> 
> 1. **Stripe incompatibility** -- These roles imply a hierarchy that breaks
> in some multi-device arrays. e.g. with 5 devices of equal size and mixed
> roles ("data_only" vs "data"), it's impossible to form a 5-device-wide
> data chunk.
> 
Thanks for the feedback.

Details about the current proposal are here:

[1] https://asj.github.io/chunk-alloc-enhancement.html

Some allocation modes aren't compatible with certain block group
profiles. We'll need to check this at mkfs time and fail the command if
the number of devices is below the minimum required.

The role hierarchy (exclusive-> none-> non-exclusive) only applies when
there are more devices than required for a given block group profile and
the allocator has a choice of which devices to use.

The use case for non-exclusive roles with striped profiles isn't very
practical, but the design allows for future extensions if needed.

> 2. **Poor extensibility** -- The role system doesn't scale when
> introducing additional allocation types. Any new category (e.g. PPL or
> journal) would require duplicating preference permutations like "data,
> then journal, then metadata" vs "journal, then data, then metadata",
> resulting in combinatorial explosion.

Special devices like journal or write-cache are different; they are
separate from the data and metadata storage devices. We will still hit
ENOSPC even if the journal device is empty.

That said, it is still possible to specify write-cache as a role. For
example:

	mkfs.btrfs /dev/sdx:write-cache ...

I'm not sure I understood what you meant by "not extensible"?

Also, allocation modes (for example, FREE_SPACE, ROLE, LINEAR,
ROUND_ROBIN) are designed to be composable as needed.

If roles do not cover a specific use case, the existing alloc_priority
(1 to 255) and alloc_mode can be extended to support new logic.

Note: LINEAR and ROUND_ROBIN are not implemented yet.

> 3. **Misleading terminology** -- The name "none" is used in a misleading
> way.  That name should be reserved for the case that prohibits all new
> chunk allocations--a critical use case for array reshaping. A clearer
> term would be "default," but the scheme would be even clearer if all
> the legacy role names were discarded.
> 

Got it, I'll rename none to default.

"None" is internal to the kernel and means no particular role
preference. It currently falls into the middle tier (41 to 80) of
alloc_priority, but we could adjust that to something more meaningful if
needed.

> I suggest replacing roles with a pair of orthogonal properties per device
> for each allocation type:
> 
> * Per-type tier level -- A simple u8 tier number that expresses allocation
> preference. Allocators attempt to satisfy allocation using devices at
> the lowest available tier, expanding the set to higher tiers as needed
> until the minimum number of devices is reached.

This is the same as alloc_priority stored in dev_item::type:8.

> * Per-type enable bit -- Indicates whether the device allows allocations
> of that type at all. This can be stored explicitly, or encoded using a
> reserved tier value (e.g. 0xFF = disabled).

The device type can refer to a special device (like write-cache) or a
regular data/metadata device. Within a data/metadata device, the role,
whether for data or metadata, can still be represented using the current
*_only, *, or default/any roles. So this approach remains compatible.

> Encoding this way makes "0" a reasonable default value for each field.
> 
> Then you get all of the required combinations, e.g.
> 

Added below the current proposal.

> * metadata 0, data 0 - what btrfs does now, equal preference

  role=< > no role | default

> 
> * metadata 2, data 1 - metadata preferred, data allowed
> 
  role=metadata

> * metadata 1, data 2 - data preferred, metadata allowed

  role=data

> * metadata 0, data 255 - metadata only, no data

  role=metadata_only

> * metadata 255, data 0 - data only, no metadata

  role=data_only

> * metadata 255, data 255 - no new chunk allocations

  Flag it read-only.

> This model offers cleaner semantics and more robust scaling:
> 
> * It eliminates unintended allocation spillover. A device either allows
> data/metadata, or it doesn't.
> * It expresses preference via explicit tiering rather than role overlap.
> * It generalizes easily to future allocation types without rewriting
> role logic.
> 

> "Allow nothing" is an important case for reshaping arrays.  If you are
> upgrading 4 out of 12 disks in a striped raid filesystem, you don't
> want to rewrite all the data in the filesystem 4 times.  Instead, set
> the devices you want to remove to "allow nothing", run a balance with a
> `devid` filter targeting each device to evacuate the data, and then run
> device delete on the 4 empty drives.

We can do the same by setting the device read-only.


I actually started with the idea of using bitmap flags, since it's more
straightforward. But I eventually leaned toward using an Allocation
Priority list to allow for a manual priority order within roles or
tiers, if needed in the future. That flexibility pushed me in that
direction.

You can find more details about the current Allocation Priority list
here:

	https://asj.github.io/chunk-alloc-enhancement.html

That said, we could store the mode in a separate btrfs-key and keep the
manual priority in dev_item::type, which would give us both.
But as always, we try to avoid new on-disk new keys unless absolute
necessary.

>> Introducing Role-then-Space allocation method:
>>
>>   Metadata allocation can happen on devices with the roles metadata_only,
>>   metadata, none, and data in that order. If multiple devices share a role,
>>   they are arranged based on device free space.
>>
>>   Similarly, data allocation can happen on devices with the roles data_only,
>>   data, none, and metadata in that order. If multiple devices share a role,
>>   they are arranged based on device free space.
>>
>> Finding device speed automatically:
>>
>>   Measuring device read/write latency for the allocaiton is not good idea,
>>   as the historical readings and may be misleading, as they could include
>>   iostat data from periods with issues that have since been fixed. Testing
>>   to determine relative latency and arranging in ascending order for metadata
>>   and descending for data is possible, but is better handled by an external
>>   tool that can still set device roles.
>>
>> On-Disk Format changes:
>>
>>   The following items are defined but are unused on-disk format:
>>
>> 	btrfs_dev_item::
>> 	 __le64 type; // unused
>> 	 __le64 start_offset; // unused
>> 	 __le32 dev_group; // unused
>> 	 __u8 seek_speed; // unused
>> 	 __u8 bandwidth; // unused
>>
>>   The device roles is using the dev_item::type 8-bit field to store each
>>   device's role.
> 
> In the other implementations of this idea, allocation roles are stored in
> `dev_item::type`, a single `u8` field, for simplicity; however, it would
> be better to store these roles in the filesystem tree--e.g. using a
> `BTRFS_PERSISTENT_ITEM_KEY` with a dedicated objectid for allocation
> roles, and offset values corresponding to device IDs. This would enable
> versioning of the schema and flexible extension (e.g., to add migration
> policies, size-based allocation preferences, or other enhancements).
> 
> Since btrfs loads the trees before allocation can occur, tree-based
> role data will be available in time for allocation, and we don't need
> to store roles in the superblocks.
> 
> A longer version of this with use cases and some discussion is available
> here:
> 
> 	https://github.com/kakra/linux/pull/36#issuecomment-2784251968
> 
> 	https://github.com/kakra/linux/pull/36#issuecomment-2784434490
> 

dev_item::dev_type (u64) comes from the reserved field list, so there's
no additional space overhead in using it. I considered whether using a
btrfs_key for roles would offer any advantage over dev_item::dev_type,
but I couldn't find a clear benefit.

Also, with alloc_priority + alloc_mode, we can support a manual device
order with the same cost.

Let me still consider what you proposed again to see if there’s any
advantage to doing it that way.

Good discussion, thanks a lot.

-Anand

>> Anand Jain (10):
>>    btrfs: fix thresh scope in should_alloc_chunk()
>>    btrfs: refactor should_alloc_chunk() arg type
>>    btrfs: introduce btrfs_split_sysfs_arg() for argument parsing
>>    btrfs: introduce device allocation method
>>    btrfs: sysfs: show device allocation method
>>    btrfs: skip device sorting when only one device is present
>>    btrfs: refactor chunk allocation device handling to use list_head
>>    btrfs: introduce explicit device roles for block groups
>>    btrfs: introduce ROLE_THEN_SPACE device allocation method
>>    btrfs: pass device roles through device add ioctl
>>
>>   fs/btrfs/block-group.c |  11 +-
>>   fs/btrfs/ioctl.c       |  12 +-
>>   fs/btrfs/sysfs.c       | 130 ++++++++++++++++++++--
>>   fs/btrfs/volumes.c     | 242 +++++++++++++++++++++++++++++++++--------
>>   fs/btrfs/volumes.h     |  35 +++++-
>>   5 files changed, 366 insertions(+), 64 deletions(-)
>>
>> -- 
>> 2.49.0
>>
>>
> 


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH RFC 00/10] btrfs: new performance-based chunk allocation using device roles
  2025-06-02  4:24   ` Anand Jain
@ 2025-06-04 21:29     ` Ferry Toth
  2025-06-04 21:48       ` Anand Jain
  0 siblings, 1 reply; 44+ messages in thread
From: Ferry Toth @ 2025-06-04 21:29 UTC (permalink / raw)
  To: Anand Jain, linux-btrfs

Hi,

Op 02-06-2025 om 06:24 schreef Anand Jain:
> On 23/5/25 04:39, Ferry Toth wrote:
>> Hi,
>>
>> Op 12-05-2025 om 20:07 schreef Anand Jain:
>>> In host hardware, devices can have different speeds. Generally, faster
>>> devices come with lesser capacity while slower devices come with larger
>>> capacity. A typical configuration would expect that:
>>>
>>>   - A filesystem's read/write performance is evenly distributed on 
>>> average
>>>   across the entire filesystem. This is not achievable with the current
>>>   allocation method because chunks are allocated based only on device 
>>> free
>>>   space.
>>>
>>>   - Typically, faster devices are assigned to metadata chunk allocations
>>>   while slower devices are assigned to data chunk allocations.
>>
>> Finally a new effort in this direction.
>>
>>> Introducing Device Roles:
>>>
>>>   Here I define 5 device roles in a specific order for metadata and 
>>> in the
>>>   reverse order for data: metadata_only, metadata, none, data, 
>>> data_only.
>>>   One or more devices may have the same role.
>>>
>>>   The metadata and data roles indicate preference but not exclusivity 
>>> for
>>>   that role, whereas data_only and metadata_only are exclusive roles.
>>>
>>> Introducing Role-then-Space allocation method:
>>>
>>>   Metadata allocation can happen on devices with the roles 
>>> metadata_only,
>>>   metadata, none, and data in that order. If multiple devices share a 
>>> role,
>>>   they are arranged based on device free space.
>>>
>>>   Similarly, data allocation can happen on devices with the roles 
>>> data_only,
>>>   data, none, and metadata in that order. If multiple devices share a 
>>> role,
>>>   they are arranged based on device free space.
>>
>> I can see the use case for large pools of disks used in server 
>> environments where disks get assigned a role.
>>
>> For desktop use I would like it a lot better with no roles, just a 
>> performance-based chunk allocation to select between a ssd and a hdd. 
>> And then used more like a hint to the allocator. Really nothing should 
>> go wrong if a data or meta-data gets allocated on the wrong / sub- 
>> optimal disk.
>>
>> This could then bring back the old hot relocation idea, finally.
>>
>>> Finding device speed automatically:
>>>
>>>   Measuring device read/write latency for the allocaiton is not good 
>>> idea,
>>>   as the historical readings and may be misleading, as they could 
>>> include
>>>   iostat data from periods with issues that have since been fixed. 
>>> Testing
>>>   to determine relative latency and arranging in ascending order for 
>>> metadata
>>>   and descending for data is possible, but is better handled by an 
>>> external
>>>   tool that can still set device roles.
>>>
>>> On-Disk Format changes:
>>>
>>>   The following items are defined but are unused on-disk format:
>>>
>>>     btrfs_dev_item::
>>>      __le64 type; // unused
>>>      __le64 start_offset; // unused
>>>      __le32 dev_group; // unused
>>>      __u8 seek_speed; // unused
>>>      __u8 bandwidth; // unused
>>>
>>>   The device roles is using the dev_item::type 8-bit field to store each
>>>   device's role.
>>
>> I think filling the fields with either measured or user entered data 
>> should be fine, as long as when the disk behavior changes you can re- 
>> measure or re-enter.
>>
>> The difference between a ssd and a hdd will be so huge small changes 
>> will have no real effect.
> 
> 
> Yeah, for desktop setups with SSDs and HDDs, the distinction is clear
> and stable, so assigning data or metadata based on device type makes
> sense. It’s straightforward to handle statically, and a
> --set-roles-by-type mkfs option will make it automatic.
> 
> Even if the SSD temporarily slows down during a balance, we’d still
> prefer to keep metadata on it, assuming the slowdown is short-lived.
> SSD performance typically recovers, so there's no need to overreact
> to transient dips.
> 
> For virtual devices, mkfs --set-roles-by-iostat should also work well.
> And later if performance characteristics change permanently, a
> balance-time option like --recalibrate-role-by-iostat could
> re-evaluate based on I/O stats, confirm with the user, and relocate
> chunks accordingly.
> 
> Also, I'm trying not to introduce too many options or configuration
> paths, just enough to keep Btrfs simple to use.
> 
> Does that sound reasonable?

That sounds very good.

I  am curious what happens when the fast device fills up, what will the 
allocator do? I guess it will fall back to allocating to the slow device?

If so, we're going to need some periodic or just in time "move files 
that have for a long time not been written / read" to the slow disk.

While that file my be referenced from multiple subvolumes, and you 
wouldn't want those duped (like happens with defragmenting).

> Thanks, Anand
> 
>>> Anand Jain (10):
>>>    btrfs: fix thresh scope in should_alloc_chunk()
>>>    btrfs: refactor should_alloc_chunk() arg type
>>>    btrfs: introduce btrfs_split_sysfs_arg() for argument parsing
>>>    btrfs: introduce device allocation method
>>>    btrfs: sysfs: show device allocation method
>>>    btrfs: skip device sorting when only one device is present
>>>    btrfs: refactor chunk allocation device handling to use list_head
>>>    btrfs: introduce explicit device roles for block groups
>>>    btrfs: introduce ROLE_THEN_SPACE device allocation method
>>>    btrfs: pass device roles through device add ioctl
>>>
>>>   fs/btrfs/block-group.c |  11 +-
>>>   fs/btrfs/ioctl.c       |  12 +-
>>>   fs/btrfs/sysfs.c       | 130 ++++++++++++++++++++--
>>>   fs/btrfs/volumes.c     | 242 +++++++++++++++++++++++++++++++++--------
>>>   fs/btrfs/volumes.h     |  35 +++++-
>>>   5 files changed, 366 insertions(+), 64 deletions(-)
>>>
>>
> 


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH RFC 00/10] btrfs: new performance-based chunk allocation using device roles
  2025-06-04 21:29     ` Ferry Toth
@ 2025-06-04 21:48       ` Anand Jain
  0 siblings, 0 replies; 44+ messages in thread
From: Anand Jain @ 2025-06-04 21:48 UTC (permalink / raw)
  To: Ferry Toth, linux-btrfs


>>
>> Yeah, for desktop setups with SSDs and HDDs, the distinction is clear
>> and stable, so assigning data or metadata based on device type makes
>> sense. It’s straightforward to handle statically, and a
>> --set-roles-by-type mkfs option will make it automatic.
>>
>> Even if the SSD temporarily slows down during a balance, we’d still
>> prefer to keep metadata on it, assuming the slowdown is short-lived.
>> SSD performance typically recovers, so there's no need to overreact
>> to transient dips.
>>
>> For virtual devices, mkfs --set-roles-by-iostat should also work well.
>> And later if performance characteristics change permanently, a
>> balance-time option like --recalibrate-role-by-iostat could
>> re-evaluate based on I/O stats, confirm with the user, and relocate
>> chunks accordingly.
>>
>> Also, I'm trying not to introduce too many options or configuration
>> paths, just enough to keep Btrfs simple to use.
>>
>> Does that sound reasonable?
> 
> That sounds very good.
> 
> I  am curious what happens when the fast device fills up, what will the 
> allocator do? I guess it will fall back to allocating to the slow device?
> 
> If so, we're going to need some periodic or just in time "move files 
> that have for a long time not been written / read" to the slow disk.
> 
> While that file my be referenced from multiple subvolumes, and you 
> wouldn't want those duped (like happens with defragmenting).

Currently, if the devices are non-exclusive, the allocator will fall
back to using the other device type, slower or faster for metadata or
data, respectively. However, if they are marked as exclusive, allocation
will fail with ENOSPC.

Dynamic rebalancing based on chunk usage isn’t part of this patch set.
That would probably require a heuristic to make smart relocation
decisions. We can look into it further and potentially provide it as a
separate mode.

Thanks, Anand

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH RFC 00/10] btrfs: new performance-based chunk allocation using device roles
  2025-06-02  4:25   ` Anand Jain
@ 2025-06-06 14:21     ` waxhead
  0 siblings, 0 replies; 44+ messages in thread
From: waxhead @ 2025-06-06 14:21 UTC (permalink / raw)
  To: Anand Jain, linux-btrfs

Anand Jain wrote:
> On 23/5/25 02:19, waxhead wrote:
>> Anand Jain wrote:
>>> In host hardware, devices can have different speeds. Generally, faster
>>> devices come with lesser capacity while slower devices come with larger
>>> capacity. A typical configuration would expect that:
>>>
>>>   - A filesystem's read/write performance is evenly distributed on 
>>> average
>>>   across the entire filesystem. This is not achievable with the current
>>>   allocation method because chunks are allocated based only on device 
>>> free
>>>   space.
>>>
>>>   - Typically, faster devices are assigned to metadata chunk allocations
>>>   while slower devices are assigned to data chunk allocations.
>>>
>>> Introducing Device Roles:
>>>
>>>   Here I define 5 device roles in a specific order for metadata and 
>>> in the
>>>   reverse order for data: metadata_only, metadata, none, data, 
>>> data_only.
>>>   One or more devices may have the same role.
>>>
>>>   The metadata and data roles indicate preference but not exclusivity 
>>> for
>>>   that role, whereas data_only and metadata_only are exclusive roles.
>>
>> As a BTRFS user I would like to comment a bit on this. I have earlier 
>> mentioned that I think that BTRFS should allow for device groups. E.g. 
>> assigning a storage device to one or more groups (or vice versa).
>>
>> I really like what is being introduced here, but I would like to 
>> suggest to take this a step further. Instead of assigning a role to 
>> the storage device itself then maybe it would have been wiser to 
>> follow a scheme like this:
>>
>> DeviceID -> Group(s) -> Group properties
>>
>> In this case what is being introduced here could easily be dealt with 
>> as a simple group property like (meta)data_weight=0...128 for example.
>>
>> Personally I think that would have been a much cleaner interface.
>>
>> Setting a metadata/data roles as originally suggested here would be 
>> fine on a low number of devices, but on larger storage arrays with 
>> many devices it sounds (to me) like it would quickly become difficult 
>> to keep track of.
>>
>> With the scheme I suggest you would simply list the properties of a 
>> group and see what DeviceID's that belong in that group... perhaps 
>> even in a nice table if you where lucky.
>>
>> (And just for the record: other properties I can from the top of my 
>> head imagine that would be useful would be read/write weight that 
>> could (automatically) be set higher and higher if a device starts to 
>> throw errors, or group_exclusive=1|0 (to prevent other groups owning 
>> that DeviceID etc... etc...)
>>
>> And this would of course require another step after mkfs, but 
>> personally I do not understand why setting these roles (or the scheme 
>> I suggest) would be very useful at mkfs time. It might as well be done 
>> at first mount before the filesystem gets put to use.
>>
>> Great to see progress for BTRFS for things like this , but please do 
>> consider another scheme for setting the roles.
> 
> 
> Thanks for the feedback.
> 
> The question is: which approach handles large numbers of devices
> better, Mode Groups or Direct Modes?
> 
> Let’s try to break it down.
> 
> Both approaches need to manage the following:
> 
> Five role types (preferences):
>     metadata_only, metadata, none (any), data, data_only
> 
> Fault tolerance (FT) groups:
>     2 to n device groups
> 
> Four allocation strategies:
>     linear-devid, linear-priority, round-robin, free-space
> 
> Pros and Cons:
> Direct Modes are simpler and work well for small setups. As things
> scale, complexity grows, but scripts or tooling can manage that.
> 
> Mode Groups are better organized for large setups, but may be overkill
> for small ones. They also require managing an extra btrfs key, which
> adds some overhead.
> 
> Did I miss anything?
> 
Not really, you are spot on as far as I am concerned, but if you 
acknowledge that the direct approach is really not very suitable for 
large arrays without additional scripts or extra tooling anyway, I would 
personally lean towards the "overkill" approach myself.

And as a user I especially lean towards something that I can count on 
that will work out of the box and will not require scripts and/or 
tooling that may lag behind if something changes.

There may be debugging benefits as well by having the configuration 
stored in metadata on the filesystem instead of scripts and tools that 
may be located on another filesystem.

> So far, I'm leaning toward Direct Modes. But if there's enough interest
> in Mode Groups, we can explore that too. Alternatively, we could start
> with Direct Modes and add Mode Groups later if needed.
> Does that sound reasonable?
> 
I agree that it sounds reasonable, but it is also a matter of taste and 
I am obviously biased. Admittedly I must say that I would personally 
really like to see this feature on my filesystem preferably today, right 
now!! , but I can't help to think that the "overkill" solution would be 
the cleaner and beyond all the less messy solution in the long run. I 
hope you are willing to rethink this a bit. In any case thanks for 
working on a great feature.

> I’ve put up a draft work in progress version of the proposal here:
> 
>    https://asj.github.io/chunk-alloc-enhancement.html
> 
> Thanks, Anand

Brilliant, thanks!

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH RFC 00/14] btrfs-progs: add support for device role-based chunk allocation
  2025-05-12 18:09 ` [PATCH RFC 00/14] btrfs-progs: add support for device role-based chunk allocation Anand Jain
                     ` (13 preceding siblings ...)
  2025-05-12 18:09   ` [PATCH 14/14] btrfs-progs: disable exclusive metadata/data device roles Anand Jain
@ 2025-06-20 16:46   ` David Sterba
  14 siblings, 0 replies; 44+ messages in thread
From: David Sterba @ 2025-06-20 16:46 UTC (permalink / raw)
  To: Anand Jain; +Cc: linux-btrfs

On Tue, May 13, 2025 at 02:09:17AM +0800, Anand Jain wrote:
> Adds cleanup, fixes, and device role support to enable more efficient
> kernel chunk allocation based on device perforamnce.
> 
> Anand Jain (14):
>   btrfs-progs: minor spelling correction in the list-chunk help text
>   btrfs-progs: refactor devid comparison function
>   btrfs-progs: rename local dev_list to devices in btrfs_alloc_chunk

I've taken the 3 patches to progs as they seem to be standalone fixes.
Thanks.

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH RFC 00/10] btrfs: new performance-based chunk allocation using device roles
  2025-06-02  4:26   ` Anand Jain
@ 2025-06-21  1:11     ` Zygo Blaxell
  0 siblings, 0 replies; 44+ messages in thread
From: Zygo Blaxell @ 2025-06-21  1:11 UTC (permalink / raw)
  To: Anand Jain; +Cc: linux-btrfs

On Mon, Jun 02, 2025 at 12:26:41PM +0800, Anand Jain wrote:
> 
> Thanks for the detailed proposal, more below..
> 
> On 22/5/25 12:07, Zygo Blaxell wrote:
> > On Tue, May 13, 2025 at 02:07:06AM +0800, Anand Jain wrote:
> > > In host hardware, devices can have different speeds. Generally, faster
> > > devices come with lesser capacity while slower devices come with larger
> > > capacity. [...]
> > Using role-based names like these presents three problems:
> > 
> > 1. **Stripe incompatibility** -- These roles imply a hierarchy that breaks
> > in some multi-device arrays. e.g. with 5 devices of equal size and mixed
> > roles ("data_only" vs "data"), it's impossible to form a 5-device-wide
> > data chunk.
> > 
> Thanks for the feedback.
> 
> Details about the current proposal are here:
> 
> [1] https://asj.github.io/chunk-alloc-enhancement.html
>
> Some allocation modes aren't compatible with certain block group
> profiles. We'll need to check this at mkfs time and fail the command if
> the number of devices is below the minimum required.
> 
> The role hierarchy (exclusive-> none-> non-exclusive) only applies when
> there are more devices than required for a given block group profile and
> the allocator has a choice of which devices to use.

I may be reading more into this than you intended, but this can lead
to some unpleasant surprises.  To ensure predictable behavior, the
allocator should _always_ select devices based on configured priority.
If no devices meet the configured requirements, the chunk allocator
should return ENOSPC immediately, rather than silently falling back to
something not explicitly permitted.

We should never allocate metadata on slower devices simply because faster
ones are full.  That must be _explicitly_ allowed by the configuration.

> The use case for non-exclusive roles with striped profiles isn't very
> practical, but the design allows for future extensions if needed.
> 
> > 2. **Poor extensibility** -- The role system doesn't scale when
> > introducing additional allocation types. Any new category (e.g. PPL or
> > journal) would require duplicating preference permutations like "data,
> > then journal, then metadata" vs "journal, then data, then metadata",
> > resulting in combinatorial explosion.
> 
> Special devices like journal or write-cache are different; they are
> separate from the data and metadata storage devices. We will still hit
> ENOSPC even if the journal device is empty.
> 
> That said, it is still possible to specify write-cache as a role. For
> example:
> 
> 	mkfs.btrfs /dev/sdx:write-cache ...
> 
> I'm not sure I understood what you meant by "not extensible"?

There are some interesting proposals based on the allocation preferences
patches from 2020.  We might want to hack up the extent allocator so that
extents <= 128K are sent to SSD, while larger extents are sent to HDD.
A maintenance process could run a defrag-like operation periodically
to relocate cold data to slow devices by combining small extents (which
prefer SSD) into large extents (which prefer HDD).

In that case, we'd need at least two preference levels for data block
groups, in addition to a separate preference level for metadata (i.e.
a total of 3 alloc_priority fields per device:  small_data, large_data,
and metadata).

> Also, allocation modes (for example, FREE_SPACE, ROLE, LINEAR,
> ROUND_ROBIN) are designed to be composable as needed.

In that case, why bother with ROLE, when PRIORITY (with one priority
value per role) can express a functional superset?

> If roles do not cover a specific use case, the existing alloc_priority
> (1 to 255) and alloc_mode can be extended to support new logic.
> 
> Note: LINEAR and ROUND_ROBIN are not implemented yet.
> 
> > 3. **Misleading terminology** -- The name "none" is used in a misleading
> > way.  That name should be reserved for the case that prohibits all new
> > chunk allocations--a critical use case for array reshaping. A clearer
> > term would be "default," but the scheme would be even clearer if all
> > the legacy role names were discarded.
> 
> Got it, I'll rename none to default.
> 
> "None" is internal to the kernel and means no particular role
> preference. It currently falls into the middle tier (41 to 80) of
> alloc_priority, but we could adjust that to something more meaningful if
> needed.

This "default" value was present in earlier versions of the allocation
preferences patch set from 2020.  It was killed because its semantics were
confusing--users had to read the doc to understand what it did, then asked
questions about why there are two options that are equivalent to "data
preferred, metadata allowed", but behave differently in a filesystem
because they have different numeric values in the device sort.

In other words, the problem is not just the name--it's the _concept_ of
"default" being a distinct value, as opposed to an alias for one of the
non-default values.  There's no way a device can have a "default" or
"other" role in the presence of any device with a non-default role--a
device that merely participates in allocation potentially modifies the
result of every allocation.

Instead, we picked "data preferred" (which is "role=data" in your
proposal) as the default.  This compromise achieves two key goals:

 * not putting metadata on slow drives when fast drives exist, and
 * not running out of metadata space, by allowing metadata allocation
   on data devices as a last resort.

That said, there could be a distinct on-disk encoding for "default"
or "unspecified", as long as it maps exactly to one of the explicit
choices at runtime, i.e. it must not have a distinct numeric value in the
ordering.  I think the point is moot, though, since there's no need to put
role on disk at all, and alloc_priority can simply default to zero.

> > I suggest replacing roles with a pair of orthogonal properties per device
> > for each allocation type:
> > 
> > * Per-type tier level -- A simple u8 tier number that expresses allocation
> > preference. Allocators attempt to satisfy allocation using devices at
> > the lowest available tier, expanding the set to higher tiers as needed
> > until the minimum number of devices is reached.
> 
> This is the same as alloc_priority stored in dev_item::type:8.

To clarify, I propose independent priorities for each allocation type on a
device.  For example, a device would maintain separate priority values for
data and metadata, and future extensions like write_cache, small_extent, etc.
(For the purpose of this discussion, system is part of metadata.)

struct btrfs_dev_item {
        ...
        union {
            __le64 type;              /* unused */
            struct {
                __le8 reserved[6];
                __le8 prio_metadata;  /* bits 8 - 15: metadata priority */
                __le8 prio_data;      /* bits 0 - 07: data priority */
            };
        };
        __le32 dev_group;  /* FT device groups */
        ...
    };

These per-type priorities group devices into tiers for each type.
For different allocation types, devices may be arranged into different
groups.  Within a tier, devices can then be ordered or filtered--using
round-robin, linear placement, FT-domain, or other placement policies--to
satisfy the allocation requirements.

In other words, _devices_ shouldn't have roles--_allocations_ do.

While I'm here...I'm looking at the other alloc_mode bits.  Do we need
any of them?

 * FREE_SPACE: legacy.  Don't need a bit for this--it's the absense of
   all other bits.

 * ROLE: honor role bits.  Don't need this because we can do it better
   by making "role" an attribute of the allocation, and use priority for
   device selection by role.

 * PRIORITY: use raw alloc_priority.  Don't need this bit, because we'll
   always use priority.  The default is zero, and "all devices have
   priority zero" gives current legacy behavior.

 * FT_GROUP: use dev_group for fault domains.  Don't need this because
   it's equivalent to "every dev_group on the filesystem != 0".

The above eliminates all of the bits except:

 * LINEAR: sequential allocation.
 * ROUND-ROBIN: pick the next device.

How do those work if some devices have thee bits set and some are cleared?

It seems to me that it would be better to put LINEAR and ROUND-ROBIN in
a btrfs item, so there's only one item on disk, which describes how
allocations work on disks of the same priority.

> > * Per-type enable bit -- Indicates whether the device allows allocations
> > of that type at all. This can be stored explicitly, or encoded using a
> > reserved tier value (e.g. 0xFF = disabled).
> 
> The device type can refer to a special device (like write-cache) or a
> regular data/metadata device. Within a data/metadata device, the role,
> whether for data or metadata, can still be represented using the current
> *_only, *, or default/any roles. So this approach remains compatible.

This reflects different mental models.  Your approach treats a device's
role as _exclusive_:  if it holds block groups for one role, it cannot be
used for another.  A device can hold metadata but cannot hold write_cache.
As a result, we need special cases like "preferred data with metadata
allowed" because we can't have a single-device filesystem without at
least one mixed role.

In contrast, my model is _inclusive_:  each device can support any
permitted block group type, provided that it assigns a priority to
that type.  Per-type priorities then partition devices into tiers,
letting a device handle data, metadata, write_cache--or any combination
thereof--seamlessly, scaling to 2^N configurations as new roles emerge.

If the user wants exclusive device roles, like "journal only" or "write
cache only", then they can simply set the priorities so that only one
type of chunk is allowed on the device.

> > Encoding this way makes "0" a reasonable default value for each field.
> > 
> > Then you get all of the required combinations, e.g.
> > 
> 
> Added below the current proposal.
> 
> > * metadata 0, data 0 - what btrfs does now, equal preference
> 
>  role=< > no role | default
> 
> > * metadata 2, data 1 - metadata preferred, data allowed
> > 
>  role=metadata
> 
> > * metadata 1, data 2 - data preferred, metadata allowed
> 
>  role=data
> 
> > * metadata 0, data 255 - metadata only, no data
> 
>  role=metadata_only
> 
> > * metadata 255, data 0 - data only, no metadata
> 
>  role=data_only

Fair point--this example didn't show anything that can't be done with
pure role-based allocations.  Try this with 5 equal-size drives:

 * Device 1:  metadata preference 100, data preference 100
 * Device 2:  metadata preference 100, data preference 100
 * Device 3:  metadata preference 100, data preference 100
 * Device 4:  metadata preference 200, data preference 100
 * Device 5:  metadata preference 200, data preference 100

then put -draid5 -mraid1 on it.

When the sorting for data is "data_only, data, metadata, metadata_only",
and the sorting for metadata is the opposite, it's not possible to get
a 5-device-wide data chunk.

Even with the PRIORITY bit overriding the sort order, each device has only
one priority.  We can solve the above by setting the role for devices
1-3 to 'data_only' and 4-5 to 'data', but we can't solve this for 7
devices with 4 metadata drives when there's two distinct preferences
for the metadata devices.

There need to be two _distinct_ priority values on each device to make
this work.

> > * metadata 255, data 255 - no new chunk allocations
> 
>  Flag it read-only.

"Read only" is another misleading name.  Allocations and writes must
still be allowed in existing block groups on these devices.  We are only
preventing the allocation of new block groups.  "None" is a better name
for this, or "no_alloc" or even "no_new".

> [...]
> I actually started with the idea of using bitmap flags, since it's more
> straightforward. But I eventually leaned toward using an Allocation
> Priority list to allow for a manual priority order within roles or
> tiers, if needed in the future. That flexibility pushed me in that
> direction.

I went the other way:  from roles to bitmaps, then replaced the bitmaps
with role-specific priority levels to allow userspace full control over
device selection in chunk allocation.  We did this because the concept
of a device role with a single priority was too limiting in practice.

We also found that even with years of experience running with the patches
based on four roles, sysadmins still made errors trying to predict where
data would go.  The priority-driven system is much easier to understand:
data goes where it's allowed, metadata goes where it's allowed, and
when both are allowed, priority rules specify which devices are filled
first.  Devices can be reordered for allocation in a way similar to the
LINEAR mode you propose with priority alone.

Your proposal has some other interesting elements.  The linear and
round-robin modes would work well after sorting devices by FT group
and per-role priority.

I have seen a lot of users request it, but I'm not sure what round-robin
mode is intended to address in practice.  The most significant effect is
that it can cause the filesystem to reach ENOSPC earlier than necessary
if space is not distributed carefully, but users have requested it for
its perceived load-balancing properties.  To work properly, the
allocator needs to store some persistent state to remember the device
it last allocated on--without this, every umount/mount cycle would
reset the allocator, so it would either fill up a low-numbered devid all
the time, or it would behave the same way as legacy btrfs allocation.
This points to a btrfs item as a good place to store all the allocator
configuration and state information, so the allocator can remember where
it was in the round-robin sequence across mounts.

> You can find more details about the current Allocation Priority list
> here:
> 
> 	https://asj.github.io/chunk-alloc-enhancement.html

I note that there is no "read-only" variant at this URL.

> That said, we could store the mode in a separate btrfs-key and keep the
> manual priority in dev_item::type, which would give us both.
> But as always, we try to avoid new on-disk new keys unless absolute
> necessary.
> [...]
> dev_item::dev_type (u64) comes from the reserved field list, so there's
> no additional space overhead in using it. I considered whether using a
> btrfs_key for roles would offer any advantage over dev_item::dev_type,
> but I couldn't find a clear benefit.

Heh.  Back in 2020 I got different opinions on new items (using the
existing PERSISTENT_ITEM key, but claiming some of the objectid space).
One reviewer wanted to control it via xattrs on the root directory.

On the one hand, using items (or even xattrs) means that schema upgrades
and schema size are practically unlimited.  As long as there's some way
to recognize all the new keys as part of the same feature, we don't need
to burn precious superblock compat bits for each one--the filesystem
would support allocation enhancement or not, and if it did, the kernel
would look into the items to find versioning information.

On the other hand, if the sysadmin has to cope with the cognitive load
of parsing more than 64 bits of different interrelated configuration
settings to predict where the filesystem is going to put its data, the
design places too much burden on users, and risks becoming impractical
from the start.  So there is value in keeping the schema small enough
to fit it all in one u64 in btrfs_dev_item.

This detail of the implementation doesn't matter very much given the
scope proposed so far.  There is room for 7 allocation type priorities
in the u64, today we need only 2, and in 10 years we might use up
to 5 priorities if all proposals I'm aware of are implemented.

If we do run out of bits in a u64 some day, we can always create new
items then.

> Also, with alloc_priority + alloc_mode, we can support a manual device
> order with the same cost.

> Let me still consider what you proposed again to see if there’s any
> advantage to doing it that way.
> 
> Good discussion, thanks a lot.
> 
> -Anand

Thanks for reading this far!

> > > Anand Jain (10):
> > >    btrfs: fix thresh scope in should_alloc_chunk()
> > >    btrfs: refactor should_alloc_chunk() arg type
> > >    btrfs: introduce btrfs_split_sysfs_arg() for argument parsing
> > >    btrfs: introduce device allocation method
> > >    btrfs: sysfs: show device allocation method
> > >    btrfs: skip device sorting when only one device is present
> > >    btrfs: refactor chunk allocation device handling to use list_head
> > >    btrfs: introduce explicit device roles for block groups
> > >    btrfs: introduce ROLE_THEN_SPACE device allocation method
> > >    btrfs: pass device roles through device add ioctl
> > > 
> > >   fs/btrfs/block-group.c |  11 +-
> > >   fs/btrfs/ioctl.c       |  12 +-
> > >   fs/btrfs/sysfs.c       | 130 ++++++++++++++++++++--
> > >   fs/btrfs/volumes.c     | 242 +++++++++++++++++++++++++++++++++--------
> > >   fs/btrfs/volumes.h     |  35 +++++-
> > >   5 files changed, 366 insertions(+), 64 deletions(-)
> > > 
> > > -- 
> > > 2.49.0
> > > 
> > > 
> > 
> 
> 
> 

^ permalink raw reply	[flat|nested] 44+ messages in thread

end of thread, other threads:[~2025-06-21  1:23 UTC | newest]

Thread overview: 44+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-05-12 18:07 [PATCH RFC 00/10] btrfs: new performance-based chunk allocation using device roles Anand Jain
2025-05-12 18:07 ` [PATCH 01/10] btrfs: fix thresh scope in should_alloc_chunk() Anand Jain
2025-05-12 18:07 ` [PATCH 02/10] btrfs: refactor should_alloc_chunk() arg type Anand Jain
2025-05-12 18:07 ` [PATCH 03/10] btrfs: introduce btrfs_split_sysfs_arg() for argument parsing Anand Jain
2025-05-12 18:07 ` [PATCH 04/10] btrfs: introduce device allocation method Anand Jain
2025-05-12 18:07 ` [PATCH 05/10] btrfs: sysfs: show " Anand Jain
2025-05-12 18:07 ` [PATCH 06/10] btrfs: skip device sorting when only one device is present Anand Jain
2025-05-12 18:07 ` [PATCH 07/10] btrfs: refactor chunk allocation device handling to use list_head Anand Jain
2025-05-12 18:07 ` [PATCH 08/10] btrfs: introduce explicit device roles for block groups Anand Jain
2025-05-12 18:07 ` [PATCH 09/10] btrfs: introduce ROLE_THEN_SPACE device allocation method Anand Jain
2025-05-12 18:07 ` [PATCH 10/10] btrfs: pass device roles through device add ioctl Anand Jain
2025-05-12 18:09 ` [PATCH RFC 00/14] btrfs-progs: add support for device role-based chunk allocation Anand Jain
2025-05-12 18:09   ` [PATCH 01/14] btrfs-progs: minor spelling correction in the list-chunk help text Anand Jain
2025-05-12 18:09   ` [PATCH 02/14] btrfs-progs: refactor devid comparison function Anand Jain
2025-05-12 18:09   ` [PATCH 03/14] btrfs-progs: rename local dev_list to devices in btrfs_alloc_chunk Anand Jain
2025-05-12 18:09   ` [PATCH 04/14] btrfs-progs: mkfs: prepare to merge duplicate if-else blocks Anand Jain
2025-05-12 18:09   ` [PATCH 05/14] btrfs-progs: mkfs: eliminate duplicate code in if-else Anand Jain
2025-05-12 18:09   ` [PATCH 06/14] btrfs-progs: mkfs: refactor test_num_disk_vs_raid - split data and metadata Anand Jain
2025-05-12 18:09   ` [PATCH 07/14] btrfs-progs: mkfs: device argument handling with a list Anand Jain
2025-05-12 18:09   ` [PATCH 08/14] btrfs-progs: import device role handling from the kernel Anand Jain
2025-05-12 18:09   ` [PATCH 09/14] btrfs-progs: mkfs: introduce device roles in device paths Anand Jain
2025-05-12 18:09   ` [PATCH 10/14] btrfs-progs: sort devices by role before using them Anand Jain
2025-05-12 18:09   ` [PATCH 11/14] btrfs-progs: helper for the device role within dev_item::type Anand Jain
2025-05-12 18:09   ` [PATCH 12/14] btrfs-progs: mkfs: persist device roles to dev_item::type Anand Jain
2025-05-12 18:09   ` [PATCH 13/14] btrfs-progs: update device add ioctl with device type Anand Jain
2025-05-12 18:09   ` [PATCH 14/14] btrfs-progs: disable exclusive metadata/data device roles Anand Jain
2025-06-20 16:46   ` [PATCH RFC 00/14] btrfs-progs: add support for device role-based chunk allocation David Sterba
2025-05-12 18:11 ` [PATCH RFC 0/2] fstests: btrfs: add functional verification for device roles Anand Jain
2025-05-12 18:11   ` [PATCH 1/2] fstests: common/btrfs: add _require_btrfs_feature_device_roles Anand Jain
2025-05-12 18:11   ` [PATCH 2/2] fstests: btrfs/366: add test for device role-based chunk allocation Anand Jain
2025-05-20  9:19 ` [PATCH RFC 00/10] btrfs: new performance-based chunk allocation using device roles Forza
2025-05-21  8:37   ` Anand Jain
2025-05-22  4:07 ` Zygo Blaxell
2025-06-02  4:26   ` Anand Jain
2025-06-21  1:11     ` Zygo Blaxell
2025-05-22 18:19 ` waxhead
2025-06-02  4:25   ` Anand Jain
2025-06-06 14:21     ` waxhead
2025-05-22 20:39 ` Ferry Toth
2025-06-02  4:24   ` Anand Jain
2025-06-04 21:29     ` Ferry Toth
2025-06-04 21:48       ` Anand Jain
2025-05-30  0:15 ` Jani Partanen
2025-06-02  4:25   ` Anand Jain

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox