[RFC] btrfs: ssd_metadata: storing metadata on SSD

linux-btrfs.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [RFC] btrfs: ssd_metadata: storing metadata on SSD
@ 2020-04-01 20:03 Goffredo Baroncelli
  2020-04-01 20:03 ` [PATCH] btrfs: add ssd_metadata mode Goffredo Baroncelli
  0 siblings, 1 reply; 23+ messages in thread
From: Goffredo Baroncelli @ 2020-04-01 20:03 UTC (permalink / raw)
  To: linux-btrfs

This is an RFC; I wrote this patch because I find the idea interesting
even though it adds more complication to the chunk allocator.

The core idea is to store the metadata on the ssd and leave the data
on the rotational disks. BTRFS looks at the rotational flags to
understand the kind of disks.

This new mode is enabled passing the option ssd_metadata at mount
time.
This policy of allocation is the "preferred" one. If this doesn't permit
a chunk allocation, the "classic" one is used.

Some examples: (/dev/sd[abc] are ssd, and /dev/sd[ef] are rotational)

Non striped profile: metadata->raid1, data->raid1
The data is stored on /dev/sd[ef], metadata is stored on /dev/sd[abc].
When /dev/sd[ef] are full, then the data chunk is allocated also on
/dev/sd[abc].

Striped profile: metadata->raid5, data->raid5
raid5 requires 3 disks at minimum, so /dev/sd[ef] are not enough for a
data profile raid5. To allow a data chunk allocation, the data profile raid5
will be stored on all the disks /dev/sd[abcdef].
Instead the metadata profile raid5 will be allocated on /dev/sd[abc],
because these are enough to host this chunk.

NOTE1: I don't touch the mkfs.btrfs program, so the new policy will be
applied only to the next chunk allocation. If you want to allocate
properly the chunk created by mkfs.btrfs, a btrfs balance is required.
NOTE1: I made only few tests. Don't use for valued data

BR
G.Baroncelli

^ permalink raw reply	[flat|nested] 23+ messages in thread

* [PATCH] btrfs: add ssd_metadata mode
  2020-04-01 20:03 [RFC] btrfs: ssd_metadata: storing metadata on SSD Goffredo Baroncelli
@ 2020-04-01 20:03 ` Goffredo Baroncelli
  2020-04-01 20:53   ` Goffredo Baroncelli
                     ` (2 more replies)
  0 siblings, 3 replies; 23+ messages in thread
From: Goffredo Baroncelli @ 2020-04-01 20:03 UTC (permalink / raw)
  To: linux-btrfs; +Cc: Goffredo Baroncelli

From: Goffredo Baroncelli <kreijack@inwind.it>

When this mode is enabled, the allocation policy of the chunk
is so modified:
- when a metadata chunk is allocated, priority is given to
ssd disk.
- When a data chunk is allocated, priority is given to a
rotational disk.

When a striped profile is involved (like RAID0,5,6), the logic
is a bit more complex. If there are enough disks, the data profiles
are stored on the rotational disks only; the metadata profiles
are stored on the non rotational disk only.
If the disks are not enough, then the profiles is stored on all
the disks.

Example: assuming that sda, sdb, sdc are ssd disks, and sde, sdf are
rotational ones.
A data profile raid5, will be stored on sda, sdb, sdc, sde, sdf (sde
and sdf are not enough to host a raid5 profile).
A metadata profile raid5, will be stored on sda, sdb, sdc (these
are enough to host a raid5 profile).

To enable this mode pass -o ssd_metadata at mount time.

Signed-off-by: Goffredo Baroncelli <kreijack@inwind.it>
---
 fs/btrfs/ctree.h   |  1 +
 fs/btrfs/super.c   |  8 +++++
 fs/btrfs/volumes.c | 89 ++++++++++++++++++++++++++++++++++++++++++++--
 fs/btrfs/volumes.h |  1 +
 4 files changed, 97 insertions(+), 2 deletions(-)

diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index 2e9f938508e9..0f3c09cc4863 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -1187,6 +1187,7 @@ static inline u32 BTRFS_MAX_XATTR_SIZE(const struct btrfs_fs_info *info)
 #define BTRFS_MOUNT_FREE_SPACE_TREE	(1 << 26)
 #define BTRFS_MOUNT_NOLOGREPLAY		(1 << 27)
 #define BTRFS_MOUNT_REF_VERIFY		(1 << 28)
+#define BTRFS_MOUNT_SSD_METADATA	(1 << 29)
 
 #define BTRFS_DEFAULT_COMMIT_INTERVAL	(30)
 #define BTRFS_DEFAULT_MAX_INLINE	(2048)
diff --git a/fs/btrfs/super.c b/fs/btrfs/super.c
index c6557d44907a..d0a5cf496f90 100644
--- a/fs/btrfs/super.c
+++ b/fs/btrfs/super.c
@@ -346,6 +346,7 @@ enum {
 #ifdef CONFIG_BTRFS_FS_REF_VERIFY
 	Opt_ref_verify,
 #endif
+	Opt_ssd_metadata,
 	Opt_err,
 };
 
@@ -416,6 +417,7 @@ static const match_table_t tokens = {
 #ifdef CONFIG_BTRFS_FS_REF_VERIFY
 	{Opt_ref_verify, "ref_verify"},
 #endif
+	{Opt_ssd_metadata, "ssd_metadata"},
 	{Opt_err, NULL},
 };
 
@@ -853,6 +855,10 @@ int btrfs_parse_options(struct btrfs_fs_info *info, char *options,
 			btrfs_set_opt(info->mount_opt, REF_VERIFY);
 			break;
 #endif
+		case Opt_ssd_metadata:
+			btrfs_set_and_info(info, SSD_METADATA,
+					"enabling ssd_metadata");
+			break;
 		case Opt_err:
 			btrfs_info(info, "unrecognized mount option '%s'", p);
 			ret = -EINVAL;
@@ -1369,6 +1375,8 @@ static int btrfs_show_options(struct seq_file *seq, struct dentry *dentry)
 #endif
 	if (btrfs_test_opt(info, REF_VERIFY))
 		seq_puts(seq, ",ref_verify");
+	if (btrfs_test_opt(info, SSD_METADATA))
+		seq_puts(seq, ",ssd_metadata");
 	seq_printf(seq, ",subvolid=%llu",
 		  BTRFS_I(d_inode(dentry))->root->root_key.objectid);
 	seq_puts(seq, ",subvol=");
diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index a8b71ded4d21..678dc3366711 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -4758,6 +4758,58 @@ static int btrfs_cmp_device_info(const void *a, const void *b)
 	return 0;
 }
 
+/*
+ * sort the devices in descending order by rotational,
+ * max_avail, total_avail
+ */
+static int btrfs_cmp_device_info_metadata(const void *a, const void *b)
+{
+	const struct btrfs_device_info *di_a = a;
+	const struct btrfs_device_info *di_b = b;
+
+	/* metadata -> non rotational first */
+	if (!di_a->rotational && di_b->rotational)
+		return -1;
+	if (di_a->rotational && !di_b->rotational)
+		return 1;
+	if (di_a->max_avail > di_b->max_avail)
+		return -1;
+	if (di_a->max_avail < di_b->max_avail)
+		return 1;
+	if (di_a->total_avail > di_b->total_avail)
+		return -1;
+	if (di_a->total_avail < di_b->total_avail)
+		return 1;
+	return 0;
+}
+
+/*
+ * sort the devices in descending order by !rotational,
+ * max_avail, total_avail
+ */
+static int btrfs_cmp_device_info_data(const void *a, const void *b)
+{
+	const struct btrfs_device_info *di_a = a;
+	const struct btrfs_device_info *di_b = b;
+
+	/* data -> non rotational last */
+	if (!di_a->rotational && di_b->rotational)
+		return 1;
+	if (di_a->rotational && !di_b->rotational)
+		return -1;
+	if (di_a->max_avail > di_b->max_avail)
+		return -1;
+	if (di_a->max_avail < di_b->max_avail)
+		return 1;
+	if (di_a->total_avail > di_b->total_avail)
+		return -1;
+	if (di_a->total_avail < di_b->total_avail)
+		return 1;
+	return 0;
+}
+
+
+
 static void check_raid56_incompat_flag(struct btrfs_fs_info *info, u64 type)
 {
 	if (!(type & BTRFS_BLOCK_GROUP_RAID56_MASK))
@@ -4805,6 +4857,7 @@ static int __btrfs_alloc_chunk(struct btrfs_trans_handle *trans,
 	int i;
 	int j;
 	int index;
+	int nr_rotational;
 
 	BUG_ON(!alloc_profile_is_valid(type, 0));
 
@@ -4860,6 +4913,7 @@ static int __btrfs_alloc_chunk(struct btrfs_trans_handle *trans,
 	 * about the available holes on each device.
 	 */
 	ndevs = 0;
+	nr_rotational = 0;
 	list_for_each_entry(device, &fs_devices->alloc_list, dev_alloc_list) {
 		u64 max_avail;
 		u64 dev_offset;
@@ -4911,14 +4965,45 @@ static int __btrfs_alloc_chunk(struct btrfs_trans_handle *trans,
 		devices_info[ndevs].max_avail = max_avail;
 		devices_info[ndevs].total_avail = total_avail;
 		devices_info[ndevs].dev = device;
+		devices_info[ndevs].rotational = !test_bit(QUEUE_FLAG_NONROT,
+				&(bdev_get_queue(device->bdev)->queue_flags));
+		if (devices_info[ndevs].rotational)
+			nr_rotational++;
 		++ndevs;
 	}
 
+	BUG_ON(nr_rotational > ndevs);
 	/*
 	 * now sort the devices by hole size / available space
 	 */
-	sort(devices_info, ndevs, sizeof(struct btrfs_device_info),
-	     btrfs_cmp_device_info, NULL);
+	if (((type & BTRFS_BLOCK_GROUP_DATA) &&
+	     (type & BTRFS_BLOCK_GROUP_METADATA)) ||
+	    !btrfs_test_opt(info, SSD_METADATA)) {
+		/* mixed bg or SSD_METADATA not set */
+		sort(devices_info, ndevs, sizeof(struct btrfs_device_info),
+			     btrfs_cmp_device_info, NULL);
+	} else {
+		/*
+		 * if SSD_METADATA is set, sort the device considering also the
+		 * kind (ssd or not). Limit the availables devices to the ones
+		 * of the same kind, to avoid that a striped profile like raid5
+		 * spans to all kind of devices (ssd and rotational).
+		 * It is allowed to span different kind of devices if the ones of
+		 * the same kind are not enough alone.
+		 */
+		if (type & BTRFS_BLOCK_GROUP_DATA) {
+			sort(devices_info, ndevs, sizeof(struct btrfs_device_info),
+				     btrfs_cmp_device_info_data, NULL);
+			if (nr_rotational > devs_min)
+				ndevs = nr_rotational;
+		} else {
+			int nr_norot = ndevs - nr_rotational;
+			sort(devices_info, ndevs, sizeof(struct btrfs_device_info),
+				     btrfs_cmp_device_info_metadata, NULL);
+			if (nr_norot > devs_min)
+				ndevs = nr_norot;
+		}
+	}
 
 	/*
 	 * Round down to number of usable stripes, devs_increment can be any
diff --git a/fs/btrfs/volumes.h b/fs/btrfs/volumes.h
index fc1b564b9cfe..bc1cfa0c27ea 100644
--- a/fs/btrfs/volumes.h
+++ b/fs/btrfs/volumes.h
@@ -340,6 +340,7 @@ struct btrfs_device_info {
 	u64 dev_offset;
 	u64 max_avail;
 	u64 total_avail;
+	int rotational:1;
 };
 
 struct btrfs_raid_attr {
-- 
2.26.0


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* Re: [PATCH] btrfs: add ssd_metadata mode
  2020-04-01 20:03 ` [PATCH] btrfs: add ssd_metadata mode Goffredo Baroncelli
@ 2020-04-01 20:53   ` Goffredo Baroncelli
  2020-04-02  9:33   ` Steven Davies
  2020-04-02 18:01   ` Martin Svec
  2 siblings, 0 replies; 23+ messages in thread
From: Goffredo Baroncelli @ 2020-04-01 20:53 UTC (permalink / raw)
  To: Goffredo Baroncelli, linux-btrfs

On 4/1/20 10:03 PM, Goffredo Baroncelli wrote:
> From: Goffredo Baroncelli <kreijack@inwind.it>
> 
> When this mode is enabled, the allocation policy of the chunk
> is so modified:
> - when a metadata chunk is allocated, priority is given to
> ssd disk.
> - When a data chunk is allocated, priority is given to a
> rotational disk.
> 
> When a striped profile is involved (like RAID0,5,6), the logic
> is a bit more complex. If there are enough disks, the data profiles
> are stored on the rotational disks only; the metadata profiles
> are stored on the non rotational disk only.
> If the disks are not enough, then the profiles is stored on all
> the disks.
> 
> Example: assuming that sda, sdb, sdc are ssd disks, and sde, sdf are
> rotational ones.
> A data profile raid5, will be stored on sda, sdb, sdc, sde, sdf (sde
> and sdf are not enough to host a raid5 profile).
> A metadata profile raid5, will be stored on sda, sdb, sdc (these
> are enough to host a raid5 profile).

The example is a bit wrong. Replace raid5 with raid6. However I
hope that the sense is still understandable.

> 
> To enable this mode pass -o ssd_metadata at mount time.
> 
> Signed-off-by: Goffredo Baroncelli <kreijack@inwind.it>
> ---
>   fs/btrfs/ctree.h   |  1 +
>   fs/btrfs/super.c   |  8 +++++
>   fs/btrfs/volumes.c | 89 ++++++++++++++++++++++++++++++++++++++++++++--
>   fs/btrfs/volumes.h |  1 +
>   4 files changed, 97 insertions(+), 2 deletions(-)
> 
> diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
> index 2e9f938508e9..0f3c09cc4863 100644
> --- a/fs/btrfs/ctree.h
> +++ b/fs/btrfs/ctree.h
> @@ -1187,6 +1187,7 @@ static inline u32 BTRFS_MAX_XATTR_SIZE(const struct btrfs_fs_info *info)
>   #define BTRFS_MOUNT_FREE_SPACE_TREE	(1 << 26)
>   #define BTRFS_MOUNT_NOLOGREPLAY		(1 << 27)
>   #define BTRFS_MOUNT_REF_VERIFY		(1 << 28)
> +#define BTRFS_MOUNT_SSD_METADATA	(1 << 29)
>   
>   #define BTRFS_DEFAULT_COMMIT_INTERVAL	(30)
>   #define BTRFS_DEFAULT_MAX_INLINE	(2048)
> diff --git a/fs/btrfs/super.c b/fs/btrfs/super.c
> index c6557d44907a..d0a5cf496f90 100644
> --- a/fs/btrfs/super.c
> +++ b/fs/btrfs/super.c
> @@ -346,6 +346,7 @@ enum {
>   #ifdef CONFIG_BTRFS_FS_REF_VERIFY
>   	Opt_ref_verify,
>   #endif
> +	Opt_ssd_metadata,
>   	Opt_err,
>   };
>   
> @@ -416,6 +417,7 @@ static const match_table_t tokens = {
>   #ifdef CONFIG_BTRFS_FS_REF_VERIFY
>   	{Opt_ref_verify, "ref_verify"},
>   #endif
> +	{Opt_ssd_metadata, "ssd_metadata"},
>   	{Opt_err, NULL},
>   };
>   
> @@ -853,6 +855,10 @@ int btrfs_parse_options(struct btrfs_fs_info *info, char *options,
>   			btrfs_set_opt(info->mount_opt, REF_VERIFY);
>   			break;
>   #endif
> +		case Opt_ssd_metadata:
> +			btrfs_set_and_info(info, SSD_METADATA,
> +					"enabling ssd_metadata");
> +			break;
>   		case Opt_err:
>   			btrfs_info(info, "unrecognized mount option '%s'", p);
>   			ret = -EINVAL;
> @@ -1369,6 +1375,8 @@ static int btrfs_show_options(struct seq_file *seq, struct dentry *dentry)
>   #endif
>   	if (btrfs_test_opt(info, REF_VERIFY))
>   		seq_puts(seq, ",ref_verify");
> +	if (btrfs_test_opt(info, SSD_METADATA))
> +		seq_puts(seq, ",ssd_metadata");
>   	seq_printf(seq, ",subvolid=%llu",
>   		  BTRFS_I(d_inode(dentry))->root->root_key.objectid);
>   	seq_puts(seq, ",subvol=");
> diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
> index a8b71ded4d21..678dc3366711 100644
> --- a/fs/btrfs/volumes.c
> +++ b/fs/btrfs/volumes.c
> @@ -4758,6 +4758,58 @@ static int btrfs_cmp_device_info(const void *a, const void *b)
>   	return 0;
>   }
>   
> +/*
> + * sort the devices in descending order by rotational,
> + * max_avail, total_avail
> + */
> +static int btrfs_cmp_device_info_metadata(const void *a, const void *b)
> +{
> +	const struct btrfs_device_info *di_a = a;
> +	const struct btrfs_device_info *di_b = b;
> +
> +	/* metadata -> non rotational first */
> +	if (!di_a->rotational && di_b->rotational)
> +		return -1;
> +	if (di_a->rotational && !di_b->rotational)
> +		return 1;
> +	if (di_a->max_avail > di_b->max_avail)
> +		return -1;
> +	if (di_a->max_avail < di_b->max_avail)
> +		return 1;
> +	if (di_a->total_avail > di_b->total_avail)
> +		return -1;
> +	if (di_a->total_avail < di_b->total_avail)
> +		return 1;
> +	return 0;
> +}
> +
> +/*
> + * sort the devices in descending order by !rotational,
> + * max_avail, total_avail
> + */
> +static int btrfs_cmp_device_info_data(const void *a, const void *b)
> +{
> +	const struct btrfs_device_info *di_a = a;
> +	const struct btrfs_device_info *di_b = b;
> +
> +	/* data -> non rotational last */
> +	if (!di_a->rotational && di_b->rotational)
> +		return 1;
> +	if (di_a->rotational && !di_b->rotational)
> +		return -1;
> +	if (di_a->max_avail > di_b->max_avail)
> +		return -1;
> +	if (di_a->max_avail < di_b->max_avail)
> +		return 1;
> +	if (di_a->total_avail > di_b->total_avail)
> +		return -1;
> +	if (di_a->total_avail < di_b->total_avail)
> +		return 1;
> +	return 0;
> +}
> +
> +
> +
>   static void check_raid56_incompat_flag(struct btrfs_fs_info *info, u64 type)
>   {
>   	if (!(type & BTRFS_BLOCK_GROUP_RAID56_MASK))
> @@ -4805,6 +4857,7 @@ static int __btrfs_alloc_chunk(struct btrfs_trans_handle *trans,
>   	int i;
>   	int j;
>   	int index;
> +	int nr_rotational;
>   
>   	BUG_ON(!alloc_profile_is_valid(type, 0));
>   
> @@ -4860,6 +4913,7 @@ static int __btrfs_alloc_chunk(struct btrfs_trans_handle *trans,
>   	 * about the available holes on each device.
>   	 */
>   	ndevs = 0;
> +	nr_rotational = 0;
>   	list_for_each_entry(device, &fs_devices->alloc_list, dev_alloc_list) {
>   		u64 max_avail;
>   		u64 dev_offset;
> @@ -4911,14 +4965,45 @@ static int __btrfs_alloc_chunk(struct btrfs_trans_handle *trans,
>   		devices_info[ndevs].max_avail = max_avail;
>   		devices_info[ndevs].total_avail = total_avail;
>   		devices_info[ndevs].dev = device;
> +		devices_info[ndevs].rotational = !test_bit(QUEUE_FLAG_NONROT,
> +				&(bdev_get_queue(device->bdev)->queue_flags));
> +		if (devices_info[ndevs].rotational)
> +			nr_rotational++;
>   		++ndevs;
>   	}
>   
> +	BUG_ON(nr_rotational > ndevs);
>   	/*
>   	 * now sort the devices by hole size / available space
>   	 */
> -	sort(devices_info, ndevs, sizeof(struct btrfs_device_info),
> -	     btrfs_cmp_device_info, NULL);
> +	if (((type & BTRFS_BLOCK_GROUP_DATA) &&
> +	     (type & BTRFS_BLOCK_GROUP_METADATA)) ||
> +	    !btrfs_test_opt(info, SSD_METADATA)) {
> +		/* mixed bg or SSD_METADATA not set */
> +		sort(devices_info, ndevs, sizeof(struct btrfs_device_info),
> +			     btrfs_cmp_device_info, NULL);
> +	} else {
> +		/*
> +		 * if SSD_METADATA is set, sort the device considering also the
> +		 * kind (ssd or not). Limit the availables devices to the ones
> +		 * of the same kind, to avoid that a striped profile like raid5
> +		 * spans to all kind of devices (ssd and rotational).
> +		 * It is allowed to span different kind of devices if the ones of
> +		 * the same kind are not enough alone.
> +		 */
> +		if (type & BTRFS_BLOCK_GROUP_DATA) {
> +			sort(devices_info, ndevs, sizeof(struct btrfs_device_info),
> +				     btrfs_cmp_device_info_data, NULL);
> +			if (nr_rotational > devs_min)

It should be
			if (nr_rotational >= devs_min)

> +				ndevs = nr_rotational;
> +		} else {
> +			int nr_norot = ndevs - nr_rotational;
> +			sort(devices_info, ndevs, sizeof(struct btrfs_device_info),
> +				     btrfs_cmp_device_info_metadata, NULL);
> +			if (nr_norot > devs_min)

It should be
			if (nr_nonrot >= devs_min)



> +				ndevs = nr_norot;
> +		}
> +	}
>   
>   	/*
>   	 * Round down to number of usable stripes, devs_increment can be any
> diff --git a/fs/btrfs/volumes.h b/fs/btrfs/volumes.h
> index fc1b564b9cfe..bc1cfa0c27ea 100644
> --- a/fs/btrfs/volumes.h
> +++ b/fs/btrfs/volumes.h
> @@ -340,6 +340,7 @@ struct btrfs_device_info {
>   	u64 dev_offset;
>   	u64 max_avail;
>   	u64 total_avail;
> +	int rotational:1;
>   };
>   
>   struct btrfs_raid_attr {
> 


-- 
gpg @keyserver.linux.it: Goffredo Baroncelli <kreijackATinwind.it>
Key fingerprint BBF5 1610 0B64 DAC6 5F7D  17B2 0EDA 9B37 8B82 E0B5

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH] btrfs: add ssd_metadata mode
  2020-04-01 20:03 ` [PATCH] btrfs: add ssd_metadata mode Goffredo Baroncelli
  2020-04-01 20:53   ` Goffredo Baroncelli
@ 2020-04-02  9:33   ` Steven Davies
  2020-04-02 16:39     ` Goffredo Baroncelli
  2020-04-03  8:43     ` Michael
  2020-04-02 18:01   ` Martin Svec
  2 siblings, 2 replies; 23+ messages in thread
From: Steven Davies @ 2020-04-02  9:33 UTC (permalink / raw)
  To: Goffredo Baroncelli; +Cc: linux-btrfs, Anand Jain

On 01/04/2020 21:03, Goffredo Baroncelli wrote:
> From: Goffredo Baroncelli <kreijack@inwind.it>
>
> When this mode is enabled, the allocation policy of the chunk
> is so modified:
> - when a metadata chunk is allocated, priority is given to
> ssd disk.
> - When a data chunk is allocated, priority is given to a
> rotational disk.
>
> When a striped profile is involved (like RAID0,5,6), the logic
> is a bit more complex. If there are enough disks, the data profiles
> are stored on the rotational disks only; the metadata profiles
> are stored on the non rotational disk only.
> If the disks are not enough, then the profiles is stored on all
> the disks.
>
> Example: assuming that sda, sdb, sdc are ssd disks, and sde, sdf are
> rotational ones.
> A data profile raid5, will be stored on sda, sdb, sdc, sde, sdf (sde
> and sdf are not enough to host a raid5 profile).
> A metadata profile raid5, will be stored on sda, sdb, sdc (these
> are enough to host a raid5 profile).
>
> To enable this mode pass -o ssd_metadata at mount time.
>
> Signed-off-by: Goffredo Baroncelli <kreijack@inwind.it>

The idea of this sounds similar to what Anand has been working on with 
the readmirror patchset[1] which was originally designed to prefer 
reading from SSD devices in a RAID1 configuration but has evolved into 
allowing the read policy to be configured through sysfs, at least partly 
because detecting SSDs correctly is not an exact science. Also, there 
may be more considerations than just HDD or SSD: for example in my 
system I use a SATA SSD and an NVMe SSD in RAID1 where the NVMe device 
is twice the speed of the SSD.

I would therefore vote for configurability of this rather than always 
choosing SSD over HDD.

[1] https://patchwork.kernel.org/project/linux-btrfs/list/?series=245121

-- 
Steven Davies


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH] btrfs: add ssd_metadata mode
  2020-04-02  9:33   ` Steven Davies
@ 2020-04-02 16:39     ` Goffredo Baroncelli
  2020-04-03  8:43     ` Michael
  1 sibling, 0 replies; 23+ messages in thread
From: Goffredo Baroncelli @ 2020-04-02 16:39 UTC (permalink / raw)
  To: Steven Davies; +Cc: linux-btrfs, Anand Jain

On 4/2/20 11:33 AM, Steven Davies wrote:
> On 01/04/2020 21:03, Goffredo Baroncelli wrote:
>> From: Goffredo Baroncelli <kreijack@inwind.it>
>>
>> When this mode is enabled, the allocation policy of the chunk
>> is so modified:
>> - when a metadata chunk is allocated, priority is given to
>> ssd disk.
>> - When a data chunk is allocated, priority is given to a
>> rotational disk.
>>
>> When a striped profile is involved (like RAID0,5,6), the logic
>> is a bit more complex. If there are enough disks, the data profiles
>> are stored on the rotational disks only; the metadata profiles
>> are stored on the non rotational disk only.
>> If the disks are not enough, then the profiles is stored on all
>> the disks.
>>
>> Example: assuming that sda, sdb, sdc are ssd disks, and sde, sdf are
>> rotational ones.
>> A data profile raid5, will be stored on sda, sdb, sdc, sde, sdf (sde
>> and sdf are not enough to host a raid5 profile).
>> A metadata profile raid5, will be stored on sda, sdb, sdc (these
>> are enough to host a raid5 profile).
>>
>> To enable this mode pass -o ssd_metadata at mount time.
>>
>> Signed-off-by: Goffredo Baroncelli <kreijack@inwind.it>
> 
> The idea of this sounds similar to what Anand has been working on with the readmirror patchset[1] which was originally designed to prefer reading from SSD devices in a RAID1 configuration but has evolved into allowing the read policy to be configured through sysfs, 

The work done by Anand is very different. He is working to developing better policy about which mirror has to be select during the reading.

I am investigating the possibility to store (reading and *writing*) the metadata in SSD and the data in rotational disk.

> at least partly because detecting SSDs correctly is not an exact science. Also, there may be more considerations than just HDD or SSD: for example in my system I use a SATA SSD and an NVMe SSD in RAID1 where the NVMe device is twice the speed of the SSD.

There is the "rotational" attribute, which can be set or by the kernel or by appropriate udev rules.

> 
> I would therefore vote for configurability of this rather than always choosing SSD over HDD.

As state above, there are two completely different patch set, which address two different problem.
> 
> [1] https://patchwork.kernel.org/project/linux-btrfs/list/?series=245121
> 


-- 
gpg @keyserver.linux.it: Goffredo Baroncelli <kreijackATinwind.it>
Key fingerprint BBF5 1610 0B64 DAC6 5F7D  17B2 0EDA 9B37 8B82 E0B5

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH] btrfs: add ssd_metadata mode
  2020-04-01 20:03 ` [PATCH] btrfs: add ssd_metadata mode Goffredo Baroncelli
  2020-04-01 20:53   ` Goffredo Baroncelli
  2020-04-02  9:33   ` Steven Davies
@ 2020-04-02 18:01   ` Martin Svec
  2 siblings, 0 replies; 23+ messages in thread
From: Martin Svec @ 2020-04-02 18:01 UTC (permalink / raw)
  To: Goffredo Baroncelli, linux-btrfs; +Cc: Goffredo Baroncelli

Hi Goffredo,

we're using similar in-house patch on our backup servers for years and the performance impact is
HUGE. Or backup workload includes rsyncs of mailservers/webservers and image-based CBT vSphere
backups, we've hundreds of millions of files and thousands of daily snapshots on every backup
server. Nothing of this would be possible without dedicated metadata SSDs. A typical btrfs backup
server has two 500 GB NVMe SSDs in btrfs RAID1 and twelve 10TB SATA drives in hardware RAID6.

Our chunk allocation logic is fairly simple: if btrfs contains both rotational and non-rotational
drives and there's a metadata chunk allocation request, ignore rotational drives in
__btrfs_alloc_chunk(); in the same way, ignore non-rotational drives when allocating a data chunk.
If the allocation request cannot be satisfied, fallback to the standard __btrfs_alloc_chunk()
implementation which uses all available drives.

Martin

Dne 1.4.2020 v 22:03 Goffredo Baroncelli napsal(a):
> From: Goffredo Baroncelli <kreijack@inwind.it>
>
> When this mode is enabled, the allocation policy of the chunk
> is so modified:
> - when a metadata chunk is allocated, priority is given to
> ssd disk.
> - When a data chunk is allocated, priority is given to a
> rotational disk.
>
> When a striped profile is involved (like RAID0,5,6), the logic
> is a bit more complex. If there are enough disks, the data profiles
> are stored on the rotational disks only; the metadata profiles
> are stored on the non rotational disk only.
> If the disks are not enough, then the profiles is stored on all
> the disks.
>
> Example: assuming that sda, sdb, sdc are ssd disks, and sde, sdf are
> rotational ones.
> A data profile raid5, will be stored on sda, sdb, sdc, sde, sdf (sde
> and sdf are not enough to host a raid5 profile).
> A metadata profile raid5, will be stored on sda, sdb, sdc (these
> are enough to host a raid5 profile).
>
> To enable this mode pass -o ssd_metadata at mount time.
>
> Signed-off-by: Goffredo Baroncelli <kreijack@inwind.it>
> ---
>  fs/btrfs/ctree.h   |  1 +
>  fs/btrfs/super.c   |  8 +++++
>  fs/btrfs/volumes.c | 89 ++++++++++++++++++++++++++++++++++++++++++++--
>  fs/btrfs/volumes.h |  1 +
>  4 files changed, 97 insertions(+), 2 deletions(-)
>
> diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
> index 2e9f938508e9..0f3c09cc4863 100644
> --- a/fs/btrfs/ctree.h
> +++ b/fs/btrfs/ctree.h
> @@ -1187,6 +1187,7 @@ static inline u32 BTRFS_MAX_XATTR_SIZE(const struct btrfs_fs_info *info)
>  #define BTRFS_MOUNT_FREE_SPACE_TREE	(1 << 26)
>  #define BTRFS_MOUNT_NOLOGREPLAY		(1 << 27)
>  #define BTRFS_MOUNT_REF_VERIFY		(1 << 28)
> +#define BTRFS_MOUNT_SSD_METADATA	(1 << 29)
>  
>  #define BTRFS_DEFAULT_COMMIT_INTERVAL	(30)
>  #define BTRFS_DEFAULT_MAX_INLINE	(2048)
> diff --git a/fs/btrfs/super.c b/fs/btrfs/super.c
> index c6557d44907a..d0a5cf496f90 100644
> --- a/fs/btrfs/super.c
> +++ b/fs/btrfs/super.c
> @@ -346,6 +346,7 @@ enum {
>  #ifdef CONFIG_BTRFS_FS_REF_VERIFY
>  	Opt_ref_verify,
>  #endif
> +	Opt_ssd_metadata,
>  	Opt_err,
>  };
>  
> @@ -416,6 +417,7 @@ static const match_table_t tokens = {
>  #ifdef CONFIG_BTRFS_FS_REF_VERIFY
>  	{Opt_ref_verify, "ref_verify"},
>  #endif
> +	{Opt_ssd_metadata, "ssd_metadata"},
>  	{Opt_err, NULL},
>  };
>  
> @@ -853,6 +855,10 @@ int btrfs_parse_options(struct btrfs_fs_info *info, char *options,
>  			btrfs_set_opt(info->mount_opt, REF_VERIFY);
>  			break;
>  #endif
> +		case Opt_ssd_metadata:
> +			btrfs_set_and_info(info, SSD_METADATA,
> +					"enabling ssd_metadata");
> +			break;
>  		case Opt_err:
>  			btrfs_info(info, "unrecognized mount option '%s'", p);
>  			ret = -EINVAL;
> @@ -1369,6 +1375,8 @@ static int btrfs_show_options(struct seq_file *seq, struct dentry *dentry)
>  #endif
>  	if (btrfs_test_opt(info, REF_VERIFY))
>  		seq_puts(seq, ",ref_verify");
> +	if (btrfs_test_opt(info, SSD_METADATA))
> +		seq_puts(seq, ",ssd_metadata");
>  	seq_printf(seq, ",subvolid=%llu",
>  		  BTRFS_I(d_inode(dentry))->root->root_key.objectid);
>  	seq_puts(seq, ",subvol=");
> diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
> index a8b71ded4d21..678dc3366711 100644
> --- a/fs/btrfs/volumes.c
> +++ b/fs/btrfs/volumes.c
> @@ -4758,6 +4758,58 @@ static int btrfs_cmp_device_info(const void *a, const void *b)
>  	return 0;
>  }
>  
> +/*
> + * sort the devices in descending order by rotational,
> + * max_avail, total_avail
> + */
> +static int btrfs_cmp_device_info_metadata(const void *a, const void *b)
> +{
> +	const struct btrfs_device_info *di_a = a;
> +	const struct btrfs_device_info *di_b = b;
> +
> +	/* metadata -> non rotational first */
> +	if (!di_a->rotational && di_b->rotational)
> +		return -1;
> +	if (di_a->rotational && !di_b->rotational)
> +		return 1;
> +	if (di_a->max_avail > di_b->max_avail)
> +		return -1;
> +	if (di_a->max_avail < di_b->max_avail)
> +		return 1;
> +	if (di_a->total_avail > di_b->total_avail)
> +		return -1;
> +	if (di_a->total_avail < di_b->total_avail)
> +		return 1;
> +	return 0;
> +}
> +
> +/*
> + * sort the devices in descending order by !rotational,
> + * max_avail, total_avail
> + */
> +static int btrfs_cmp_device_info_data(const void *a, const void *b)
> +{
> +	const struct btrfs_device_info *di_a = a;
> +	const struct btrfs_device_info *di_b = b;
> +
> +	/* data -> non rotational last */
> +	if (!di_a->rotational && di_b->rotational)
> +		return 1;
> +	if (di_a->rotational && !di_b->rotational)
> +		return -1;
> +	if (di_a->max_avail > di_b->max_avail)
> +		return -1;
> +	if (di_a->max_avail < di_b->max_avail)
> +		return 1;
> +	if (di_a->total_avail > di_b->total_avail)
> +		return -1;
> +	if (di_a->total_avail < di_b->total_avail)
> +		return 1;
> +	return 0;
> +}
> +
> +
> +
>  static void check_raid56_incompat_flag(struct btrfs_fs_info *info, u64 type)
>  {
>  	if (!(type & BTRFS_BLOCK_GROUP_RAID56_MASK))
> @@ -4805,6 +4857,7 @@ static int __btrfs_alloc_chunk(struct btrfs_trans_handle *trans,
>  	int i;
>  	int j;
>  	int index;
> +	int nr_rotational;
>  
>  	BUG_ON(!alloc_profile_is_valid(type, 0));
>  
> @@ -4860,6 +4913,7 @@ static int __btrfs_alloc_chunk(struct btrfs_trans_handle *trans,
>  	 * about the available holes on each device.
>  	 */
>  	ndevs = 0;
> +	nr_rotational = 0;
>  	list_for_each_entry(device, &fs_devices->alloc_list, dev_alloc_list) {
>  		u64 max_avail;
>  		u64 dev_offset;
> @@ -4911,14 +4965,45 @@ static int __btrfs_alloc_chunk(struct btrfs_trans_handle *trans,
>  		devices_info[ndevs].max_avail = max_avail;
>  		devices_info[ndevs].total_avail = total_avail;
>  		devices_info[ndevs].dev = device;
> +		devices_info[ndevs].rotational = !test_bit(QUEUE_FLAG_NONROT,
> +				&(bdev_get_queue(device->bdev)->queue_flags));
> +		if (devices_info[ndevs].rotational)
> +			nr_rotational++;
>  		++ndevs;
>  	}
>  
> +	BUG_ON(nr_rotational > ndevs);
>  	/*
>  	 * now sort the devices by hole size / available space
>  	 */
> -	sort(devices_info, ndevs, sizeof(struct btrfs_device_info),
> -	     btrfs_cmp_device_info, NULL);
> +	if (((type & BTRFS_BLOCK_GROUP_DATA) &&
> +	     (type & BTRFS_BLOCK_GROUP_METADATA)) ||
> +	    !btrfs_test_opt(info, SSD_METADATA)) {
> +		/* mixed bg or SSD_METADATA not set */
> +		sort(devices_info, ndevs, sizeof(struct btrfs_device_info),
> +			     btrfs_cmp_device_info, NULL);
> +	} else {
> +		/*
> +		 * if SSD_METADATA is set, sort the device considering also the
> +		 * kind (ssd or not). Limit the availables devices to the ones
> +		 * of the same kind, to avoid that a striped profile like raid5
> +		 * spans to all kind of devices (ssd and rotational).
> +		 * It is allowed to span different kind of devices if the ones of
> +		 * the same kind are not enough alone.
> +		 */
> +		if (type & BTRFS_BLOCK_GROUP_DATA) {
> +			sort(devices_info, ndevs, sizeof(struct btrfs_device_info),
> +				     btrfs_cmp_device_info_data, NULL);
> +			if (nr_rotational > devs_min)
> +				ndevs = nr_rotational;
> +		} else {
> +			int nr_norot = ndevs - nr_rotational;
> +			sort(devices_info, ndevs, sizeof(struct btrfs_device_info),
> +				     btrfs_cmp_device_info_metadata, NULL);
> +			if (nr_norot > devs_min)
> +				ndevs = nr_norot;
> +		}
> +	}
>  
>  	/*
>  	 * Round down to number of usable stripes, devs_increment can be any
> diff --git a/fs/btrfs/volumes.h b/fs/btrfs/volumes.h
> index fc1b564b9cfe..bc1cfa0c27ea 100644
> --- a/fs/btrfs/volumes.h
> +++ b/fs/btrfs/volumes.h
> @@ -340,6 +340,7 @@ struct btrfs_device_info {
>  	u64 dev_offset;
>  	u64 max_avail;
>  	u64 total_avail;
> +	int rotational:1;
>  };
>  
>  struct btrfs_raid_attr {




^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH] btrfs: add ssd_metadata mode
  2020-04-02  9:33   ` Steven Davies
  2020-04-02 16:39     ` Goffredo Baroncelli
@ 2020-04-03  8:43     ` Michael
  2020-04-03 10:08       ` Steven Davies
  2020-04-03 16:19       ` Goffredo Baroncelli
  1 sibling, 2 replies; 23+ messages in thread
From: Michael @ 2020-04-03  8:43 UTC (permalink / raw)
  To: Steven Davies, Goffredo Baroncelli; +Cc: linux-btrfs, Anand Jain

02.04.2020 12:33, Steven Davies пишет:
> On 01/04/2020 21:03, Goffredo Baroncelli wrote:
>> From: Goffredo Baroncelli <kreijack@inwind.it>
>>
>> When this mode is enabled, the allocation policy of the chunk
>> is so modified:
>> - when a metadata chunk is allocated, priority is given to
>> ssd disk.
>> - When a data chunk is allocated, priority is given to a
>> rotational disk.
>>
>> When a striped profile is involved (like RAID0,5,6), the logic
>> is a bit more complex. If there are enough disks, the data profiles
>> are stored on the rotational disks only; the metadata profiles
>> are stored on the non rotational disk only.
>> If the disks are not enough, then the profiles is stored on all
>> the disks.
>>
>> Example: assuming that sda, sdb, sdc are ssd disks, and sde, sdf are
>> rotational ones.
>> A data profile raid5, will be stored on sda, sdb, sdc, sde, sdf (sde
>> and sdf are not enough to host a raid5 profile).
>> A metadata profile raid5, will be stored on sda, sdb, sdc (these
>> are enough to host a raid5 profile).
>>
>> To enable this mode pass -o ssd_metadata at mount time.
>>
>> Signed-off-by: Goffredo Baroncelli <kreijack@inwind.it>
>
> The idea of this sounds similar to what Anand has been working on with 
> the readmirror patchset[1] which was originally designed to prefer 
> reading from SSD devices in a RAID1 configuration but has evolved into 
> allowing the read policy to be configured through sysfs, at least 
> partly because detecting SSDs correctly is not an exact science. Also, 
> there may be more considerations than just HDD or SSD: for example in 
> my system I use a SATA SSD and an NVMe SSD in RAID1 where the NVMe 
> device is twice the speed of the SSD.
May be something like -o 
metadata_preferred_devices=device_id,[device_id,[device_id]]... ?
>
> I would therefore vote for configurability of this rather than always 
> choosing SSD over HDD.
>
> [1] https://patchwork.kernel.org/project/linux-btrfs/list/?series=245121
>

-- 
С уважением, Михаил
067-786-11-75

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH] btrfs: add ssd_metadata mode
  2020-04-03  8:43     ` Michael
@ 2020-04-03 10:08       ` Steven Davies
  2020-04-03 16:19       ` Goffredo Baroncelli
  1 sibling, 0 replies; 23+ messages in thread
From: Steven Davies @ 2020-04-03 10:08 UTC (permalink / raw)
  To: Michael, Goffredo Baroncelli; +Cc: linux-btrfs, Anand Jain

On 03/04/2020 09:43, Michael wrote:
> 02.04.2020 12:33, Steven Davies пишет:
>> On 01/04/2020 21:03, Goffredo Baroncelli wrote:
>>> From: Goffredo Baroncelli <kreijack@inwind.it>
>>>
>>> When this mode is enabled, the allocation policy of the chunk
>>> is so modified:
>>> - when a metadata chunk is allocated, priority is given to
>>> ssd disk.
>>> - When a data chunk is allocated, priority is given to a
>>> rotational disk.
>>>
>>> When a striped profile is involved (like RAID0,5,6), the logic
>>> is a bit more complex. If there are enough disks, the data profiles
>>> are stored on the rotational disks only; the metadata profiles
>>> are stored on the non rotational disk only.
>>> If the disks are not enough, then the profiles is stored on all
>>> the disks.
>>>
>>> Example: assuming that sda, sdb, sdc are ssd disks, and sde, sdf are
>>> rotational ones.
>>> A data profile raid5, will be stored on sda, sdb, sdc, sde, sdf (sde
>>> and sdf are not enough to host a raid5 profile).
>>> A metadata profile raid5, will be stored on sda, sdb, sdc (these
>>> are enough to host a raid5 profile).
>>>
>>> To enable this mode pass -o ssd_metadata at mount time.
>>>
>>> Signed-off-by: Goffredo Baroncelli <kreijack@inwind.it>
>>
>> The idea of this sounds similar to what Anand has been working on with 
>> the readmirror patchset[1] which was originally designed to prefer 
>> reading from SSD devices in a RAID1 configuration but has evolved into 
>> allowing the read policy to be configured through sysfs, at least 
>> partly because detecting SSDs correctly is not an exact science. Also, 
>> there may be more considerations than just HDD or SSD: for example in 
>> my system I use a SATA SSD and an NVMe SSD in RAID1 where the NVMe 
>> device is twice the speed of the SSD.
> May be something like -o 
> metadata_preferred_devices=device_id,[device_id,[device_id]]... ?

Yes, that's what I was thinking of.

-- 
Steven Davies

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH] btrfs: add ssd_metadata mode
  2020-04-03  8:43     ` Michael
  2020-04-03 10:08       ` Steven Davies
@ 2020-04-03 16:19       ` Goffredo Baroncelli
  2020-04-03 16:28         ` Hugo Mills
  1 sibling, 1 reply; 23+ messages in thread
From: Goffredo Baroncelli @ 2020-04-03 16:19 UTC (permalink / raw)
  To: Michael, Steven Davies; +Cc: linux-btrfs, Anand Jain

On 4/3/20 10:43 AM, Michael wrote:
> 02.04.2020 12:33, Steven Davies пишет:
>> On 01/04/2020 21:03, Goffredo Baroncelli wrote:
>>> From: Goffredo Baroncelli <kreijack@inwind.it>
[...]
>>> To enable this mode pass -o ssd_metadata at mount time.
>>>
>>> Signed-off-by: Goffredo Baroncelli <kreijack@inwind.it>
>>
>> The idea of this sounds similar to what Anand has been working on with the readmirror patchset[1] which was originally designed to prefer reading from SSD devices in a RAID1 configuration but has evolved into allowing the read policy to be configured through sysfs, at least partly because detecting SSDs correctly is not an exact science. Also, there may be more considerations than just HDD or SSD: for example in my system I use a SATA SSD and an NVMe SSD in RAID1 where the NVMe device is twice the speed of the SSD.
> May be something like -o metadata_preferred_devices=device_id,[device_id,[device_id]]... ?

I think that it should be a device property instead of a device list passed at mount time.
Looking at the btrfs_dev_item structure (which is embedded in the super block), there are several promising fields:
- seek_speed
- bandwidth
- dev_group

currently these fields are unused. Se we are free to use as we wish. For example, if we use 2 bit of dev_group as marker
for "not for metadata" and "not for date" we can have the following combination:
- 0 (default) -> the disk is suitable for both data and metadata
- "not for metadata" -> the disk has an high priority for "data"
- "not for data" -> the disk has an high priority for "metadata"
- "not for data" and "not for metadata" -> invalid

As kernel<->user space interface, I like the idea to export these bits via sysfs.. unfortunately there is no a directory for the devices.... :-(


>>
>> I would therefore vote for configurability of this rather than always choosing SSD over HDD.
>>
>> [1] https://patchwork.kernel.org/project/linux-btrfs/list/?series=245121
>>
> 


-- 
gpg @keyserver.linux.it: Goffredo Baroncelli <kreijackATinwind.it>
Key fingerprint BBF5 1610 0B64 DAC6 5F7D  17B2 0EDA 9B37 8B82 E0B5

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH] btrfs: add ssd_metadata mode
  2020-04-03 16:19       ` Goffredo Baroncelli
@ 2020-04-03 16:28         ` Hugo Mills
  2020-04-03 16:36           ` Hans van Kranenburg
  0 siblings, 1 reply; 23+ messages in thread
From: Hugo Mills @ 2020-04-03 16:28 UTC (permalink / raw)
  To: kreijack; +Cc: Michael, Steven Davies, linux-btrfs, Anand Jain

On Fri, Apr 03, 2020 at 06:19:59PM +0200, Goffredo Baroncelli wrote:
> On 4/3/20 10:43 AM, Michael wrote:
> > 02.04.2020 12:33, Steven Davies пишет:
> > > On 01/04/2020 21:03, Goffredo Baroncelli wrote:
> > > > From: Goffredo Baroncelli <kreijack@inwind.it>
> [...]
> > > > To enable this mode pass -o ssd_metadata at mount time.
> > > > 
> > > > Signed-off-by: Goffredo Baroncelli <kreijack@inwind.it>
> > > 
> > > The idea of this sounds similar to what Anand has been working on with the readmirror patchset[1] which was originally designed to prefer reading from SSD devices in a RAID1 configuration but has evolved into allowing the read policy to be configured through sysfs, at least partly because detecting SSDs correctly is not an exact science. Also, there may be more considerations than just HDD or SSD: for example in my system I use a SATA SSD and an NVMe SSD in RAID1 where the NVMe device is twice the speed of the SSD.
> > May be something like -o metadata_preferred_devices=device_id,[device_id,[device_id]]... ?
> 
> I think that it should be a device property instead of a device list passed at mount time.
> Looking at the btrfs_dev_item structure (which is embedded in the super block), there are several promising fields:
> - seek_speed
> - bandwidth
> - dev_group
> 
> currently these fields are unused.

   The first two of these at least were left over from the last
attempt at this which got anywhere near code.

   If you're going to do this kind of thing, please read (at least) my
(sadly code-less; sorry) proposal from a few years ago:

https://www.mail-archive.com/linux-btrfs@vger.kernel.org/msg33632.html

   Essentially, allowing for multiple policies for chunk allocation,
of which this case is a small subset.

   I'd envisaged putting the relevant config data into properties, but
since we didn't actually have them in any sense at the time, I got
bogged down in that part of it.

> Se we are free to use as we wish. For example, if we use 2 bit of dev_group as marker
> for "not for metadata" and "not for date" we can have the following combination:
> - 0 (default) -> the disk is suitable for both data and metadata
> - "not for metadata" -> the disk has an high priority for "data"
> - "not for data" -> the disk has an high priority for "metadata"
> - "not for data" and "not for metadata" -> invalid
> 
> As kernel<->user space interface, I like the idea to export these bits via sysfs.. unfortunately there is no a directory for the devices.... :-(
> 
> 
> > > 
> > > I would therefore vote for configurability of this rather than always choosing SSD over HDD.
> > > 
> > > [1] https://patchwork.kernel.org/project/linux-btrfs/list/?series=245121
> > > 
> > 
> 
> 

-- 
Hugo Mills             | Some days, it's just not worth gnawing through the
hugo@... carfax.org.uk | straps
http://carfax.org.uk/  |
PGP: E2AB1DE4          |

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH] btrfs: add ssd_metadata mode
  2020-04-03 16:28         ` Hugo Mills
@ 2020-04-03 16:36           ` Hans van Kranenburg
  0 siblings, 0 replies; 23+ messages in thread
From: Hans van Kranenburg @ 2020-04-03 16:36 UTC (permalink / raw)
  To: Hugo Mills, kreijack, Michael, Steven Davies, linux-btrfs,
	Anand Jain

On 4/3/20 6:28 PM, Hugo Mills wrote:
> On Fri, Apr 03, 2020 at 06:19:59PM +0200, Goffredo Baroncelli wrote:
>> On 4/3/20 10:43 AM, Michael wrote:
>>> 02.04.2020 12:33, Steven Davies пишет:
>>>> On 01/04/2020 21:03, Goffredo Baroncelli wrote:
>>>>> From: Goffredo Baroncelli <kreijack@inwind.it>
>> [...]
>>>>> To enable this mode pass -o ssd_metadata at mount time.
>>>>>
>>>>> Signed-off-by: Goffredo Baroncelli <kreijack@inwind.it>
>>>>
>>>> The idea of this sounds similar to what Anand has been working on with the readmirror patchset[1] which was originally designed to prefer reading from SSD devices in a RAID1 configuration but has evolved into allowing the read policy to be configured through sysfs, at least partly because detecting SSDs correctly is not an exact science. Also, there may be more considerations than just HDD or SSD: for example in my system I use a SATA SSD and an NVMe SSD in RAID1 where the NVMe device is twice the speed of the SSD.
>>> May be something like -o metadata_preferred_devices=device_id,[device_id,[device_id]]... ?
>>
>> I think that it should be a device property instead of a device list passed at mount time.
>> Looking at the btrfs_dev_item structure (which is embedded in the super block), there are several promising fields:
>> - seek_speed
>> - bandwidth
>> - dev_group
>>
>> currently these fields are unused.
> 
>    The first two of these at least were left over from the last
> attempt at this which got anywhere near code.
> 
>    If you're going to do this kind of thing, please read (at least) my
> (sadly code-less; sorry) proposal from a few years ago:
> 
> https://www.mail-archive.com/linux-btrfs@vger.kernel.org/msg33632.html
> 
>    Essentially, allowing for multiple policies for chunk allocation,
> of which this case is a small subset.
> 
>    I'd envisaged putting the relevant config data into properties, but
> since we didn't actually have them in any sense at the time, I got
> bogged down in that part of it.

Ideas about properties:

https://github.com/kdave/drafts/blob/master/btrfs/properties.txt

>> Se we are free to use as we wish. For example, if we use 2 bit of dev_group as marker
>> for "not for metadata" and "not for date" we can have the following combination:
>> - 0 (default) -> the disk is suitable for both data and metadata
>> - "not for metadata" -> the disk has an high priority for "data"
>> - "not for data" -> the disk has an high priority for "metadata"
>> - "not for data" and "not for metadata" -> invalid
>>
>> As kernel<->user space interface, I like the idea to export these bits via sysfs.. unfortunately there is no a directory for the devices.... :-(
>>
>>
>>>>
>>>> I would therefore vote for configurability of this rather than always choosing SSD over HDD.
>>>>
>>>> [1] https://patchwork.kernel.org/project/linux-btrfs/list/?series=245121
>>>>
>>>
>>
>>
> 


^ permalink raw reply	[flat|nested] 23+ messages in thread

* [PATCH] btrfs: add ssd_metadata mode
  2020-04-05  7:19 [RFC][PATCH v2] btrfs: ssd_metadata: storing metadata on SSD Goffredo Baroncelli
@ 2020-04-05  7:19 ` Goffredo Baroncelli
  0 siblings, 0 replies; 23+ messages in thread
From: Goffredo Baroncelli @ 2020-04-05  7:19 UTC (permalink / raw)
  To: linux-btrfs; +Cc: Michael, Hugo Mills, Martin Svec, Goffredo Baroncelli

From: Goffredo Baroncelli <kreijack@inwind.it>

When this mode is enabled, the allocation policy of the chunk
is so modified:
- allocation of metadata chunk: priority is given to ssd disk.
- allocation of data chunk: priority is given to a rotational disk.

When a striped profile is involved (like RAID0,5,6), the logic
is a bit more complex. If there are enough disks, the data profiles
are stored on the rotational disks only; instead the metadata profiles
are stored on the non rotational disk only.
If the disks are not enough, then the profiles is stored on all
the disks.

Example: assuming that sda, sdb, sdc are ssd disks, and sde, sdf are
rotational ones.
A data profile raid6, will be stored on sda, sdb, sdc, sde, sdf (sde
and sdf are not enough to host a raid5 profile).
A metadata profile raid6, will be stored on sda, sdb, sdc (these
are enough to host a raid6 profile).

To enable this mode pass -o ssd_metadata at mount time.

Signed-off-by: Goffredo Baroncelli <kreijack@inwind.it>
---
 fs/btrfs/ctree.h   |  1 +
 fs/btrfs/super.c   |  8 +++++
 fs/btrfs/volumes.c | 89 ++++++++++++++++++++++++++++++++++++++++++++--
 fs/btrfs/volumes.h |  1 +
 4 files changed, 97 insertions(+), 2 deletions(-)

diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index 36df977b64d9..c2e8b1b6eae7 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -1236,6 +1236,7 @@ static inline u32 BTRFS_MAX_XATTR_SIZE(const struct btrfs_fs_info *info)
 #define BTRFS_MOUNT_NOLOGREPLAY		(1 << 27)
 #define BTRFS_MOUNT_REF_VERIFY		(1 << 28)
 #define BTRFS_MOUNT_DISCARD_ASYNC	(1 << 29)
+#define BTRFS_MOUNT_SSD_METADATA	(1 << 29)
 
 #define BTRFS_DEFAULT_COMMIT_INTERVAL	(30)
 #define BTRFS_DEFAULT_MAX_INLINE	(2048)
diff --git a/fs/btrfs/super.c b/fs/btrfs/super.c
index 67c63858812a..4ad14b0a57b3 100644
--- a/fs/btrfs/super.c
+++ b/fs/btrfs/super.c
@@ -350,6 +350,7 @@ enum {
 #ifdef CONFIG_BTRFS_FS_REF_VERIFY
 	Opt_ref_verify,
 #endif
+	Opt_ssd_metadata,
 	Opt_err,
 };
 
@@ -421,6 +422,7 @@ static const match_table_t tokens = {
 #ifdef CONFIG_BTRFS_FS_REF_VERIFY
 	{Opt_ref_verify, "ref_verify"},
 #endif
+	{Opt_ssd_metadata, "ssd_metadata"},
 	{Opt_err, NULL},
 };
 
@@ -872,6 +874,10 @@ int btrfs_parse_options(struct btrfs_fs_info *info, char *options,
 			btrfs_set_opt(info->mount_opt, REF_VERIFY);
 			break;
 #endif
+		case Opt_ssd_metadata:
+			btrfs_set_and_info(info, SSD_METADATA,
+					"enabling ssd_metadata");
+			break;
 		case Opt_err:
 			btrfs_info(info, "unrecognized mount option '%s'", p);
 			ret = -EINVAL;
@@ -1390,6 +1396,8 @@ static int btrfs_show_options(struct seq_file *seq, struct dentry *dentry)
 #endif
 	if (btrfs_test_opt(info, REF_VERIFY))
 		seq_puts(seq, ",ref_verify");
+	if (btrfs_test_opt(info, SSD_METADATA))
+		seq_puts(seq, ",ssd_metadata");
 	seq_printf(seq, ",subvolid=%llu",
 		  BTRFS_I(d_inode(dentry))->root->root_key.objectid);
 	seq_puts(seq, ",subvol=");
diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index 9cfc668f91f4..ffb2bc912c43 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -4761,6 +4761,58 @@ static int btrfs_cmp_device_info(const void *a, const void *b)
 	return 0;
 }
 
+/*
+ * sort the devices in descending order by rotational,
+ * max_avail, total_avail
+ */
+static int btrfs_cmp_device_info_metadata(const void *a, const void *b)
+{
+	const struct btrfs_device_info *di_a = a;
+	const struct btrfs_device_info *di_b = b;
+
+	/* metadata -> non rotational first */
+	if (!di_a->rotational && di_b->rotational)
+		return -1;
+	if (di_a->rotational && !di_b->rotational)
+		return 1;
+	if (di_a->max_avail > di_b->max_avail)
+		return -1;
+	if (di_a->max_avail < di_b->max_avail)
+		return 1;
+	if (di_a->total_avail > di_b->total_avail)
+		return -1;
+	if (di_a->total_avail < di_b->total_avail)
+		return 1;
+	return 0;
+}
+
+/*
+ * sort the devices in descending order by !rotational,
+ * max_avail, total_avail
+ */
+static int btrfs_cmp_device_info_data(const void *a, const void *b)
+{
+	const struct btrfs_device_info *di_a = a;
+	const struct btrfs_device_info *di_b = b;
+
+	/* data -> non rotational last */
+	if (!di_a->rotational && di_b->rotational)
+		return 1;
+	if (di_a->rotational && !di_b->rotational)
+		return -1;
+	if (di_a->max_avail > di_b->max_avail)
+		return -1;
+	if (di_a->max_avail < di_b->max_avail)
+		return 1;
+	if (di_a->total_avail > di_b->total_avail)
+		return -1;
+	if (di_a->total_avail < di_b->total_avail)
+		return 1;
+	return 0;
+}
+
+
+
 static void check_raid56_incompat_flag(struct btrfs_fs_info *info, u64 type)
 {
 	if (!(type & BTRFS_BLOCK_GROUP_RAID56_MASK))
@@ -4808,6 +4860,7 @@ static int __btrfs_alloc_chunk(struct btrfs_trans_handle *trans,
 	int i;
 	int j;
 	int index;
+	int nr_rotational;
 
 	BUG_ON(!alloc_profile_is_valid(type, 0));
 
@@ -4863,6 +4916,7 @@ static int __btrfs_alloc_chunk(struct btrfs_trans_handle *trans,
 	 * about the available holes on each device.
 	 */
 	ndevs = 0;
+	nr_rotational = 0;
 	list_for_each_entry(device, &fs_devices->alloc_list, dev_alloc_list) {
 		u64 max_avail;
 		u64 dev_offset;
@@ -4914,14 +4968,45 @@ static int __btrfs_alloc_chunk(struct btrfs_trans_handle *trans,
 		devices_info[ndevs].max_avail = max_avail;
 		devices_info[ndevs].total_avail = total_avail;
 		devices_info[ndevs].dev = device;
+		devices_info[ndevs].rotational = !test_bit(QUEUE_FLAG_NONROT,
+				&(bdev_get_queue(device->bdev)->queue_flags));
+		if (devices_info[ndevs].rotational)
+			nr_rotational++;
 		++ndevs;
 	}
 
+	BUG_ON(nr_rotational > ndevs);
 	/*
 	 * now sort the devices by hole size / available space
 	 */
-	sort(devices_info, ndevs, sizeof(struct btrfs_device_info),
-	     btrfs_cmp_device_info, NULL);
+	if (((type & BTRFS_BLOCK_GROUP_DATA) &&
+	     (type & BTRFS_BLOCK_GROUP_METADATA)) ||
+	    !btrfs_test_opt(info, SSD_METADATA)) {
+		/* mixed bg or SSD_METADATA not set */
+		sort(devices_info, ndevs, sizeof(struct btrfs_device_info),
+			     btrfs_cmp_device_info, NULL);
+	} else {
+		/*
+		 * if SSD_METADATA is set, sort the device considering also the
+		 * kind (ssd or not). Limit the availables devices to the ones
+		 * of the same kind, to avoid that a striped profile like raid5
+		 * spread to all kind of devices (ssd and rotational).
+		 * It is allowed to use different kinds of devices if the ones
+		 * of the same kind are not enough alone.
+		 */
+		if (type & BTRFS_BLOCK_GROUP_DATA) {
+			sort(devices_info, ndevs, sizeof(struct btrfs_device_info),
+				     btrfs_cmp_device_info_data, NULL);
+			if (nr_rotational >= devs_min)
+				ndevs = nr_rotational;
+		} else {
+			int nr_norot = ndevs - nr_rotational;
+			sort(devices_info, ndevs, sizeof(struct btrfs_device_info),
+				     btrfs_cmp_device_info_metadata, NULL);
+			if (nr_norot >= devs_min)
+				ndevs = nr_norot;
+		}
+	}
 
 	/*
 	 * Round down to number of usable stripes, devs_increment can be any
diff --git a/fs/btrfs/volumes.h b/fs/btrfs/volumes.h
index f01552a0785e..285d71d54a03 100644
--- a/fs/btrfs/volumes.h
+++ b/fs/btrfs/volumes.h
@@ -343,6 +343,7 @@ struct btrfs_device_info {
 	u64 dev_offset;
 	u64 max_avail;
 	u64 total_avail;
+	int rotational:1;
 };
 
 struct btrfs_raid_attr {
-- 
2.26.0


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCH] btrfs: add ssd_metadata mode
  2020-04-05  8:26 [RFC][PATCH V3] btrfs: ssd_metadata: storing metadata on SSD Goffredo Baroncelli
@ 2020-04-05  8:26 ` Goffredo Baroncelli
  2020-04-14  5:24   ` Paul Jones
  2020-10-23  7:23   ` Wang Yugui
  0 siblings, 2 replies; 23+ messages in thread
From: Goffredo Baroncelli @ 2020-04-05  8:26 UTC (permalink / raw)
  To: linux-btrfs
  Cc: Michael, Hugo Mills, Martin Svec, Wang Yugui, Goffredo Baroncelli

From: Goffredo Baroncelli <kreijack@inwind.it>

When this mode is enabled, the allocation policy of the chunk
is so modified:
- allocation of metadata chunk: priority is given to ssd disk.
- allocation of data chunk: priority is given to a rotational disk.

When a striped profile is involved (like RAID0,5,6), the logic
is a bit more complex. If there are enough disks, the data profiles
are stored on the rotational disks only; instead the metadata profiles
are stored on the non rotational disk only.
If the disks are not enough, then the profiles is stored on all
the disks.

Example: assuming that sda, sdb, sdc are ssd disks, and sde, sdf are
rotational ones.
A data profile raid6, will be stored on sda, sdb, sdc, sde, sdf (sde
and sdf are not enough to host a raid5 profile).
A metadata profile raid6, will be stored on sda, sdb, sdc (these
are enough to host a raid6 profile).

To enable this mode pass -o ssd_metadata at mount time.

Signed-off-by: Goffredo Baroncelli <kreijack@inwind.it>
---
 fs/btrfs/ctree.h   |  1 +
 fs/btrfs/super.c   |  8 +++++
 fs/btrfs/volumes.c | 89 ++++++++++++++++++++++++++++++++++++++++++++--
 fs/btrfs/volumes.h |  1 +
 4 files changed, 97 insertions(+), 2 deletions(-)

diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index 36df977b64d9..773c7f8b0b0d 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -1236,6 +1236,7 @@ static inline u32 BTRFS_MAX_XATTR_SIZE(const struct btrfs_fs_info *info)
 #define BTRFS_MOUNT_NOLOGREPLAY		(1 << 27)
 #define BTRFS_MOUNT_REF_VERIFY		(1 << 28)
 #define BTRFS_MOUNT_DISCARD_ASYNC	(1 << 29)
+#define BTRFS_MOUNT_SSD_METADATA	(1 << 30)
 
 #define BTRFS_DEFAULT_COMMIT_INTERVAL	(30)
 #define BTRFS_DEFAULT_MAX_INLINE	(2048)
diff --git a/fs/btrfs/super.c b/fs/btrfs/super.c
index 67c63858812a..4ad14b0a57b3 100644
--- a/fs/btrfs/super.c
+++ b/fs/btrfs/super.c
@@ -350,6 +350,7 @@ enum {
 #ifdef CONFIG_BTRFS_FS_REF_VERIFY
 	Opt_ref_verify,
 #endif
+	Opt_ssd_metadata,
 	Opt_err,
 };
 
@@ -421,6 +422,7 @@ static const match_table_t tokens = {
 #ifdef CONFIG_BTRFS_FS_REF_VERIFY
 	{Opt_ref_verify, "ref_verify"},
 #endif
+	{Opt_ssd_metadata, "ssd_metadata"},
 	{Opt_err, NULL},
 };
 
@@ -872,6 +874,10 @@ int btrfs_parse_options(struct btrfs_fs_info *info, char *options,
 			btrfs_set_opt(info->mount_opt, REF_VERIFY);
 			break;
 #endif
+		case Opt_ssd_metadata:
+			btrfs_set_and_info(info, SSD_METADATA,
+					"enabling ssd_metadata");
+			break;
 		case Opt_err:
 			btrfs_info(info, "unrecognized mount option '%s'", p);
 			ret = -EINVAL;
@@ -1390,6 +1396,8 @@ static int btrfs_show_options(struct seq_file *seq, struct dentry *dentry)
 #endif
 	if (btrfs_test_opt(info, REF_VERIFY))
 		seq_puts(seq, ",ref_verify");
+	if (btrfs_test_opt(info, SSD_METADATA))
+		seq_puts(seq, ",ssd_metadata");
 	seq_printf(seq, ",subvolid=%llu",
 		  BTRFS_I(d_inode(dentry))->root->root_key.objectid);
 	seq_puts(seq, ",subvol=");
diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index 9cfc668f91f4..ffb2bc912c43 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -4761,6 +4761,58 @@ static int btrfs_cmp_device_info(const void *a, const void *b)
 	return 0;
 }
 
+/*
+ * sort the devices in descending order by rotational,
+ * max_avail, total_avail
+ */
+static int btrfs_cmp_device_info_metadata(const void *a, const void *b)
+{
+	const struct btrfs_device_info *di_a = a;
+	const struct btrfs_device_info *di_b = b;
+
+	/* metadata -> non rotational first */
+	if (!di_a->rotational && di_b->rotational)
+		return -1;
+	if (di_a->rotational && !di_b->rotational)
+		return 1;
+	if (di_a->max_avail > di_b->max_avail)
+		return -1;
+	if (di_a->max_avail < di_b->max_avail)
+		return 1;
+	if (di_a->total_avail > di_b->total_avail)
+		return -1;
+	if (di_a->total_avail < di_b->total_avail)
+		return 1;
+	return 0;
+}
+
+/*
+ * sort the devices in descending order by !rotational,
+ * max_avail, total_avail
+ */
+static int btrfs_cmp_device_info_data(const void *a, const void *b)
+{
+	const struct btrfs_device_info *di_a = a;
+	const struct btrfs_device_info *di_b = b;
+
+	/* data -> non rotational last */
+	if (!di_a->rotational && di_b->rotational)
+		return 1;
+	if (di_a->rotational && !di_b->rotational)
+		return -1;
+	if (di_a->max_avail > di_b->max_avail)
+		return -1;
+	if (di_a->max_avail < di_b->max_avail)
+		return 1;
+	if (di_a->total_avail > di_b->total_avail)
+		return -1;
+	if (di_a->total_avail < di_b->total_avail)
+		return 1;
+	return 0;
+}
+
+
+
 static void check_raid56_incompat_flag(struct btrfs_fs_info *info, u64 type)
 {
 	if (!(type & BTRFS_BLOCK_GROUP_RAID56_MASK))
@@ -4808,6 +4860,7 @@ static int __btrfs_alloc_chunk(struct btrfs_trans_handle *trans,
 	int i;
 	int j;
 	int index;
+	int nr_rotational;
 
 	BUG_ON(!alloc_profile_is_valid(type, 0));
 
@@ -4863,6 +4916,7 @@ static int __btrfs_alloc_chunk(struct btrfs_trans_handle *trans,
 	 * about the available holes on each device.
 	 */
 	ndevs = 0;
+	nr_rotational = 0;
 	list_for_each_entry(device, &fs_devices->alloc_list, dev_alloc_list) {
 		u64 max_avail;
 		u64 dev_offset;
@@ -4914,14 +4968,45 @@ static int __btrfs_alloc_chunk(struct btrfs_trans_handle *trans,
 		devices_info[ndevs].max_avail = max_avail;
 		devices_info[ndevs].total_avail = total_avail;
 		devices_info[ndevs].dev = device;
+		devices_info[ndevs].rotational = !test_bit(QUEUE_FLAG_NONROT,
+				&(bdev_get_queue(device->bdev)->queue_flags));
+		if (devices_info[ndevs].rotational)
+			nr_rotational++;
 		++ndevs;
 	}
 
+	BUG_ON(nr_rotational > ndevs);
 	/*
 	 * now sort the devices by hole size / available space
 	 */
-	sort(devices_info, ndevs, sizeof(struct btrfs_device_info),
-	     btrfs_cmp_device_info, NULL);
+	if (((type & BTRFS_BLOCK_GROUP_DATA) &&
+	     (type & BTRFS_BLOCK_GROUP_METADATA)) ||
+	    !btrfs_test_opt(info, SSD_METADATA)) {
+		/* mixed bg or SSD_METADATA not set */
+		sort(devices_info, ndevs, sizeof(struct btrfs_device_info),
+			     btrfs_cmp_device_info, NULL);
+	} else {
+		/*
+		 * if SSD_METADATA is set, sort the device considering also the
+		 * kind (ssd or not). Limit the availables devices to the ones
+		 * of the same kind, to avoid that a striped profile like raid5
+		 * spread to all kind of devices (ssd and rotational).
+		 * It is allowed to use different kinds of devices if the ones
+		 * of the same kind are not enough alone.
+		 */
+		if (type & BTRFS_BLOCK_GROUP_DATA) {
+			sort(devices_info, ndevs, sizeof(struct btrfs_device_info),
+				     btrfs_cmp_device_info_data, NULL);
+			if (nr_rotational >= devs_min)
+				ndevs = nr_rotational;
+		} else {
+			int nr_norot = ndevs - nr_rotational;
+			sort(devices_info, ndevs, sizeof(struct btrfs_device_info),
+				     btrfs_cmp_device_info_metadata, NULL);
+			if (nr_norot >= devs_min)
+				ndevs = nr_norot;
+		}
+	}
 
 	/*
 	 * Round down to number of usable stripes, devs_increment can be any
diff --git a/fs/btrfs/volumes.h b/fs/btrfs/volumes.h
index f01552a0785e..285d71d54a03 100644
--- a/fs/btrfs/volumes.h
+++ b/fs/btrfs/volumes.h
@@ -343,6 +343,7 @@ struct btrfs_device_info {
 	u64 dev_offset;
 	u64 max_avail;
 	u64 total_avail;
+	int rotational:1;
 };
 
 struct btrfs_raid_attr {
-- 
2.26.0


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* RE: [PATCH] btrfs: add ssd_metadata mode
  2020-04-05  8:26 ` [PATCH] btrfs: add ssd_metadata mode Goffredo Baroncelli
@ 2020-04-14  5:24   ` Paul Jones
  2020-10-23  7:23   ` Wang Yugui
  1 sibling, 0 replies; 23+ messages in thread
From: Paul Jones @ 2020-04-14  5:24 UTC (permalink / raw)
  To: Goffredo Baroncelli, linux-btrfs@vger.kernel.org

> -----Original Message-----
> From: linux-btrfs-owner@vger.kernel.org <linux-btrfs-
> owner@vger.kernel.org> On Behalf Of Goffredo Baroncelli
> Sent: Sunday, 5 April 2020 6:27 PM
> To: linux-btrfs@vger.kernel.org
> Cc: Michael <mclaud@roznica.com.ua>; Hugo Mills <hugo@carfax.org.uk>;
> Martin Svec <martin.svec@zoner.cz>; Wang Yugui <wangyugui@e16-
> tech.com>; Goffredo Baroncelli <kreijack@inwind.it>
> Subject: [PATCH] btrfs: add ssd_metadata mode
> 
> From: Goffredo Baroncelli <kreijack@inwind.it>
> 
> When this mode is enabled, the allocation policy of the chunk is so modified:
> - allocation of metadata chunk: priority is given to ssd disk.
> - allocation of data chunk: priority is given to a rotational disk.
> 
> When a striped profile is involved (like RAID0,5,6), the logic is a bit more
> complex. If there are enough disks, the data profiles are stored on the
> rotational disks only; instead the metadata profiles are stored on the non
> rotational disk only.
> If the disks are not enough, then the profiles is stored on all the disks.
> 
> Example: assuming that sda, sdb, sdc are ssd disks, and sde, sdf are rotational
> ones.
> A data profile raid6, will be stored on sda, sdb, sdc, sde, sdf (sde and sdf are
> not enough to host a raid5 profile).
> A metadata profile raid6, will be stored on sda, sdb, sdc (these are enough to
> host a raid6 profile).
> 
> To enable this mode pass -o ssd_metadata at mount time.
> 
> Signed-off-by: Goffredo Baroncelli <kreijack@inwind.it>

Tested-By: Paul Jones <paul@pauljones.id.au>

Using raid 1. Makes a surprising difference in speed, nearly as fast as when I was using dm-cache


Paul.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH] btrfs: add ssd_metadata mode
  2020-04-05  8:26 ` [PATCH] btrfs: add ssd_metadata mode Goffredo Baroncelli
  2020-04-14  5:24   ` Paul Jones
@ 2020-10-23  7:23   ` Wang Yugui
  2020-10-23 10:11     ` Adam Borowski
  1 sibling, 1 reply; 23+ messages in thread
From: Wang Yugui @ 2020-10-23  7:23 UTC (permalink / raw)
  To: Goffredo Baroncelli
  Cc: linux-btrfs, Michael, Hugo Mills, Martin Svec,
	Goffredo Baroncelli

Hi, Goffredo Baroncelli 

We can move 'rotational of struct btrfs_device_info' to  'bool rotating
of struct btrfs_device'.

1, it will be more close to 'bool rotating of struct btrfs_fs_devices'.

2, it maybe used to enhance the path of '[PATCH] btrfs: balance RAID1/RAID10 mirror selection'.
https://lore.kernel.org/linux-btrfs/3bddd73e-cb60-b716-4e98-61ff24beb570@oracle.com/T/#t

Best Regards
王玉贵
2020/10/23

> From: Goffredo Baroncelli <kreijack@inwind.it>
> 
> When this mode is enabled, the allocation policy of the chunk
> is so modified:
> - allocation of metadata chunk: priority is given to ssd disk.
> - allocation of data chunk: priority is given to a rotational disk.
> 
> When a striped profile is involved (like RAID0,5,6), the logic
> is a bit more complex. If there are enough disks, the data profiles
> are stored on the rotational disks only; instead the metadata profiles
> are stored on the non rotational disk only.
> If the disks are not enough, then the profiles is stored on all
> the disks.
> 
> Example: assuming that sda, sdb, sdc are ssd disks, and sde, sdf are
> rotational ones.
> A data profile raid6, will be stored on sda, sdb, sdc, sde, sdf (sde
> and sdf are not enough to host a raid5 profile).
> A metadata profile raid6, will be stored on sda, sdb, sdc (these
> are enough to host a raid6 profile).
> 
> To enable this mode pass -o ssd_metadata at mount time.
> 
> Signed-off-by: Goffredo Baroncelli <kreijack@inwind.it>
> ---
>  fs/btrfs/ctree.h   |  1 +
>  fs/btrfs/super.c   |  8 +++++
>  fs/btrfs/volumes.c | 89 ++++++++++++++++++++++++++++++++++++++++++++--
>  fs/btrfs/volumes.h |  1 +
>  4 files changed, 97 insertions(+), 2 deletions(-)
> 
> diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
> index 36df977b64d9..773c7f8b0b0d 100644
> --- a/fs/btrfs/ctree.h
> +++ b/fs/btrfs/ctree.h
> @@ -1236,6 +1236,7 @@ static inline u32 BTRFS_MAX_XATTR_SIZE(const struct btrfs_fs_info *info)
>  #define BTRFS_MOUNT_NOLOGREPLAY		(1 << 27)
>  #define BTRFS_MOUNT_REF_VERIFY		(1 << 28)
>  #define BTRFS_MOUNT_DISCARD_ASYNC	(1 << 29)
> +#define BTRFS_MOUNT_SSD_METADATA	(1 << 30)
>  
>  #define BTRFS_DEFAULT_COMMIT_INTERVAL	(30)
>  #define BTRFS_DEFAULT_MAX_INLINE	(2048)
> diff --git a/fs/btrfs/super.c b/fs/btrfs/super.c
> index 67c63858812a..4ad14b0a57b3 100644
> --- a/fs/btrfs/super.c
> +++ b/fs/btrfs/super.c
> @@ -350,6 +350,7 @@ enum {
>  #ifdef CONFIG_BTRFS_FS_REF_VERIFY
>  	Opt_ref_verify,
>  #endif
> +	Opt_ssd_metadata,
>  	Opt_err,
>  };
>  
> @@ -421,6 +422,7 @@ static const match_table_t tokens = {
>  #ifdef CONFIG_BTRFS_FS_REF_VERIFY
>  	{Opt_ref_verify, "ref_verify"},
>  #endif
> +	{Opt_ssd_metadata, "ssd_metadata"},
>  	{Opt_err, NULL},
>  };
>  
> @@ -872,6 +874,10 @@ int btrfs_parse_options(struct btrfs_fs_info *info, char *options,
>  			btrfs_set_opt(info->mount_opt, REF_VERIFY);
>  			break;
>  #endif
> +		case Opt_ssd_metadata:
> +			btrfs_set_and_info(info, SSD_METADATA,
> +					"enabling ssd_metadata");
> +			break;
>  		case Opt_err:
>  			btrfs_info(info, "unrecognized mount option '%s'", p);
>  			ret = -EINVAL;
> @@ -1390,6 +1396,8 @@ static int btrfs_show_options(struct seq_file *seq, struct dentry *dentry)
>  #endif
>  	if (btrfs_test_opt(info, REF_VERIFY))
>  		seq_puts(seq, ",ref_verify");
> +	if (btrfs_test_opt(info, SSD_METADATA))
> +		seq_puts(seq, ",ssd_metadata");
>  	seq_printf(seq, ",subvolid=%llu",
>  		  BTRFS_I(d_inode(dentry))->root->root_key.objectid);
>  	seq_puts(seq, ",subvol=");
> diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
> index 9cfc668f91f4..ffb2bc912c43 100644
> --- a/fs/btrfs/volumes.c
> +++ b/fs/btrfs/volumes.c
> @@ -4761,6 +4761,58 @@ static int btrfs_cmp_device_info(const void *a, const void *b)
>  	return 0;
>  }
>  
> +/*
> + * sort the devices in descending order by rotational,
> + * max_avail, total_avail
> + */
> +static int btrfs_cmp_device_info_metadata(const void *a, const void *b)
> +{
> +	const struct btrfs_device_info *di_a = a;
> +	const struct btrfs_device_info *di_b = b;
> +
> +	/* metadata -> non rotational first */
> +	if (!di_a->rotational && di_b->rotational)
> +		return -1;
> +	if (di_a->rotational && !di_b->rotational)
> +		return 1;
> +	if (di_a->max_avail > di_b->max_avail)
> +		return -1;
> +	if (di_a->max_avail < di_b->max_avail)
> +		return 1;
> +	if (di_a->total_avail > di_b->total_avail)
> +		return -1;
> +	if (di_a->total_avail < di_b->total_avail)
> +		return 1;
> +	return 0;
> +}
> +
> +/*
> + * sort the devices in descending order by !rotational,
> + * max_avail, total_avail
> + */
> +static int btrfs_cmp_device_info_data(const void *a, const void *b)
> +{
> +	const struct btrfs_device_info *di_a = a;
> +	const struct btrfs_device_info *di_b = b;
> +
> +	/* data -> non rotational last */
> +	if (!di_a->rotational && di_b->rotational)
> +		return 1;
> +	if (di_a->rotational && !di_b->rotational)
> +		return -1;
> +	if (di_a->max_avail > di_b->max_avail)
> +		return -1;
> +	if (di_a->max_avail < di_b->max_avail)
> +		return 1;
> +	if (di_a->total_avail > di_b->total_avail)
> +		return -1;
> +	if (di_a->total_avail < di_b->total_avail)
> +		return 1;
> +	return 0;
> +}
> +
> +
> +
>  static void check_raid56_incompat_flag(struct btrfs_fs_info *info, u64 type)
>  {
>  	if (!(type & BTRFS_BLOCK_GROUP_RAID56_MASK))
> @@ -4808,6 +4860,7 @@ static int __btrfs_alloc_chunk(struct btrfs_trans_handle *trans,
>  	int i;
>  	int j;
>  	int index;
> +	int nr_rotational;
>  
>  	BUG_ON(!alloc_profile_is_valid(type, 0));
>  
> @@ -4863,6 +4916,7 @@ static int __btrfs_alloc_chunk(struct btrfs_trans_handle *trans,
>  	 * about the available holes on each device.
>  	 */
>  	ndevs = 0;
> +	nr_rotational = 0;
>  	list_for_each_entry(device, &fs_devices->alloc_list, dev_alloc_list) {
>  		u64 max_avail;
>  		u64 dev_offset;
> @@ -4914,14 +4968,45 @@ static int __btrfs_alloc_chunk(struct btrfs_trans_handle *trans,
>  		devices_info[ndevs].max_avail = max_avail;
>  		devices_info[ndevs].total_avail = total_avail;
>  		devices_info[ndevs].dev = device;
> +		devices_info[ndevs].rotational = !test_bit(QUEUE_FLAG_NONROT,
> +				&(bdev_get_queue(device->bdev)->queue_flags));
> +		if (devices_info[ndevs].rotational)
> +			nr_rotational++;
>  		++ndevs;
>  	}
>  
> +	BUG_ON(nr_rotational > ndevs);
>  	/*
>  	 * now sort the devices by hole size / available space
>  	 */
> -	sort(devices_info, ndevs, sizeof(struct btrfs_device_info),
> -	     btrfs_cmp_device_info, NULL);
> +	if (((type & BTRFS_BLOCK_GROUP_DATA) &&
> +	     (type & BTRFS_BLOCK_GROUP_METADATA)) ||
> +	    !btrfs_test_opt(info, SSD_METADATA)) {
> +		/* mixed bg or SSD_METADATA not set */
> +		sort(devices_info, ndevs, sizeof(struct btrfs_device_info),
> +			     btrfs_cmp_device_info, NULL);
> +	} else {
> +		/*
> +		 * if SSD_METADATA is set, sort the device considering also the
> +		 * kind (ssd or not). Limit the availables devices to the ones
> +		 * of the same kind, to avoid that a striped profile like raid5
> +		 * spread to all kind of devices (ssd and rotational).
> +		 * It is allowed to use different kinds of devices if the ones
> +		 * of the same kind are not enough alone.
> +		 */
> +		if (type & BTRFS_BLOCK_GROUP_DATA) {
> +			sort(devices_info, ndevs, sizeof(struct btrfs_device_info),
> +				     btrfs_cmp_device_info_data, NULL);
> +			if (nr_rotational >= devs_min)
> +				ndevs = nr_rotational;
> +		} else {
> +			int nr_norot = ndevs - nr_rotational;
> +			sort(devices_info, ndevs, sizeof(struct btrfs_device_info),
> +				     btrfs_cmp_device_info_metadata, NULL);
> +			if (nr_norot >= devs_min)
> +				ndevs = nr_norot;
> +		}
> +	}
>  
>  	/*
>  	 * Round down to number of usable stripes, devs_increment can be any
> diff --git a/fs/btrfs/volumes.h b/fs/btrfs/volumes.h
> index f01552a0785e..285d71d54a03 100644
> --- a/fs/btrfs/volumes.h
> +++ b/fs/btrfs/volumes.h
> @@ -343,6 +343,7 @@ struct btrfs_device_info {
>  	u64 dev_offset;
>  	u64 max_avail;
>  	u64 total_avail;
> +	int rotational:1;
>  };
>  
>  struct btrfs_raid_attr {
> -- 
> 2.26.0

--------------------------------------
北京京垓科技有限公司
王玉贵	wangyugui@e16-tech.com
电话：+86-136-71123776


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH] btrfs: add ssd_metadata mode
  2020-10-23  7:23   ` Wang Yugui
@ 2020-10-23 10:11     ` Adam Borowski
  2020-10-23 11:25       ` Qu Wenruo
  0 siblings, 1 reply; 23+ messages in thread
From: Adam Borowski @ 2020-10-23 10:11 UTC (permalink / raw)
  To: Wang Yugui
  Cc: Goffredo Baroncelli, linux-btrfs, Michael, Hugo Mills,
	Martin Svec, Goffredo Baroncelli

On Fri, Oct 23, 2020 at 03:23:30PM +0800, Wang Yugui wrote:
> Hi, Goffredo Baroncelli 
> 
> We can move 'rotational of struct btrfs_device_info' to  'bool rotating
> of struct btrfs_device'.
> 
> 1, it will be more close to 'bool rotating of struct btrfs_fs_devices'.
> 
> 2, it maybe used to enhance the path of '[PATCH] btrfs: balance RAID1/RAID10 mirror selection'.
> https://lore.kernel.org/linux-btrfs/3bddd73e-cb60-b716-4e98-61ff24beb570@oracle.com/T/#t

I don't think it should be a bool -- or at least, turned into a bool
late in the processing.

There are many storage tiers; rotational applies only to one of the
coldest.  In my use case, at least, I've added the following patchlet:

-               devices_info[ndevs].rotational = !test_bit(QUEUE_FLAG_NONROT,
+               devices_info[ndevs].rotational = !test_bit(QUEUE_FLAG_DAX,

Or, you may want Optane NVMe vs legacy (ie, NAND) NVMe.

The tiers look like:
* DIMM-connected Optane (dax=1)
* NVMe-connected Optane
* NVMe-connected flash
* SATA-connected flash
* SATA-connected spinning rust (rotational=1)
* IDE-connected spinning rust (rotational=1)
* SD cards
* floppies?

And even that is just for local storage only.

Thus, please don't hardcode the notion of "rotational", what we want is
"faster but smaller" vs "slower but bigger".

> > From: Goffredo Baroncelli <kreijack@inwind.it>
> > 
> > When this mode is enabled, the allocation policy of the chunk
> > is so modified:
> > - allocation of metadata chunk: priority is given to ssd disk.
> > - allocation of data chunk: priority is given to a rotational disk.
> > 
> > When a striped profile is involved (like RAID0,5,6), the logic
> > is a bit more complex. If there are enough disks, the data profiles
> > are stored on the rotational disks only; instead the metadata profiles
> > are stored on the non rotational disk only.
> > If the disks are not enough, then the profiles is stored on all
> > the disks.

And, a newer version of Goffredo's patchset already had
"preferred_metadata".  It did not assign the preference automatically,
but if we want god defaults, they should be smarter than just rotationality.

Meow!
-- 
⢀⣴⠾⠻⢶⣦⠀
⣾⠁⢠⠒⠀⣿⡁ Imagine there are bandits in your house, your kid is bleeding out,
⢿⡄⠘⠷⠚⠋⠀ the house is on fire, and seven giant trumpets are playing in the
⠈⠳⣄⠀⠀⠀⠀ sky.  Your cat demands food.  The priority should be obvious...

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH] btrfs: add ssd_metadata mode
  2020-10-23 10:11     ` Adam Borowski
@ 2020-10-23 11:25       ` Qu Wenruo
  2020-10-23 12:37         ` Wang Yugui
  0 siblings, 1 reply; 23+ messages in thread
From: Qu Wenruo @ 2020-10-23 11:25 UTC (permalink / raw)
  To: Adam Borowski, Wang Yugui
  Cc: Goffredo Baroncelli, linux-btrfs, Michael, Hugo Mills,
	Martin Svec, Goffredo Baroncelli



On 2020/10/23 下午6:11, Adam Borowski wrote:
> On Fri, Oct 23, 2020 at 03:23:30PM +0800, Wang Yugui wrote:
>> Hi, Goffredo Baroncelli
>>
>> We can move 'rotational of struct btrfs_device_info' to  'bool rotating
>> of struct btrfs_device'.
>>
>> 1, it will be more close to 'bool rotating of struct btrfs_fs_devices'.
>>
>> 2, it maybe used to enhance the path of '[PATCH] btrfs: balance RAID1/RAID10 mirror selection'.
>> https://lore.kernel.org/linux-btrfs/3bddd73e-cb60-b716-4e98-61ff24beb570@oracle.com/T/#t
>
> I don't think it should be a bool -- or at least, turned into a bool
> late in the processing.
>
> There are many storage tiers; rotational applies only to one of the
> coldest.  In my use case, at least, I've added the following patchlet:
>
> -               devices_info[ndevs].rotational = !test_bit(QUEUE_FLAG_NONROT,
> +               devices_info[ndevs].rotational = !test_bit(QUEUE_FLAG_DAX,
>
> Or, you may want Optane NVMe vs legacy (ie, NAND) NVMe.

A little off topic here, btrfs in fact has a better ways to model a
storage, and definitely not simply rotational or not.

In btrfs_dev_item, we have bandwith and seek_speed to determine the
characteristic, although they're never really utilized.

So if we're really going to dig deeper into the rabbit hole, we need
more characteristic to describe a device.
From basic bandwidth for large block size IO, to things like IOPS for
small random block size, and even possible multi-level performance
characteristic for cases like multi-level cache used in current NVME
ssds, and future SMR + CMR mixed devices.

Although computer is binary, the performance characteristic is never
binary. :)

Thanks,
Qu
>
> The tiers look like:
> * DIMM-connected Optane (dax=1)
> * NVMe-connected Optane
> * NVMe-connected flash
> * SATA-connected flash
> * SATA-connected spinning rust (rotational=1)
> * IDE-connected spinning rust (rotational=1)
> * SD cards
> * floppies?
>
> And even that is just for local storage only.
>
> Thus, please don't hardcode the notion of "rotational", what we want is
> "faster but smaller" vs "slower but bigger".
>
>>> From: Goffredo Baroncelli <kreijack@inwind.it>
>>>
>>> When this mode is enabled, the allocation policy of the chunk
>>> is so modified:
>>> - allocation of metadata chunk: priority is given to ssd disk.
>>> - allocation of data chunk: priority is given to a rotational disk.
>>>
>>> When a striped profile is involved (like RAID0,5,6), the logic
>>> is a bit more complex. If there are enough disks, the data profiles
>>> are stored on the rotational disks only; instead the metadata profiles
>>> are stored on the non rotational disk only.
>>> If the disks are not enough, then the profiles is stored on all
>>> the disks.
>
> And, a newer version of Goffredo's patchset already had
> "preferred_metadata".  It did not assign the preference automatically,
> but if we want god defaults, they should be smarter than just rotationality.
>
>
> Meow!
>

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH] btrfs: add ssd_metadata mode
  2020-10-23 11:25       ` Qu Wenruo
@ 2020-10-23 12:37         ` Wang Yugui
  2020-10-23 12:45           ` Qu Wenruo
                             ` (2 more replies)
  0 siblings, 3 replies; 23+ messages in thread
From: Wang Yugui @ 2020-10-23 12:37 UTC (permalink / raw)
  To: Qu Wenruo
  Cc: Adam Borowski, Goffredo Baroncelli, linux-btrfs, Michael,
	Hugo Mills, Martin Svec, Goffredo Baroncelli

Hi,

Can we add the feature of 'Storage Tiering' to btrfs for these use cases?

1) use faster tier firstly for metadata

2) only the subvol with higher tier can save data to 
    the higher tier disk?

3) use faster tier firstly for mirror selection of RAID1/RAID10

Best Regards
Wang Yugui (wangyugui@e16-tech.com)
2020/10/23



^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH] btrfs: add ssd_metadata mode
  2020-10-23 12:37         ` Wang Yugui
@ 2020-10-23 12:45           ` Qu Wenruo
  2020-10-23 13:10           ` Steven Davies
  2020-10-23 18:03           ` Goffredo Baroncelli
  2 siblings, 0 replies; 23+ messages in thread
From: Qu Wenruo @ 2020-10-23 12:45 UTC (permalink / raw)
  To: Wang Yugui
  Cc: Adam Borowski, Goffredo Baroncelli, linux-btrfs, Michael,
	Hugo Mills, Martin Svec, Goffredo Baroncelli



On 2020/10/23 下午8:37, Wang Yugui wrote:
> Hi,
>
> Can we add the feature of 'Storage Tiering' to btrfs for these use cases?

Feel free to contribute.

Although it seems pretty hard already, too many factors are involved,
from extent allocator to chunk allocator.

Just consider how complex things are for bcache, it won't be a simple
feature at all.

Thanks,
Qu
>
> 1) use faster tier firstly for metadata
>
> 2) only the subvol with higher tier can save data to
>     the higher tier disk?
>
> 3) use faster tier firstly for mirror selection of RAID1/RAID10
>
> Best Regards
> Wang Yugui (wangyugui@e16-tech.com)
> 2020/10/23
>
>

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH] btrfs: add ssd_metadata mode
  2020-10-23 12:37         ` Wang Yugui
  2020-10-23 12:45           ` Qu Wenruo
@ 2020-10-23 13:10           ` Steven Davies
  2020-10-23 13:49             ` Wang Yugui
  2020-10-23 18:03           ` Goffredo Baroncelli
  2 siblings, 1 reply; 23+ messages in thread
From: Steven Davies @ 2020-10-23 13:10 UTC (permalink / raw)
  To: Wang Yugui
  Cc: Qu Wenruo, Adam Borowski, Goffredo Baroncelli, linux-btrfs,
	Michael, Hugo Mills, Martin Svec, Goffredo Baroncelli

On 2020-10-23 13:37, Wang Yugui wrote:
> Hi,
> 
> Can we add the feature of 'Storage Tiering' to btrfs for these use 
> cases?
> 
> 1) use faster tier firstly for metadata
> 
> 2) only the subvol with higher tier can save data to
>     the higher tier disk?
> 
> 3) use faster tier firstly for mirror selection of RAID1/RAID10

I'd support user-configurable tiering by specifying which device IDs are 
allowed to be used for

a) storing metadata
b) reading data from RAID1/RAID10

that would fit into both this patch and Anand's read policy patchset. It 
could be a mount option, sysfs tunable and/or btrfs-device command.

e.g. for sysfs
/sys/fs/btrfs/6e2797f3-d0ab-4aa1-b262-c2395fd0626e/devices/sdb2/prio_metadata_store 
[0..15]
/sys/fs/btrfs/6e2797f3-d0ab-4aa1-b262-c2395fd0626e/devices/sdb2/prio_metadata_read 
[0..15]
/sys/fs/btrfs/6e2797f3-d0ab-4aa1-b262-c2395fd0626e/devices/sdb2/prio_data_read 
[0..15]

Getting the user to specify the devices' priorities would be more 
reliable than looking at the rotational attribute.

-- 
Steven Davies

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH] btrfs: add ssd_metadata mode
  2020-10-23 13:10           ` Steven Davies
@ 2020-10-23 13:49             ` Wang Yugui
  0 siblings, 0 replies; 23+ messages in thread
From: Wang Yugui @ 2020-10-23 13:49 UTC (permalink / raw)
  To: Steven Davies
  Cc: Qu Wenruo, Adam Borowski, Goffredo Baroncelli, linux-btrfs,
	Michael, Hugo Mills, Martin Svec, Goffredo Baroncelli

Hi,

We define 'int tier' for different device, such as
* dax=1
* NVMe
* SSD
* rotational=1

but in most case, just two tier of them are used at the same time,
so we use them just as two tier(faster tier, slower tier).

for phase1, we support
1) use faster tier firstly for metadata
3) use faster tier firstly for mirror selection of RAID1/RAID10

and in phase2(TODO), we support
2) let some subvol save data to the faster tier disk

for metadata, the policy is
1) faster-tier-firstly (default)
2) faster-tier-only

for data, the policy is
4) slower-tier-firstly(default)
3) slower-tier-only
1) faster-tier-firstly(phase2 TODO)
2) faster-tier-only(phase2 TODO)

Now we are using metadata profile(RAID type) and data profile(RAID type),
in phase2, we change the name of them  to 'faster tier profile' and 
'lower tier profile'.

The key is 'in most case, just two tier are used at the same time'.
so we can support the flowing case in phase1 without user config.
1) use faster tier firstly for metadata
3) use faster tier firstly for mirror selection of RAID1/RAID10

Best Regards
Wang Yugui (wangyugui@e16-tech.com)
2020/10/23


> On 2020-10-23 13:37, Wang Yugui wrote:
> > Hi,
> >
> > Can we add the feature of 'Storage Tiering' to btrfs for these use
> > cases?
> >
> > 1) use faster tier firstly for metadata
> >
> > 2) only the subvol with higher tier can save data to
> >     the higher tier disk?
> >
> > 3) use faster tier firstly for mirror selection of RAID1/RAID10
> 
> I'd support user-configurable tiering by specifying which device IDs are allowed to be used for
> 
> a) storing metadata
> b) reading data from RAID1/RAID10
> 
> that would fit into both this patch and Anand's read policy patchset. It could be a mount option, sysfs tunable and/or btrfs-device command.
> 
> e.g. for sysfs
> /sys/fs/btrfs/6e2797f3-d0ab-4aa1-b262-c2395fd0626e/devices/sdb2/prio_metadata_store [0..15]
> /sys/fs/btrfs/6e2797f3-d0ab-4aa1-b262-c2395fd0626e/devices/sdb2/prio_metadata_read [0..15]
> /sys/fs/btrfs/6e2797f3-d0ab-4aa1-b262-c2395fd0626e/devices/sdb2/prio_data_read [0..15]
> 
> Getting the user to specify the devices' priorities would be more reliable than looking at the rotational attribute.




^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH] btrfs: add ssd_metadata mode
  2020-10-23 12:37         ` Wang Yugui
  2020-10-23 12:45           ` Qu Wenruo
  2020-10-23 13:10           ` Steven Davies
@ 2020-10-23 18:03           ` Goffredo Baroncelli
  2020-10-24  3:26             ` Paul Jones
  2 siblings, 1 reply; 23+ messages in thread
From: Goffredo Baroncelli @ 2020-10-23 18:03 UTC (permalink / raw)
  To: Wang Yugui, Qu Wenruo
  Cc: Adam Borowski, linux-btrfs, Michael, Hugo Mills, Martin Svec,
	Goffredo Baroncelli

On 10/23/20 2:37 PM, Wang Yugui wrote:
> Hi,
> 
> Can we add the feature of 'Storage Tiering' to btrfs for these use cases?
> 
> 1) use faster tier firstly for metadata

My tests revealed that a BTRFS filesystem stacked over bcache has better performance. So I am not sure that putting the metadata in a dedicated storage is a good thing.

> 
> 2) only the subvol with higher tier can save data to
>      the higher tier disk?
> 
> 3) use faster tier firstly for mirror selection of RAID1/RAID10

If you want to put a subvolume in a "higher tier", it is more simple to use two filesystems: one in the "higher tier" and one in the "slower one".

Also what should be the semantic of cp --reflink between different subvolumes of different tiers ? From a technical point of view, it is defined, but the expectation of the users can be vary...

The same is true about assign different raid profile to differents subvolumes....

Let the things simple.
> 
> Best Regards
> Wang Yugui (wangyugui@e16-tech.com)
> 2020/10/23
> 

BR
GB
> 

-- 
gpg @keyserver.linux.it: Goffredo Baroncelli <kreijackATinwind.it>
Key fingerprint BBF5 1610 0B64 DAC6 5F7D  17B2 0EDA 9B37 8B82 E0B5

^ permalink raw reply	[flat|nested] 23+ messages in thread

* RE: [PATCH] btrfs: add ssd_metadata mode
  2020-10-23 18:03           ` Goffredo Baroncelli
@ 2020-10-24  3:26             ` Paul Jones
  0 siblings, 0 replies; 23+ messages in thread
From: Paul Jones @ 2020-10-24  3:26 UTC (permalink / raw)
  To: kreijack@inwind.it, Wang Yugui, Qu Wenruo
  Cc: Adam Borowski, linux-btrfs@vger.kernel.org, Michael, Hugo Mills,
	Martin Svec

> -----Original Message-----
> From: Goffredo Baroncelli <kreijack@libero.it>
> Sent: Saturday, 24 October 2020 5:04 AM
> To: Wang Yugui <wangyugui@e16-tech.com>; Qu Wenruo
> <quwenruo.btrfs@gmx.com>
> Cc: Adam Borowski <kilobyte@angband.pl>; linux-btrfs@vger.kernel.org;
> Michael <mclaud@roznica.com.ua>; Hugo Mills <hugo@carfax.org.uk>;
> Martin Svec <martin.svec@zoner.cz>; Goffredo Baroncelli
> <kreijack@inwind.it>
> Subject: Re: [PATCH] btrfs: add ssd_metadata mode
> 
> On 10/23/20 2:37 PM, Wang Yugui wrote:
> > Hi,
> >
> > Can we add the feature of 'Storage Tiering' to btrfs for these use cases?
> >
> > 1) use faster tier firstly for metadata
> 
> My tests revealed that a BTRFS filesystem stacked over bcache has better
> performance. So I am not sure that putting the metadata in a dedicated
> storage is a good thing.

There is a balance between ultimate speed and simplicity. I used to use dm-cache under btrfs which worked very well but is complicated and error prone to setup, fragile, and slow to restore after an error. Now I'm using your ssd_metadata patch which is almost as fast but far more robust and quick/easy to restore after errors. It's night and day better imho, especially for a production system.

Paul.

^ permalink raw reply	[flat|nested] 23+ messages in thread

end of thread, other threads:[~2020-10-24  3:38 UTC | newest]

Thread overview: 23+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2020-04-01 20:03 [RFC] btrfs: ssd_metadata: storing metadata on SSD Goffredo Baroncelli
2020-04-01 20:03 ` [PATCH] btrfs: add ssd_metadata mode Goffredo Baroncelli
2020-04-01 20:53   ` Goffredo Baroncelli
2020-04-02  9:33   ` Steven Davies
2020-04-02 16:39     ` Goffredo Baroncelli
2020-04-03  8:43     ` Michael
2020-04-03 10:08       ` Steven Davies
2020-04-03 16:19       ` Goffredo Baroncelli
2020-04-03 16:28         ` Hugo Mills
2020-04-03 16:36           ` Hans van Kranenburg
2020-04-02 18:01   ` Martin Svec
  -- strict thread matches above, loose matches on Subject: below --
2020-04-05  7:19 [RFC][PATCH v2] btrfs: ssd_metadata: storing metadata on SSD Goffredo Baroncelli
2020-04-05  7:19 ` [PATCH] btrfs: add ssd_metadata mode Goffredo Baroncelli
2020-04-05  8:26 [RFC][PATCH V3] btrfs: ssd_metadata: storing metadata on SSD Goffredo Baroncelli
2020-04-05  8:26 ` [PATCH] btrfs: add ssd_metadata mode Goffredo Baroncelli
2020-04-14  5:24   ` Paul Jones
2020-10-23  7:23   ` Wang Yugui
2020-10-23 10:11     ` Adam Borowski
2020-10-23 11:25       ` Qu Wenruo
2020-10-23 12:37         ` Wang Yugui
2020-10-23 12:45           ` Qu Wenruo
2020-10-23 13:10           ` Steven Davies
2020-10-23 13:49             ` Wang Yugui
2020-10-23 18:03           ` Goffredo Baroncelli
2020-10-24  3:26             ` Paul Jones

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).