Using Intel Optane to accelerate a BTRFS array? (equivalent of ZLOG/SIL for ZFS?)

All of lore.kernel.org
 help / color / mirror / Atom feed

* Using Intel Optane to accelerate a BTRFS array? (equivalent of ZLOG/SIL for ZFS?)
@ 2020-03-29 22:30 Victor Hooi
  2020-03-30  5:46 ` Andrei Borzenkov
  0 siblings, 1 reply; 11+ messages in thread
From: Victor Hooi @ 2020-03-29 22:30 UTC (permalink / raw)
  To: linux-btrfs

Hi,

I have a small 12-bay SuperMicro server I'm using as a local NAS, with
FreeNAS/ZFS.

Each drive is a 12TB HDD.

I'm in the process of moving it to Linux - and I thought this might be
a good chance to try out BTRFS again =).

(I'd previously tried BTRFS many years a go, and hit some issues -
it's possible this may have been made worse by my inexperience with
BTRFS at the time - e.g.
https://www.spinics.net/lists/linux-btrfs/msg04240.html)

Anyhow - currently the server has a 750GB Intel Optane drive, that
we're using as a ZLOG/SIL drive:

https://www.ixsystems.com/community/threads/how-best-to-use-960gb-optane-in-freenas-build.75798/#post-527264

My question is - what's the equivalent in BTRFS-land?

Or what is the best way to use an ultra-fast Intel Optane drive to
accelerate reads/writes on a BTRFS array?

Thanks,
Victor

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Using Intel Optane to accelerate a BTRFS array? (equivalent of ZLOG/SIL for ZFS?)
  2020-03-29 22:30 Using Intel Optane to accelerate a BTRFS array? (equivalent of ZLOG/SIL for ZFS?) Victor Hooi
@ 2020-03-30  5:46 ` Andrei Borzenkov
  2020-03-30  6:00   ` Paul Jones
  0 siblings, 1 reply; 11+ messages in thread
From: Andrei Borzenkov @ 2020-03-30  5:46 UTC (permalink / raw)
  To: Victor Hooi, linux-btrfs

30.03.2020 01:30, Victor Hooi пишет:
> Hi,
> 
> I have a small 12-bay SuperMicro server I'm using as a local NAS, with
> FreeNAS/ZFS.
> 
> Each drive is a 12TB HDD.
> 
> I'm in the process of moving it to Linux - and I thought this might be
> a good chance to try out BTRFS again =).
> 
> (I'd previously tried BTRFS many years a go, and hit some issues -
> it's possible this may have been made worse by my inexperience with
> BTRFS at the time - e.g.
> https://www.spinics.net/lists/linux-btrfs/msg04240.html)
> 
> Anyhow - currently the server has a 750GB Intel Optane drive, that
> we're using as a ZLOG/SIL drive:
> 

Do you mean ZIL/SLOG? ZIL == ZFS Intent Log, SLOG == SSD Log.

> https://www.ixsystems.com/community/threads/how-best-to-use-960gb-optane-in-freenas-build.75798/#post-527264
> 
> My question is - what's the equivalent in BTRFS-land?
> 

Not on btrfs level. I guess using bcache on top of btrfs may achieve
some similar effects.

> Or what is the best way to use an ultra-fast Intel Optane drive to
> accelerate reads/writes on a BTRFS array?
> 

ZIL is *write* intent log, it does not directly accelerates reads. ZFS
supports SSD as second-level read cache, but as far as I remember it is
physically separate from ZIL.

^ permalink raw reply	[flat|nested] 11+ messages in thread

* RE: Using Intel Optane to accelerate a BTRFS array? (equivalent of ZLOG/SIL for ZFS?)
  2020-03-30  5:46 ` Andrei Borzenkov
@ 2020-03-30  6:00   ` Paul Jones
  2020-03-31 17:01     ` Eli V
  0 siblings, 1 reply; 11+ messages in thread
From: Paul Jones @ 2020-03-30  6:00 UTC (permalink / raw)
  To: Andrei Borzenkov, Victor Hooi, linux-btrfs

> -----Original Message-----
> From: linux-btrfs-owner@vger.kernel.org <linux-btrfs-
> owner@vger.kernel.org> On Behalf Of Andrei Borzenkov
> Sent: Monday, 30 March 2020 4:46 PM
> To: Victor Hooi <victorhooi@gmail.com>; linux-btrfs <linux-
> btrfs@vger.kernel.org>
> Subject: Re: Using Intel Optane to accelerate a BTRFS array? (equivalent of
> ZLOG/SIL for ZFS?)
> 
> 30.03.2020 01:30, Victor Hooi пишет:
> > Hi,
> >
> > I have a small 12-bay SuperMicro server I'm using as a local NAS, with
> > FreeNAS/ZFS.
> >
> > Each drive is a 12TB HDD.
> >
> > I'm in the process of moving it to Linux - and I thought this might be
> > a good chance to try out BTRFS again =).
> >
> > (I'd previously tried BTRFS many years a go, and hit some issues -
> > it's possible this may have been made worse by my inexperience with
> > BTRFS at the time - e.g.
> > https://www.spinics.net/lists/linux-btrfs/msg04240.html)
> >
> > Anyhow - currently the server has a 750GB Intel Optane drive, that
> > we're using as a ZLOG/SIL drive:
> >
> 
> Do you mean ZIL/SLOG? ZIL == ZFS Intent Log, SLOG == SSD Log.
> 
> > https://www.ixsystems.com/community/threads/how-best-to-use-960gb-
> opta
> > ne-in-freenas-build.75798/#post-527264
> >
> > My question is - what's the equivalent in BTRFS-land?
> >
> 
> Not on btrfs level. I guess using bcache on top of btrfs may achieve some
> similar effects.
> 
> > Or what is the best way to use an ultra-fast Intel Optane drive to
> > accelerate reads/writes on a BTRFS array?
> >
> 
> 
> ZIL is *write* intent log, it does not directly accelerates reads. ZFS supports
> SSD as second-level read cache, but as far as I remember it is physically
> separate from ZIL.

I have used caching with lvm under btrfs. It's a pain to setup correctly for a btrfs raid1 setup (need separate volume groups with separate logical volumes to ensure it's impossible to have two raid1 stripes on the same physical disk without noticing it) but it did work quite well and I never had any strange problems with it.

Paul.

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Using Intel Optane to accelerate a BTRFS array? (equivalent of ZLOG/SIL for ZFS?)
  2020-03-30  6:00   ` Paul Jones
@ 2020-03-31 17:01     ` Eli V
  2020-03-31 17:09       ` Andrei Borzenkov
  2020-03-31 17:17       ` Roman Mamedov
  0 siblings, 2 replies; 11+ messages in thread
From: Eli V @ 2020-03-31 17:01 UTC (permalink / raw)
  To: Paul Jones; +Cc: Andrei Borzenkov, Victor Hooi, linux-btrfs

On Mon, Mar 30, 2020 at 2:02 AM Paul Jones <paul@pauljones.id.au> wrote:
>
> > -----Original Message-----
> > From: linux-btrfs-owner@vger.kernel.org <linux-btrfs-
> > owner@vger.kernel.org> On Behalf Of Andrei Borzenkov
> > Sent: Monday, 30 March 2020 4:46 PM
> > To: Victor Hooi <victorhooi@gmail.com>; linux-btrfs <linux-
> > btrfs@vger.kernel.org>
> > Subject: Re: Using Intel Optane to accelerate a BTRFS array? (equivalent of
> > ZLOG/SIL for ZFS?)
> >
> > 30.03.2020 01:30, Victor Hooi пишет:
> > > Hi,
> > >
> > > I have a small 12-bay SuperMicro server I'm using as a local NAS, with
> > > FreeNAS/ZFS.
> > >
> > > Each drive is a 12TB HDD.
> > >
> > > I'm in the process of moving it to Linux - and I thought this might be
> > > a good chance to try out BTRFS again =).
> > >
> > > (I'd previously tried BTRFS many years a go, and hit some issues -
> > > it's possible this may have been made worse by my inexperience with
> > > BTRFS at the time - e.g.
> > > https://www.spinics.net/lists/linux-btrfs/msg04240.html)
> > >
> > > Anyhow - currently the server has a 750GB Intel Optane drive, that
> > > we're using as a ZLOG/SIL drive:
> > >
> >
> > Do you mean ZIL/SLOG? ZIL == ZFS Intent Log, SLOG == SSD Log.
> >
> > > https://www.ixsystems.com/community/threads/how-best-to-use-960gb-
> > opta
> > > ne-in-freenas-build.75798/#post-527264
> > >
> > > My question is - what's the equivalent in BTRFS-land?
> > >
> >
> > Not on btrfs level. I guess using bcache on top of btrfs may achieve some
> > similar effects.
> >
> > > Or what is the best way to use an ultra-fast Intel Optane drive to
> > > accelerate reads/writes on a BTRFS array?
> > >
> >
> >
> > ZIL is *write* intent log, it does not directly accelerates reads. ZFS supports
> > SSD as second-level read cache, but as far as I remember it is physically
> > separate from ZIL.
>
> I have used caching with lvm under btrfs. It's a pain to setup correctly for a btrfs raid1 setup (need separate volume groups with separate logical volumes to ensure it's impossible to have two raid1 stripes on the same physical disk without noticing it) but it did work quite well and I never had any strange problems with it.
>
> Paul.

Another option is to put the 12TB drives in an mdadm RAID, and then
use the mdadm raid & the ssd for btrfs RAID1 metadata, with SINGLE
data on the the array. Currently, this will make roughly half of the
meta data lookups run at SSD speed, but there is a pending patch to
allow all the metadata reads to go to the SSD. This option is, of
course, only useful for speeding up metadata operations. It can make
large btrfs filesystems feel much more responsive in interactive use
however.

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Using Intel Optane to accelerate a BTRFS array? (equivalent of ZLOG/SIL for ZFS?)
  2020-03-31 17:01     ` Eli V
@ 2020-03-31 17:09       ` Andrei Borzenkov
  2020-03-31 20:08         ` Goffredo Baroncelli
  2020-03-31 17:17       ` Roman Mamedov
  1 sibling, 1 reply; 11+ messages in thread
From: Andrei Borzenkov @ 2020-03-31 17:09 UTC (permalink / raw)
  To: Eli V, Paul Jones; +Cc: Victor Hooi, linux-btrfs

31.03.2020 20:01, Eli V пишет:
> 
> Another option is to put the 12TB drives in an mdadm RAID, and then
> use the mdadm raid & the ssd for btrfs RAID1 metadata, with SINGLE
> data on the the array.

How do you restrict specific device for metadata only?

> Currently, this will make roughly half of the
> meta data lookups run at SSD speed, but there is a pending patch to
> allow all the metadata reads to go to the SSD. This option is, of
> course, only useful for speeding up metadata operations. It can make
> large btrfs filesystems feel much more responsive in interactive use
> however.
> 


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Using Intel Optane to accelerate a BTRFS array? (equivalent of ZLOG/SIL for ZFS?)
  2020-03-31 17:01     ` Eli V
  2020-03-31 17:09       ` Andrei Borzenkov
@ 2020-03-31 17:17       ` Roman Mamedov
  2020-03-31 17:31         ` Eli V
  1 sibling, 1 reply; 11+ messages in thread
From: Roman Mamedov @ 2020-03-31 17:17 UTC (permalink / raw)
  To: Eli V; +Cc: Paul Jones, Andrei Borzenkov, Victor Hooi, linux-btrfs

On Tue, 31 Mar 2020 13:01:09 -0400
Eli V <eliventer@gmail.com> wrote:

> Another option is to put the 12TB drives in an mdadm RAID, and then
> use the mdadm raid & the ssd for btrfs RAID1 metadata, with SINGLE
> data on the the array. Currently, this will make roughly half of the
> meta data lookups run at SSD speed, but there is a pending patch to
> allow all the metadata reads to go to the SSD. This option is, of
> course, only useful for speeding up metadata operations. It can make
> large btrfs filesystems feel much more responsive in interactive use
> however.

If you're not taking advantage of Btrfs-side features for RAID, then might as
well run LVM Cache on top of mdadm, and then Btrfs on top of the
cached LV.
https://access.redhat.com/documentation/en-us/red_hat_enterprise_linux/6/html/logical_volume_manager_administration/lvm_cache_volume_creation
https://lukas.zapletalovi.com/2019/05/lvm-cache-in-six-easy-steps.html

Or Bcache, which is the same concept, but I do not suggest it over LVM cache
due to perceived lower code quality, i.e. many data loss bugs, at least in the
past. And as the 2nd article mentions, you can't un-bcache a block device,
even if the cache device is disabled, the metadata cannot be removed. Unlike
LVM where it is easy to switch back an LV to a plain uncached one.

-- 
With respect,
Roman

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Using Intel Optane to accelerate a BTRFS array? (equivalent of ZLOG/SIL for ZFS?)
  2020-03-31 17:17       ` Roman Mamedov
@ 2020-03-31 17:31         ` Eli V
  2020-03-31 17:42           ` Roman Mamedov
  0 siblings, 1 reply; 11+ messages in thread
From: Eli V @ 2020-03-31 17:31 UTC (permalink / raw)
  To: Roman Mamedov; +Cc: Paul Jones, Andrei Borzenkov, Victor Hooi, linux-btrfs

On Tue, Mar 31, 2020 at 1:17 PM Roman Mamedov <rm@romanrm.net> wrote:
>
> On Tue, 31 Mar 2020 13:01:09 -0400
> Eli V <eliventer@gmail.com> wrote:
>
> > Another option is to put the 12TB drives in an mdadm RAID, and then
> > use the mdadm raid & the ssd for btrfs RAID1 metadata, with SINGLE
> > data on the the array. Currently, this will make roughly half of the
> > meta data lookups run at SSD speed, but there is a pending patch to
> > allow all the metadata reads to go to the SSD. This option is, of
> > course, only useful for speeding up metadata operations. It can make
> > large btrfs filesystems feel much more responsive in interactive use
> > however.
>
> If you're not taking advantage of Btrfs-side features for RAID, then might as
> well run LVM Cache on top of mdadm, and then Btrfs on top of the
> cached LV.
> https://access.redhat.com/documentation/en-us/red_hat_enterprise_linux/6/html/logical_volume_manager_administration/lvm_cache_volume_creation
> https://lukas.zapletalovi.com/2019/05/lvm-cache-in-six-easy-steps.html
>
> Or Bcache, which is the same concept, but I do not suggest it over LVM cache
> due to perceived lower code quality, i.e. many data loss bugs, at least in the
> past. And as the 2nd article mentions, you can't un-bcache a block device,
> even if the cache device is disabled, the metadata cannot be removed. Unlike
> LVM where it is easy to switch back an LV to a plain uncached one.
>
> --
> With respect,
> Roman

Yes using lvm cache is an option, and will give you actual caching of
the data files as well. However, in my experience it doesn't do much
caching of metadata so using it on large filesystems doesn't seem to
improve interactive usage much at all, i.e. ls -l, or btrfs filesystem
usage etc.

As to the question of "How do you restrict specific device for
metadata only?" With btrfs metadata as RAID1 and data as SINGLE, and
the mdadm array being much larger then the SSD, all data allocations
will naturally go to the mdadm array, and all metadata writes will go
to both the SSD and the array. Currently, the metadata reads will be
balanced across the 2 devices based on PID. Once the btrfs readmirror
patches are merged then you'll be able to have all the metadata reads
go to just the SSD.

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Using Intel Optane to accelerate a BTRFS array? (equivalent of ZLOG/SIL for ZFS?)
  2020-03-31 17:31         ` Eli V
@ 2020-03-31 17:42           ` Roman Mamedov
  2020-03-31 19:46             ` Eli V
  0 siblings, 1 reply; 11+ messages in thread
From: Roman Mamedov @ 2020-03-31 17:42 UTC (permalink / raw)
  To: Eli V; +Cc: Paul Jones, Andrei Borzenkov, Victor Hooi, linux-btrfs

On Tue, 31 Mar 2020 13:31:19 -0400
Eli V <eliventer@gmail.com> wrote:

> Yes using lvm cache is an option, and will give you actual caching of
> the data files as well. However, in my experience it doesn't do much
> caching of metadata so using it on large filesystems doesn't seem to
> improve interactive usage much at all, i.e. ls -l, or btrfs filesystem
> usage etc.

Forgot to mention that in my case (on a large media server) I had great
results with the described setup, especially noticeable in the mount time.
Walking large directories in a GUI file manager was more responsive too. Not
to mention mass deletion of snapshots. LVM cache seemed to know well to avoid
polluting itself with infrequently accessed sequential-pattern bulk operations
(i.e. copying or reading back the actual file data) and appeared to cache
mostly the metadata as it should. For anyone considering this, give it a try,
and give it at least a few days of normal usage to properly warm up.

-- 
With respect,
Roman

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Using Intel Optane to accelerate a BTRFS array? (equivalent of ZLOG/SIL for ZFS?)
  2020-03-31 17:42           ` Roman Mamedov
@ 2020-03-31 19:46             ` Eli V
  0 siblings, 0 replies; 11+ messages in thread
From: Eli V @ 2020-03-31 19:46 UTC (permalink / raw)
  To: Roman Mamedov; +Cc: Paul Jones, Andrei Borzenkov, Victor Hooi, linux-btrfs

On Tue, Mar 31, 2020 at 1:42 PM Roman Mamedov <rm@romanrm.net> wrote:
>
> On Tue, 31 Mar 2020 13:31:19 -0400
> Eli V <eliventer@gmail.com> wrote:
>
> > Yes using lvm cache is an option, and will give you actual caching of
> > the data files as well. However, in my experience it doesn't do much
> > caching of metadata so using it on large filesystems doesn't seem to
> > improve interactive usage much at all, i.e. ls -l, or btrfs filesystem
> > usage etc.
>
> Forgot to mention that in my case (on a large media server) I had great
> results with the described setup, especially noticeable in the mount time.
> Walking large directories in a GUI file manager was more responsive too. Not
> to mention mass deletion of snapshots. LVM cache seemed to know well to avoid
> polluting itself with infrequently accessed sequential-pattern bulk operations
> (i.e. copying or reading back the actual file data) and appeared to cache
> mostly the metadata as it should. For anyone considering this, give it a try,
> and give it at least a few days of normal usage to properly warm up.
>
> --
> With respect,
> Roman

Yes, certainly test it out for yourself. My use case is quite
different, large(>300TB) btrfs filesystems used for rsync & snapshot
backups of proprietary NAS. The coolest thing is, through the wonders
of btrfs and lvm, you can dynamically convert from one configuration
to the other. I don't think even a umount is needed.

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Using Intel Optane to accelerate a BTRFS array? (equivalent of ZLOG/SIL for ZFS?)
  2020-03-31 17:09       ` Andrei Borzenkov
@ 2020-03-31 20:08         ` Goffredo Baroncelli
  2020-03-31 21:44           ` Goffredo Baroncelli
  0 siblings, 1 reply; 11+ messages in thread
From: Goffredo Baroncelli @ 2020-03-31 20:08 UTC (permalink / raw)
  To: Andrei Borzenkov, Eli V, Paul Jones; +Cc: Victor Hooi, linux-btrfs

On 3/31/20 7:09 PM, Andrei Borzenkov wrote:
> 31.03.2020 20:01, Eli V пишет:
>>
>> Another option is to put the 12TB drives in an mdadm RAID, and then
>> use the mdadm raid & the ssd for btrfs RAID1 metadata, with SINGLE
>> data on the the array.
> 
> How do you restrict specific device for metadata only?

I never tried, but I don't think that it would be so complicated.

When BTRFS has to allocate a new chunk, it collects all the available
free spaces on the disks; it sorts all these free spaces on the basis of
criterion like the largest contiguous area and how the disk is full
and pick the top one.

It could be sufficient to add another criteria to the sorting algorithm,
something like that
- if the chunk is a metadata one, an SSD has an higher priority
- if the chunk is a data one, an SSD has a lower priority

So the metadata will have an higher likelihood to be on the SSD,
instead the data will have an higher  likelihood to be a NON SSD disk.

Of course this is a soft constraint, when a kind of disk is full, it will
be possible to use the other kind, only with a lower priority.

> 
>> Currently, this will make roughly half of the
>> meta data lookups run at SSD speed, but there is a pending patch to
>> allow all the metadata reads to go to the SSD. This option is, of
>> course, only useful for speeding up metadata operations. It can make
>> large btrfs filesystems feel much more responsive in interactive use
>> however.
>>
> 


-- 
gpg @keyserver.linux.it: Goffredo Baroncelli <kreijackATinwind.it>
Key fingerprint BBF5 1610 0B64 DAC6 5F7D  17B2 0EDA 9B37 8B82 E0B5

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Using Intel Optane to accelerate a BTRFS array? (equivalent of ZLOG/SIL for ZFS?)
  2020-03-31 20:08         ` Goffredo Baroncelli
@ 2020-03-31 21:44           ` Goffredo Baroncelli
  0 siblings, 0 replies; 11+ messages in thread
From: Goffredo Baroncelli @ 2020-03-31 21:44 UTC (permalink / raw)
  To: Andrei Borzenkov, Eli V, Paul Jones; +Cc: Victor Hooi, linux-btrfs

On 3/31/20 10:08 PM, Goffredo Baroncelli wrote:
> On 3/31/20 7:09 PM, Andrei Borzenkov wrote:
>> 31.03.2020 20:01, Eli V пишет:
>>>
>>> Another option is to put the 12TB drives in an mdadm RAID, and then
>>> use the mdadm raid & the ssd for btrfs RAID1 metadata, with SINGLE
>>> data on the the array.
>>
>> How do you restrict specific device for metadata only?
> 
> I never tried, but I don't think that it would be so complicated.
> 
> When BTRFS has to allocate a new chunk, it collects all the available
> free spaces on the disks; it sorts all these free spaces on the basis of
> criterion like the largest contiguous area and how the disk is full
> and pick the top one.
> 
> It could be sufficient to add another criteria to the sorting algorithm,
> something like that
> - if the chunk is a metadata one, an SSD has an higher priority
> - if the chunk is a data one, an SSD has a lower priority
> 
> So the metadata will have an higher likelihood to be on the SSD,
> instead the data will have an higher  likelihood to be a NON SSD disk.
> 
> Of course this is a soft constraint, when a kind of disk is full, it will
> be possible to use the other kind, only with a lower priority.
> 

This is only to give an idea. In order to enable the feature, it must be mounted
with the flag ssd_metadata:

# mount -o ssd_metadata /dev/sdX /mnt/test

(don't try at home !)

diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index 2e9f938508e9..0f3c09cc4863 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -1187,6 +1187,7 @@ static inline u32 BTRFS_MAX_XATTR_SIZE(const struct btrfs_fs_info *info)
  #define BTRFS_MOUNT_FREE_SPACE_TREE	(1 << 26)
  #define BTRFS_MOUNT_NOLOGREPLAY		(1 << 27)
  #define BTRFS_MOUNT_REF_VERIFY		(1 << 28)
+#define BTRFS_MOUNT_SSD_METADATA	(1 << 29)
  
  #define BTRFS_DEFAULT_COMMIT_INTERVAL	(30)
  #define BTRFS_DEFAULT_MAX_INLINE	(2048)
diff --git a/fs/btrfs/super.c b/fs/btrfs/super.c
index c6557d44907a..d0a5cf496f90 100644
--- a/fs/btrfs/super.c
+++ b/fs/btrfs/super.c
@@ -346,6 +346,7 @@ enum {
  #ifdef CONFIG_BTRFS_FS_REF_VERIFY
  	Opt_ref_verify,
  #endif
+	Opt_ssd_metadata,
  	Opt_err,
  };
  
@@ -416,6 +417,7 @@ static const match_table_t tokens = {
  #ifdef CONFIG_BTRFS_FS_REF_VERIFY
  	{Opt_ref_verify, "ref_verify"},
  #endif
+	{Opt_ssd_metadata, "ssd_metadata"},
  	{Opt_err, NULL},
  };
  
@@ -853,6 +855,10 @@ int btrfs_parse_options(struct btrfs_fs_info *info, char *options,
  			btrfs_set_opt(info->mount_opt, REF_VERIFY);
  			break;
  #endif
+		case Opt_ssd_metadata:
+			btrfs_set_and_info(info, SSD_METADATA,
+					"enabling ssd_metadata");
+			break;
  		case Opt_err:
  			btrfs_info(info, "unrecognized mount option '%s'", p);
  			ret = -EINVAL;
@@ -1369,6 +1375,8 @@ static int btrfs_show_options(struct seq_file *seq, struct dentry *dentry)
  #endif
  	if (btrfs_test_opt(info, REF_VERIFY))
  		seq_puts(seq, ",ref_verify");
+	if (btrfs_test_opt(info, SSD_METADATA))
+		seq_puts(seq, ",ssd_metadata");
  	seq_printf(seq, ",subvolid=%llu",
  		  BTRFS_I(d_inode(dentry))->root->root_key.objectid);
  	seq_puts(seq, ",subvol=");
diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index a8b71ded4d21..43bb5d98a8cb 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -4758,6 +4758,67 @@ static int btrfs_cmp_device_info(const void *a, const void *b)
  	return 0;
  }
  
+/*
+ * sort the devices in descending order by rotational,
+ * max_avail, total_avail
+ */
+static int btrfs_cmp_device_info_metadata(const void *a, const void *b)
+{
+	const struct btrfs_device_info *di_a = a;
+	const struct btrfs_device_info *di_b = b;
+	const int nrot_a = test_bit(QUEUE_FLAG_NONROT,
+			&(bdev_get_queue(di_a->dev->bdev)->queue_flags));
+
+	const int nrot_b = test_bit(QUEUE_FLAG_NONROT,
+			&(bdev_get_queue(di_b->dev->bdev)->queue_flags));
+
+	/* metadata -> non rotational first */
+	if (nrot_a && !nrot_b)
+		return -1;
+	if (!nrot_a && nrot_b)
+		return 1;
+	if (di_a->max_avail > di_b->max_avail)
+		return -1;
+	if (di_a->max_avail < di_b->max_avail)
+		return 1;
+	if (di_a->total_avail > di_b->total_avail)
+		return -1;
+	if (di_a->total_avail < di_b->total_avail)
+		return 1;
+	return 0;
+}
+
+/*
+ * sort the devices in descending order by !rotational,
+ * max_avail, total_avail
+ */
+static int btrfs_cmp_device_info_data(const void *a, const void *b)
+{
+	const struct btrfs_device_info *di_a = a;
+	const struct btrfs_device_info *di_b = b;
+	const int nrot_a = test_bit(QUEUE_FLAG_NONROT,
+			&(bdev_get_queue(di_a->dev->bdev)->queue_flags));
+	const int nrot_b = test_bit(QUEUE_FLAG_NONROT,
+			&(bdev_get_queue(di_b->dev->bdev)->queue_flags));
+
+	/* data -> non rotational last */
+	if (nrot_a && !nrot_b)
+		return 1;
+	if (!nrot_a && nrot_b)
+		return -1;
+	if (di_a->max_avail > di_b->max_avail)
+		return -1;
+	if (di_a->max_avail < di_b->max_avail)
+		return 1;
+	if (di_a->total_avail > di_b->total_avail)
+		return -1;
+	if (di_a->total_avail < di_b->total_avail)
+		return 1;
+	return 0;
+}
+
+
+
  static void check_raid56_incompat_flag(struct btrfs_fs_info *info, u64 type)
  {
  	if (!(type & BTRFS_BLOCK_GROUP_RAID56_MASK))
@@ -4917,9 +4978,17 @@ static int __btrfs_alloc_chunk(struct btrfs_trans_handle *trans,
  	/*
  	 * now sort the devices by hole size / available space
  	 */
-	sort(devices_info, ndevs, sizeof(struct btrfs_device_info),
-	     btrfs_cmp_device_info, NULL);
-
+	if (((type & BTRFS_BLOCK_GROUP_DATA) &&
+	     (type & BTRFS_BLOCK_GROUP_METADATA)) ||
+	    !btrfs_test_opt(info, SSD_METADATA))
+		sort(devices_info, ndevs, sizeof(struct btrfs_device_info),
+			     btrfs_cmp_device_info, NULL);
+	else if (type & BTRFS_BLOCK_GROUP_DATA)
+		sort(devices_info, ndevs, sizeof(struct btrfs_device_info),
+			     btrfs_cmp_device_info_data, NULL);
+	else
+		sort(devices_info, ndevs, sizeof(struct btrfs_device_info),
+			     btrfs_cmp_device_info_metadata, NULL);
  	/*
  	 * Round down to number of usable stripes, devs_increment can be any
  	 * number so we can't use round_down()


>>
>>> Currently, this will make roughly half of the
>>> meta data lookups run at SSD speed, but there is a pending patch to
>>> allow all the metadata reads to go to the SSD. This option is, of
>>> course, only useful for speeding up metadata operations. It can make
>>> large btrfs filesystems feel much more responsive in interactive use
>>> however.
>>>
>>
> 
> 


-- 
gpg @keyserver.linux.it: Goffredo Baroncelli <kreijackATinwind.it>
Key fingerprint BBF5 1610 0B64 DAC6 5F7D  17B2 0EDA 9B37 8B82 E0B5

^ permalink raw reply related	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2020-03-31 21:44 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2020-03-29 22:30 Using Intel Optane to accelerate a BTRFS array? (equivalent of ZLOG/SIL for ZFS?) Victor Hooi
2020-03-30  5:46 ` Andrei Borzenkov
2020-03-30  6:00   ` Paul Jones
2020-03-31 17:01     ` Eli V
2020-03-31 17:09       ` Andrei Borzenkov
2020-03-31 20:08         ` Goffredo Baroncelli
2020-03-31 21:44           ` Goffredo Baroncelli
2020-03-31 17:17       ` Roman Mamedov
2020-03-31 17:31         ` Eli V
2020-03-31 17:42           ` Roman Mamedov
2020-03-31 19:46             ` Eli V

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.