Tiered storage?

linux-btrfs.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* Tiered storage?
@ 2017-11-15  1:01 Roy Sigurd Karlsbakk
  2017-11-15  7:11 ` waxhead
  0 siblings, 1 reply; 8+ messages in thread
From: Roy Sigurd Karlsbakk @ 2017-11-15  1:01 UTC (permalink / raw)
  To: linux-btrfs

Hi all

I've been following this project on and off for quite a few years, and I wonder if anyone has looked into tiered storage on it. With tiered storage, I mean hot data lying on fast storage and cold data on slow storage. I'm not talking about cashing (where you just keep a copy of the hot data on the fast storage).

And btw, how far is raid[56] and block-level dedup from something useful in production?

Vennlig hilsen

roy
--
Roy Sigurd Karlsbakk
(+47) 98013356
http://blogg.karlsbakk.net/
GPG Public key: http://karlsbakk.net/roysigurdkarlsbakk.pubkey.txt
--
Hið góða skaltu í stein höggva, hið illa í snjó rita.

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Tiered storage?
  2017-11-15  1:01 Tiered storage? Roy Sigurd Karlsbakk
@ 2017-11-15  7:11 ` waxhead
  2017-11-15  9:26   ` Marat Khalili
                     ` (2 more replies)
  0 siblings, 3 replies; 8+ messages in thread
From: waxhead @ 2017-11-15  7:11 UTC (permalink / raw)
  To: Roy Sigurd Karlsbakk, linux-btrfs

As a regular BTRFS user I can tell you that there is no such thing as 
hot data tracking yet. Some people seem to use bcache together with 
btrfs and come asking for help on the mailing list.

Raid5/6 have received a few fixes recently, and it *may* soon me worth 
trying out raid5/6 for data, but keeping metadata in raid1/10 (I would 
rather loose a file or two than the entire filesystem).
I had plans to run some tests on this a while ago, but forgot about it.
As call good citizens, remember to have good backups. Last time I tested 
for Raid5/6 I ran into issues easily. For what it's worth - raid1/10 
seems pretty rock solid as long as you have sufficient disks (hint: you 
need more than two for raid1 if you want to stay safe)

As for dedupe there is (to my knowledge) nothing fully automatic yet. 
You have to run a program to scan your filesystem but all the 
deduplication is done in the kernel.
duperemove works apparently quite well when I tested it, but there may 
be some performance implications.

Roy Sigurd Karlsbakk wrote:
> Hi all
>
> I've been following this project on and off for quite a few years, and I wonder if anyone has looked into tiered storage on it. With tiered storage, I mean hot data lying on fast storage and cold data on slow storage. I'm not talking about cashing (where you just keep a copy of the hot data on the fast storage).
>
> And btw, how far is raid[56] and block-level dedup from something useful in production?
>
> Vennlig hilsen
>
> roy
> --
> Roy Sigurd Karlsbakk
> (+47) 98013356
> http://blogg.karlsbakk.net/
> GPG Public key: http://karlsbakk.net/roysigurdkarlsbakk.pubkey.txt
> --
> Hið góða skaltu í stein höggva, hið illa í snjó rita.
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Tiered storage?
  2017-11-15  7:11 ` waxhead
@ 2017-11-15  9:26   ` Marat Khalili
  2017-11-15 12:43     ` Austin S. Hemmelgarn
  2017-11-15 12:52   ` Austin S. Hemmelgarn
  2017-11-16 16:42   ` Kai Krakow
  2 siblings, 1 reply; 8+ messages in thread
From: Marat Khalili @ 2017-11-15  9:26 UTC (permalink / raw)
  To: waxhead, linux-btrfs; +Cc: Roy Sigurd Karlsbakk


On 15/11/17 10:11, waxhead wrote:
> hint: you need more than two for raid1 if you want to stay safe
Huh? Two is not enough? Having three or more makes a difference? (Or, 
you mean hot spare?)

--

With Best Regards,
Marat Khalili

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Tiered storage?
  2017-11-15  9:26   ` Marat Khalili
@ 2017-11-15 12:43     ` Austin S. Hemmelgarn
  0 siblings, 0 replies; 8+ messages in thread
From: Austin S. Hemmelgarn @ 2017-11-15 12:43 UTC (permalink / raw)
  To: Marat Khalili, waxhead, linux-btrfs; +Cc: Roy Sigurd Karlsbakk

On 2017-11-15 04:26, Marat Khalili wrote:
> 
> On 15/11/17 10:11, waxhead wrote:
>> hint: you need more than two for raid1 if you want to stay safe
> Huh? Two is not enough? Having three or more makes a difference? (Or, 
> you mean hot spare?)
They're probably referring to an issue where a two device array 
configured for raid1 which had lost a device and was mounted degraded 
and writable would generate single profile chunks on the remaining 
device instead of a half-complete raid1 chunk.  This, when combined with 
the fact that older kernels only check the filesystem as a whole for 
normal/degraded/irreparable instead of checking individual chunks would 
refuse to mount the resultant filesystem, meant that you only had one 
chance to fix such an array.

If instead you have more than two devices, regular complete raid1 
profile chunks are generated, and it becomes a non-issue.

The second issue (checking degraded status at the chunk level instead of 
volume level) has been fixed in the most recent kernels.

The first issue has not been fixed yet, but I'm pretty sure there are 
patches pending.

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Tiered storage?
  2017-11-15  7:11 ` waxhead
  2017-11-15  9:26   ` Marat Khalili
@ 2017-11-15 12:52   ` Austin S. Hemmelgarn
  2017-11-15 14:10     ` Roy Sigurd Karlsbakk
  2017-11-16 16:42   ` Kai Krakow
  2 siblings, 1 reply; 8+ messages in thread
From: Austin S. Hemmelgarn @ 2017-11-15 12:52 UTC (permalink / raw)
  To: waxhead, Roy Sigurd Karlsbakk, linux-btrfs

On 2017-11-15 02:11, waxhead wrote:
> As a regular BTRFS user I can tell you that there is no such thing as 
> hot data tracking yet. Some people seem to use bcache together with 
> btrfs and come asking for help on the mailing list.
Bcache works fine recently.  It was only with older versions that there 
were issues.  dm-cache similarly works fine on recent versions.  In both 
cases though, you need to be sure you know what you're doing, otherwise 
you are liable to break things.
> 
> Raid5/6 have received a few fixes recently, and it *may* soon me worth 
> trying out raid5/6 for data, but keeping metadata in raid1/10 (I would 
> rather loose a file or two than the entire filesystem).
> I had plans to run some tests on this a while ago, but forgot about it.
> As call good citizens, remember to have good backups. Last time I tested 
> for Raid5/6 I ran into issues easily. For what it's worth - raid1/10 
> seems pretty rock solid as long as you have sufficient disks (hint: you 
> need more than two for raid1 if you want to stay safe)
Parity profiles (raid5 and raid6) still have issues, although there are 
fewer than there were, with most of the remaining issues surrounding 
recovery.  I would still recommend against it for production usage.

Simple replication (raid1) is pretty much rock solid as long as you keep 
on top of replacing failing hardware and aren't stupid enough to run the 
array degraded for any extended period of time (converting to a single 
device volume instead of leaving things with half a volume is vastly 
preferred for multiple reasons).

Striped replication (raid10) is generally fine, but you can get much 
better performance by running BTRFS with a raid1 profile on top of two 
MD/LVM/Hardware RAID0 volumes (BTRFS still doesn't do a very good job of 
parallelizing things).
> 
> As for dedupe there is (to my knowledge) nothing fully automatic yet. 
> You have to run a program to scan your filesystem but all the 
> deduplication is done in the kernel.
> duperemove works apparently quite well when I tested it, but there may 
> be some performance implications.
Correct, there is nothing automatic (and there are pretty significant 
arguments against doing automatic deduplication in most cases), but the 
off-line options (via the EXTENT_SAME ioctl) are reasonably reliable. 
Duperemove in particular does a good job, though it may take a long time 
for large data sets.

As far as performance, it's no worse than large numbers of snapshots. 
The issues arise from using very large numbers of reflinks.
> 
> Roy Sigurd Karlsbakk wrote:
>> Hi all
>>
>> I've been following this project on and off for quite a few years, and 
>> I wonder if anyone has looked into tiered storage on it. With tiered 
>> storage, I mean hot data lying on fast storage and cold data on slow 
>> storage. I'm not talking about cashing (where you just keep a copy of 
>> the hot data on the fast storage).
>>
>> And btw, how far is raid[56] and block-level dedup from something 
>> useful in production?
>>
>> Vennlig hilsen
>>
>> roy
>> -- 
>> Roy Sigurd Karlsbakk
>> (+47) 98013356
>> http://blogg.karlsbakk.net/
>> GPG Public key: http://karlsbakk.net/roysigurdkarlsbakk.pubkey.txt
>> -- 
>> Hið góða skaltu í stein höggva, hið illa í snjó rita.


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Tiered storage?
  2017-11-15 12:52   ` Austin S. Hemmelgarn
@ 2017-11-15 14:10     ` Roy Sigurd Karlsbakk
  2017-11-15 22:09       ` Duncan
  0 siblings, 1 reply; 8+ messages in thread
From: Roy Sigurd Karlsbakk @ 2017-11-15 14:10 UTC (permalink / raw)
  To: Austin S Hemmelgarn; +Cc: waxhead, linux-btrfs

>> As for dedupe there is (to my knowledge) nothing fully automatic yet.
>> You have to run a program to scan your filesystem but all the
>> deduplication is done in the kernel.
>> duperemove works apparently quite well when I tested it, but there may
>> be some performance implications.
> Correct, there is nothing automatic (and there are pretty significant
> arguments against doing automatic deduplication in most cases), but the
> off-line options (via the EXTENT_SAME ioctl) are reasonably reliable.
> Duperemove in particular does a good job, though it may take a long time
> for large data sets.
> 
> As far as performance, it's no worse than large numbers of snapshots.
> The issues arise from using very large numbers of reflinks.

What is this "large" number of snapshots? Not that it's directly comparible, but I've worked with ZFS a while, and haven't seen those issues there.

Vennlig hilsen

roy
--
Roy Sigurd Karlsbakk
(+47) 98013356
http://blogg.karlsbakk.net/
GPG Public key: http://karlsbakk.net/roysigurdkarlsbakk.pubkey.txt
--
Hið góða skaltu í stein höggva, hið illa í snjó rita.

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Tiered storage?
  2017-11-15 14:10     ` Roy Sigurd Karlsbakk
@ 2017-11-15 22:09       ` Duncan
  0 siblings, 0 replies; 8+ messages in thread
From: Duncan @ 2017-11-15 22:09 UTC (permalink / raw)
  To: linux-btrfs

Roy Sigurd Karlsbakk posted on Wed, 15 Nov 2017 15:10:08 +0100 as
excerpted:

>>> As for dedupe there is (to my knowledge) nothing fully automatic yet.
>>> You have to run a program to scan your filesystem but all the
>>> deduplication is done in the kernel.
>>> duperemove works apparently quite well when I tested it, but there may
>>> be some performance implications.

>> Correct, there is nothing automatic (and there are pretty significant
>> arguments against doing automatic deduplication in most cases), but the
>> off-line options (via the EXTENT_SAME ioctl) are reasonably reliable.
>> Duperemove in particular does a good job, though it may take a long
>> time for large data sets.
>> 
>> As far as performance, it's no worse than large numbers of snapshots.
>> The issues arise from using very large numbers of reflinks.
> 
> What is this "large" number of snapshots? Not that it's directly
> comparible, but I've worked with ZFS a while, and haven't seen those
> issues there.

Btrfs has scaling issues with reflinks, not so much in normal operation, 
but when it comes to filesystem maintenance such as btrfs check and btrfs 
balance.

Numerically, low double-digits of reflinks per extent seems to be 
reasonably fine, high double-digits to low triple-digits begins to run 
into scaling issues, and high triple digits to over 1000... better be 
prepared to wait awhile (can be days or weeks!) for that balance or check 
to complete, and check requires LOTS more memory as well, particularly at 
TB+ scale.

Of course snapshots are the common instance of reflinking, and each 
snapshot is another reflink to each extent of the data in the subvolume 
it covers, so limiting snapshots to 10-50 of each subvolume is 
recommended, and limiting to under 250-ish is STRONGLY recommended.  
(Total number of snapshots per filesystem, where there's many subvolumes 
and snapshots per subvolume falls within the above limits, doesn't seem 
to be a problem.)

Dedupe uses reflinking too, but the effects can be much more variable 
depending on the use-case and how many actual reflinks are being created.

A single extent with 1000 deduping reflinks, as might be common in a 
commercial/hosting use-case, shouldn't be too bad, perhaps comparable to 
a single snapshot, but obviously, do that with a bunch of extents (as a 
hosting use-case might) and it quickly builds to the effect of 1000 
snapshots of the same subvolume, which as mentioned above puts 
maintenance-task time out of the realm of reasonable, for many.

Tho of course in a commercial/hosting case maintenance may well not be 
done as a simple swap-in of a fresh backup is more likely, so it may not 
matter for that scenario.

OTOH, a typical individual/personal use-case may dedup many files but 
only single-digit times each, so the effect would be the same as a single-
digit number of snapshots at worst.

Meanwhile, while btrfs quotas are finally maturing in terms of actually 
tracking the numbers correctly, their effect on scaling is pretty bad 
too.  The recommendation is to keep btrfs quotas off unless you actually 
need them.  If you do need quotas, temporarily disable them while doing 
balances and device-removes (which do implicit balances), then quota-
rescan after the balance is done, because precisely tracking quotas thru 
a balance ends up repeatedly recalculating the numbers again and again 
during the balance, and that just doesn't scale.

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Tiered storage?
  2017-11-15  7:11 ` waxhead
  2017-11-15  9:26   ` Marat Khalili
  2017-11-15 12:52   ` Austin S. Hemmelgarn
@ 2017-11-16 16:42   ` Kai Krakow
  2 siblings, 0 replies; 8+ messages in thread
From: Kai Krakow @ 2017-11-16 16:42 UTC (permalink / raw)
  To: linux-btrfs

Am Wed, 15 Nov 2017 08:11:04 +0100
schrieb waxhead <waxhead@dirtcellar.net>:

> As for dedupe there is (to my knowledge) nothing fully automatic yet. 
> You have to run a program to scan your filesystem but all the 
> deduplication is done in the kernel.
> duperemove works apparently quite well when I tested it, but there
> may be some performance implications.

There's bees as near-line deduplication tool, that is it watches for
generation changes in the filesystem and walks the inodes. It only
looks at extents, not at files. Deduplication itself is then delegated
to the kernel which ensures all changes are data-safe. The process is
running as a daemon and processes your changes in realtime (delayed by
a few seconds to minutes of course, due to transaction commit and
hashing phase).

You need to dedicate it part of your RAM to work, around 1 GB is
usually sufficient to work well enough. The RAM will be locked and
cannot be swapped out, so you should have a sufficiently equipped
system.

Works very well here (2TB of data, 1GB hash table, 16GB RAM).
New dDuplicated files are picked up within seconds, scanned (hitting
the cache most of the time thus not requiring physical IO), and then
submitted to the kernel for deduplication.

I'd call that fully automatic: Once set up, it just works, and works
well. Performance impact is very low once the initial scan is done.

https://github.com/Zygo/bees

-- 
Regards,
Kai

Replies to list-only preferred.

^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2017-11-16 16:42 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2017-11-15  1:01 Tiered storage? Roy Sigurd Karlsbakk
2017-11-15  7:11 ` waxhead
2017-11-15  9:26   ` Marat Khalili
2017-11-15 12:43     ` Austin S. Hemmelgarn
2017-11-15 12:52   ` Austin S. Hemmelgarn
2017-11-15 14:10     ` Roy Sigurd Karlsbakk
2017-11-15 22:09       ` Duncan
2017-11-16 16:42   ` Kai Krakow

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).