* Tiered storage?
@ 2017-11-15 1:01 Roy Sigurd Karlsbakk
2017-11-15 7:11 ` waxhead
0 siblings, 1 reply; 8+ messages in thread
From: Roy Sigurd Karlsbakk @ 2017-11-15 1:01 UTC (permalink / raw)
To: linux-btrfs
Hi all
I've been following this project on and off for quite a few years, and I wonder if anyone has looked into tiered storage on it. With tiered storage, I mean hot data lying on fast storage and cold data on slow storage. I'm not talking about cashing (where you just keep a copy of the hot data on the fast storage).
And btw, how far is raid[56] and block-level dedup from something useful in production?
Vennlig hilsen
roy
--
Roy Sigurd Karlsbakk
(+47) 98013356
http://blogg.karlsbakk.net/
GPG Public key: http://karlsbakk.net/roysigurdkarlsbakk.pubkey.txt
--
Hið góða skaltu í stein höggva, hið illa í snjó rita.
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: Tiered storage?
2017-11-15 1:01 Tiered storage? Roy Sigurd Karlsbakk
@ 2017-11-15 7:11 ` waxhead
2017-11-15 9:26 ` Marat Khalili
` (2 more replies)
0 siblings, 3 replies; 8+ messages in thread
From: waxhead @ 2017-11-15 7:11 UTC (permalink / raw)
To: Roy Sigurd Karlsbakk, linux-btrfs
As a regular BTRFS user I can tell you that there is no such thing as
hot data tracking yet. Some people seem to use bcache together with
btrfs and come asking for help on the mailing list.
Raid5/6 have received a few fixes recently, and it *may* soon me worth
trying out raid5/6 for data, but keeping metadata in raid1/10 (I would
rather loose a file or two than the entire filesystem).
I had plans to run some tests on this a while ago, but forgot about it.
As call good citizens, remember to have good backups. Last time I tested
for Raid5/6 I ran into issues easily. For what it's worth - raid1/10
seems pretty rock solid as long as you have sufficient disks (hint: you
need more than two for raid1 if you want to stay safe)
As for dedupe there is (to my knowledge) nothing fully automatic yet.
You have to run a program to scan your filesystem but all the
deduplication is done in the kernel.
duperemove works apparently quite well when I tested it, but there may
be some performance implications.
Roy Sigurd Karlsbakk wrote:
> Hi all
>
> I've been following this project on and off for quite a few years, and I wonder if anyone has looked into tiered storage on it. With tiered storage, I mean hot data lying on fast storage and cold data on slow storage. I'm not talking about cashing (where you just keep a copy of the hot data on the fast storage).
>
> And btw, how far is raid[56] and block-level dedup from something useful in production?
>
> Vennlig hilsen
>
> roy
> --
> Roy Sigurd Karlsbakk
> (+47) 98013356
> http://blogg.karlsbakk.net/
> GPG Public key: http://karlsbakk.net/roysigurdkarlsbakk.pubkey.txt
> --
> Hið góða skaltu í stein höggva, hið illa í snjó rita.
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
>
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: Tiered storage?
2017-11-15 7:11 ` waxhead
@ 2017-11-15 9:26 ` Marat Khalili
2017-11-15 12:43 ` Austin S. Hemmelgarn
2017-11-15 12:52 ` Austin S. Hemmelgarn
2017-11-16 16:42 ` Kai Krakow
2 siblings, 1 reply; 8+ messages in thread
From: Marat Khalili @ 2017-11-15 9:26 UTC (permalink / raw)
To: waxhead, linux-btrfs; +Cc: Roy Sigurd Karlsbakk
On 15/11/17 10:11, waxhead wrote:
> hint: you need more than two for raid1 if you want to stay safe
Huh? Two is not enough? Having three or more makes a difference? (Or,
you mean hot spare?)
--
With Best Regards,
Marat Khalili
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: Tiered storage?
2017-11-15 9:26 ` Marat Khalili
@ 2017-11-15 12:43 ` Austin S. Hemmelgarn
0 siblings, 0 replies; 8+ messages in thread
From: Austin S. Hemmelgarn @ 2017-11-15 12:43 UTC (permalink / raw)
To: Marat Khalili, waxhead, linux-btrfs; +Cc: Roy Sigurd Karlsbakk
On 2017-11-15 04:26, Marat Khalili wrote:
>
> On 15/11/17 10:11, waxhead wrote:
>> hint: you need more than two for raid1 if you want to stay safe
> Huh? Two is not enough? Having three or more makes a difference? (Or,
> you mean hot spare?)
They're probably referring to an issue where a two device array
configured for raid1 which had lost a device and was mounted degraded
and writable would generate single profile chunks on the remaining
device instead of a half-complete raid1 chunk. This, when combined with
the fact that older kernels only check the filesystem as a whole for
normal/degraded/irreparable instead of checking individual chunks would
refuse to mount the resultant filesystem, meant that you only had one
chance to fix such an array.
If instead you have more than two devices, regular complete raid1
profile chunks are generated, and it becomes a non-issue.
The second issue (checking degraded status at the chunk level instead of
volume level) has been fixed in the most recent kernels.
The first issue has not been fixed yet, but I'm pretty sure there are
patches pending.
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: Tiered storage?
2017-11-15 7:11 ` waxhead
2017-11-15 9:26 ` Marat Khalili
@ 2017-11-15 12:52 ` Austin S. Hemmelgarn
2017-11-15 14:10 ` Roy Sigurd Karlsbakk
2017-11-16 16:42 ` Kai Krakow
2 siblings, 1 reply; 8+ messages in thread
From: Austin S. Hemmelgarn @ 2017-11-15 12:52 UTC (permalink / raw)
To: waxhead, Roy Sigurd Karlsbakk, linux-btrfs
On 2017-11-15 02:11, waxhead wrote:
> As a regular BTRFS user I can tell you that there is no such thing as
> hot data tracking yet. Some people seem to use bcache together with
> btrfs and come asking for help on the mailing list.
Bcache works fine recently. It was only with older versions that there
were issues. dm-cache similarly works fine on recent versions. In both
cases though, you need to be sure you know what you're doing, otherwise
you are liable to break things.
>
> Raid5/6 have received a few fixes recently, and it *may* soon me worth
> trying out raid5/6 for data, but keeping metadata in raid1/10 (I would
> rather loose a file or two than the entire filesystem).
> I had plans to run some tests on this a while ago, but forgot about it.
> As call good citizens, remember to have good backups. Last time I tested
> for Raid5/6 I ran into issues easily. For what it's worth - raid1/10
> seems pretty rock solid as long as you have sufficient disks (hint: you
> need more than two for raid1 if you want to stay safe)
Parity profiles (raid5 and raid6) still have issues, although there are
fewer than there were, with most of the remaining issues surrounding
recovery. I would still recommend against it for production usage.
Simple replication (raid1) is pretty much rock solid as long as you keep
on top of replacing failing hardware and aren't stupid enough to run the
array degraded for any extended period of time (converting to a single
device volume instead of leaving things with half a volume is vastly
preferred for multiple reasons).
Striped replication (raid10) is generally fine, but you can get much
better performance by running BTRFS with a raid1 profile on top of two
MD/LVM/Hardware RAID0 volumes (BTRFS still doesn't do a very good job of
parallelizing things).
>
> As for dedupe there is (to my knowledge) nothing fully automatic yet.
> You have to run a program to scan your filesystem but all the
> deduplication is done in the kernel.
> duperemove works apparently quite well when I tested it, but there may
> be some performance implications.
Correct, there is nothing automatic (and there are pretty significant
arguments against doing automatic deduplication in most cases), but the
off-line options (via the EXTENT_SAME ioctl) are reasonably reliable.
Duperemove in particular does a good job, though it may take a long time
for large data sets.
As far as performance, it's no worse than large numbers of snapshots.
The issues arise from using very large numbers of reflinks.
>
> Roy Sigurd Karlsbakk wrote:
>> Hi all
>>
>> I've been following this project on and off for quite a few years, and
>> I wonder if anyone has looked into tiered storage on it. With tiered
>> storage, I mean hot data lying on fast storage and cold data on slow
>> storage. I'm not talking about cashing (where you just keep a copy of
>> the hot data on the fast storage).
>>
>> And btw, how far is raid[56] and block-level dedup from something
>> useful in production?
>>
>> Vennlig hilsen
>>
>> roy
>> --
>> Roy Sigurd Karlsbakk
>> (+47) 98013356
>> http://blogg.karlsbakk.net/
>> GPG Public key: http://karlsbakk.net/roysigurdkarlsbakk.pubkey.txt
>> --
>> Hið góða skaltu í stein höggva, hið illa í snjó rita.
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: Tiered storage?
2017-11-15 12:52 ` Austin S. Hemmelgarn
@ 2017-11-15 14:10 ` Roy Sigurd Karlsbakk
2017-11-15 22:09 ` Duncan
0 siblings, 1 reply; 8+ messages in thread
From: Roy Sigurd Karlsbakk @ 2017-11-15 14:10 UTC (permalink / raw)
To: Austin S Hemmelgarn; +Cc: waxhead, linux-btrfs
>> As for dedupe there is (to my knowledge) nothing fully automatic yet.
>> You have to run a program to scan your filesystem but all the
>> deduplication is done in the kernel.
>> duperemove works apparently quite well when I tested it, but there may
>> be some performance implications.
> Correct, there is nothing automatic (and there are pretty significant
> arguments against doing automatic deduplication in most cases), but the
> off-line options (via the EXTENT_SAME ioctl) are reasonably reliable.
> Duperemove in particular does a good job, though it may take a long time
> for large data sets.
>
> As far as performance, it's no worse than large numbers of snapshots.
> The issues arise from using very large numbers of reflinks.
What is this "large" number of snapshots? Not that it's directly comparible, but I've worked with ZFS a while, and haven't seen those issues there.
Vennlig hilsen
roy
--
Roy Sigurd Karlsbakk
(+47) 98013356
http://blogg.karlsbakk.net/
GPG Public key: http://karlsbakk.net/roysigurdkarlsbakk.pubkey.txt
--
Hið góða skaltu í stein höggva, hið illa í snjó rita.
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: Tiered storage?
2017-11-15 14:10 ` Roy Sigurd Karlsbakk
@ 2017-11-15 22:09 ` Duncan
0 siblings, 0 replies; 8+ messages in thread
From: Duncan @ 2017-11-15 22:09 UTC (permalink / raw)
To: linux-btrfs
Roy Sigurd Karlsbakk posted on Wed, 15 Nov 2017 15:10:08 +0100 as
excerpted:
>>> As for dedupe there is (to my knowledge) nothing fully automatic yet.
>>> You have to run a program to scan your filesystem but all the
>>> deduplication is done in the kernel.
>>> duperemove works apparently quite well when I tested it, but there may
>>> be some performance implications.
>> Correct, there is nothing automatic (and there are pretty significant
>> arguments against doing automatic deduplication in most cases), but the
>> off-line options (via the EXTENT_SAME ioctl) are reasonably reliable.
>> Duperemove in particular does a good job, though it may take a long
>> time for large data sets.
>>
>> As far as performance, it's no worse than large numbers of snapshots.
>> The issues arise from using very large numbers of reflinks.
>
> What is this "large" number of snapshots? Not that it's directly
> comparible, but I've worked with ZFS a while, and haven't seen those
> issues there.
Btrfs has scaling issues with reflinks, not so much in normal operation,
but when it comes to filesystem maintenance such as btrfs check and btrfs
balance.
Numerically, low double-digits of reflinks per extent seems to be
reasonably fine, high double-digits to low triple-digits begins to run
into scaling issues, and high triple digits to over 1000... better be
prepared to wait awhile (can be days or weeks!) for that balance or check
to complete, and check requires LOTS more memory as well, particularly at
TB+ scale.
Of course snapshots are the common instance of reflinking, and each
snapshot is another reflink to each extent of the data in the subvolume
it covers, so limiting snapshots to 10-50 of each subvolume is
recommended, and limiting to under 250-ish is STRONGLY recommended.
(Total number of snapshots per filesystem, where there's many subvolumes
and snapshots per subvolume falls within the above limits, doesn't seem
to be a problem.)
Dedupe uses reflinking too, but the effects can be much more variable
depending on the use-case and how many actual reflinks are being created.
A single extent with 1000 deduping reflinks, as might be common in a
commercial/hosting use-case, shouldn't be too bad, perhaps comparable to
a single snapshot, but obviously, do that with a bunch of extents (as a
hosting use-case might) and it quickly builds to the effect of 1000
snapshots of the same subvolume, which as mentioned above puts
maintenance-task time out of the realm of reasonable, for many.
Tho of course in a commercial/hosting case maintenance may well not be
done as a simple swap-in of a fresh backup is more likely, so it may not
matter for that scenario.
OTOH, a typical individual/personal use-case may dedup many files but
only single-digit times each, so the effect would be the same as a single-
digit number of snapshots at worst.
Meanwhile, while btrfs quotas are finally maturing in terms of actually
tracking the numbers correctly, their effect on scaling is pretty bad
too. The recommendation is to keep btrfs quotas off unless you actually
need them. If you do need quotas, temporarily disable them while doing
balances and device-removes (which do implicit balances), then quota-
rescan after the balance is done, because precisely tracking quotas thru
a balance ends up repeatedly recalculating the numbers again and again
during the balance, and that just doesn't scale.
--
Duncan - List replies preferred. No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master." Richard Stallman
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: Tiered storage?
2017-11-15 7:11 ` waxhead
2017-11-15 9:26 ` Marat Khalili
2017-11-15 12:52 ` Austin S. Hemmelgarn
@ 2017-11-16 16:42 ` Kai Krakow
2 siblings, 0 replies; 8+ messages in thread
From: Kai Krakow @ 2017-11-16 16:42 UTC (permalink / raw)
To: linux-btrfs
Am Wed, 15 Nov 2017 08:11:04 +0100
schrieb waxhead <waxhead@dirtcellar.net>:
> As for dedupe there is (to my knowledge) nothing fully automatic yet.
> You have to run a program to scan your filesystem but all the
> deduplication is done in the kernel.
> duperemove works apparently quite well when I tested it, but there
> may be some performance implications.
There's bees as near-line deduplication tool, that is it watches for
generation changes in the filesystem and walks the inodes. It only
looks at extents, not at files. Deduplication itself is then delegated
to the kernel which ensures all changes are data-safe. The process is
running as a daemon and processes your changes in realtime (delayed by
a few seconds to minutes of course, due to transaction commit and
hashing phase).
You need to dedicate it part of your RAM to work, around 1 GB is
usually sufficient to work well enough. The RAM will be locked and
cannot be swapped out, so you should have a sufficiently equipped
system.
Works very well here (2TB of data, 1GB hash table, 16GB RAM).
New dDuplicated files are picked up within seconds, scanned (hitting
the cache most of the time thus not requiring physical IO), and then
submitted to the kernel for deduplication.
I'd call that fully automatic: Once set up, it just works, and works
well. Performance impact is very low once the initial scan is done.
https://github.com/Zygo/bees
--
Regards,
Kai
Replies to list-only preferred.
^ permalink raw reply [flat|nested] 8+ messages in thread
end of thread, other threads:[~2017-11-16 16:42 UTC | newest]
Thread overview: 8+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2017-11-15 1:01 Tiered storage? Roy Sigurd Karlsbakk
2017-11-15 7:11 ` waxhead
2017-11-15 9:26 ` Marat Khalili
2017-11-15 12:43 ` Austin S. Hemmelgarn
2017-11-15 12:52 ` Austin S. Hemmelgarn
2017-11-15 14:10 ` Roy Sigurd Karlsbakk
2017-11-15 22:09 ` Duncan
2017-11-16 16:42 ` Kai Krakow
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).