From: Austin S Hemmelgarn <ahferroin7@gmail.com>
To: Duncan <1i5t5.duncan@cox.net>, linux-btrfs@vger.kernel.org
Subject: Re: Understanding BTRFS storage
Date: Fri, 28 Aug 2015 08:54:02 -0400 [thread overview]
Message-ID: <55E059EA.9040402@gmail.com> (raw)
In-Reply-To: <pan$9606c$9f08381a$d5df543d$2943921@cox.net>
[-- Attachment #1: Type: text/plain, Size: 3983 bytes --]
On 2015-08-28 05:47, Duncan wrote:
> Austin S Hemmelgarn posted on Thu, 27 Aug 2015 08:01:58 -0400 as
> excerpted:
>
>>> Someone (IIRC it was Austin H) posted what I thought was an extremely
>>> good setup, a few weeks ago. Create two (or more) mdraid0s, and put
>>> btrfs raid1 (or raid5/6 when it's a bit more mature, I've been
>>> recommending waiting until 4.4 and see what the on-list reports for it
>>> look like then) on top. The btrfs raid on top lets you use btrfs' data
>>> integrity features, while the mdraid0s beneath help counteract the fact
>>> that btrfs isn't well optimized for speed yet, the way mdraid has been.
>>> And the btrfs raid on top means all is not lost with a device going bad
>>> in the mdraid0, as would normally be the case, since the other
>>> raid0(s),
>>> functioning as the remaining btrfs devices, let you rebuild the missing
>>> btrfs device, by recreating the missing raid0.
>>>
>>> Normally, that sort of raid01 is discouraged in favor of raid10, with
>>> raid1 at the lower level and raid0 on top, for more efficient rebuilds,
>>> but btrfs' data integrity features change that story entirely. =:^)
>>>
>> Two additional things:
>> 1. If you use MD RAID1 instead of RAID0, it's just as fast for reads, no
>> slower than on top of single disks for writes, and get's you better data
>> safety guarantees than even raid6 (if you do 2 MD RAID 1 devices with
>> BTRFS raid1 on top, you can lose all but one disk and still have all
>> your data).
>
> My hesitation for btrfs raid1 on top of mdraid1, is that a btrfs scrub
> doesn't scrub all the mdraid component devices.
>
> Of course if btrfs scrub finds an error, it will try to rewrite the bad
> copy from the (hopefully good) other btrfs raid1 copy, and that will
> trigger a rewrite of both/all copies on that underlying mdraid1, which
> should catch the bad one in the process no matter which one it was.
>
> But if one of the lower level mdraid1 component devices is bad while the
> other(s) are good, and mdraid happens to pick the good device, it won't
> even see and thus can't scrub the bad lower-level copy.
>
> To avoid that problem, one can of course do an mdraid level scrub
> followed by a btrfs scrub. The mdraid level scrub won't tell bad from
> good but will simply ensure they match, and if it happens to pick the bad
> one at that level, the followon btrfs level scrub will detect that and
> trigger a rewrite from its other copy, which again, will rewrite both/all
> the underlying mdraid1 component devices on that btrfs raid1 side, but
> that still wouldn't ensure that the rewrite actually happened properly,
> so then you're left redoing both levels yet again, to ensure that.
>
> Which in theory can work, but in practice, particularly on spinning rust,
> you pretty quickly reach a point when you're running 24/7 scrubs, which,
> again particularly on spinning rust, is going to kill throughput for
> pretty much any other IO going on at the same time.
Well yes, but only if you are working with large data sets. In my use
case, the usage amounts to write once, read at most twice, and the data
sets are both less than 32G, so scrubbing the lower level RAID1 takes
about 10 minutes as of right now. In particular, the array's get
written to at most once a day, and only read when the primary data
sources fail. In my use case, performance isn't as important as up-time.
>
> Which is one of the reasons I found btrfs raid1 on mdraid0 so appealing
> in comparison -- raid0 has only the single copy, which is either correct
> or incorrect, and if the btrfs scrub turns up a problem, it does the
> rewrite, and a single second pass of that btrfs scrub can verify that the
> rewrite happened correctly, because there's no hidden copies being picked
> more or less randomly at the mdraid level, only the single copy, which is
> either correct or incorrect. I like that determinism! =:^)
>
[-- Attachment #2: S/MIME Cryptographic Signature --]
[-- Type: application/pkcs7-signature, Size: 3019 bytes --]
next prev parent reply other threads:[~2015-08-28 12:54 UTC|newest]
Thread overview: 18+ messages / expand[flat|nested] mbox.gz Atom feed top
2015-08-26 8:56 Understanding BTRFS storage George Duffield
2015-08-26 11:41 ` Austin S Hemmelgarn
2015-08-26 11:50 ` Hugo Mills
2015-08-26 11:50 ` Roman Mamedov
2015-08-26 12:03 ` Austin S Hemmelgarn
2015-08-27 2:58 ` Duncan
2015-08-27 12:01 ` Austin S Hemmelgarn
2015-08-28 9:47 ` Duncan
2015-08-28 12:54 ` Austin S Hemmelgarn [this message]
2015-08-28 8:50 ` George Duffield
2015-08-28 9:35 ` Hugo Mills
2015-08-28 15:42 ` Chris Murphy
2015-08-28 17:11 ` Austin S Hemmelgarn
2015-08-29 8:52 ` George Duffield
2015-08-29 22:28 ` Chris Murphy
2015-09-02 5:01 ` Russell Coker
2015-08-28 9:46 ` Roman Mamedov
2015-08-26 11:50 ` Duncan
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=55E059EA.9040402@gmail.com \
--to=ahferroin7@gmail.com \
--cc=1i5t5.duncan@cox.net \
--cc=linux-btrfs@vger.kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).