From: Duncan <1i5t5.duncan@cox.net>
To: linux-btrfs@vger.kernel.org
Subject: Re: Is metadata redundant over more than one drive with raid0 too?
Date: Sun, 4 May 2014 21:49:24 +0000 (UTC) [thread overview]
Message-ID: <pan$a3b93$98e8eca3$3f96ad92$ba13c4f@cox.net> (raw)
In-Reply-To: 20140503232702.GC9061@merlins.org
Marc MERLIN posted on Sat, 03 May 2014 16:27:02 -0700 as excerpted:
> So, I was thinking. In the past, I've done this:
> mkfs.btrfs -d raid0 -m raid1 -L btrfs_raid0 /dev/mapper/raid0d*
>
> My rationale at the time was that if I lose a drive, I'll still have
> full metadata for the entire filesystem and only missing files.
> If I have raid1 with 2 drives, I should end up with 4 copies of each
> file's metadata, right?
Brendan has answered well, but sometimes a second way of putting things
helps, especially when there was originally some misconception to clear
up, as seems to be the case here. So let me try to be that rewording.
=:^)
No. Btrfs raid1 (the multi-device metadata default) is (still only) two
copies, as is btrfs dup (which is the single-device metadata default
except for SSDs). The distinction is that dup is designed for the single
device case and puts both copies on that single device, while raid1 is
designed for the multi-device case, and ensures that the two copies
always go to different devices, so loss of the single device won't kill
the metadata.
Additional details:
I am not aware of any current possibility of having more than two copies,
no matter the mode, with a possible exception during mode conversion (say
between raid1 and raid6), altho even then, there should be only two /
active/ copies.
Dup mode being designed for single device usage only, it's normally not
available on multi-device filesystems. As Brendan mentions, the way
people sometimes get it is starting with a single-device filesystem in dup
mode and adding devices. If they then fail to balance-convert, old
metadata chunks will be dup mode on the original device, while new ones
should be created as raid1 by default. Of course a partial balance-
convert will be just that, partial, with whatever failed to convert still
dup mode on the original single device.
As a result, originally (and I believe still) it was impossible to
configure dup mode on a multi-device filesystem at all. However, someone
did post a request that dup mode on multi-device be added as a (normally
still heavily discouraged) option, to allow a conversion back to single-
device, without at any point dropping to non-redundant single-copy-only.
Using the two-device raid1 to single-device dup conversion as an example,
currently you can't btrfs device delete below two devices as that's no
longer raid1. Of course if both data and metadata are raid1, it's
possible to physically disconnect one device, leaving the other as the
only online copy but having the disconnected one in reserve, but that's
not possible when the data is single mode, and even if it was, that
physical disconnection will trigger read-only mode on filesystem as it's
no longer raid1, thereby making the balance-conversion back to dup
impossible. And you can't balance-convert to dup on a multi-device
filesystem, so balance-converting to single, thereby losing the
protection of the second copy, then doing the btrfs device delete,
becomes the only option. Thus the request to allow balance-convert to dup
mode on a multi-device filesystem, for the sole purpose of then allowing
btrfs device delete of the second device, converting it back to a single-
device filesystem without ever losing second-copy redundancy protection.
Finally, for the single-device-filesystem case, dup mode is normally only
allowed for metadata (where it is again the default, except on ssd),
*NOT* for data. However, someone noticed and posted that one of the side-
effects of mixed-block-group mode, used by default on filesystems under 1
GiB but normally discouraged on filesystems above 32-64 gig for
performance reasons, because in mixed-bg mode data and metadata share the
same chunks, mixed-bg mode actually allows (and defaults to, except on
SSD) dup for data as well as metadata. There was some discussion in that
thread as to whether that was a deliberate feature or simply an
accidental result of the sharing. Chris Mason confirmed it was the
latter. The intention has been that dup mode is a special case for
rather critical metadata on a single device in ordered to provide better
protection for it, and the fact that mixed-bg mode allows (indeed, even
defaults to) dup mode for data was entirely an accident of mixed-bg mode
implementation -- albeit one that's pretty much impossible to remove.
But given that accident and the fact that some users do appreciate the
ability to do dup mode data via mixed-bg mode on larger single-device
filesystems even if it reduces performance and effectively halves storage
space, I expect/predict that at some point, dup mode for data will be
added as an option as well, thereby eliminating the performance impact of
mixed-bg mode while offering single-device duplicate data redundancy on
large filesystems, for those that value the protection such duplication
provides, particularly given btrfs' data checksumming and integrity
features.
> But now I have 2 questions
> 1) btrfs has two copies of all metadata on even a single drive, correct?
By default, yes, except on SSD, where dup remains an option. But not if
single (the default metadata mode for single-device SSD) or (for multi-
device) raid0 modes are chosen instead of dup.
> If so, and I have a -d raid0 -m raid0 filesystem, are both copies of the
> metadata on the same drive or is btrfs smart enough to spread out
> metadata copies so that they're not on the same drive?
If you specify raid0 metadata, there's no second metadata copy, on the
same drive or elsewhere. Further, raid0 mode stripes metadata across all
available devices so it's even more fragmented than single mode,
practically eliminating any chance of recovery in the event of device
failure.
IOW, if you have raid0 metadata and a device fails or even simply does
what would be a relatively minor temporary dropout in other raid cases,
consider the filesystem toast. (If you're extremely lucky and the
dropout was temporary, such that you can recreate the raid0 with the
dropped device, you /may/ be able to save it. And it should drop to read-
only mode as soon as a dropped device is detected to help maximize the
chance of that. But don't count on it! Simply don't use raid0 for
anything you value at all, and you won't have to worry about it.)
> 2) does btrfs lay out files on raid0 so that files aren't striped across
> more than one drive, so that if I lose a drive, I only lose whole files,
> but not little chunks of all my files, making my entire FS toast?
No. That's the distinction between raid0 mode and single mode. Raid0
mode effectively sacrifices everything else for (single thread sequential
access) speed. If a device drops out, consider anything that was raid0
toast.
In theory at least, if the metadata is intact (as it should be with a
single device drop for metadata raid1 mode), a file smaller than a single
raid0 "strip" (the size of a stripe on a single device) may still be
intact as well. And as more devices are added to the raid0 stripe,
dropping a single one does allow the lucky-case recovery file-size to
increase as well, up to stripe-size minus strip-size for a single device
drop-out, while also increasing the absolute chances for sub-strip-size
files since their chances approximate N-1/N (where N is the number of
devices in the stripe and -1 is the single device drop).
Additionally, it can be noted that if a file is small enough, btrfs may
actually store it in metadata instead of going to the trouble of
allocating a data chunk extent for it, and the sub-block end of a file
may similarly be stored in metadata instead of taking another whole block
of data. (Reiserfs users will be familiar with this as tail-packing.)
Of course if the metadata is dup/raid1/whatever instead of raid0/single,
these small metadata-only-stored files should be recoverable as well.
But those are the lucky cases. As I said above, the general rule is that
anything on raid0 is destroyed if a device drops, so you never NEVER
stick anything on raid0 that you value at all, and then you won't have to
worry about it! =:^)
(Meanwhile, from experience I can say that the speed of raid0 isn't
always as good as one might expect, either. It does speed up the single-
thread sequential-access case as one might expect, but on today's multi-
core multi-threading many-tasking systems, single-IO-thread filesystem
access is actually rather rare. Then of course there's random-access as
well. As a result, at least for my use-case which apparently includes
far more independent task parallel read than some, I actually found mdraid
with its N-copies raid1 and surprisingly good parallel multi-IO-thread
read scheduling faster than its raid0, with writes still occurring at
normal single-device speed (unlike raid5/6 which penalizes writes) due to
the bottleneck being the physical spinning rust. (Obviously fast SSDs
will change that bottleneck factor, with the individual bus to SSD speed
usually becoming the bottleneck except for the underpowered CPU case, but
raid1 write speed still remains reasonably close to the slowest device
write speed in most cases.) Of course btrfs raid1 currently limits to
two copies and may or may not be as efficiently scheduled as md/raid1,
but that's yet another reason why I really /really/ want N-way-mirroring
for btrfs, since two-thread-parallel-read-access certainly beats single-
thread, but from experience I know that at least for my use-case and on
spinning-rust, a 3-4-thread-parallel-read pattern is common enough that I
see the benefits. That said, I'm switching to SSD now, and the speed
there is sufficient that I suspect I'm unlikely to see much benefit above
3-thread-parallel and I might not actually see much from 3-thread-
parallel either. But I'd sure like the chance to try it, and with the
data-integrity benefits of 3-way-mirroring on btrfs as well, I'm really
eager to see the feature introduced. =:^)
Of course the much safer and more flexible but still speedy compromise is
raid10, which remains the general case ideal -- with the only caveat
being the relatively high entry four-device-minimum entry cost. (Tho
mdraid10 does have some flexibility in that regard and can do its form of
raid10 on fewer than four devices, at the cost of increased conceptual
complexity and speed.)
The bottom line remains, however, don't put anything on raid0 that you
value at all, such that you're entirely OK considering it toast and
simply putting the remaining devices to other uses instead of even trying
to recover, if a device drops out of the raid0. Raid0 is optimized for
one thing only, speed, and that in only one rather narrow and
increasingly uncommon in the modern age use-case, single-thread-
sequential-access. And the price it pays for that optimization is, IMO,
very rarely worth it, tho if you have that use-case and are prepared to
pay the cost in terms of data-loss risk, it can /indeed/ be worth it.
Just be sure that's your use case, preferably testing a raid0 deployment
in actual use to be sure it's giving you that extra speed, because in
many cases, it won't, and then it's simply NOT worth the data risk cost,
period.
--
Duncan - List replies preferred. No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master." Richard Stallman
prev parent reply other threads:[~2014-05-04 21:49 UTC|newest]
Thread overview: 15+ messages / expand[flat|nested] mbox.gz Atom feed top
2014-05-03 23:27 Is metadata redundant over more than one drive with raid0 too? Marc MERLIN
2014-05-04 6:57 ` Brendan Hide
2014-05-04 7:24 ` Marc MERLIN
2014-05-04 7:44 ` Brendan Hide
2014-05-05 1:27 ` Marc MERLIN
2014-05-06 19:05 ` Duncan
2014-05-06 19:39 ` Duncan
2014-05-05 0:46 ` Daniel Lee
2014-05-05 5:06 ` Marc MERLIN
2014-05-06 17:16 ` Duncan
2014-05-07 8:18 ` raid0 vs single, and should we allow -mdup by default on SSDs? Marc MERLIN
2014-05-07 8:29 ` Hugo Mills
2014-05-07 8:52 ` Marc MERLIN
2014-05-07 22:39 ` Mitch Harder
2014-05-04 21:49 ` Duncan [this message]
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to='pan$a3b93$98e8eca3$3f96ad92$ba13c4f@cox.net' \
--to=1i5t5.duncan@cox.net \
--cc=linux-btrfs@vger.kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).