From: Duncan <1i5t5.duncan@cox.net>
To: linux-btrfs@vger.kernel.org
Subject: Re: List of known BTRFS Raid 5/6 Bugs?
Date: Sat, 8 Sep 2018 08:40:50 +0000 (UTC) [thread overview]
Message-ID: <pan$3058a$74a39424$67b0d289$23de5876@cox.net> (raw)
In-Reply-To: 2685207.fCDoYgCM7x@t460-skr
Stefan K posted on Fri, 07 Sep 2018 15:58:36 +0200 as excerpted:
> sorry for disturb this discussion,
>
> are there any plans/dates to fix the raid5/6 issue? Is somebody working
> on this issue? Cause this is for me one of the most important things for
> a fileserver, with a raid1 config I loose to much diskspace.
There's a more technically complete discussion of this in at least two
earlier threads you can find on the list archive, if you're interested,
but here's the basics (well, extended basics...) from a btrfs-using-
sysadmin perspective.
"The raid5/6 issue" can refer to at least three conceptually separate
issues, with different states of solution maturity:
1) Now generally historic bugs in btrfs scrub, etc, that are fixed (thus
the historic) in current kernels and tools. Unfortunately these will
still affect for some time many users of longer-term stale^H^Hble distros
who don't update using other sources for some time, as because the raid56
feature wasn't yet stable at the lock-in time for whatever versions they
stabilized on, they're not likely to get the fixes as it's new-feature
material.
If you're using a current kernel and tools, however, this issue is
fixed. You can look on the wiki for the specific versions, but with the
4.18 kernel current latest stable, it or 4.17, and similar tools versions
since the version numbers are synced, are the two latest release series,
with the two latest release series being best supported and considered
"current" on this list.
Also see...
2) General feature maturity: While raid56 mode should be /reasonably/
stable now, it remains one of the newer features and simply hasn't yet
had the testing of time that tends to flush out the smaller and corner-
case bugs, that more mature features such as raid1 have now had the
benefit of.
There's nothing to do for this but test, report any bugs you find, and
wait for the maturity that time brings.
Of course this is one of several reasons we so strongly emphasize and
recommend "current" on this list, because even for reasonably stable and
mature features such as raid1, btrfs itself remains new enough that they
still occasionally get latent bugs found and fixed, and while /some/ of
those fixes get backported to LTS kernels (with even less chance for
distros to backport tools fixes), not all of them do and even when they
do, current still gets the fixes first.
3) The remaining issue is the infamous parity-raid write-hole that
affects all parity-raid implementations (not just btrfs) unless they take
specific steps to work around the issue.
The first thing to point out here again is that it's not btrfs-specific.
Between that and the fact that it *ONLY* affects parity-raid operating in
degraded mode *WITH* an ungraceful-shutdown recovery situation, it could
be argued not to be a btrfs issue at all, but rather one inherent to
parity-raid mode and considered an acceptable risk to those choosing
parity-raid because it's only a factor when operating degraded, if an
ungraceful shutdown does occur.
But btrfs' COW nature along with a couple technical implementation
factors (the read-modify-write cycle for incomplete stripe widths and how
that risks existing metadata when new metadata is written) does amplify
the risk somewhat compared to that seen with the same write-hole issue in
various other parity-raid implementations that don't avoid it due to
taking write-hole avoidance countermeasures.
So what can be done right now?
As it happens there is a mitigation the admin can currently take -- btrfs
allows specifying data and metadata modes separately, and even where
raid1 loses too much space to be used for both, it's possible to specify
data as raid5/6 and metadata as raid1. While btrfs raid1 only covers
loss of a single device, it doesn't have the parity-raid write-hole as
it's not parity-raid, and for most use-cases at least, specifying raid1
for metadata only, while raid5 for data, should strictly limit both the
risk of the parity-raid write-hole as it'll be limited to data which in
most cases will be full-stripe writes and thus not subject to the
problem, and the size-doubling of raid1 as it'll be limited to metadata.
Meanwhile, arguably, for a sysadmin properly following the sysadmin's
first rule of backups, that the true value of data isn't defined by
arbitrary claims, but by the number of backups it is considered worth the
time/trouble/resources to have of that data, it's a known parity-raid
risk specifically limited to the corner-case of having an ungraceful
shutdown *WHILE* already operating degraded, and as such, it can be
managed along with all the other known risks to the data, including admin
fat-fingering, the risk that more devices will go out than the array can
tolerate, the risk of general bugs affecting the filesystem or other
storage-function related code, etc.
IOW, in the context of the admin's first rule of backups, no matter the
issue, raid56 write hole or whatever other issue of the many possible,
loss of data can *never* be a particularly big issue, because by
definition, in *all* cases, what was of most value was saved, either the
data if it was defined as valuable enough to have a backup, or the time/
trouble/resources that would have otherwise gone into making that backup,
if the data wasn't worth it to have a backup.
(One nice thing about this rule is that it covers the loss of whatever
number of backups along with the working copy just as well as it does
loss of just the working copy. No matter the number of backups, the
value of the data is either worth having one more backup, just in case,
or it's not. Similarly, the rule covers the age of the backup and
updates nicely as well, as that's just a subset of the original problem,
with the value of the data in the delta between the last backup and the
working copy now being the deciding factor, either the risk of losing it
is worth updating the backup, or not, same rule, applied to a data
subset.)
So from an admin's perspective, in practice, while not entirely stable
and mature yet, and with the risk of the already-degraded crash-case
corner-case that's already known to apply to parity-raid unless
mitigation steps are taken, btrfs raid56 mode should now be within the
acceptable risk range already well covered by the risk mitigation of
following an appropriate backup policy, optionally combined with the
partial write-hole-mitigation strategy of doing data as raid5/6, with
metadata as raid1.
OK, but what is being done to better mitigate the parity-raid write-hole
problem for the future, and when might we be able to use that mitigation?
There are a number of possible mitigation strategies, and there's
actually code being written using one of them right now, tho it'll be (at
least) a few kernel cycles until it's considered complete and stable
enough for mainline, and as mentioned in #2 above, even after that it'll
take some time to mature to reasonable stability.
The strategy being taken is partial-stripe-write logging. Full stripe
writes aren't affected by the write hole and (AFAIK) won't be logged, but
partial stripe writes are read-modify-write and thus write-hole
succeptible, and will be logged. That means small files and
modifications to existing files, the ends of large files, and much of the
metadata, will be written twice, first to the log, then to the final
location. In the event of a crash, on reboot and mount, anything in the
log can be replayed, thus preventing the write hole.
As for the log, it'll be written using a new 3/4-way-mirroring mode,
basically raid1 but mirrored more than two ways (which current btrfs
raid1 is limited to, even with more than two devices in the filesystem),
thus handling the loss of multiple devices.
That's actually what's being developed ATM, the 3/4-way-mirroring mode,
which will be available for other uses as well.
Actually, that's what I'm personally excited about, as years ago, when I
first looked into btrfs, I was running older devices in mdraid's raid1
mode, which does N-way mirroring. I liked the btrfs data checksumming
and scrubbing ability, but with the older devices I didn't trust having
just two-way-mirroring and wanted at least 3-way-mirroring, so back at
that time I skipped btrfs and stayed with mdraid. Later I upgraded to
ssds and decided btrfs-raid1's two-way-mirroring was sufficient, but when
one of the ssds ended up prematurely bad and needed replaced, I would
have sure felt a bit better before I got the replacement done, if I still
had good two-way-mirroring even with the bad device.
So I'm still interested in 3-way-mirroring and would probably use it for
some things now, were it available and "stabilish", and I'm eager to see
that code merged, not for the parity-raid logging it'll also be used for,
but for the reliability of 3-way-mirroring. Tho I'll probably wait at
least 2-5 kernel cycles after introduction and see how it stabilizes
before actually considering it stable enough to use myself, because even
tho I do follow the backups policy above, just because I'm not
considering the updated-data delta worth an updated backup yet, doesn't
mean I want to unnecessarily risk having to redo the work since the last
backup, which means choosing the newer 3-way-mirroring over the more
stable and mature existing raid1 2-way-mirroring isn't going to be worth
it to me until the 3-way-mirroring has at least a /few/ kernel cycles to
stabilize.
And I'd recommend the same caution with the new raid5/6 logging mode
built on top of that multi-way-mirroring, once it's merged as well.
Don't just jump on it immediately after merge unless you're deliberately
doing so to help test for bugs and get them fixed and the feature
stabilized as soon as possible. Wait a few kernel cycles, follow the
list to see how the feature's stability is coming, and /then/ use it,
after factoring in its remaining then still new and less mature
additional risk into your backup risks profile, of course.
Time? Not a dev but following the list and obviously following the new 3-
way-mirroring, I'd say probably not 4.20 (5.0?) for the new mirroring
modes, so 4.21/5.1 more reasonably likely (if all goes well, could be
longer), probably another couple cycles (if all goes well) after that for
the parity-raid logging code built on top of the new mirroring modes, so
perhaps a year (~5 kernel cycles) to introduction for it. Then wait
however many cycles until you think it has stabilized. Call that another
year. So say about 10 kernel cycles or two years. It could be a bit
less than that, say 5-7 cycles, if things go well and you take it before
I'd really consider it stable enough to recommend, but given the
historically much longer than predicted development and stabilization
times for raid56 already, it could just as easily end up double that, 4-5
years out, too.
But raid56 logging mode for write-hole mitigation is indeed actively
being worked on right now. That's what we know at this time.
And even before that, right now, raid56 mode should already be reasonably
usable, especially if you do data raid5/6 and metadata raid1, as long as
your backup policy and practice is equally reasonable.
--
Duncan - List replies preferred. No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master." Richard Stallman
next prev parent reply other threads:[~2018-09-08 13:28 UTC|newest]
Thread overview: 21+ messages / expand[flat|nested] mbox.gz Atom feed top
2018-09-07 13:58 List of known BTRFS Raid 5/6 Bugs? Stefan K
2018-09-08 8:40 ` Duncan [this message]
2018-09-11 11:29 ` Stefan K
2018-09-12 1:57 ` Duncan
-- strict thread matches above, loose matches on Subject: below --
2018-08-13 21:56 erenthetitan
2018-08-14 4:09 ` Zygo Blaxell
2018-08-14 7:32 ` Menion
2018-08-15 3:33 ` Zygo Blaxell
2018-08-15 7:27 ` Menion
2018-08-16 19:38 ` erenthetitan
2018-08-17 8:33 ` Menion
2018-08-11 2:18 erenthetitan
2018-08-11 5:49 ` Zygo Blaxell
2018-08-11 6:27 ` erenthetitan
2018-08-11 15:25 ` Zygo Blaxell
2018-08-13 7:20 ` Menion
2018-08-14 3:49 ` Zygo Blaxell
2018-08-11 0:45 Zygo Blaxell
2018-08-10 23:32 erenthetitan
2018-08-10 1:40 erenthetitan
2018-08-10 7:12 ` Zygo Blaxell
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to='pan$3058a$74a39424$67b0d289$23de5876@cox.net' \
--to=1i5t5.duncan@cox.net \
--cc=linux-btrfs@vger.kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).