From: Duncan <1i5t5.duncan@cox.net>
To: linux-btrfs@vger.kernel.org
Subject: Re: dear developers, can we have notdatacow + checksumming, plz?
Date: Wed, 16 Dec 2015 09:55:08 +0000 (UTC) [thread overview]
Message-ID: <pan$5e79$48bd5e4f$cab49683$cb54fe41@cox.net> (raw)
In-Reply-To: 56703928.7070003@gmail.com
Austin S. Hemmelgarn posted on Tue, 15 Dec 2015 11:00:40 -0500 as
excerpted:
> AFAIUI, checksums are stored per-instance for every block. This is
> important in a multi-device filesystem in case you lose a device, so
> that you still have a checksum for the block. There should be no
> difference between extent layout and compression between devices
> however.
I don't believe that's quite correct.
What is correct, to the best of my knowledge, is that checksums are
metadata, and thus have whatever duplication/parity level metadata is
assigned.
For single devices, that is of course by default dup, 2X the metadata and
thus 2X the checksums, both on the single data (as effectively the only
choice on a single device, at least thru 4.3, tho there's a patch adding
dup data as an option that I think should be in 4.4) when covering data,
dup metadata when covering it.
For multiple devices, it's default raid1 metadata, default single data,
so the picture doesn't differ much by default from the single-device
default picture. It's also possible to do single metadata, raidN data,
which really doesn't make sense except for raid0 data, and thus I believe
there's a warning about that sort of layout in newer mkfs.btrfs, or when
lowering the metadata redundancy using balance filters.
But of course it's possible to do raid1 data and metadata, which would be
two copies of each, regardless of the number of devices (except that it's
2+, of course). But the copies aren't 1:1 assigned. That is, if they're
equal generation, btrfs can read either checksum and apply it to either
data/metadata block. (Of course if they're not equal generation, btrfs
will choose the higher one, thus covering the case of writing at the time
of a crash, since either they will both be the same generation if the
root block wasn't updated to the new one on either one yet, or one will
be a higher/newer generation than the other, if it had already finished
writing one but not the other at the time of the crash.)
This is why it's an extremely good idea if you have a pair of devices in
raid1, and you mount one of them degraded/writable with the other
unavailable for some reason, that you don't also mount the other one
writable and then try to recombined them. Chances are the generations
wouldn't match and it'd pick the one with the higher generation, but if
they did for some reason match, and both checksums were valid on their
data, but the data differed... either one could be chosen, and a scrub
might choose either one to fix the other, as well, which could in theory
result in a file with intermixed blocks from the two different versions!
Just ensure that if one is mounted writable, it's the only one mounted
writable if there's a chance of recombining, and you'll be fine, as it'll
be the only one with advancing generations. And if by some accident both
are mounted writable separately, the best bet is to be sure and wipe the
one, then add it as a new device, if you're going to reintroduce it to
the same filesystem.
Of course this gets a bit more complicated with 3+ device raid1, since
currently, there's still only two copies of each block and two copies of
the checksum, meaning there's at least one device without a copy of each
block, and if the filesystem is mounted degraded writable repeatedly with
a random device missing...
Similarly, the permutations can be calculated for the other raid types,
and for mixed raid types like raid6 data (specified) and raid1 metadata
(unspecified so the default used), but I won't attempt that here.
--
Duncan - List replies preferred. No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master." Richard Stallman
next prev parent reply other threads:[~2015-12-16 9:55 UTC|newest]
Thread overview: 12+ messages / expand[flat|nested] mbox.gz Atom feed top
2015-12-14 4:59 dear developers, can we have notdatacow + checksumming, plz? Christoph Anton Mitterer
2015-12-14 6:42 ` Russell Coker
2015-12-15 1:02 ` Christoph Anton Mitterer
2015-12-14 14:16 ` Austin S. Hemmelgarn
2015-12-15 3:15 ` Christoph Anton Mitterer
2015-12-15 16:00 ` Austin S. Hemmelgarn
2015-12-16 9:15 ` Duncan
2015-12-16 9:55 ` Duncan [this message]
2015-12-17 2:09 ` Christoph Anton Mitterer
2015-12-21 13:36 ` Austin S. Hemmelgarn
2015-12-22 9:12 ` Duncan
2015-12-22 12:16 ` Austin S. Hemmelgarn
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to='pan$5e79$48bd5e4f$cab49683$cb54fe41@cox.net' \
--to=1i5t5.duncan@cox.net \
--cc=linux-btrfs@vger.kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox