From: Zygo Blaxell <ce3g8jdj@umail.furryterror.org>
To: Roman Mamedov <rm@romanrm.net>
Cc: Chris Murphy <lists@colorremedies.com>,
Hugo Mills <hugo@carfax.org.uk>,
"linux-btrfs@vger.kernel.org" <linux-btrfs@vger.kernel.org>,
Austin Hemmelgarn <ahferroin7@gmail.com>
Subject: Re: RAID system with adaption to changed number of disks
Date: Wed, 12 Oct 2016 15:19:16 -0400 [thread overview]
Message-ID: <20161012191915.GF26140@hungrycats.org> (raw)
In-Reply-To: <20161012173141.GX21290@hungrycats.org>
[-- Attachment #1: Type: text/plain, Size: 3034 bytes --]
On Wed, Oct 12, 2016 at 01:31:41PM -0400, Zygo Blaxell wrote:
> On Wed, Oct 12, 2016 at 12:25:51PM +0500, Roman Mamedov wrote:
> > Zygo Blaxell <ce3g8jdj@umail.furryterror.org> wrote:
> >
> > > A btrfs -dsingle -mdup array on a mdadm raid[56] device might have a
> > > snowball's chance in hell of surviving a disk failure on a live array
> > > with only data losses. This would work if mdadm and btrfs successfully
> > > arrange to have each dup copy of metadata updated separately, and one
> > > of the copies survives the raid5 write hole. I've never tested this
> > > configuration, and I'd test the heck out of it before considering
> > > using it.
> >
> > Not sure what you mean here, a non-fatal disk failure (i.e. within being
> > compensated by redundancy) is invisible to the upper layers on mdadm arrays.
> > They do not need to "arrange" anything, on such failure from the point of view
> > of Btrfs nothing whatsoever has happened to the /dev/mdX block device, it's
> > still perfectly and correctly readable and writable.
>
> btrfs hurls a bunch of writes for one metadata copy to mdadm, mdadm
> forwards those writes to the disks. btrfs sends a barrier to mdadm,
> mdadm must properly forward that barrier to all the disks and wait until
> they're all done. Repeat the above for the other metadata copy.
I'm not even sure btrfs does this--I haven't checked precisely what
it does in dup mode. It could send both copies of metadata to the
disks with a single barrier to separate both metadata updates from
the superblock updates. That would be bad in this particular case.
> If that's all implemented correctly in mdadm, all is well; otherwise,
> mdadm and btrfs fail to arrange to have each dup copy of metadata
> updated separately.
To be clearer about the consequences of this:
If both copies of metadata are updated at the same time (because btrfs
and mdadm failed to get the barriers right), it's possible to have both
copies of metadata in an inconsistent (unreadable) state at the same time,
ending the filesystem.
In degraded RAID5/6 mode, all writes temporarily corrupt data, so if there
is an interruption (system crash, a disk times out, etc) in degraded mode,
one of the metadata copies will be damaged. The damage may not be limited
to the current commit, so we need the second copy of the metadata intact
to recover from broken changes to the first copy. Usually metadata chunks
are larger than RAID5 stripes, so this works out for btrfs on mdadm RAID5
(maybe not if two metadata chunks are adjacent and not stripe-aligned,
but that's a rare case, and one that only affects array sizes that are
not a power of 2 + 1 disk for RAID5, or power of 2 + 2 disks for RAID6).
> The present state of the disks is irrelevant. The array could go
> degraded due to a disk failure at any time, so for practical failure
> analysis purposes, only the behavior in degraded mode is relevant.
>
> >
> > --
> > With respect,
> > Roman
>
>
[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 181 bytes --]
next prev parent reply other threads:[~2016-10-12 19:21 UTC|newest]
Thread overview: 30+ messages / expand[flat|nested] mbox.gz Atom feed top
2016-10-11 15:14 RAID system with adaption to changed number of disks Philip Louis Moetteli
2016-10-11 16:06 ` Hugo Mills
2016-10-11 23:58 ` Chris Murphy
2016-10-12 1:32 ` Qu Wenruo
2016-10-12 4:37 ` Zygo Blaxell
2016-10-12 5:48 ` Qu Wenruo
2016-10-12 17:19 ` Zygo Blaxell
2016-10-12 19:55 ` Adam Borowski
2016-10-12 21:10 ` Zygo Blaxell
2016-10-13 3:40 ` Adam Borowski
2016-10-12 20:41 ` Chris Murphy
2016-10-13 0:35 ` Qu Wenruo
2016-10-13 21:03 ` Zygo Blaxell
2016-10-14 1:24 ` Qu Wenruo
2016-10-14 7:16 ` Chris Murphy
2016-10-14 19:55 ` Zygo Blaxell
2016-10-14 21:19 ` Duncan
2016-10-14 21:38 ` Chris Murphy
2016-10-14 22:30 ` Chris Murphy
2016-10-15 3:19 ` Zygo Blaxell
2016-10-12 7:02 ` Anand Jain
2016-10-12 7:25 ` Roman Mamedov
2016-10-12 17:31 ` Zygo Blaxell
2016-10-12 19:19 ` Zygo Blaxell [this message]
2016-10-12 19:33 ` Roman Mamedov
2016-10-12 20:33 ` Zygo Blaxell
2016-10-11 16:37 ` Austin S. Hemmelgarn
2016-10-11 17:16 ` Tomasz Kusmierz
2016-10-11 17:29 ` ronnie sahlberg
2016-10-12 1:33 ` Dan Mons
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20161012191915.GF26140@hungrycats.org \
--to=ce3g8jdj@umail.furryterror.org \
--cc=ahferroin7@gmail.com \
--cc=hugo@carfax.org.uk \
--cc=linux-btrfs@vger.kernel.org \
--cc=lists@colorremedies.com \
--cc=rm@romanrm.net \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).