From: "Austin S. Hemmelgarn" <ahferroin7@gmail.com>
To: Chris Murphy <lists@colorremedies.com>,
Andrei Borzenkov <arvidjaar@gmail.com>
Cc: Hugo Mills <hugo@carfax.org.uk>,
Zygo Blaxell <ce3g8jdj@umail.furryterror.org>,
kreijack@inwind.it, Roman Mamedov <rm@romanrm.net>,
Btrfs BTRFS <linux-btrfs@vger.kernel.org>
Subject: Re: Adventures in btrfs raid5 disk recovery
Date: Fri, 24 Jun 2016 14:19:53 -0400 [thread overview]
Message-ID: <c2a320a6-261b-723d-ab83-58f883e6315b@gmail.com> (raw)
In-Reply-To: <CAJCQCtSskA4PC_a8tgQopHFNO83NQ=Gkx406haB7G0nBi5e=2A@mail.gmail.com>
On 2016-06-24 13:52, Chris Murphy wrote:
> On Fri, Jun 24, 2016 at 11:21 AM, Andrei Borzenkov <arvidjaar@gmail.com> wrote:
>> 24.06.2016 20:06, Chris Murphy пишет:
>>> On Fri, Jun 24, 2016 at 3:52 AM, Andrei Borzenkov <arvidjaar@gmail.com> wrote:
>>>> On Fri, Jun 24, 2016 at 11:50 AM, Hugo Mills <hugo@carfax.org.uk> wrote:
>>>> eta)data and RAID56 parity is not data.
>>>>>
>>>>> Checksums are not parity, correct. However, every data block
>>>>> (including, I think, the parity) is checksummed and put into the csum
>>>>> tree. This allows the FS to determine where damage has occurred,
>>>>> rather thansimply detecting that it has occurred (which would be the
>>>>> case if the parity doesn't match the data, or if the two copies of a
>>>>> RAID-1 array don't match).
>>>>>
>>>>
>>>> Yes, that is what I wrote below. But that means that RAID5 with one
>>>> degraded disk won't be able to reconstruct data on this degraded disk
>>>> because reconstructed extent content won't match checksum. Which kinda
>>>> makes RAID5 pointless.
>>>
>>> I don't understand this. Whether the failed disk means a stripe is
>>> missing a data strip or parity strip, if any other strip is damaged of
>>> course the reconstruction isn't going to match checksum. This does not
>>> make raid5 pointless.
>>>
>>
>> Yes, you are right. We have double failure here. Still, in current
>> situation we apparently may end with btrfs reconstructing missing block
>> using wrong information. As was mentioned elsewhere, btrfs does not
>> verify checksum of reconstructed block, meaning data corruption.
>
> Well that'd be bad, but also good in that it would explain a lot of
> problems people have when metadata is also raid5. In this whole thread
> the premise is the metadata is raid1, so the fs doesn't totally face
> plant we just get a bunch of weird data corruptions. The metadata
> raid5 case were sorta "WTF happened?" and not much was really said
> about it other than telling the user to scrape off what they can and
> start over.
>
> Anyway, while not good I still think this is not super problematic to
> at least *do* check EXTENT_CSUM after reconstruction from parity
> rather than assuming that reconstruction happened correctly. The data
> needed to pass fail the rebuild is already on the disk. It just needs
> to be checked.
>
> Better would be to get parity csummed and put into the csum tree. But
> I don't know how much that helps. Think about always computing and
> writing csums for parity, which almost never get used vs keeping
> things the way they are now and just *checking our work* after
> reconstruction from parity. If there's some obvious major advantage to
> checksumming the parity I'm all ears but I'm not thinking of it at the
> moment.
>
Well, the obvious major advantage that comes to mind for me to
checksumming parity is that it would let us scrub the parity data itself
and verify it. I'd personally much rather know my parity is bad before
I need to use it than after using it to reconstruct data and getting an
error there, and I'd be willing to be that most seasoned sysadmins
working for companies using big storage arrays likely feel the same
about it. I could see it being practical to have an option to turn this
off for performance reasons or similar, but again, I have a feeling that
most people would rather be able to check if a rebuild will eat data
before trying to rebuild (depending on the situation in such a case, it
will sometimes just make more sense to nuke the array and restore from a
backup instead of spending time waiting for it to rebuild).
next prev parent reply other threads:[~2016-06-24 18:20 UTC|newest]
Thread overview: 68+ messages / expand[flat|nested] mbox.gz Atom feed top
2016-06-20 3:44 Adventures in btrfs raid5 disk recovery Zygo Blaxell
2016-06-20 18:13 ` Roman Mamedov
2016-06-20 19:11 ` Zygo Blaxell
2016-06-20 19:30 ` Chris Murphy
2016-06-20 20:40 ` Zygo Blaxell
2016-06-20 21:27 ` Chris Murphy
2016-06-21 1:55 ` Zygo Blaxell
2016-06-21 3:53 ` Zygo Blaxell
2016-06-22 17:14 ` Chris Murphy
2016-06-22 20:35 ` Zygo Blaxell
2016-06-23 19:32 ` Goffredo Baroncelli
2016-06-24 0:26 ` Chris Murphy
2016-06-24 1:47 ` Zygo Blaxell
2016-06-24 4:02 ` Andrei Borzenkov
2016-06-24 8:50 ` Hugo Mills
2016-06-24 9:52 ` Andrei Borzenkov
2016-06-24 10:16 ` Hugo Mills
2016-06-24 10:19 ` Andrei Borzenkov
2016-06-24 10:59 ` Hugo Mills
2016-06-24 11:36 ` Austin S. Hemmelgarn
2016-06-24 17:40 ` Chris Murphy
2016-06-24 18:06 ` Zygo Blaxell
2016-06-24 17:06 ` Chris Murphy
2016-06-24 17:21 ` Andrei Borzenkov
2016-06-24 17:52 ` Chris Murphy
2016-06-24 18:19 ` Austin S. Hemmelgarn [this message]
2016-06-25 16:44 ` Chris Murphy
2016-06-25 21:52 ` Chris Murphy
2016-06-26 7:54 ` Andrei Borzenkov
2016-06-26 15:03 ` Duncan
2016-06-26 19:30 ` Chris Murphy
2016-06-26 19:52 ` Zygo Blaxell
2016-06-27 11:21 ` Austin S. Hemmelgarn
2016-06-27 16:17 ` Chris Murphy
2016-06-27 20:54 ` Chris Murphy
2016-06-27 21:02 ` Henk Slager
2016-06-27 21:57 ` Zygo Blaxell
2016-06-27 22:30 ` Chris Murphy
2016-06-28 1:52 ` Zygo Blaxell
2016-06-28 2:39 ` Chris Murphy
2016-06-28 3:17 ` Zygo Blaxell
2016-06-28 11:23 ` Austin S. Hemmelgarn
2016-06-28 12:05 ` Austin S. Hemmelgarn
2016-06-28 12:14 ` Steven Haigh
2016-06-28 12:25 ` Austin S. Hemmelgarn
2016-06-28 16:40 ` Steven Haigh
2016-06-28 18:01 ` Chris Murphy
2016-06-28 18:17 ` Steven Haigh
2016-07-05 23:05 ` Chris Murphy
2016-07-06 11:51 ` Austin S. Hemmelgarn
2016-07-06 16:43 ` Chris Murphy
2016-07-06 17:18 ` Austin S. Hemmelgarn
2016-07-06 18:45 ` Chris Murphy
2016-07-06 19:15 ` Austin S. Hemmelgarn
2016-07-06 21:01 ` Chris Murphy
2016-06-24 16:52 ` Chris Murphy
2016-06-24 16:56 ` Hugo Mills
2016-06-24 16:39 ` Zygo Blaxell
2016-06-24 1:36 ` Zygo Blaxell
2016-06-23 23:37 ` Chris Murphy
2016-06-24 2:07 ` Zygo Blaxell
2016-06-24 5:20 ` Chris Murphy
2016-06-24 10:16 ` Andrei Borzenkov
2016-06-24 17:33 ` Chris Murphy
2016-06-24 11:24 ` Austin S. Hemmelgarn
2016-06-24 16:32 ` Zygo Blaxell
2016-06-24 2:17 ` Zygo Blaxell
2016-06-22 4:06 ` Adventures in btrfs raid5 disk recovery - update Zygo Blaxell
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=c2a320a6-261b-723d-ab83-58f883e6315b@gmail.com \
--to=ahferroin7@gmail.com \
--cc=arvidjaar@gmail.com \
--cc=ce3g8jdj@umail.furryterror.org \
--cc=hugo@carfax.org.uk \
--cc=kreijack@inwind.it \
--cc=linux-btrfs@vger.kernel.org \
--cc=lists@colorremedies.com \
--cc=rm@romanrm.net \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).