From: Anand Jain <anand.jain@oracle.com>
To: bo.li.liu@oracle.com
Cc: linux-btrfs@vger.kernel.org
Subject: Re: [PATCH 2/4] Btrfs: fix data corruption in raid6
Date: Thu, 9 Nov 2017 17:29:25 +0800 [thread overview]
Message-ID: <ca33de53-1894-09d9-240d-d0f128a5fc83@oracle.com> (raw)
In-Reply-To: <20171108195315.GB14443@dhcp-10-11-181-32.int.fusionio.com>
On 11/09/2017 03:53 AM, Liu Bo wrote:
> On Tue, Nov 07, 2017 at 04:32:55PM +0800, Anand Jain wrote:
>>
>>
>> On 11/02/2017 08:54 AM, Liu Bo wrote:
>>> With raid6 profile, btrfs can end up with data corruption by the
>>> following steps.
>>>
>>> Say we have a 5 disks that are set up with raid6 profile,
>>>
>>> 1) mount this btrfs
>>> 2) one disk gets pulled out
>>> 3) write something to btrfs and sync
>>> 4) another disk gets pulled out
>>> 5) write something to btrfs and sync
>>> 6) umount this btrfs
>>> 7) bring the two disk back
>>> 8) reboot
>>> 9) mount this btrfs
>>>
>>> Chances are mount will fail (even with -odegraded) because of failure
>>> on reading metadata blocks, IOW, our raid6 setup is not able to
>>> protect two disk failures in some cases.
>>>
>>> So it turns out that there is a bug in raid6's recover code, that is,
>>>
>>> if we have one of stripes in the raid6 layout as shown here,
>>>
>>> | D1(*) | D2(*) | D3 | D4 | D5 |
>>> -------------------------------------
>>> | data1 | data2 | P | Q | data0 |
>>
>>
>>> D1 and D2 are the two disks which got pulled out and brought back.
>>> When mount reads data1 and finds out that data1 doesn't match its crc,
>>> btrfs goes to recover data1 from other stripes, what it'll be doing is
>>>
>>> 1) read data2, parity P, parity Q and data0
>>>
>>> 2) as we have a valid parity P and two data stripes, it goes to
>>> recover in raid5 style.
>>
>>
>>> (However, since disk D2 was pulled out and data2 on D2 could be stale,
>>
>> data2 should end up crc error, if not then raid5 recover is as
>> expected OR this example is confusing to explain the context of
>> two data stipe missing.
>
> The assumption you have is not true,
> when doing reconstruction, we
> don't verify checksum for each data stripe that is read from disk.
Why ? what if data0 (which has InSync flag) was wrong ?
I am wary about the In_sync approach. IMO we are loosing the
advantage of FS being merged with volume manager - which is to
know exactly which active stripes were lost for quicker reconstruct.
Lets say RAID6 is 50% full and disk1 is pulled out, and at 75%
full disk2 is pulled out and reaches 90% full.
So now when all the disks are back, two disks will not have
In_sync flag? hope I understood this correctly.
Now.
15% of the data have lost two stripes.
25% of the data have lost one stripe.
50% of the data did not loose any stripe at all.
Now for read and reconstruct..
50% of the data does not need any reconstruction.
25% of the data needs RAID5 style reconstruction as they have lost
only one stripe.
15% of the data needs RAID6 reconstruction since they have lost two
stripes.
Does InSync design here help to work like this ?
Now for RAID1, lets say disk1 lost first 10% of the stripes and
disk2 lost the last 10% of the stripes and these stripes are mutually
exclusive. That means there is no single stripe which has lost
completely.
There is no code yet where I can see when the In_sync will be reset,
which is fine for this eg. Assume no reconstruction is done. But now
when both the disks are being mounted and as the In_sync will be based
on dev_item.generation, one of it will get In_sync to me it looks like
there will be 10% of stripes which will fail even though its not
practically true in this example.
Thanks, Anand
> Thanks,
>
> -liubo
>>
>> Thanks, Anand
>>
>>
>>> we still get the wrong data1 from this reconstruction.)
>>>
>>> 3) btrfs continues to try to reconstruct data1 from parity Q, data2
>>> and data0, we still get the wrong one for the same reason.
>>>
>>> The fix here is to take advantage of the device flag, ie. 'In_sync',
>>> all data on a device might be stale if 'In_sync' has not been set.
>>>
>>> With this, we can build the correct data2 from parity P, Q and data1.
>>>
>>> Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
>>> ---
>>> fs/btrfs/raid56.c | 3 ++-
>>> 1 file changed, 2 insertions(+), 1 deletion(-)
>>>
>>> diff --git a/fs/btrfs/raid56.c b/fs/btrfs/raid56.c
>>> index 67262f8..3c0ce61 100644
>>> --- a/fs/btrfs/raid56.c
>>> +++ b/fs/btrfs/raid56.c
>>> @@ -1076,7 +1076,8 @@ static int rbio_add_io_page(struct btrfs_raid_bio *rbio,
>>> disk_start = stripe->physical + (page_index << PAGE_SHIFT);
>>> /* if the device is missing, just fail this stripe */
>>> - if (!stripe->dev->bdev)
>>> + if (!stripe->dev->bdev ||
>>> + !test_bit(In_sync, &stripe->dev->flags))
>>> return fail_rbio_index(rbio, stripe_nr);
>>> /* see if we can add this page onto our existing bio */
>>>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
>
next prev parent reply other threads:[~2017-11-09 9:29 UTC|newest]
Thread overview: 16+ messages / expand[flat|nested] mbox.gz Atom feed top
2017-11-02 0:54 [PATCH 0/4] Fix raid6 reconstruction bug Liu Bo
2017-11-02 0:54 ` [PATCH 1/4] Btrfs: introduce device flags Liu Bo
2017-11-06 16:40 ` David Sterba
2017-11-08 19:46 ` Liu Bo
2017-11-13 17:13 ` David Sterba
2017-11-07 9:30 ` Anand Jain
2017-11-02 0:54 ` [PATCH 2/4] Btrfs: fix data corruption in raid6 Liu Bo
2017-11-07 8:32 ` Anand Jain
2017-11-08 19:53 ` Liu Bo
2017-11-09 9:29 ` Anand Jain [this message]
2017-11-09 9:59 ` Qu Wenruo
2017-11-10 0:12 ` Liu Bo
2017-11-10 0:54 ` Qu Wenruo
2017-11-10 10:15 ` Qu Wenruo
2017-11-02 0:54 ` [PATCH 3/4] Btrfs: make raid1 and raid10 be aware of device flag In_sync Liu Bo
2017-11-02 0:54 ` [PATCH 4/4] Btrfs: change how we set In_sync Liu Bo
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=ca33de53-1894-09d9-240d-d0f128a5fc83@oracle.com \
--to=anand.jain@oracle.com \
--cc=bo.li.liu@oracle.com \
--cc=linux-btrfs@vger.kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).