From: Anand Jain <anand.jain@oracle.com>
To: Qu Wenruo <quwenruo.btrfs@gmx.com>, linux-btrfs@vger.kernel.org
Subject: Re: [PATCH RFC] btrfs: fix read corrpution from disks of different generation
Date: Wed, 20 Mar 2019 21:54:16 +0800 [thread overview]
Message-ID: <ce5be72e-5fae-932b-4ba7-85e0b3d8b825@oracle.com> (raw)
In-Reply-To: <541e2efd-ae4f-bc06-1d08-46f55208a095@gmx.com>
On 3/20/19 2:27 PM, Qu Wenruo wrote:
>
>
> On 2019/3/20 下午1:47, Anand Jain wrote:
>>
>>
>>>>>> A tree based integrity verification
>>>>>> is important for all data, which is missing.
>>>>>> Fix:
>>>>>> In this RFC patch it proposes to use same disk from with the
>>>>>> metadata
>>>>>> is read to read the data.
>>>>>
>>>>> The obvious problem I found is, the idea only works for RAID1/10.
>>>>>
>>>>> For striped profile it makes no sense, or even have a worse chance to
>>>>> get stale data.
>>>>>
>>>>>
>>>>> To me, the idea of using possible better mirror makes some sense, but
>>>>> very profile limited.
>>>>
>>>> Yep. This problem and fix is only for the mirror based profiles
>>>> such as raid1/raid10.
>>>
>>> Then current implementation lacks such check.
>>>
>>> Further more, data and metadata can lie in different chunks and have
>>> different chunk types.
>>
>> Right. Current tests for this RFC were only for raid1.
>>
>> But the final patch can fix that.
>>
>> In fact current patch works for all the cases except for the case of
>> replace is running and mix of metadata:raid1 and data:raid56
>>
>> We need some cleanups in mirror_num, basically we need to bring it
>> under #define. and handle it accordingly in __btrfs_map_block()
>
>
> Wait for a minute.
>
> There is a hidden pitfall from the very beginning.
>
> Consider such chunk layout:
> Chunk A Type DATA|RAID1
> Stripe 1: Dev 1
> Stripe 2: Dev 2
>
> Chunk B Type METADATA|RAID1
> Stripe 1: Dev 2
> Stripe 2: Dev 1
Right this fix can't solve this case.
So one down. The other choice I had is to
Quarantine whole disk by comparing dev1 and dev2 generation # in the
superblock during mount. or
Quarantine whole chunk(s) by comparing dev1 and dev2 generation # in
the chunk tree during mount. or
Always read eb from both dev1 and dev2, if fails then fail its
corresponding data block as well if any. or
During mount compare dev1 and dev2 generation #, record the range of
generation number missing on the disk. But this need ondisk format
changes. (But nodatacow, notrunc reg data extent write must change gen#). or
Similar to btree eb read pages, nest the extent data read as well.
(But nodatacow, notrunc reg data extent write must change gen#).
Thanks, Anand
> Then when we found stale metadata in chunk B mirror 1, caused by dev 2,
> then your patch consider device 2 stale, and try to use mirror num 2 to
> read from data chunk.
>
> However in data chunk, mirror num 2 means it's still from device 2, and
> we get stale data.
>
> So the eb->mirror_num can still map to bad/stale device, due to the
> flexibility provided by btrfs per-chunk mapping.
>
> Thanks,
> Qu
>
>>
>>>>> Another idea I get inspired from the idea is, make it more generic so
>>>>> that bad/stale device get a lower priority.
>>>>
>>>> When it comes to reading junk data, its not about the priority its
>>>> about the eliminating. When the problem is only few blocks, I am
>>>> against making the whole disk as bad.
>>>>
>>>>> Although it suffers the same problem as I described.
>>>>>
>>>>> To make the point short, the use case looks very limited.
>>>>
>>>> It applies to raid1/raid10 with nodatacow (which implies nodatasum).
>>>> In my understanding that's not rare.
>>>>
>>>> Any comments on the fix offered here?
>>>
>>> The implementation part is, is eb->read_mirror reliable?
>>>
>>> E.g. if the data and the eb are in different chunks, and the stale
>>> happens in the chunk of eb but not in the data chunk?
>>
>>
>> eb and regular data are indeed in different chunks always. But eb
>> can never be stale as there is parent transid which is verified against
>> the read eb. However we do not have such a check for the data (this is
>> the core of the issue here) and so we return the junk data silently.
>>
>> Also any idea why the generation number for the extent data is not
>> incremented [2] when -o nodatacow and notrunc option is used, is it
>> a bug? the dump-tree is taken with the script as below [1]
>> (this corruption is seen with or without generation number is
>> being incremented, but as another way to fix for the corruption we can
>> verify the inode EXTENT_DATA generation from the same disk from which
>> the data is read).
>>
>> [1]
>> umount /btrfs; mkfs.btrfs -fq -dsingle -msingle /dev/sdb && \
>> mount -o notreelog,max_inline=0,nodatasum /dev/sdb /btrfs && \
>> echo 1st write: && \
>> dd status=none if=/dev/urandom of=/btrfs/anand bs=4096 count=1
>> conv=fsync,notrunc && sync && \
>> btrfs in dump-tree /dev/sdb | egrep -A7 "257 INODE_ITEM 0\) item" && \
>> echo --- && \
>> btrfs in dump-tree /dev/sdb | grep -A4 "257 EXTENT_DATA" && \
>> echo 2nd write: && \
>> dd status=none if=/dev/urandom of=/btrfs/anand bs=4096 count=1
>> conv=fsync,notrunc && sync && \
>> btrfs in dump-tree /dev/sdb | egrep -A7 "257 INODE_ITEM 0\) item" && \
>> echo --- && \
>> btrfs in dump-tree /dev/sdb | grep -A4 "257 EXTENT_DATA"
>>
>>
>> 1st write:
>> item 4 key (257 INODE_ITEM 0) itemoff 15881 itemsize 160
>> generation 6 transid 6 size 4096 nbytes 4096
>> block group 0 mode 100644 links 1 uid 0 gid 0 rdev 0
>> sequence 1 flags 0x3(NODATASUM|NODATACOW)
>> atime 1553058460.163985452 (2019-03-20 13:07:40)
>> ctime 1553058460.163985452 (2019-03-20 13:07:40)
>> mtime 1553058460.163985452 (2019-03-20 13:07:40)
>> otime 1553058460.163985452 (2019-03-20 13:07:40)
>> ---
>> item 6 key (257 EXTENT_DATA 0) itemoff 15813 itemsize 53
>> generation 6 type 1 (regular)
>> extent data disk byte 13631488 nr 4096
>> extent data offset 0 nr 4096 ram 4096
>> extent compression 0 (none)
>> 2nd write:
>> item 4 key (257 INODE_ITEM 0) itemoff 15881 itemsize 160
>> generation 6 transid 7 size 4096 nbytes 4096
>> block group 0 mode 100644 links 1 uid 0 gid 0 rdev 0
>> sequence 2 flags 0x3(NODATASUM|NODATACOW)
>> atime 1553058460.163985452 (2019-03-20 13:07:40)
>> ctime 1553058460.189985450 (2019-03-20 13:07:40)
>> mtime 1553058460.189985450 (2019-03-20 13:07:40)
>> otime 1553058460.163985452 (2019-03-20 13:07:40)
>> ---
>> item 6 key (257 EXTENT_DATA 0) itemoff 15813 itemsize 53
>> generation 6 type 1 (regular) <----- [2]
>> extent data disk byte 13631488 nr 4096
>> extent data offset 0 nr 4096 ram 4096
>> extent compression 0 (none)
>>
>>
>> Thanks, Anand
>
next prev parent reply other threads:[~2019-03-20 13:54 UTC|newest]
Thread overview: 15+ messages / expand[flat|nested] mbox.gz Atom feed top
2019-03-19 11:35 [PATCH RFC] btrfs: fix read corrpution from disks of different generation Anand Jain
2019-03-19 11:35 ` [PATCH] fstests: btrfs test read " Anand Jain
2020-04-06 12:00 ` [PATCH v2] " Anand Jain
2019-03-19 12:07 ` [PATCH RFC] btrfs: fix read corrpution " Qu Wenruo
2019-03-19 23:41 ` Anand Jain
2019-03-20 1:02 ` Qu Wenruo
2019-03-20 5:47 ` Anand Jain
2019-03-20 6:19 ` Qu Wenruo
2019-03-20 14:00 ` Anand Jain
2019-03-20 14:40 ` Qu Wenruo
2019-03-20 15:40 ` Zygo Blaxell
2019-03-21 6:37 ` Anand Jain
2019-03-20 6:27 ` Qu Wenruo
2019-03-20 13:54 ` Anand Jain [this message]
2019-03-20 15:46 ` Zygo Blaxell
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=ce5be72e-5fae-932b-4ba7-85e0b3d8b825@oracle.com \
--to=anand.jain@oracle.com \
--cc=linux-btrfs@vger.kernel.org \
--cc=quwenruo.btrfs@gmx.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox