From: Anand Jain <anand.jain@oracle.com>
To: Zygo Blaxell <ce3g8jdj@umail.furryterror.org>,
Qu Wenruo <quwenruo.btrfs@gmx.com>
Cc: linux-btrfs@vger.kernel.org
Subject: Re: [PATCH RFC] btrfs: fix read corrpution from disks of different generation
Date: Thu, 21 Mar 2019 14:37:58 +0800 [thread overview]
Message-ID: <00492b06-9c55-985a-72fa-bb181c4137ba@oracle.com> (raw)
In-Reply-To: <20190320154048.GD16651@hungrycats.org>
On 3/20/19 11:40 PM, Zygo Blaxell wrote:
> On Wed, Mar 20, 2019 at 10:40:07PM +0800, Qu Wenruo wrote:
>>
>>
>> On 2019/3/20 下午10:00, Anand Jain wrote:
>>>
>>>>> Also any idea why the generation number for the extent data is not
>>>>> incremented [2] when -o nodatacow and notrunc option is used, is it
>>>>> a bug? the dump-tree is taken with the script as below [1]
>>>>> (this corruption is seen with or without generation number is
>>>>> being incremented, but as another way to fix for the corruption we can
>>>>> verify the inode EXTENT_DATA generation from the same disk from which
>>>>> the data is read).
>>>>
>>>> For the generation part, it's the generation when data is written to
>>>> disk.
>>>>
>>>> Truncation/nocow overwrite shouldn't really change the generation of
>>>> existing file extents.
>>>>
>>>> So I'm afraid you can't use that generation to do the check.
>>>
>>> Any idea why it shouldn't change? Albeit there isn't new allocation
>>> due to nodatacow and notrunc overwrite, but sure data is overwritten.
>
> The references to the extent in the subvol trees hold a copy of the
> extent's generation, so if the extent's generation is modified, all the
> references to the extent in all the subvol trees have to be modified too,
> or they fail transid verification later on.
Extents with refs > 1 will cow , so gen # is incremented on the
overwritten extents. But my scenario is when refs = 1. As one of
the idea I am mulling is to track/record missed generation numbers
on the disks and quarantine those reads on those disks accordingly.
> Compared to pure datacow
> (nodatasum, compress=none), this would be the same number of iops, only
> fragmentation can be saved because of block overwrites (and even that
> isn't saved all the time).
>
>>> If that's the case then I would guess there will be bug in send receive
>>> as well.
>
> Send requires a read-only snapshot, > and the snapshot's reference to
> the nodatacow extents automatically turns on datacow for those extents.
> Thus, send behaves correctly because nodatacow is disabled.
>
> The nodatacow flag is advisory. It doesn't prevent btrfs from relocating
> data when needed.
Oh yes. I forgot, there is rdonly snapshot prerequisite for the sends.
Thanks, Anand
>> I'm not sure about the send part.
>>
>> On the other hand, if btrfs is going to update the generation of
>> nodatacow file extent overwrite, it should cause pretty big performance
>> degradation.
>>
>> The idea of nodatacow is to skip all the expensive csum, extent
>> allocation (maybe not that expensive) and the race of subvol tree.
>
> nodatacow also skips RAID data integrity checks. Generally, applications
> and admins should plan for any data put in a nodatacow file to be silently
> corrupted at any time, i.e. the same situation as an ext4-on-mdadm setup.
> This is the price for minimizing the overhead for a write to a nodatacow
> extent.
>
>> If we're going to update file extents for such case, we're re-introduce
>> performance impact to users who don't want that impact at all.
>> I don't believe it's worthy at all.
>>
>> Thanks,
>> Qu
>>
>>>
>>> Thanks, Anand
>>>
>>>> Thanks,
>>>> Qu
>>>>
>>>>>
>>>>> [1]
>>>>> umount /btrfs; mkfs.btrfs -fq -dsingle -msingle /dev/sdb && \
>>>>> mount -o notreelog,max_inline=0,nodatasum /dev/sdb /btrfs && \
>>>>> echo 1st write: && \
>>>>> dd status=none if=/dev/urandom of=/btrfs/anand bs=4096 count=1
>>>>> conv=fsync,notrunc && sync && \
>>>>> btrfs in dump-tree /dev/sdb | egrep -A7 "257 INODE_ITEM 0\) item" && \
>>>>> echo --- && \
>>>>> btrfs in dump-tree /dev/sdb | grep -A4 "257 EXTENT_DATA" && \
>>>>> echo 2nd write: && \
>>>>> dd status=none if=/dev/urandom of=/btrfs/anand bs=4096 count=1
>>>>> conv=fsync,notrunc && sync && \
>>>>> btrfs in dump-tree /dev/sdb | egrep -A7 "257 INODE_ITEM 0\) item" && \
>>>>> echo --- && \
>>>>> btrfs in dump-tree /dev/sdb | grep -A4 "257 EXTENT_DATA"
>>>>>
>>>>>
>>>>> 1st write:
>>>>> item 4 key (257 INODE_ITEM 0) itemoff 15881 itemsize 160
>>>>> generation 6 transid 6 size 4096 nbytes 4096
>>>>> block group 0 mode 100644 links 1 uid 0 gid 0 rdev 0
>>>>> sequence 1 flags 0x3(NODATASUM|NODATACOW)
>>>>> atime 1553058460.163985452 (2019-03-20 13:07:40)
>>>>> ctime 1553058460.163985452 (2019-03-20 13:07:40)
>>>>> mtime 1553058460.163985452 (2019-03-20 13:07:40)
>>>>> otime 1553058460.163985452 (2019-03-20 13:07:40)
>>>>> ---
>>>>> item 6 key (257 EXTENT_DATA 0) itemoff 15813 itemsize 53
>>>>> generation 6 type 1 (regular)
>>>>> extent data disk byte 13631488 nr 4096
>>>>> extent data offset 0 nr 4096 ram 4096
>>>>> extent compression 0 (none)
>>>>> 2nd write:
>>>>> item 4 key (257 INODE_ITEM 0) itemoff 15881 itemsize 160
>>>>> generation 6 transid 7 size 4096 nbytes 4096
>>>>> block group 0 mode 100644 links 1 uid 0 gid 0 rdev 0
>>>>> sequence 2 flags 0x3(NODATASUM|NODATACOW)
>>>>> atime 1553058460.163985452 (2019-03-20 13:07:40)
>>>>> ctime 1553058460.189985450 (2019-03-20 13:07:40)
>>>>> mtime 1553058460.189985450 (2019-03-20 13:07:40)
>>>>> otime 1553058460.163985452 (2019-03-20 13:07:40)
>>>>> ---
>>>>> item 6 key (257 EXTENT_DATA 0) itemoff 15813 itemsize 53
>>>>> generation 6 type 1 (regular) <----- [2]
>>>>> extent data disk byte 13631488 nr 4096
>>>>> extent data offset 0 nr 4096 ram 4096
>>>>> extent compression 0 (none)
>>>>>
>>>>>
>>>>> Thanks, Anand
>>>>
>>
>
>
>
next prev parent reply other threads:[~2019-03-21 6:38 UTC|newest]
Thread overview: 15+ messages / expand[flat|nested] mbox.gz Atom feed top
2019-03-19 11:35 [PATCH RFC] btrfs: fix read corrpution from disks of different generation Anand Jain
2019-03-19 11:35 ` [PATCH] fstests: btrfs test read " Anand Jain
2020-04-06 12:00 ` [PATCH v2] " Anand Jain
2019-03-19 12:07 ` [PATCH RFC] btrfs: fix read corrpution " Qu Wenruo
2019-03-19 23:41 ` Anand Jain
2019-03-20 1:02 ` Qu Wenruo
2019-03-20 5:47 ` Anand Jain
2019-03-20 6:19 ` Qu Wenruo
2019-03-20 14:00 ` Anand Jain
2019-03-20 14:40 ` Qu Wenruo
2019-03-20 15:40 ` Zygo Blaxell
2019-03-21 6:37 ` Anand Jain [this message]
2019-03-20 6:27 ` Qu Wenruo
2019-03-20 13:54 ` Anand Jain
2019-03-20 15:46 ` Zygo Blaxell
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=00492b06-9c55-985a-72fa-bb181c4137ba@oracle.com \
--to=anand.jain@oracle.com \
--cc=ce3g8jdj@umail.furryterror.org \
--cc=linux-btrfs@vger.kernel.org \
--cc=quwenruo.btrfs@gmx.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox