public inbox for linux-btrfs@vger.kernel.org
 help / color / mirror / Atom feed
From: Anand Jain <anand.jain@oracle.com>
To: Zygo Blaxell <ce3g8jdj@umail.furryterror.org>,
	Qu Wenruo <quwenruo.btrfs@gmx.com>
Cc: linux-btrfs@vger.kernel.org
Subject: Re: [PATCH RFC] btrfs: fix read corrpution from disks of different generation
Date: Thu, 21 Mar 2019 14:37:58 +0800	[thread overview]
Message-ID: <00492b06-9c55-985a-72fa-bb181c4137ba@oracle.com> (raw)
In-Reply-To: <20190320154048.GD16651@hungrycats.org>



On 3/20/19 11:40 PM, Zygo Blaxell wrote:
> On Wed, Mar 20, 2019 at 10:40:07PM +0800, Qu Wenruo wrote:
>>
>>
>> On 2019/3/20 下午10:00, Anand Jain wrote:
>>>
>>>>>    Also any idea why the generation number for the extent data is not
>>>>>    incremented [2] when -o nodatacow and notrunc option is used, is it
>>>>>    a bug? the dump-tree is taken with the script as below [1]
>>>>>    (this corruption is seen with or without generation number is
>>>>>    being incremented, but as another way to fix for the corruption we can
>>>>>    verify the inode EXTENT_DATA generation from the same disk from which
>>>>>    the data is read).
>>>>
>>>> For the generation part, it's the generation when data is written to
>>>> disk.
>>>>
>>>> Truncation/nocow overwrite shouldn't really change the generation of
>>>> existing file extents.
>>>>
>>>> So I'm afraid you can't use that generation to do the check.
>>>
>>>   Any idea why it shouldn't change? Albeit there isn't new allocation
>>>   due to nodatacow and notrunc overwrite, but sure data is overwritten.
> 
> The references to the extent in the subvol trees hold a copy of the
> extent's generation, so if the extent's generation is modified, all the
> references to the extent in all the subvol trees have to be modified too,
> or they fail transid verification later on.


  Extents with refs > 1 will cow , so gen # is incremented on the
  overwritten extents. But my scenario is when refs = 1. As one of
  the idea I am mulling is to track/record missed generation numbers
  on the disks and quarantine those reads on those disks accordingly.


> Compared to pure datacow
> (nodatasum, compress=none), this would be the same number of iops, only
> fragmentation can be saved because of block overwrites (and even that
> isn't saved all the time).
> 
>>>   If that's the case then I would guess there will be bug in send receive
>>>   as well.
> 
> Send requires a read-only snapshot, > and the snapshot's reference to
> the nodatacow extents automatically turns on datacow for those extents.
> Thus, send behaves correctly because nodatacow is disabled.
> 
> The nodatacow flag is advisory.  It doesn't prevent btrfs from relocating
> data when needed.

  Oh yes. I forgot, there is rdonly snapshot prerequisite for the sends.

Thanks, Anand

>> I'm not sure about the send part.
>>
>> On the other hand, if btrfs is going to update the generation of
>> nodatacow file extent overwrite, it should cause pretty big performance
>> degradation.
>>
>> The idea of nodatacow is to skip all the expensive csum, extent
>> allocation (maybe not that expensive) and the race of subvol tree.
> 
> nodatacow also skips RAID data integrity checks.  Generally, applications
> and admins should plan for any data put in a nodatacow file to be silently
> corrupted at any time, i.e. the same situation as an ext4-on-mdadm setup.
> This is the price for minimizing the overhead for a write to a nodatacow
> extent.
> 
>> If we're going to update file extents for such case, we're re-introduce
>> performance impact to users who don't want that impact at all.
>> I don't believe it's worthy at all.
>>
>> Thanks,
>> Qu
>>
>>>
>>> Thanks, Anand
>>>
>>>> Thanks,
>>>> Qu
>>>>
>>>>>
>>>>> [1]
>>>>>    umount /btrfs; mkfs.btrfs -fq -dsingle -msingle /dev/sdb && \
>>>>>    mount -o notreelog,max_inline=0,nodatasum /dev/sdb /btrfs && \
>>>>>    echo 1st write: && \
>>>>>    dd status=none if=/dev/urandom of=/btrfs/anand bs=4096 count=1
>>>>> conv=fsync,notrunc && sync && \
>>>>>    btrfs in dump-tree /dev/sdb | egrep -A7 "257 INODE_ITEM 0\) item" && \
>>>>>    echo --- && \
>>>>>    btrfs in dump-tree /dev/sdb  | grep -A4 "257 EXTENT_DATA" && \
>>>>>    echo 2nd write: && \
>>>>>    dd status=none if=/dev/urandom of=/btrfs/anand bs=4096 count=1
>>>>> conv=fsync,notrunc && sync && \
>>>>>    btrfs in dump-tree /dev/sdb | egrep -A7 "257 INODE_ITEM 0\) item" && \
>>>>>    echo --- && \
>>>>>    btrfs in dump-tree /dev/sdb  | grep -A4 "257 EXTENT_DATA"
>>>>>
>>>>>
>>>>> 1st write:
>>>>>       item 4 key (257 INODE_ITEM 0) itemoff 15881 itemsize 160
>>>>>           generation 6 transid 6 size 4096 nbytes 4096
>>>>>           block group 0 mode 100644 links 1 uid 0 gid 0 rdev 0
>>>>>           sequence 1 flags 0x3(NODATASUM|NODATACOW)
>>>>>           atime 1553058460.163985452 (2019-03-20 13:07:40)
>>>>>           ctime 1553058460.163985452 (2019-03-20 13:07:40)
>>>>>           mtime 1553058460.163985452 (2019-03-20 13:07:40)
>>>>>           otime 1553058460.163985452 (2019-03-20 13:07:40)
>>>>> ---
>>>>>       item 6 key (257 EXTENT_DATA 0) itemoff 15813 itemsize 53
>>>>>           generation 6 type 1 (regular)
>>>>>           extent data disk byte 13631488 nr 4096
>>>>>           extent data offset 0 nr 4096 ram 4096
>>>>>           extent compression 0 (none)
>>>>> 2nd write:
>>>>>       item 4 key (257 INODE_ITEM 0) itemoff 15881 itemsize 160
>>>>>           generation 6 transid 7 size 4096 nbytes 4096
>>>>>           block group 0 mode 100644 links 1 uid 0 gid 0 rdev 0
>>>>>           sequence 2 flags 0x3(NODATASUM|NODATACOW)
>>>>>           atime 1553058460.163985452 (2019-03-20 13:07:40)
>>>>>           ctime 1553058460.189985450 (2019-03-20 13:07:40)
>>>>>           mtime 1553058460.189985450 (2019-03-20 13:07:40)
>>>>>           otime 1553058460.163985452 (2019-03-20 13:07:40)
>>>>> ---
>>>>>       item 6 key (257 EXTENT_DATA 0) itemoff 15813 itemsize 53
>>>>>           generation 6 type 1 (regular)   <----- [2]
>>>>>           extent data disk byte 13631488 nr 4096
>>>>>           extent data offset 0 nr 4096 ram 4096
>>>>>           extent compression 0 (none)
>>>>>
>>>>>
>>>>> Thanks, Anand
>>>>
>>
> 
> 
> 

  reply	other threads:[~2019-03-21  6:38 UTC|newest]

Thread overview: 15+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2019-03-19 11:35 [PATCH RFC] btrfs: fix read corrpution from disks of different generation Anand Jain
2019-03-19 11:35 ` [PATCH] fstests: btrfs test read " Anand Jain
2020-04-06 12:00   ` [PATCH v2] " Anand Jain
2019-03-19 12:07 ` [PATCH RFC] btrfs: fix read corrpution " Qu Wenruo
2019-03-19 23:41   ` Anand Jain
2019-03-20  1:02     ` Qu Wenruo
2019-03-20  5:47       ` Anand Jain
2019-03-20  6:19         ` Qu Wenruo
2019-03-20 14:00           ` Anand Jain
2019-03-20 14:40             ` Qu Wenruo
2019-03-20 15:40               ` Zygo Blaxell
2019-03-21  6:37                 ` Anand Jain [this message]
2019-03-20  6:27         ` Qu Wenruo
2019-03-20 13:54           ` Anand Jain
2019-03-20 15:46             ` Zygo Blaxell

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=00492b06-9c55-985a-72fa-bb181c4137ba@oracle.com \
    --to=anand.jain@oracle.com \
    --cc=ce3g8jdj@umail.furryterror.org \
    --cc=linux-btrfs@vger.kernel.org \
    --cc=quwenruo.btrfs@gmx.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox