On Wed, Mar 20, 2019 at 09:54:16PM +0800, Anand Jain wrote: > > > On 3/20/19 2:27 PM, Qu Wenruo wrote: > > > > > > On 2019/3/20 下午1:47, Anand Jain wrote: > > > > > > > > > > > > > A tree based integrity verification > > > > > > >    is important for all data, which is missing. > > > > > > >      Fix: > > > > > > >      In this RFC patch it proposes to use same disk from with the > > > > > > > metadata > > > > > > >    is read to read the data. > > > > > > > > > > > > The obvious problem I found is, the idea only works for RAID1/10. > > > > > > > > > > > > For striped profile it makes no sense, or even have a worse chance to > > > > > > get stale data. > > > > > > > > > > > > > > > > > > To me, the idea of using possible better mirror makes some sense, but > > > > > > very profile limited. > > > > > > > > > >   Yep. This problem and fix is only for the mirror based profiles > > > > >   such as raid1/raid10. > > > > > > > > Then current implementation lacks such check. > > > > > > > > Further more, data and metadata can lie in different chunks and have > > > > different chunk types. > > > > > >  Right. Current tests for this RFC were only for raid1. > > > > > >  But the final patch can fix that. > > > > > >  In fact current patch works for all the cases except for the case of > > >  replace is running and mix of metadata:raid1 and data:raid56 > > > > > >  We need some cleanups in mirror_num, basically we need to bring it > > >  under #define. and handle it accordingly in __btrfs_map_block() > > > > > > Wait for a minute. > > > > There is a hidden pitfall from the very beginning. > > > > Consider such chunk layout: > > Chunk A Type DATA|RAID1 > > Stripe 1: Dev 1 > > Stripe 2: Dev 2 > > > > Chunk B Type METADATA|RAID1 > > Stripe 1: Dev 2 > > Stripe 2: Dev 1 > > Right this fix can't solve this case. > So one down. The other choice I had is to I think it's more than one down. I think *only* a handful of scenarios can work. What happens if there are 4 disks? Chunk A Type DATA|RAID1 Stripe 1: Dev 1 Stripe 2: Dev 2 Chunk B Type METADATA|RAID1 Stripe 1: Dev 3 Stripe 2: Dev 4 How do you know if dev 2 lost some data writes, but came back in time to update the superblock? Which disk do you plan to read data from if Dev 3 fails a generation check? > Quarantine whole disk by comparing dev1 and dev2 generation # in the > superblock during mount. or > Quarantine whole chunk(s) by comparing dev1 and dev2 generation # in the > chunk tree during mount. or > Always read eb from both dev1 and dev2, if fails then fail its > corresponding data block as well if any. or > During mount compare dev1 and dev2 generation #, record the range of > generation number missing on the disk. But this need ondisk format changes. > (But nodatacow, notrunc reg data extent write must change gen#). or > Similar to btree eb read pages, nest the extent data read as well. > (But nodatacow, notrunc reg data extent write must change gen#). > Thanks, Anand > > > > Then when we found stale metadata in chunk B mirror 1, caused by dev 2, > > then your patch consider device 2 stale, and try to use mirror num 2 to > > read from data chunk. > > > > However in data chunk, mirror num 2 means it's still from device 2, and > > we get stale data. > > > > So the eb->mirror_num can still map to bad/stale device, due to the > > flexibility provided by btrfs per-chunk mapping. > > > > Thanks, > > Qu > > > > > > > > > > > Another idea I get inspired from the idea is, make it more generic so > > > > > > that bad/stale device get a lower priority. > > > > > > > > > >   When it comes to reading junk data, its not about the priority its > > > > >   about the eliminating. When the problem is only few blocks, I am > > > > >   against making the whole disk as bad. > > > > > > > > > > > Although it suffers the same problem as I described. > > > > > > > > > > > > To make the point short, the use case looks very limited. > > > > > > > > > >   It applies to raid1/raid10 with nodatacow (which implies nodatasum). > > > > >   In my understanding that's not rare. > > > > > > > > > >   Any comments on the fix offered here? > > > > > > > > The implementation part is, is eb->read_mirror reliable? > > > > > > > > E.g. if the data and the eb are in different chunks, and the stale > > > > happens in the chunk of eb but not in the data chunk? > > > > > > > > >  eb and regular data are indeed in different chunks always. But eb > > >  can never be stale as there is parent transid which is verified against > > >  the read eb. However we do not have such a check for the data (this is > > >  the core of the issue here) and so we return the junk data silently. > > > > > >  Also any idea why the generation number for the extent data is not > > >  incremented [2] when -o nodatacow and notrunc option is used, is it > > >  a bug? the dump-tree is taken with the script as below [1] > > >  (this corruption is seen with or without generation number is > > >  being incremented, but as another way to fix for the corruption we can > > >  verify the inode EXTENT_DATA generation from the same disk from which > > >  the data is read). > > > > > > [1] > > >  umount /btrfs; mkfs.btrfs -fq -dsingle -msingle /dev/sdb && \ > > >  mount -o notreelog,max_inline=0,nodatasum /dev/sdb /btrfs && \ > > >  echo 1st write: && \ > > >  dd status=none if=/dev/urandom of=/btrfs/anand bs=4096 count=1 > > > conv=fsync,notrunc && sync && \ > > >  btrfs in dump-tree /dev/sdb | egrep -A7 "257 INODE_ITEM 0\) item" && \ > > >  echo --- && \ > > >  btrfs in dump-tree /dev/sdb  | grep -A4 "257 EXTENT_DATA" && \ > > >  echo 2nd write: && \ > > >  dd status=none if=/dev/urandom of=/btrfs/anand bs=4096 count=1 > > > conv=fsync,notrunc && sync && \ > > >  btrfs in dump-tree /dev/sdb | egrep -A7 "257 INODE_ITEM 0\) item" && \ > > >  echo --- && \ > > >  btrfs in dump-tree /dev/sdb  | grep -A4 "257 EXTENT_DATA" > > > > > > > > > 1st write: > > >     item 4 key (257 INODE_ITEM 0) itemoff 15881 itemsize 160 > > >         generation 6 transid 6 size 4096 nbytes 4096 > > >         block group 0 mode 100644 links 1 uid 0 gid 0 rdev 0 > > >         sequence 1 flags 0x3(NODATASUM|NODATACOW) > > >         atime 1553058460.163985452 (2019-03-20 13:07:40) > > >         ctime 1553058460.163985452 (2019-03-20 13:07:40) > > >         mtime 1553058460.163985452 (2019-03-20 13:07:40) > > >         otime 1553058460.163985452 (2019-03-20 13:07:40) > > > --- > > >     item 6 key (257 EXTENT_DATA 0) itemoff 15813 itemsize 53 > > >         generation 6 type 1 (regular) > > >         extent data disk byte 13631488 nr 4096 > > >         extent data offset 0 nr 4096 ram 4096 > > >         extent compression 0 (none) > > > 2nd write: > > >     item 4 key (257 INODE_ITEM 0) itemoff 15881 itemsize 160 > > >         generation 6 transid 7 size 4096 nbytes 4096 > > >         block group 0 mode 100644 links 1 uid 0 gid 0 rdev 0 > > >         sequence 2 flags 0x3(NODATASUM|NODATACOW) > > >         atime 1553058460.163985452 (2019-03-20 13:07:40) > > >         ctime 1553058460.189985450 (2019-03-20 13:07:40) > > >         mtime 1553058460.189985450 (2019-03-20 13:07:40) > > >         otime 1553058460.163985452 (2019-03-20 13:07:40) > > > --- > > >     item 6 key (257 EXTENT_DATA 0) itemoff 15813 itemsize 53 > > >         generation 6 type 1 (regular)   <----- [2] > > >         extent data disk byte 13631488 nr 4096 > > >         extent data offset 0 nr 4096 ram 4096 > > >         extent compression 0 (none) > > > > > > > > > Thanks, Anand > >