Re: [PATCH RFC] btrfs: fix read corrpution from disks of different generation

public inbox for linux-btrfs@vger.kernel.org
 help / color / mirror / Atom feed

From: Qu Wenruo <quwenruo.btrfs@gmx.com>
To: Anand Jain <anand.jain@oracle.com>, linux-btrfs@vger.kernel.org
Subject: Re: [PATCH RFC] btrfs: fix read corrpution from disks of different generation
Date: Tue, 19 Mar 2019 20:07:47 +0800	[thread overview]
Message-ID: <b458f22f-9833-27bd-dd06-ba2862aa59c4@gmx.com> (raw)
In-Reply-To: <1552995330-28927-1-git-send-email-anand.jain@oracle.com>


[-- Attachment #1.1: Type: text/plain, Size: 9590 bytes --]



On 2019/3/19 下午7:35, Anand Jain wrote:
>    In case of file regular extent (non inline), the metadata and data are 
>  read from two different IO operations. When we read the metadata using 
>  the btree each extent block gets verified with the expected  transid as 
>  per its parent. So suppose if any of the block is stale it gets reported 
>  and corrected by reading from the mirror (consider raid1).
>  
>    This also means the data extent is not yet read or verified and this 
>  happens when application tries to read the file.
>  When application tries to read the file btrfs_readpage(), it shall first 
>  read file extent map as provided by the metadata in the above operation 
>  (which is assured to be consistent).
>  And when file data has to be read using the above filemap as it again 
>  has the choice of using either of the mirrors, suppose if the IO is 
>  routed to the disk on which we haven't verified the metadata on it and 
>  if there is write hole on that disk, then we shall be reading the junk 
>  data without being reported anywhere.
> 
>    This is reproducible with the test case [1] in this email thread.
>      [1]
>      fstests: btrfs test read from disks of different generation
> 
>    As we have data csum most of these get caught and reported as csum 
>  error.

So far so good.

>  But csum verification is a point in verification and its not a 
>  tree based transid verification. Which means if there is a stale data 
>  with matching csum we may return a junk data silently.

Then the normal idea is to use stronger but slower csum in the first
place, to avoid the csum match case.

>  This problem is 
>  easily reproducible when csum is disabled but not impossible to achieve 
>  when csum is not disabled as well.

Under this case, it's the user to be blamed for the decision to disable
the csum in the first place.

> A tree based integrity verification 
>  is important for all data, which is missing.
>  
>  Fix:
>    In this RFC patch it proposes to use same disk from with the metadata 
>  is read to read the data.

The obvious problem I found is, the idea only works for RAID1/10.

For striped profile it makes no sense, or even have a worse chance to
get stale data.


To me, the idea of using possible better mirror makes some sense, but
very profile limited.


Another idea I get inspired from the idea is, make it more generic so
that bad/stale device get a lower priority.
Although it suffers the same problem as I described.

To make the point short, the use case looks very limited.

Thanks,
Qu

> Which means the feature such as readmirror is 
>  less effective (which is fine). But if the good data fails (media error) 
>  as usual we shall read the mirrored data without being aware that this 
>  data may be stale. To fix that, we have to invalidate the corresponding 
>  metadata and re-read the metadata from other disk (this part is not 
>  implemented in this patch).
>  
>     The other better solution is - a tree based verification, which means 
>  we have to verify the data-extent's metadata in the extent tree when its 
>  read - but there is a problem we don't update the generation number in 
>  the extent tree for all types of write, for example consider overwrite 
>  with notrunc option on a filesystem mounted with nodatacow option [1], 
>  which I wonder if its a bug or if its like that for a reason? as because 
>  there is no cow?
>  
>  [1] The EXTENT_DATA generation is same before and after dd with notrunc
>  
>    mkfs.btrfs -fq /dev/sdb && mount -o notreelog,nodatacow /dev/sdb /btrfs
>    dd if=/dev/urandom of=/btrfs/tf1 bs=4096 count=1 conv=fsync,notrunc && 
>  sync
>    btrfs in dump-tree /dev/sdb | grep -A4 "257 EXTENT_DATA 0"
>    btrfs in dump-tree /dev/sdb | egrep -A7 "257 INODE_ITEM 0\) itemoff"
>    dd if=/dev/urandom of=/btrfs/tf1 bs=4096 count=1 conv=fsync,notrunc && 
>  sync
>    btrfs in dump-tree /dev/sdb | grep -A4 "257 EXTENT_DATA 0"
>    btrfs in dump-tree /dev/sdb | egrep -A7 "257 INODE_ITEM 0\) itemoff"
>  
>  ---------
>       item 6 key (257 EXTENT_DATA 0) itemoff 15819 itemsize 53
>           generation 6 type 1 (regular)
>           extent data disk byte 13631488 nr 4096
>           extent data offset 0 nr 4096 ram 4096
>           extent compression 0 (none)
>       item 4 key (257 INODE_ITEM 0) itemoff 15885 itemsize 160
>           generation 6 transid 6 size 4096 nbytes 4096
>           block group 0 mode 100644 links 1 uid 0 gid 0 rdev 0
>           sequence 66785 flags 0x3(NODATASUM|NODATACOW)
>           atime 1552993569.276079763 (2019-03-19 19:06:09)
>           ctime 1552993569.276079763 (2019-03-19 19:06:09)
>           mtime 1552993569.276079763 (2019-03-19 19:06:09)
>           otime 1552993569.276079763 (2019-03-19 19:06:09)
>  1+0 records in
>  1+0 records out
>  4096 bytes (4.1 kB) copied, 0.00302717 s, 1.4 MB/s
>       item 6 key (257 EXTENT_DATA 0) itemoff 15819 itemsize 53
>           generation 6 type 1 (regular)
>           extent data disk byte 13631488 nr 4096
>           extent data offset 0 nr 4096 ram 4096
>           extent compression 0 (none)
>       item 4 key (257 INODE_ITEM 0) itemoff 15885 itemsize 160
>           generation 6 transid 7 size 4096 nbytes 4096
>           block group 0 mode 100644 links 1 uid 0 gid 0 rdev 0
>           sequence 66786 flags 0x3(NODATASUM|NODATACOW)
>           atime 1552993569.276079763 (2019-03-19 19:06:09)
>           ctime 1552993569.302079763 (2019-03-19 19:06:09)
>           mtime 1552993569.302079763 (2019-03-19 19:06:09)
>           otime 1552993569.276079763 (2019-03-19 19:06:09)
>  -------
>  
>    Comments appreciated.
> 
> 
> Signed-off-by: Anand Jain <anand.jain@oracle.com>
> ---
>  fs/btrfs/extent_io.c | 41 ++++++++++++++++++++++++++++++++++-------
>  1 file changed, 34 insertions(+), 7 deletions(-)
> 
> diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
> index 31892052a24f..f1cc21e8dff7 100644
> --- a/fs/btrfs/extent_io.c
> +++ b/fs/btrfs/extent_io.c
> @@ -3079,7 +3079,7 @@ static inline void __do_contiguous_readpages(struct extent_io_tree *tree,
>  					     struct extent_map **em_cached,
>  					     struct bio **bio,
>  					     unsigned long *bio_flags,
> -					     u64 *prev_em_start)
> +					     u64 *prev_em_start, int mirror)
>  {
>  	struct inode *inode;
>  	struct btrfs_ordered_extent *ordered;
> @@ -3099,7 +3099,7 @@ static inline void __do_contiguous_readpages(struct extent_io_tree *tree,
>  
>  	for (index = 0; index < nr_pages; index++) {
>  		__do_readpage(tree, pages[index], btrfs_get_extent, em_cached,
> -				bio, 0, bio_flags, REQ_RAHEAD, prev_em_start);
> +				bio, mirror, bio_flags, REQ_RAHEAD, prev_em_start);
>  		put_page(pages[index]);
>  	}
>  }
> @@ -3109,7 +3109,7 @@ static void __extent_readpages(struct extent_io_tree *tree,
>  			       int nr_pages,
>  			       struct extent_map **em_cached,
>  			       struct bio **bio, unsigned long *bio_flags,
> -			       u64 *prev_em_start)
> +			       u64 *prev_em_start, int mirror)
>  {
>  	u64 start = 0;
>  	u64 end = 0;
> @@ -3130,7 +3130,7 @@ static void __extent_readpages(struct extent_io_tree *tree,
>  						  index - first_index, start,
>  						  end, em_cached,
>  						  bio, bio_flags,
> -						  prev_em_start);
> +						  prev_em_start, mirror);
>  			start = page_start;
>  			end = start + PAGE_SIZE - 1;
>  			first_index = index;
> @@ -3141,7 +3141,7 @@ static void __extent_readpages(struct extent_io_tree *tree,
>  		__do_contiguous_readpages(tree, &pages[first_index],
>  					  index - first_index, start,
>  					  end, em_cached, bio,
> -					  bio_flags, prev_em_start);
> +					  bio_flags, prev_em_start, mirror);
>  }
>  
>  static int __extent_read_full_page(struct extent_io_tree *tree,
> @@ -4093,6 +4093,27 @@ int extent_writepages(struct address_space *mapping,
>  	return ret;
>  }
>  
> +static int btrfs_get_read_mirror(struct inode *inode)
> +{
> +	int ret;
> +	struct btrfs_path *path;
> +
> +	path = btrfs_alloc_path();
> +	if (!path)
> +		return -ENOMEM;
> +
> +	ret = btrfs_lookup_file_extent(NULL, BTRFS_I(inode)->root, path,
> +				       btrfs_ino(BTRFS_I(inode)), 0, 0);
> +	if (!ret) {
> +		struct extent_buffer *eb = path->nodes[0];
> +
> +		ret = eb->read_mirror;
> +	}
> +
> +	btrfs_free_path(path);
> +	return ret;
> +}
> +
>  int extent_readpages(struct address_space *mapping, struct list_head *pages,
>  		     unsigned nr_pages)
>  {
> @@ -4103,6 +4124,11 @@ int extent_readpages(struct address_space *mapping, struct list_head *pages,
>  	struct extent_io_tree *tree = &BTRFS_I(mapping->host)->io_tree;
>  	int nr = 0;
>  	u64 prev_em_start = (u64)-1;
> +	int mirror;
> +
> +	mirror = btrfs_get_read_mirror(mapping->host);
> +	if (mirror < 0)
> +		return mirror;
>  
>  	while (!list_empty(pages)) {
>  		for (nr = 0; nr < ARRAY_SIZE(pagepool) && !list_empty(pages);) {
> @@ -4120,14 +4146,15 @@ int extent_readpages(struct address_space *mapping, struct list_head *pages,
>  		}
>  
>  		__extent_readpages(tree, pagepool, nr, &em_cached, &bio,
> -				   &bio_flags, &prev_em_start);
> +				   &bio_flags, &prev_em_start, mirror);
> +
>  	}
>  
>  	if (em_cached)
>  		free_extent_map(em_cached);
>  
>  	if (bio)
> -		return submit_one_bio(bio, 0, bio_flags);
> +		return submit_one_bio(bio, mirror, bio_flags);
>  	return 0;
>  }
>  
> 


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

next prev parent reply	other threads:[~2019-03-19 12:08 UTC|newest]

Thread overview: 15+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2019-03-19 11:35 [PATCH RFC] btrfs: fix read corrpution from disks of different generation Anand Jain
2019-03-19 11:35 ` [PATCH] fstests: btrfs test read " Anand Jain
2020-04-06 12:00   ` [PATCH v2] " Anand Jain
2019-03-19 12:07 ` Qu Wenruo [this message]
2019-03-19 23:41   ` [PATCH RFC] btrfs: fix read corrpution " Anand Jain
2019-03-20  1:02     ` Qu Wenruo
2019-03-20  5:47       ` Anand Jain
2019-03-20  6:19         ` Qu Wenruo
2019-03-20 14:00           ` Anand Jain
2019-03-20 14:40             ` Qu Wenruo
2019-03-20 15:40               ` Zygo Blaxell
2019-03-21  6:37                 ` Anand Jain
2019-03-20  6:27         ` Qu Wenruo
2019-03-20 13:54           ` Anand Jain
2019-03-20 15:46             ` Zygo Blaxell

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=b458f22f-9833-27bd-dd06-ba2862aa59c4@gmx.com \
    --to=quwenruo.btrfs@gmx.com \
    --cc=anand.jain@oracle.com \
    --cc=linux-btrfs@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox