Linux Btrfs filesystem development
 help / color / mirror / Atom feed
* [PATCH] btrfs: Always try all copies when reading extent buffers
@ 2018-11-06 14:40 Nikolay Borisov
  2018-11-06 14:53 ` Qu Wenruo
                   ` (2 more replies)
  0 siblings, 3 replies; 7+ messages in thread
From: Nikolay Borisov @ 2018-11-06 14:40 UTC (permalink / raw)
  To: linux-btrfs; +Cc: wqu, josef, Nikolay Borisov

When a metadata read is served the endio routine btree_readpage_end_io_hook
is called which eventually runs the tree-checker. If tree-checker fails
to validate the read eb then it sets EXTENT_BUFFER_CORRUPT flag. This
leads to btree_read_extent_buffer_pages wrongly assuming that all
available copies of this extent buffer are wrong and failing prematurely.
Fix this modify btree_read_extent_buffer_pages to read all copies of
the data.

This failure was exhibitted in xfstests btrfs/124 which would
spuriously fail its balance operations. The reason was that when balance
was run following re-introduction of the missing raid1 disk
__btrfs_map_block would map the read request to stripe 0, which
corresponded to devid 2 (the disk which is being removed in the test):

    item 2 key (FIRST_CHUNK_TREE CHUNK_ITEM 3553624064) itemoff 15975 itemsize 112
	length 1073741824 owner 2 stripe_len 65536 type DATA|RAID1
	io_align 65536 io_width 65536 sector_size 4096
	num_stripes 2 sub_stripes 1
		stripe 0 devid 2 offset 2156920832
		dev_uuid 8466c350-ed0c-4c3b-b17d-6379b445d5c8
		stripe 1 devid 1 offset 3553624064
		dev_uuid 1265d8db-5596-477e-af03-df08eb38d2ca

This caused read requests for a checksum item that to be routed to the
stale disk which triggered the aforementioned logic involving
EXTENT_BUFFER_CORRUPT flag. This then triggered cascading failures of
the balance operation.

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Suggested-by: Qu Wenruo <wqu@suse.com>
Fixes: a826d6dcb32d ("Btrfs: check items for correctness as we search")
---
 fs/btrfs/disk-io.c | 11 +----------
 1 file changed, 1 insertion(+), 10 deletions(-)

diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index 00ee5e37e989..279c6dbcc736 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -477,9 +477,9 @@ static int btree_read_extent_buffer_pages(struct btrfs_fs_info *fs_info,
 	int mirror_num = 0;
 	int failed_mirror = 0;
 
-	clear_bit(EXTENT_BUFFER_CORRUPT, &eb->bflags);
 	io_tree = &BTRFS_I(fs_info->btree_inode)->io_tree;
 	while (1) {
+		clear_bit(EXTENT_BUFFER_CORRUPT, &eb->bflags);
 		ret = read_extent_buffer_pages(io_tree, eb, WAIT_COMPLETE,
 					       mirror_num);
 		if (!ret) {
@@ -493,15 +493,6 @@ static int btree_read_extent_buffer_pages(struct btrfs_fs_info *fs_info,
 				break;
 		}
 
-		/*
-		 * This buffer's crc is fine, but its contents are corrupted, so
-		 * there is no reason to read the other copies, they won't be
-		 * any less wrong.
-		 */
-		if (test_bit(EXTENT_BUFFER_CORRUPT, &eb->bflags) ||
-		    ret == -EUCLEAN)
-			break;
-
 		num_copies = btrfs_num_copies(fs_info,
 					      eb->start, eb->len);
 		if (num_copies == 1)
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 7+ messages in thread

* Re: [PATCH] btrfs: Always try all copies when reading extent buffers
  2018-11-06 14:40 [PATCH] btrfs: Always try all copies when reading extent buffers Nikolay Borisov
@ 2018-11-06 14:53 ` Qu Wenruo
  2018-11-06 15:14   ` Nikolay Borisov
  2018-11-06 16:07 ` Nikolay Borisov
  2018-11-12 21:30 ` David Sterba
  2 siblings, 1 reply; 7+ messages in thread
From: Qu Wenruo @ 2018-11-06 14:53 UTC (permalink / raw)
  To: Nikolay Borisov, linux-btrfs; +Cc: wqu, josef



On 2018/11/6 下午10:40, Nikolay Borisov wrote:
> When a metadata read is served the endio routine btree_readpage_end_io_hook
> is called which eventually runs the tree-checker. If tree-checker fails
> to validate the read eb then it sets EXTENT_BUFFER_CORRUPT flag. This
> leads to btree_read_extent_buffer_pages wrongly assuming that all
> available copies of this extent buffer are wrong and failing prematurely.
> Fix this modify btree_read_extent_buffer_pages to read all copies of
> the data.
> 
> This failure was exhibitted in xfstests btrfs/124 which would
> spuriously fail its balance operations. The reason was that when balance
> was run following re-introduction of the missing raid1 disk
> __btrfs_map_block would map the read request to stripe 0, which
> corresponded to devid 2 (the disk which is being removed in the test):
> 
>     item 2 key (FIRST_CHUNK_TREE CHUNK_ITEM 3553624064) itemoff 15975 itemsize 112
> 	length 1073741824 owner 2 stripe_len 65536 type DATA|RAID1
> 	io_align 65536 io_width 65536 sector_size 4096
> 	num_stripes 2 sub_stripes 1
> 		stripe 0 devid 2 offset 2156920832
> 		dev_uuid 8466c350-ed0c-4c3b-b17d-6379b445d5c8
> 		stripe 1 devid 1 offset 3553624064
> 		dev_uuid 1265d8db-5596-477e-af03-df08eb38d2ca
> 
> This caused read requests for a checksum item that to be routed to the
> stale disk which triggered the aforementioned logic involving
> EXTENT_BUFFER_CORRUPT flag. This then triggered cascading failures of
> the balance operation.
> 
> Signed-off-by: Nikolay Borisov <nborisov@suse.com>
> Suggested-by: Qu Wenruo <wqu@suse.com>

Reviewed-by: Qu Wenruo <wqu@suse.com>

However there is still a tiny piece missing.

Tree checker is done after some basic checks, including:
1) bytenr
2) level
3) fsid
4) csum

1~2) can be easily skipped just by pure luck.

But 3) and especially 4) are not that easy to hit.
Not to mention meeting both 3) and 4), since csum range covers fsid.

So I must say you're a really super lucky guy!

Thanks,
Qu


> Fixes: a826d6dcb32d ("Btrfs: check items for correctness as we search")
> ---
>  fs/btrfs/disk-io.c | 11 +----------
>  1 file changed, 1 insertion(+), 10 deletions(-)
> 
> diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
> index 00ee5e37e989..279c6dbcc736 100644
> --- a/fs/btrfs/disk-io.c
> +++ b/fs/btrfs/disk-io.c
> @@ -477,9 +477,9 @@ static int btree_read_extent_buffer_pages(struct btrfs_fs_info *fs_info,
>  	int mirror_num = 0;
>  	int failed_mirror = 0;
>  
> -	clear_bit(EXTENT_BUFFER_CORRUPT, &eb->bflags);
>  	io_tree = &BTRFS_I(fs_info->btree_inode)->io_tree;
>  	while (1) {
> +		clear_bit(EXTENT_BUFFER_CORRUPT, &eb->bflags);
>  		ret = read_extent_buffer_pages(io_tree, eb, WAIT_COMPLETE,
>  					       mirror_num);
>  		if (!ret) {
> @@ -493,15 +493,6 @@ static int btree_read_extent_buffer_pages(struct btrfs_fs_info *fs_info,
>  				break;
>  		}
>  
> -		/*
> -		 * This buffer's crc is fine, but its contents are corrupted, so
> -		 * there is no reason to read the other copies, they won't be
> -		 * any less wrong.
> -		 */
> -		if (test_bit(EXTENT_BUFFER_CORRUPT, &eb->bflags) ||
> -		    ret == -EUCLEAN)
> -			break;
> -
>  		num_copies = btrfs_num_copies(fs_info,
>  					      eb->start, eb->len);
>  		if (num_copies == 1)
> 

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH] btrfs: Always try all copies when reading extent buffers
  2018-11-06 14:53 ` Qu Wenruo
@ 2018-11-06 15:14   ` Nikolay Borisov
  2018-11-07  0:18     ` Qu Wenruo
  0 siblings, 1 reply; 7+ messages in thread
From: Nikolay Borisov @ 2018-11-06 15:14 UTC (permalink / raw)
  To: Qu Wenruo, linux-btrfs; +Cc: wqu, josef



On 6.11.18 г. 16:53 ч., Qu Wenruo wrote:
> 
> 
> On 2018/11/6 下午10:40, Nikolay Borisov wrote:
>> When a metadata read is served the endio routine btree_readpage_end_io_hook
>> is called which eventually runs the tree-checker. If tree-checker fails
>> to validate the read eb then it sets EXTENT_BUFFER_CORRUPT flag. This
>> leads to btree_read_extent_buffer_pages wrongly assuming that all
>> available copies of this extent buffer are wrong and failing prematurely.
>> Fix this modify btree_read_extent_buffer_pages to read all copies of
>> the data.
>>
>> This failure was exhibitted in xfstests btrfs/124 which would
>> spuriously fail its balance operations. The reason was that when balance
>> was run following re-introduction of the missing raid1 disk
>> __btrfs_map_block would map the read request to stripe 0, which
>> corresponded to devid 2 (the disk which is being removed in the test):
>>
>>     item 2 key (FIRST_CHUNK_TREE CHUNK_ITEM 3553624064) itemoff 15975 itemsize 112
>> 	length 1073741824 owner 2 stripe_len 65536 type DATA|RAID1
>> 	io_align 65536 io_width 65536 sector_size 4096
>> 	num_stripes 2 sub_stripes 1
>> 		stripe 0 devid 2 offset 2156920832
>> 		dev_uuid 8466c350-ed0c-4c3b-b17d-6379b445d5c8
>> 		stripe 1 devid 1 offset 3553624064
>> 		dev_uuid 1265d8db-5596-477e-af03-df08eb38d2ca
>>
>> This caused read requests for a checksum item that to be routed to the
>> stale disk which triggered the aforementioned logic involving
>> EXTENT_BUFFER_CORRUPT flag. This then triggered cascading failures of
>> the balance operation.
>>
>> Signed-off-by: Nikolay Borisov <nborisov@suse.com>
>> Suggested-by: Qu Wenruo <wqu@suse.com>
> 
> Reviewed-by: Qu Wenruo <wqu@suse.com>
> 
> However there is still a tiny piece missing.
> 
> Tree checker is done after some basic checks, including:
> 1) bytenr
> 2) level
> 3) fsid
> 4) csum
> 
> 1~2) can be easily skipped just by pure luck.
> 
> But 3) and especially 4) are not that easy to hit.
> Not to mention meeting both 3) and 4), since csum range covers fsid.
> 
> So I must say you're a really super lucky guy!

s/lucky/unlucky/ :)

I wonder if reads just land in some random memory on the stale disk
hence it's really a matter of what's "there" so that reads fail with
random reasons?

> 
> Thanks,
> Qu
> 
> 
>> Fixes: a826d6dcb32d ("Btrfs: check items for correctness as we search")
>> ---
>>  fs/btrfs/disk-io.c | 11 +----------
>>  1 file changed, 1 insertion(+), 10 deletions(-)
>>
>> diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
>> index 00ee5e37e989..279c6dbcc736 100644
>> --- a/fs/btrfs/disk-io.c
>> +++ b/fs/btrfs/disk-io.c
>> @@ -477,9 +477,9 @@ static int btree_read_extent_buffer_pages(struct btrfs_fs_info *fs_info,
>>  	int mirror_num = 0;
>>  	int failed_mirror = 0;
>>  
>> -	clear_bit(EXTENT_BUFFER_CORRUPT, &eb->bflags);
>>  	io_tree = &BTRFS_I(fs_info->btree_inode)->io_tree;
>>  	while (1) {
>> +		clear_bit(EXTENT_BUFFER_CORRUPT, &eb->bflags);
>>  		ret = read_extent_buffer_pages(io_tree, eb, WAIT_COMPLETE,
>>  					       mirror_num);
>>  		if (!ret) {
>> @@ -493,15 +493,6 @@ static int btree_read_extent_buffer_pages(struct btrfs_fs_info *fs_info,
>>  				break;
>>  		}
>>  
>> -		/*
>> -		 * This buffer's crc is fine, but its contents are corrupted, so
>> -		 * there is no reason to read the other copies, they won't be
>> -		 * any less wrong.
>> -		 */
>> -		if (test_bit(EXTENT_BUFFER_CORRUPT, &eb->bflags) ||
>> -		    ret == -EUCLEAN)
>> -			break;
>> -
>>  		num_copies = btrfs_num_copies(fs_info,
>>  					      eb->start, eb->len);
>>  		if (num_copies == 1)
>>
> 

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH] btrfs: Always try all copies when reading extent buffers
  2018-11-06 14:40 [PATCH] btrfs: Always try all copies when reading extent buffers Nikolay Borisov
  2018-11-06 14:53 ` Qu Wenruo
@ 2018-11-06 16:07 ` Nikolay Borisov
  2018-11-07  0:23   ` Qu Wenruo
  2018-11-12 21:30 ` David Sterba
  2 siblings, 1 reply; 7+ messages in thread
From: Nikolay Borisov @ 2018-11-06 16:07 UTC (permalink / raw)
  To: linux-btrfs; +Cc: wqu, josef



On 6.11.18 г. 16:40 ч., Nikolay Borisov wrote:
> When a metadata read is served the endio routine btree_readpage_end_io_hook
> is called which eventually runs the tree-checker. If tree-checker fails
> to validate the read eb then it sets EXTENT_BUFFER_CORRUPT flag. This
> leads to btree_read_extent_buffer_pages wrongly assuming that all
> available copies of this extent buffer are wrong and failing prematurely.
> Fix this modify btree_read_extent_buffer_pages to read all copies of
> the data.
> 
> This failure was exhibitted in xfstests btrfs/124 which would
> spuriously fail its balance operations. The reason was that when balance
> was run following re-introduction of the missing raid1 disk
> __btrfs_map_block would map the read request to stripe 0, which
> corresponded to devid 2 (the disk which is being removed in the test):
> 
>     item 2 key (FIRST_CHUNK_TREE CHUNK_ITEM 3553624064) itemoff 15975 itemsize 112
> 	length 1073741824 owner 2 stripe_len 65536 type DATA|RAID1
> 	io_align 65536 io_width 65536 sector_size 4096
> 	num_stripes 2 sub_stripes 1
> 		stripe 0 devid 2 offset 2156920832
> 		dev_uuid 8466c350-ed0c-4c3b-b17d-6379b445d5c8
> 		stripe 1 devid 1 offset 3553624064
> 		dev_uuid 1265d8db-5596-477e-af03-df08eb38d2ca
> 
> This caused read requests for a checksum item that to be routed to the
> stale disk which triggered the aforementioned logic involving
> EXTENT_BUFFER_CORRUPT flag. This then triggered cascading failures of
> the balance operation.
> 
> Signed-off-by: Nikolay Borisov <nborisov@suse.com>
> Suggested-by: Qu Wenruo <wqu@suse.com>
> Fixes: a826d6dcb32d ("Btrfs: check items for correctness as we search")
> ---
>  fs/btrfs/disk-io.c | 11 +----------
>  1 file changed, 1 insertion(+), 10 deletions(-)
> 
> diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
> index 00ee5e37e989..279c6dbcc736 100644
> --- a/fs/btrfs/disk-io.c
> +++ b/fs/btrfs/disk-io.c
> @@ -477,9 +477,9 @@ static int btree_read_extent_buffer_pages(struct btrfs_fs_info *fs_info,
>  	int mirror_num = 0;
>  	int failed_mirror = 0;
>  
> -	clear_bit(EXTENT_BUFFER_CORRUPT, &eb->bflags);
>  	io_tree = &BTRFS_I(fs_info->btree_inode)->io_tree;
>  	while (1) {
> +		clear_bit(EXTENT_BUFFER_CORRUPT, &eb->bflags);
>  		ret = read_extent_buffer_pages(io_tree, eb, WAIT_COMPLETE,
>  					       mirror_num);
>  		if (!ret) {

Qu, 

Do you think it makes sense to do refactoring like below in
 a follow up patch: 

diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index 279c6dbcc736..9891e13a2b6f 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -482,16 +482,11 @@ static int btree_read_extent_buffer_pages(struct btrfs_fs_info *fs_info,
                clear_bit(EXTENT_BUFFER_CORRUPT, &eb->bflags);
                ret = read_extent_buffer_pages(io_tree, eb, WAIT_COMPLETE,
                                               mirror_num);
-               if (!ret) {
-                       if (verify_parent_transid(io_tree, eb,
-                                                  parent_transid, 0))
-                               ret = -EIO;
-                       else if (verify_level_key(fs_info, eb, level,
-                                                 first_key, parent_transid))
-                               ret = -EUCLEAN;
-                       else
+               if (!ret &&
+                   !verify_parent_transid(io_tree, eb, parent_transid, 0) &&
+                   !verify_level_key(fs_info, eb, level, first_key,
+                                    parent_transid))
                                break;
-               }


since the ret value doesn't really have any meaning or perhaps the 
verify_level_key and ret = -EUCLEAN could be reteinaed as well as the 
if (ret == EUCLEAN) break logic ? 

> @@ -493,15 +493,6 @@ static int btree_read_extent_buffer_pages(struct btrfs_fs_info *fs_info,
>  				break;
>  		}
>  
> -		/*
> -		 * This buffer's crc is fine, but its contents are corrupted, so
> -		 * there is no reason to read the other copies, they won't be
> -		 * any less wrong.
> -		 */
> -		if (test_bit(EXTENT_BUFFER_CORRUPT, &eb->bflags) ||
> -		    ret == -EUCLEAN)
> -			break;
> -
>  		num_copies = btrfs_num_copies(fs_info,
>  					      eb->start, eb->len);
>  		if (num_copies == 1)
> 

^ permalink raw reply related	[flat|nested] 7+ messages in thread

* Re: [PATCH] btrfs: Always try all copies when reading extent buffers
  2018-11-06 15:14   ` Nikolay Borisov
@ 2018-11-07  0:18     ` Qu Wenruo
  0 siblings, 0 replies; 7+ messages in thread
From: Qu Wenruo @ 2018-11-07  0:18 UTC (permalink / raw)
  To: Nikolay Borisov, linux-btrfs; +Cc: wqu, josef



On 2018/11/6 下午11:14, Nikolay Borisov wrote:
> 
> 
> On 6.11.18 г. 16:53 ч., Qu Wenruo wrote:
>>
>>
>> On 2018/11/6 下午10:40, Nikolay Borisov wrote:
>>> When a metadata read is served the endio routine btree_readpage_end_io_hook
>>> is called which eventually runs the tree-checker. If tree-checker fails
>>> to validate the read eb then it sets EXTENT_BUFFER_CORRUPT flag. This
>>> leads to btree_read_extent_buffer_pages wrongly assuming that all
>>> available copies of this extent buffer are wrong and failing prematurely.
>>> Fix this modify btree_read_extent_buffer_pages to read all copies of
>>> the data.
>>>
>>> This failure was exhibitted in xfstests btrfs/124 which would
>>> spuriously fail its balance operations. The reason was that when balance
>>> was run following re-introduction of the missing raid1 disk
>>> __btrfs_map_block would map the read request to stripe 0, which
>>> corresponded to devid 2 (the disk which is being removed in the test):
>>>
>>>     item 2 key (FIRST_CHUNK_TREE CHUNK_ITEM 3553624064) itemoff 15975 itemsize 112
>>> 	length 1073741824 owner 2 stripe_len 65536 type DATA|RAID1
>>> 	io_align 65536 io_width 65536 sector_size 4096
>>> 	num_stripes 2 sub_stripes 1
>>> 		stripe 0 devid 2 offset 2156920832
>>> 		dev_uuid 8466c350-ed0c-4c3b-b17d-6379b445d5c8
>>> 		stripe 1 devid 1 offset 3553624064
>>> 		dev_uuid 1265d8db-5596-477e-af03-df08eb38d2ca
>>>
>>> This caused read requests for a checksum item that to be routed to the
>>> stale disk which triggered the aforementioned logic involving
>>> EXTENT_BUFFER_CORRUPT flag. This then triggered cascading failures of
>>> the balance operation.
>>>
>>> Signed-off-by: Nikolay Borisov <nborisov@suse.com>
>>> Suggested-by: Qu Wenruo <wqu@suse.com>
>>
>> Reviewed-by: Qu Wenruo <wqu@suse.com>
>>
>> However there is still a tiny piece missing.
>>
>> Tree checker is done after some basic checks, including:
>> 1) bytenr
>> 2) level
>> 3) fsid
>> 4) csum
>>
>> 1~2) can be easily skipped just by pure luck.
>>
>> But 3) and especially 4) are not that easy to hit.
>> Not to mention meeting both 3) and 4), since csum range covers fsid.
>>
>> So I must say you're a really super lucky guy!
> 
> s/lucky/unlucky/ :)
> 
> I wonder if reads just land in some random memory on the stale disk
> hence it's really a matter of what's "there" so that reads fail with
> random reasons?

Even for random memory, you're still too lucky to hit it.

Thanks,
Qu

> 
>>
>> Thanks,
>> Qu
>>
>>
>>> Fixes: a826d6dcb32d ("Btrfs: check items for correctness as we search")
>>> ---
>>>  fs/btrfs/disk-io.c | 11 +----------
>>>  1 file changed, 1 insertion(+), 10 deletions(-)
>>>
>>> diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
>>> index 00ee5e37e989..279c6dbcc736 100644
>>> --- a/fs/btrfs/disk-io.c
>>> +++ b/fs/btrfs/disk-io.c
>>> @@ -477,9 +477,9 @@ static int btree_read_extent_buffer_pages(struct btrfs_fs_info *fs_info,
>>>  	int mirror_num = 0;
>>>  	int failed_mirror = 0;
>>>  
>>> -	clear_bit(EXTENT_BUFFER_CORRUPT, &eb->bflags);
>>>  	io_tree = &BTRFS_I(fs_info->btree_inode)->io_tree;
>>>  	while (1) {
>>> +		clear_bit(EXTENT_BUFFER_CORRUPT, &eb->bflags);
>>>  		ret = read_extent_buffer_pages(io_tree, eb, WAIT_COMPLETE,
>>>  					       mirror_num);
>>>  		if (!ret) {
>>> @@ -493,15 +493,6 @@ static int btree_read_extent_buffer_pages(struct btrfs_fs_info *fs_info,
>>>  				break;
>>>  		}
>>>  
>>> -		/*
>>> -		 * This buffer's crc is fine, but its contents are corrupted, so
>>> -		 * there is no reason to read the other copies, they won't be
>>> -		 * any less wrong.
>>> -		 */
>>> -		if (test_bit(EXTENT_BUFFER_CORRUPT, &eb->bflags) ||
>>> -		    ret == -EUCLEAN)
>>> -			break;
>>> -
>>>  		num_copies = btrfs_num_copies(fs_info,
>>>  					      eb->start, eb->len);
>>>  		if (num_copies == 1)
>>>
>>

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH] btrfs: Always try all copies when reading extent buffers
  2018-11-06 16:07 ` Nikolay Borisov
@ 2018-11-07  0:23   ` Qu Wenruo
  0 siblings, 0 replies; 7+ messages in thread
From: Qu Wenruo @ 2018-11-07  0:23 UTC (permalink / raw)
  To: Nikolay Borisov, linux-btrfs; +Cc: josef



On 2018/11/7 上午12:07, Nikolay Borisov wrote:
> 
> 
> On 6.11.18 г. 16:40 ч., Nikolay Borisov wrote:
>> When a metadata read is served the endio routine btree_readpage_end_io_hook
>> is called which eventually runs the tree-checker. If tree-checker fails
>> to validate the read eb then it sets EXTENT_BUFFER_CORRUPT flag. This
>> leads to btree_read_extent_buffer_pages wrongly assuming that all
>> available copies of this extent buffer are wrong and failing prematurely.
>> Fix this modify btree_read_extent_buffer_pages to read all copies of
>> the data.
>>
>> This failure was exhibitted in xfstests btrfs/124 which would
>> spuriously fail its balance operations. The reason was that when balance
>> was run following re-introduction of the missing raid1 disk
>> __btrfs_map_block would map the read request to stripe 0, which
>> corresponded to devid 2 (the disk which is being removed in the test):
>>
>>     item 2 key (FIRST_CHUNK_TREE CHUNK_ITEM 3553624064) itemoff 15975 itemsize 112
>> 	length 1073741824 owner 2 stripe_len 65536 type DATA|RAID1
>> 	io_align 65536 io_width 65536 sector_size 4096
>> 	num_stripes 2 sub_stripes 1
>> 		stripe 0 devid 2 offset 2156920832
>> 		dev_uuid 8466c350-ed0c-4c3b-b17d-6379b445d5c8
>> 		stripe 1 devid 1 offset 3553624064
>> 		dev_uuid 1265d8db-5596-477e-af03-df08eb38d2ca
>>
>> This caused read requests for a checksum item that to be routed to the
>> stale disk which triggered the aforementioned logic involving
>> EXTENT_BUFFER_CORRUPT flag. This then triggered cascading failures of
>> the balance operation.
>>
>> Signed-off-by: Nikolay Borisov <nborisov@suse.com>
>> Suggested-by: Qu Wenruo <wqu@suse.com>
>> Fixes: a826d6dcb32d ("Btrfs: check items for correctness as we search")
>> ---
>>  fs/btrfs/disk-io.c | 11 +----------
>>  1 file changed, 1 insertion(+), 10 deletions(-)
>>
>> diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
>> index 00ee5e37e989..279c6dbcc736 100644
>> --- a/fs/btrfs/disk-io.c
>> +++ b/fs/btrfs/disk-io.c
>> @@ -477,9 +477,9 @@ static int btree_read_extent_buffer_pages(struct btrfs_fs_info *fs_info,
>>  	int mirror_num = 0;
>>  	int failed_mirror = 0;
>>  
>> -	clear_bit(EXTENT_BUFFER_CORRUPT, &eb->bflags);
>>  	io_tree = &BTRFS_I(fs_info->btree_inode)->io_tree;
>>  	while (1) {
>> +		clear_bit(EXTENT_BUFFER_CORRUPT, &eb->bflags);
>>  		ret = read_extent_buffer_pages(io_tree, eb, WAIT_COMPLETE,
>>  					       mirror_num);
>>  		if (!ret) {
> 
> Qu, 
> 
> Do you think it makes sense to do refactoring like below in
>  a follow up patch: 
> 
> diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
> index 279c6dbcc736..9891e13a2b6f 100644
> --- a/fs/btrfs/disk-io.c
> +++ b/fs/btrfs/disk-io.c
> @@ -482,16 +482,11 @@ static int btree_read_extent_buffer_pages(struct btrfs_fs_info *fs_info,
>                 clear_bit(EXTENT_BUFFER_CORRUPT, &eb->bflags);
>                 ret = read_extent_buffer_pages(io_tree, eb, WAIT_COMPLETE,
>                                                mirror_num);
> -               if (!ret) {
> -                       if (verify_parent_transid(io_tree, eb,
> -                                                  parent_transid, 0))
> -                               ret = -EIO;
> -                       else if (verify_level_key(fs_info, eb, level,
> -                                                 first_key, parent_transid))
> -                               ret = -EUCLEAN;
> -                       else
> +               if (!ret &&
> +                   !verify_parent_transid(io_tree, eb, parent_transid, 0) &&
> +                   !verify_level_key(fs_info, eb, level, first_key,
> +                                    parent_transid))
>                                 break;
> -               }
> 
> 
> since the ret value doesn't really have any meaning or perhaps the 
> verify_level_key and ret = -EUCLEAN could be reteinaed as well as the 
> if (ret == EUCLEAN) break logic ? 

Yes, that's a valid cleanup.

Thanks,
Qu

> 
>> @@ -493,15 +493,6 @@ static int btree_read_extent_buffer_pages(struct btrfs_fs_info *fs_info,
>>  				break;
>>  		}
>>  
>> -		/*
>> -		 * This buffer's crc is fine, but its contents are corrupted, so
>> -		 * there is no reason to read the other copies, they won't be
>> -		 * any less wrong.
>> -		 */
>> -		if (test_bit(EXTENT_BUFFER_CORRUPT, &eb->bflags) ||
>> -		    ret == -EUCLEAN)
>> -			break;
>> -
>>  		num_copies = btrfs_num_copies(fs_info,
>>  					      eb->start, eb->len);
>>  		if (num_copies == 1)
>>

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH] btrfs: Always try all copies when reading extent buffers
  2018-11-06 14:40 [PATCH] btrfs: Always try all copies when reading extent buffers Nikolay Borisov
  2018-11-06 14:53 ` Qu Wenruo
  2018-11-06 16:07 ` Nikolay Borisov
@ 2018-11-12 21:30 ` David Sterba
  2 siblings, 0 replies; 7+ messages in thread
From: David Sterba @ 2018-11-12 21:30 UTC (permalink / raw)
  To: Nikolay Borisov; +Cc: linux-btrfs, wqu, josef

On Tue, Nov 06, 2018 at 04:40:20PM +0200, Nikolay Borisov wrote:
> When a metadata read is served the endio routine btree_readpage_end_io_hook
> is called which eventually runs the tree-checker. If tree-checker fails
> to validate the read eb then it sets EXTENT_BUFFER_CORRUPT flag. This
> leads to btree_read_extent_buffer_pages wrongly assuming that all
> available copies of this extent buffer are wrong and failing prematurely.
> Fix this modify btree_read_extent_buffer_pages to read all copies of
> the data.
> 
> This failure was exhibitted in xfstests btrfs/124 which would
> spuriously fail its balance operations. The reason was that when balance
> was run following re-introduction of the missing raid1 disk
> __btrfs_map_block would map the read request to stripe 0, which
> corresponded to devid 2 (the disk which is being removed in the test):
> 
>     item 2 key (FIRST_CHUNK_TREE CHUNK_ITEM 3553624064) itemoff 15975 itemsize 112
> 	length 1073741824 owner 2 stripe_len 65536 type DATA|RAID1
> 	io_align 65536 io_width 65536 sector_size 4096
> 	num_stripes 2 sub_stripes 1
> 		stripe 0 devid 2 offset 2156920832
> 		dev_uuid 8466c350-ed0c-4c3b-b17d-6379b445d5c8
> 		stripe 1 devid 1 offset 3553624064
> 		dev_uuid 1265d8db-5596-477e-af03-df08eb38d2ca
> 
> This caused read requests for a checksum item that to be routed to the
> stale disk which triggered the aforementioned logic involving
> EXTENT_BUFFER_CORRUPT flag. This then triggered cascading failures of
> the balance operation.
> 
> Signed-off-by: Nikolay Borisov <nborisov@suse.com>
> Suggested-by: Qu Wenruo <wqu@suse.com>
> Fixes: a826d6dcb32d ("Btrfs: check items for correctness as we search")

Added to misc-next, thanks.

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2018-11-12 21:31 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2018-11-06 14:40 [PATCH] btrfs: Always try all copies when reading extent buffers Nikolay Borisov
2018-11-06 14:53 ` Qu Wenruo
2018-11-06 15:14   ` Nikolay Borisov
2018-11-07  0:18     ` Qu Wenruo
2018-11-06 16:07 ` Nikolay Borisov
2018-11-07  0:23   ` Qu Wenruo
2018-11-12 21:30 ` David Sterba

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox