FAILED: patch "[PATCH] btrfs: Always try all copies when reading extent buffers" failed to apply to 4.14-stable tree

All of lore.kernel.org
 help / color / mirror / Atom feed

* FAILED: patch "[PATCH] btrfs: Always try all copies when reading extent buffers" failed to apply to 4.14-stable tree
@ 2018-12-03 10:02 gregkh
  2018-12-04 12:47 ` [PATCH] btrfs: Always try all copies when reading extent buffers Nikolay Borisov
  0 siblings, 1 reply; 15+ messages in thread
From: gregkh @ 2018-12-03 10:02 UTC (permalink / raw)
  To: nborisov, dsterba, wqu; +Cc: stable

The patch below does not apply to the 4.14-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable@vger.kernel.org>.

thanks,

greg k-h

------------------ original commit in Linus's tree ------------------

>From f8397d69daef06d358430d3054662fb597e37c00 Mon Sep 17 00:00:00 2001
From: Nikolay Borisov <nborisov@suse.com>
Date: Tue, 6 Nov 2018 16:40:20 +0200
Subject: [PATCH] btrfs: Always try all copies when reading extent buffers

When a metadata read is served the endio routine btree_readpage_end_io_hook
is called which eventually runs the tree-checker. If tree-checker fails
to validate the read eb then it sets EXTENT_BUFFER_CORRUPT flag. This
leads to btree_read_extent_buffer_pages wrongly assuming that all
available copies of this extent buffer are wrong and failing prematurely.
Fix this modify btree_read_extent_buffer_pages to read all copies of
the data.

This failure was exhibitted in xfstests btrfs/124 which would
spuriously fail its balance operations. The reason was that when balance
was run following re-introduction of the missing raid1 disk
__btrfs_map_block would map the read request to stripe 0, which
corresponded to devid 2 (the disk which is being removed in the test):

    item 2 key (FIRST_CHUNK_TREE CHUNK_ITEM 3553624064) itemoff 15975 itemsize 112
	length 1073741824 owner 2 stripe_len 65536 type DATA|RAID1
	io_align 65536 io_width 65536 sector_size 4096
	num_stripes 2 sub_stripes 1
		stripe 0 devid 2 offset 2156920832
		dev_uuid 8466c350-ed0c-4c3b-b17d-6379b445d5c8
		stripe 1 devid 1 offset 3553624064
		dev_uuid 1265d8db-5596-477e-af03-df08eb38d2ca

This caused read requests for a checksum item that to be routed to the
stale disk which triggered the aforementioned logic involving
EXTENT_BUFFER_CORRUPT flag. This then triggered cascading failures of
the balance operation.

Fixes: a826d6dcb32d ("Btrfs: check items for correctness as we search")
CC: stable@vger.kernel.org # 4.4+
Suggested-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index 3f0b6d1936e8..6d776717d8b3 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -477,9 +477,9 @@ static int btree_read_extent_buffer_pages(struct btrfs_fs_info *fs_info,
 	int mirror_num = 0;
 	int failed_mirror = 0;

-	clear_bit(EXTENT_BUFFER_CORRUPT, &eb->bflags);
 	io_tree = &BTRFS_I(fs_info->btree_inode)->io_tree;
 	while (1) {
+		clear_bit(EXTENT_BUFFER_CORRUPT, &eb->bflags);
 		ret = read_extent_buffer_pages(io_tree, eb, WAIT_COMPLETE,
 					       mirror_num);
 		if (!ret) {
@@ -493,15 +493,6 @@ static int btree_read_extent_buffer_pages(struct btrfs_fs_info *fs_info,
 				break;
 		}

-		/*
-		 * This buffer's crc is fine, but its contents are corrupted, so
-		 * there is no reason to read the other copies, they won't be
-		 * any less wrong.
-		 */
-		if (test_bit(EXTENT_BUFFER_CORRUPT, &eb->bflags) ||
-		    ret == -EUCLEAN)
-			break;
-
 		num_copies = btrfs_num_copies(fs_info,
 					      eb->start, eb->len);
 		if (num_copies == 1)

^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [PATCH] btrfs: Always try all copies when reading extent buffers
  2018-12-03 10:02 FAILED: patch "[PATCH] btrfs: Always try all copies when reading extent buffers" failed to apply to 4.14-stable tree gregkh
@ 2018-12-04 12:47 ` Nikolay Borisov
  2018-12-06 10:47   ` Greg KH
  0 siblings, 1 reply; 15+ messages in thread
From: Nikolay Borisov @ 2018-12-04 12:47 UTC (permalink / raw)
  To: gregkh; +Cc: stable, dsterba, wqu, Nikolay Borisov

When a metadata read is served the endio routine btree_readpage_end_io_hook
is called which eventually runs the tree-checker. If tree-checker fails
to validate the read eb then it sets EXTENT_BUFFER_CORRUPT flag. This
leads to btree_read_extent_buffer_pages wrongly assuming that all
available copies of this extent buffer are wrong and failing prematurely.
Fix this modify btree_read_extent_buffer_pages to read all copies of
the data.

This failure was exhibitted in xfstests btrfs/124 which would
spuriously fail its balance operations. The reason was that when balance
was run following re-introduction of the missing raid1 disk
__btrfs_map_block would map the read request to stripe 0, which
corresponded to devid 2 (the disk which is being removed in the test):

    item 2 key (FIRST_CHUNK_TREE CHUNK_ITEM 3553624064) itemoff 15975 itemsize 112
	length 1073741824 owner 2 stripe_len 65536 type DATA|RAID1
	io_align 65536 io_width 65536 sector_size 4096
	num_stripes 2 sub_stripes 1
		stripe 0 devid 2 offset 2156920832
		dev_uuid 8466c350-ed0c-4c3b-b17d-6379b445d5c8
		stripe 1 devid 1 offset 3553624064
		dev_uuid 1265d8db-5596-477e-af03-df08eb38d2ca

This caused read requests for a checksum item that to be routed to the
stale disk which triggered the aforementioned logic involving
EXTENT_BUFFER_CORRUPT flag. This then triggered cascading failures of
the balance operation.

Fixes: a826d6dcb32d ("Btrfs: check items for correctness as we search")
CC: stable@vger.kernel.org # 4.4+
Suggested-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
---

Hi Greg, 

Please apply this backport of upstream commit f8397d69daef06d358430d3054662fb597e37c00
to 4.14.y

 fs/btrfs/disk-io.c | 10 +---------
 1 file changed, 1 insertion(+), 9 deletions(-)

diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index 0e67cee73c53..9364cc3b50d4 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -450,9 +450,9 @@ static int btree_read_extent_buffer_pages(struct btrfs_fs_info *fs_info,
 	int mirror_num = 0;
 	int failed_mirror = 0;
 
-	clear_bit(EXTENT_BUFFER_CORRUPT, &eb->bflags);
 	io_tree = &BTRFS_I(fs_info->btree_inode)->io_tree;
 	while (1) {
+		clear_bit(EXTENT_BUFFER_CORRUPT, &eb->bflags);
 		ret = read_extent_buffer_pages(io_tree, eb, WAIT_COMPLETE,
 					       btree_get_extent, mirror_num);
 		if (!ret) {
@@ -463,14 +463,6 @@ static int btree_read_extent_buffer_pages(struct btrfs_fs_info *fs_info,
 				ret = -EIO;
 		}
 
-		/*
-		 * This buffer's crc is fine, but its contents are corrupted, so
-		 * there is no reason to read the other copies, they won't be
-		 * any less wrong.
-		 */
-		if (test_bit(EXTENT_BUFFER_CORRUPT, &eb->bflags))
-			break;
-
 		num_copies = btrfs_num_copies(fs_info,
 					      eb->start, eb->len);
 		if (num_copies == 1)
-- 
2.17.1

^ permalink raw reply related	[flat|nested] 15+ messages in thread

* Re: [PATCH] btrfs: Always try all copies when reading extent buffers
  2018-12-04 12:47 ` [PATCH] btrfs: Always try all copies when reading extent buffers Nikolay Borisov
@ 2018-12-06 10:47   ` Greg KH
  0 siblings, 0 replies; 15+ messages in thread
From: Greg KH @ 2018-12-06 10:47 UTC (permalink / raw)
  To: Nikolay Borisov; +Cc: stable, dsterba, wqu

On Tue, Dec 04, 2018 at 02:47:49PM +0200, Nikolay Borisov wrote:
> When a metadata read is served the endio routine btree_readpage_end_io_hook
> is called which eventually runs the tree-checker. If tree-checker fails
> to validate the read eb then it sets EXTENT_BUFFER_CORRUPT flag. This
> leads to btree_read_extent_buffer_pages wrongly assuming that all
> available copies of this extent buffer are wrong and failing prematurely.
> Fix this modify btree_read_extent_buffer_pages to read all copies of
> the data.
> 
> This failure was exhibitted in xfstests btrfs/124 which would
> spuriously fail its balance operations. The reason was that when balance
> was run following re-introduction of the missing raid1 disk
> __btrfs_map_block would map the read request to stripe 0, which
> corresponded to devid 2 (the disk which is being removed in the test):
> 
>     item 2 key (FIRST_CHUNK_TREE CHUNK_ITEM 3553624064) itemoff 15975 itemsize 112
> 	length 1073741824 owner 2 stripe_len 65536 type DATA|RAID1
> 	io_align 65536 io_width 65536 sector_size 4096
> 	num_stripes 2 sub_stripes 1
> 		stripe 0 devid 2 offset 2156920832
> 		dev_uuid 8466c350-ed0c-4c3b-b17d-6379b445d5c8
> 		stripe 1 devid 1 offset 3553624064
> 		dev_uuid 1265d8db-5596-477e-af03-df08eb38d2ca
> 
> This caused read requests for a checksum item that to be routed to the
> stale disk which triggered the aforementioned logic involving
> EXTENT_BUFFER_CORRUPT flag. This then triggered cascading failures of
> the balance operation.
> 
> Fixes: a826d6dcb32d ("Btrfs: check items for correctness as we search")
> CC: stable@vger.kernel.org # 4.4+
> Suggested-by: Qu Wenruo <wqu@suse.com>
> Reviewed-by: Qu Wenruo <wqu@suse.com>
> Signed-off-by: Nikolay Borisov <nborisov@suse.com>
> Signed-off-by: David Sterba <dsterba@suse.com>
> ---
> 
> Hi Greg, 
> 
> Please apply this backport of upstream commit f8397d69daef06d358430d3054662fb597e37c00
> to 4.14.y

Now applied, thanks.

greg k-h

^ permalink raw reply	[flat|nested] 15+ messages in thread

* FAILED: patch "[PATCH] btrfs: Always try all copies when reading extent buffers" failed to apply to 4.4-stable tree
@ 2018-12-03 10:02 gregkh
  2018-12-04 12:49 ` [PATCH] btrfs: Always try all copies when reading extent buffers Nikolay Borisov
  0 siblings, 1 reply; 15+ messages in thread
From: gregkh @ 2018-12-03 10:02 UTC (permalink / raw)
  To: nborisov, dsterba, wqu; +Cc: stable

The patch below does not apply to the 4.4-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable@vger.kernel.org>.

thanks,

greg k-h

------------------ original commit in Linus's tree ------------------

>From f8397d69daef06d358430d3054662fb597e37c00 Mon Sep 17 00:00:00 2001
From: Nikolay Borisov <nborisov@suse.com>
Date: Tue, 6 Nov 2018 16:40:20 +0200
Subject: [PATCH] btrfs: Always try all copies when reading extent buffers

When a metadata read is served the endio routine btree_readpage_end_io_hook
is called which eventually runs the tree-checker. If tree-checker fails
to validate the read eb then it sets EXTENT_BUFFER_CORRUPT flag. This
leads to btree_read_extent_buffer_pages wrongly assuming that all
available copies of this extent buffer are wrong and failing prematurely.
Fix this modify btree_read_extent_buffer_pages to read all copies of
the data.

This failure was exhibitted in xfstests btrfs/124 which would
spuriously fail its balance operations. The reason was that when balance
was run following re-introduction of the missing raid1 disk
__btrfs_map_block would map the read request to stripe 0, which
corresponded to devid 2 (the disk which is being removed in the test):

    item 2 key (FIRST_CHUNK_TREE CHUNK_ITEM 3553624064) itemoff 15975 itemsize 112
	length 1073741824 owner 2 stripe_len 65536 type DATA|RAID1
	io_align 65536 io_width 65536 sector_size 4096
	num_stripes 2 sub_stripes 1
		stripe 0 devid 2 offset 2156920832
		dev_uuid 8466c350-ed0c-4c3b-b17d-6379b445d5c8
		stripe 1 devid 1 offset 3553624064
		dev_uuid 1265d8db-5596-477e-af03-df08eb38d2ca

This caused read requests for a checksum item that to be routed to the
stale disk which triggered the aforementioned logic involving
EXTENT_BUFFER_CORRUPT flag. This then triggered cascading failures of
the balance operation.

Fixes: a826d6dcb32d ("Btrfs: check items for correctness as we search")
CC: stable@vger.kernel.org # 4.4+
Suggested-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index 3f0b6d1936e8..6d776717d8b3 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -477,9 +477,9 @@ static int btree_read_extent_buffer_pages(struct btrfs_fs_info *fs_info,
 	int mirror_num = 0;
 	int failed_mirror = 0;

-	clear_bit(EXTENT_BUFFER_CORRUPT, &eb->bflags);
 	io_tree = &BTRFS_I(fs_info->btree_inode)->io_tree;
 	while (1) {
+		clear_bit(EXTENT_BUFFER_CORRUPT, &eb->bflags);
 		ret = read_extent_buffer_pages(io_tree, eb, WAIT_COMPLETE,
 					       mirror_num);
 		if (!ret) {
@@ -493,15 +493,6 @@ static int btree_read_extent_buffer_pages(struct btrfs_fs_info *fs_info,
 				break;
 		}

-		/*
-		 * This buffer's crc is fine, but its contents are corrupted, so
-		 * there is no reason to read the other copies, they won't be
-		 * any less wrong.
-		 */
-		if (test_bit(EXTENT_BUFFER_CORRUPT, &eb->bflags) ||
-		    ret == -EUCLEAN)
-			break;
-
 		num_copies = btrfs_num_copies(fs_info,
 					      eb->start, eb->len);
 		if (num_copies == 1)

^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [PATCH] btrfs: Always try all copies when reading extent buffers
  2018-12-03 10:02 FAILED: patch "[PATCH] btrfs: Always try all copies when reading extent buffers" failed to apply to 4.4-stable tree gregkh
@ 2018-12-04 12:49 ` Nikolay Borisov
  2018-12-06 10:47   ` Greg KH
  0 siblings, 1 reply; 15+ messages in thread
From: Nikolay Borisov @ 2018-12-04 12:49 UTC (permalink / raw)
  To: gregkh; +Cc: stable, dsterba, wqu, Nikolay Borisov

When a metadata read is served the endio routine btree_readpage_end_io_hook
is called which eventually runs the tree-checker. If tree-checker fails
to validate the read eb then it sets EXTENT_BUFFER_CORRUPT flag. This
leads to btree_read_extent_buffer_pages wrongly assuming that all
available copies of this extent buffer are wrong and failing prematurely.
Fix this modify btree_read_extent_buffer_pages to read all copies of
the data.

This failure was exhibitted in xfstests btrfs/124 which would
spuriously fail its balance operations. The reason was that when balance
was run following re-introduction of the missing raid1 disk
__btrfs_map_block would map the read request to stripe 0, which
corresponded to devid 2 (the disk which is being removed in the test):

    item 2 key (FIRST_CHUNK_TREE CHUNK_ITEM 3553624064) itemoff 15975 itemsize 112
	length 1073741824 owner 2 stripe_len 65536 type DATA|RAID1
	io_align 65536 io_width 65536 sector_size 4096
	num_stripes 2 sub_stripes 1
		stripe 0 devid 2 offset 2156920832
		dev_uuid 8466c350-ed0c-4c3b-b17d-6379b445d5c8
		stripe 1 devid 1 offset 3553624064
		dev_uuid 1265d8db-5596-477e-af03-df08eb38d2ca

This caused read requests for a checksum item that to be routed to the
stale disk which triggered the aforementioned logic involving
EXTENT_BUFFER_CORRUPT flag. This then triggered cascading failures of
the balance operation.

Fixes: a826d6dcb32d ("Btrfs: check items for correctness as we search")
CC: stable@vger.kernel.org # 4.4+
Suggested-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
---
Hi Greg, 

Please apply this backport of upstream commit f8397d69daef06d358430d3054662fb597e37c00
to 4.4.y

 fs/btrfs/disk-io.c | 10 +---------
 1 file changed, 1 insertion(+), 9 deletions(-)

diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index b0875ef48522..1f21c6c33228 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -445,9 +445,9 @@ static int btree_read_extent_buffer_pages(struct btrfs_root *root,
 	int mirror_num = 0;
 	int failed_mirror = 0;
 
-	clear_bit(EXTENT_BUFFER_CORRUPT, &eb->bflags);
 	io_tree = &BTRFS_I(root->fs_info->btree_inode)->io_tree;
 	while (1) {
+		clear_bit(EXTENT_BUFFER_CORRUPT, &eb->bflags);
 		ret = read_extent_buffer_pages(io_tree, eb, start,
 					       WAIT_COMPLETE,
 					       btree_get_extent, mirror_num);
@@ -459,14 +459,6 @@ static int btree_read_extent_buffer_pages(struct btrfs_root *root,
 				ret = -EIO;
 		}
 
-		/*
-		 * This buffer's crc is fine, but its contents are corrupted, so
-		 * there is no reason to read the other copies, they won't be
-		 * any less wrong.
-		 */
-		if (test_bit(EXTENT_BUFFER_CORRUPT, &eb->bflags))
-			break;
-
 		num_copies = btrfs_num_copies(root->fs_info,
 					      eb->start, eb->len);
 		if (num_copies == 1)
-- 
2.17.1

^ permalink raw reply related	[flat|nested] 15+ messages in thread

* Re: [PATCH] btrfs: Always try all copies when reading extent buffers
  2018-12-04 12:49 ` [PATCH] btrfs: Always try all copies when reading extent buffers Nikolay Borisov
@ 2018-12-06 10:47   ` Greg KH
  0 siblings, 0 replies; 15+ messages in thread
From: Greg KH @ 2018-12-06 10:47 UTC (permalink / raw)
  To: Nikolay Borisov; +Cc: stable, dsterba, wqu

On Tue, Dec 04, 2018 at 02:49:03PM +0200, Nikolay Borisov wrote:
> When a metadata read is served the endio routine btree_readpage_end_io_hook
> is called which eventually runs the tree-checker. If tree-checker fails
> to validate the read eb then it sets EXTENT_BUFFER_CORRUPT flag. This
> leads to btree_read_extent_buffer_pages wrongly assuming that all
> available copies of this extent buffer are wrong and failing prematurely.
> Fix this modify btree_read_extent_buffer_pages to read all copies of
> the data.
> 
> This failure was exhibitted in xfstests btrfs/124 which would
> spuriously fail its balance operations. The reason was that when balance
> was run following re-introduction of the missing raid1 disk
> __btrfs_map_block would map the read request to stripe 0, which
> corresponded to devid 2 (the disk which is being removed in the test):
> 
>     item 2 key (FIRST_CHUNK_TREE CHUNK_ITEM 3553624064) itemoff 15975 itemsize 112
> 	length 1073741824 owner 2 stripe_len 65536 type DATA|RAID1
> 	io_align 65536 io_width 65536 sector_size 4096
> 	num_stripes 2 sub_stripes 1
> 		stripe 0 devid 2 offset 2156920832
> 		dev_uuid 8466c350-ed0c-4c3b-b17d-6379b445d5c8
> 		stripe 1 devid 1 offset 3553624064
> 		dev_uuid 1265d8db-5596-477e-af03-df08eb38d2ca
> 
> This caused read requests for a checksum item that to be routed to the
> stale disk which triggered the aforementioned logic involving
> EXTENT_BUFFER_CORRUPT flag. This then triggered cascading failures of
> the balance operation.
> 
> Fixes: a826d6dcb32d ("Btrfs: check items for correctness as we search")
> CC: stable@vger.kernel.org # 4.4+
> Suggested-by: Qu Wenruo <wqu@suse.com>
> Reviewed-by: Qu Wenruo <wqu@suse.com>
> Signed-off-by: Nikolay Borisov <nborisov@suse.com>
> Signed-off-by: David Sterba <dsterba@suse.com>
> ---
> Hi Greg, 
> 
> Please apply this backport of upstream commit f8397d69daef06d358430d3054662fb597e37c00
> to 4.4.y

Now applied, thanks.

greg k-h

^ permalink raw reply	[flat|nested] 15+ messages in thread

* FAILED: patch "[PATCH] btrfs: Always try all copies when reading extent buffers" failed to apply to 4.9-stable tree
@ 2018-12-03 10:02 gregkh
  2018-12-04 12:48 ` [PATCH] btrfs: Always try all copies when reading extent buffers Nikolay Borisov
  0 siblings, 1 reply; 15+ messages in thread
From: gregkh @ 2018-12-03 10:02 UTC (permalink / raw)
  To: nborisov, dsterba, wqu; +Cc: stable

The patch below does not apply to the 4.9-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable@vger.kernel.org>.

thanks,

greg k-h

------------------ original commit in Linus's tree ------------------

>From f8397d69daef06d358430d3054662fb597e37c00 Mon Sep 17 00:00:00 2001
From: Nikolay Borisov <nborisov@suse.com>
Date: Tue, 6 Nov 2018 16:40:20 +0200
Subject: [PATCH] btrfs: Always try all copies when reading extent buffers

When a metadata read is served the endio routine btree_readpage_end_io_hook
is called which eventually runs the tree-checker. If tree-checker fails
to validate the read eb then it sets EXTENT_BUFFER_CORRUPT flag. This
leads to btree_read_extent_buffer_pages wrongly assuming that all
available copies of this extent buffer are wrong and failing prematurely.
Fix this modify btree_read_extent_buffer_pages to read all copies of
the data.

This failure was exhibitted in xfstests btrfs/124 which would
spuriously fail its balance operations. The reason was that when balance
was run following re-introduction of the missing raid1 disk
__btrfs_map_block would map the read request to stripe 0, which
corresponded to devid 2 (the disk which is being removed in the test):

    item 2 key (FIRST_CHUNK_TREE CHUNK_ITEM 3553624064) itemoff 15975 itemsize 112
	length 1073741824 owner 2 stripe_len 65536 type DATA|RAID1
	io_align 65536 io_width 65536 sector_size 4096
	num_stripes 2 sub_stripes 1
		stripe 0 devid 2 offset 2156920832
		dev_uuid 8466c350-ed0c-4c3b-b17d-6379b445d5c8
		stripe 1 devid 1 offset 3553624064
		dev_uuid 1265d8db-5596-477e-af03-df08eb38d2ca

This caused read requests for a checksum item that to be routed to the
stale disk which triggered the aforementioned logic involving
EXTENT_BUFFER_CORRUPT flag. This then triggered cascading failures of
the balance operation.

Fixes: a826d6dcb32d ("Btrfs: check items for correctness as we search")
CC: stable@vger.kernel.org # 4.4+
Suggested-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index 3f0b6d1936e8..6d776717d8b3 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -477,9 +477,9 @@ static int btree_read_extent_buffer_pages(struct btrfs_fs_info *fs_info,
 	int mirror_num = 0;
 	int failed_mirror = 0;

-	clear_bit(EXTENT_BUFFER_CORRUPT, &eb->bflags);
 	io_tree = &BTRFS_I(fs_info->btree_inode)->io_tree;
 	while (1) {
+		clear_bit(EXTENT_BUFFER_CORRUPT, &eb->bflags);
 		ret = read_extent_buffer_pages(io_tree, eb, WAIT_COMPLETE,
 					       mirror_num);
 		if (!ret) {
@@ -493,15 +493,6 @@ static int btree_read_extent_buffer_pages(struct btrfs_fs_info *fs_info,
 				break;
 		}

-		/*
-		 * This buffer's crc is fine, but its contents are corrupted, so
-		 * there is no reason to read the other copies, they won't be
-		 * any less wrong.
-		 */
-		if (test_bit(EXTENT_BUFFER_CORRUPT, &eb->bflags) ||
-		    ret == -EUCLEAN)
-			break;
-
 		num_copies = btrfs_num_copies(fs_info,
 					      eb->start, eb->len);
 		if (num_copies == 1)

^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [PATCH] btrfs: Always try all copies when reading extent buffers
  2018-12-03 10:02 FAILED: patch "[PATCH] btrfs: Always try all copies when reading extent buffers" failed to apply to 4.9-stable tree gregkh
@ 2018-12-04 12:48 ` Nikolay Borisov
  2018-12-06 10:45   ` Greg KH
  2018-12-06 10:48   ` Greg KH
  0 siblings, 2 replies; 15+ messages in thread
From: Nikolay Borisov @ 2018-12-04 12:48 UTC (permalink / raw)
  To: gregkh; +Cc: stable, dsterba, wqu, Nikolay Borisov

When a metadata read is served the endio routine btree_readpage_end_io_hook
is called which eventually runs the tree-checker. If tree-checker fails
to validate the read eb then it sets EXTENT_BUFFER_CORRUPT flag. This
leads to btree_read_extent_buffer_pages wrongly assuming that all
available copies of this extent buffer are wrong and failing prematurely.
Fix this modify btree_read_extent_buffer_pages to read all copies of
the data.

This failure was exhibitted in xfstests btrfs/124 which would
spuriously fail its balance operations. The reason was that when balance
was run following re-introduction of the missing raid1 disk
__btrfs_map_block would map the read request to stripe 0, which
corresponded to devid 2 (the disk which is being removed in the test):

    item 2 key (FIRST_CHUNK_TREE CHUNK_ITEM 3553624064) itemoff 15975 itemsize 112
	length 1073741824 owner 2 stripe_len 65536 type DATA|RAID1
	io_align 65536 io_width 65536 sector_size 4096
	num_stripes 2 sub_stripes 1
		stripe 0 devid 2 offset 2156920832
		dev_uuid 8466c350-ed0c-4c3b-b17d-6379b445d5c8
		stripe 1 devid 1 offset 3553624064
		dev_uuid 1265d8db-5596-477e-af03-df08eb38d2ca

This caused read requests for a checksum item that to be routed to the
stale disk which triggered the aforementioned logic involving
EXTENT_BUFFER_CORRUPT flag. This then triggered cascading failures of
the balance operation.

Fixes: a826d6dcb32d ("Btrfs: check items for correctness as we search")
CC: stable@vger.kernel.org # 4.4+
Suggested-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
---
Hi Gregk, 

Please apply this backport of upstream commit f8397d69daef06d358430d3054662fb597e37c00
to 4.9.y

 fs/btrfs/disk-io.c | 10 +---------
 1 file changed, 1 insertion(+), 9 deletions(-)

diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index 57d375c68e46..875715769a29 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -452,9 +452,9 @@ static int btree_read_extent_buffer_pages(struct btrfs_root *root,
 	int mirror_num = 0;
 	int failed_mirror = 0;
 
-	clear_bit(EXTENT_BUFFER_CORRUPT, &eb->bflags);
 	io_tree = &BTRFS_I(root->fs_info->btree_inode)->io_tree;
 	while (1) {
+		clear_bit(EXTENT_BUFFER_CORRUPT, &eb->bflags);
 		ret = read_extent_buffer_pages(io_tree, eb, WAIT_COMPLETE,
 					       btree_get_extent, mirror_num);
 		if (!ret) {
@@ -465,14 +465,6 @@ static int btree_read_extent_buffer_pages(struct btrfs_root *root,
 				ret = -EIO;
 		}
 
-		/*
-		 * This buffer's crc is fine, but its contents are corrupted, so
-		 * there is no reason to read the other copies, they won't be
-		 * any less wrong.
-		 */
-		if (test_bit(EXTENT_BUFFER_CORRUPT, &eb->bflags))
-			break;
-
 		num_copies = btrfs_num_copies(root->fs_info,
 					      eb->start, eb->len);
 		if (num_copies == 1)
-- 
2.17.1

^ permalink raw reply related	[flat|nested] 15+ messages in thread

* Re: [PATCH] btrfs: Always try all copies when reading extent buffers
  2018-12-04 12:48 ` [PATCH] btrfs: Always try all copies when reading extent buffers Nikolay Borisov
@ 2018-12-06 10:45   ` Greg KH
  2018-12-06 10:48   ` Greg KH
  1 sibling, 0 replies; 15+ messages in thread
From: Greg KH @ 2018-12-06 10:45 UTC (permalink / raw)
  To: Nikolay Borisov; +Cc: stable, dsterba, wqu

On Tue, Dec 04, 2018 at 02:48:18PM +0200, Nikolay Borisov wrote:
> When a metadata read is served the endio routine btree_readpage_end_io_hook
> is called which eventually runs the tree-checker. If tree-checker fails
> to validate the read eb then it sets EXTENT_BUFFER_CORRUPT flag. This
> leads to btree_read_extent_buffer_pages wrongly assuming that all
> available copies of this extent buffer are wrong and failing prematurely.
> Fix this modify btree_read_extent_buffer_pages to read all copies of
> the data.
> 
> This failure was exhibitted in xfstests btrfs/124 which would
> spuriously fail its balance operations. The reason was that when balance
> was run following re-introduction of the missing raid1 disk
> __btrfs_map_block would map the read request to stripe 0, which
> corresponded to devid 2 (the disk which is being removed in the test):
> 
>     item 2 key (FIRST_CHUNK_TREE CHUNK_ITEM 3553624064) itemoff 15975 itemsize 112
> 	length 1073741824 owner 2 stripe_len 65536 type DATA|RAID1
> 	io_align 65536 io_width 65536 sector_size 4096
> 	num_stripes 2 sub_stripes 1
> 		stripe 0 devid 2 offset 2156920832
> 		dev_uuid 8466c350-ed0c-4c3b-b17d-6379b445d5c8
> 		stripe 1 devid 1 offset 3553624064
> 		dev_uuid 1265d8db-5596-477e-af03-df08eb38d2ca
> 
> This caused read requests for a checksum item that to be routed to the
> stale disk which triggered the aforementioned logic involving
> EXTENT_BUFFER_CORRUPT flag. This then triggered cascading failures of
> the balance operation.
> 
> Fixes: a826d6dcb32d ("Btrfs: check items for correctness as we search")
> CC: stable@vger.kernel.org # 4.4+
> Suggested-by: Qu Wenruo <wqu@suse.com>
> Reviewed-by: Qu Wenruo <wqu@suse.com>
> Signed-off-by: Nikolay Borisov <nborisov@suse.com>
> Signed-off-by: David Sterba <dsterba@suse.com>
> ---
> Hi Gregk, 
> 
> Please apply this backport of upstream commit f8397d69daef06d358430d3054662fb597e37c00
> to 4.9.y

This didn't apply at all to 4.9.y, did something go wrong with it?

greg k-h

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH] btrfs: Always try all copies when reading extent buffers
  2018-12-04 12:48 ` [PATCH] btrfs: Always try all copies when reading extent buffers Nikolay Borisov
  2018-12-06 10:45   ` Greg KH
@ 2018-12-06 10:48   ` Greg KH
  1 sibling, 0 replies; 15+ messages in thread
From: Greg KH @ 2018-12-06 10:48 UTC (permalink / raw)
  To: Nikolay Borisov; +Cc: stable, dsterba, wqu

On Tue, Dec 04, 2018 at 02:48:18PM +0200, Nikolay Borisov wrote:
> When a metadata read is served the endio routine btree_readpage_end_io_hook
> is called which eventually runs the tree-checker. If tree-checker fails
> to validate the read eb then it sets EXTENT_BUFFER_CORRUPT flag. This
> leads to btree_read_extent_buffer_pages wrongly assuming that all
> available copies of this extent buffer are wrong and failing prematurely.
> Fix this modify btree_read_extent_buffer_pages to read all copies of
> the data.
> 
> This failure was exhibitted in xfstests btrfs/124 which would
> spuriously fail its balance operations. The reason was that when balance
> was run following re-introduction of the missing raid1 disk
> __btrfs_map_block would map the read request to stripe 0, which
> corresponded to devid 2 (the disk which is being removed in the test):
> 
>     item 2 key (FIRST_CHUNK_TREE CHUNK_ITEM 3553624064) itemoff 15975 itemsize 112
> 	length 1073741824 owner 2 stripe_len 65536 type DATA|RAID1
> 	io_align 65536 io_width 65536 sector_size 4096
> 	num_stripes 2 sub_stripes 1
> 		stripe 0 devid 2 offset 2156920832
> 		dev_uuid 8466c350-ed0c-4c3b-b17d-6379b445d5c8
> 		stripe 1 devid 1 offset 3553624064
> 		dev_uuid 1265d8db-5596-477e-af03-df08eb38d2ca
> 
> This caused read requests for a checksum item that to be routed to the
> stale disk which triggered the aforementioned logic involving
> EXTENT_BUFFER_CORRUPT flag. This then triggered cascading failures of
> the balance operation.
> 
> Fixes: a826d6dcb32d ("Btrfs: check items for correctness as we search")
> CC: stable@vger.kernel.org # 4.4+
> Suggested-by: Qu Wenruo <wqu@suse.com>
> Reviewed-by: Qu Wenruo <wqu@suse.com>
> Signed-off-by: Nikolay Borisov <nborisov@suse.com>
> Signed-off-by: David Sterba <dsterba@suse.com>
> ---
> Hi Gregk, 
> 
> Please apply this backport of upstream commit f8397d69daef06d358430d3054662fb597e37c00
> to 4.9.y

I think this was the 4.14.y version of the patch :(

I took the 4.4.y version and fixed it up by hand...

greg k-h

^ permalink raw reply	[flat|nested] 15+ messages in thread

* [PATCH] btrfs: Always try all copies when reading extent buffers
@ 2018-11-06 14:40 Nikolay Borisov
  2018-11-06 14:53 ` Qu Wenruo
                   ` (2 more replies)
  0 siblings, 3 replies; 15+ messages in thread
From: Nikolay Borisov @ 2018-11-06 14:40 UTC (permalink / raw)
  To: linux-btrfs; +Cc: wqu, josef, Nikolay Borisov

When a metadata read is served the endio routine btree_readpage_end_io_hook
is called which eventually runs the tree-checker. If tree-checker fails
to validate the read eb then it sets EXTENT_BUFFER_CORRUPT flag. This
leads to btree_read_extent_buffer_pages wrongly assuming that all
available copies of this extent buffer are wrong and failing prematurely.
Fix this modify btree_read_extent_buffer_pages to read all copies of
the data.

This failure was exhibitted in xfstests btrfs/124 which would
spuriously fail its balance operations. The reason was that when balance
was run following re-introduction of the missing raid1 disk
__btrfs_map_block would map the read request to stripe 0, which
corresponded to devid 2 (the disk which is being removed in the test):

    item 2 key (FIRST_CHUNK_TREE CHUNK_ITEM 3553624064) itemoff 15975 itemsize 112
	length 1073741824 owner 2 stripe_len 65536 type DATA|RAID1
	io_align 65536 io_width 65536 sector_size 4096
	num_stripes 2 sub_stripes 1
		stripe 0 devid 2 offset 2156920832
		dev_uuid 8466c350-ed0c-4c3b-b17d-6379b445d5c8
		stripe 1 devid 1 offset 3553624064
		dev_uuid 1265d8db-5596-477e-af03-df08eb38d2ca

This caused read requests for a checksum item that to be routed to the
stale disk which triggered the aforementioned logic involving
EXTENT_BUFFER_CORRUPT flag. This then triggered cascading failures of
the balance operation.

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Suggested-by: Qu Wenruo <wqu@suse.com>
Fixes: a826d6dcb32d ("Btrfs: check items for correctness as we search")
---
 fs/btrfs/disk-io.c | 11 +----------
 1 file changed, 1 insertion(+), 10 deletions(-)

diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index 00ee5e37e989..279c6dbcc736 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -477,9 +477,9 @@ static int btree_read_extent_buffer_pages(struct btrfs_fs_info *fs_info,
 	int mirror_num = 0;
 	int failed_mirror = 0;

-	clear_bit(EXTENT_BUFFER_CORRUPT, &eb->bflags);
 	io_tree = &BTRFS_I(fs_info->btree_inode)->io_tree;
 	while (1) {
+		clear_bit(EXTENT_BUFFER_CORRUPT, &eb->bflags);
 		ret = read_extent_buffer_pages(io_tree, eb, WAIT_COMPLETE,
 					       mirror_num);
 		if (!ret) {
@@ -493,15 +493,6 @@ static int btree_read_extent_buffer_pages(struct btrfs_fs_info *fs_info,
 				break;
 		}

-		/*
-		 * This buffer's crc is fine, but its contents are corrupted, so
-		 * there is no reason to read the other copies, they won't be
-		 * any less wrong.
-		 */
-		if (test_bit(EXTENT_BUFFER_CORRUPT, &eb->bflags) ||
-		    ret == -EUCLEAN)
-			break;
-
 		num_copies = btrfs_num_copies(fs_info,
 					      eb->start, eb->len);
 		if (num_copies == 1)
-- 
2.17.1

^ permalink raw reply related	[flat|nested] 15+ messages in thread

* Re: [PATCH] btrfs: Always try all copies when reading extent buffers
  2018-11-06 14:40 Nikolay Borisov
@ 2018-11-06 14:53 ` Qu Wenruo
  2018-11-06 15:14   ` Nikolay Borisov
  2018-11-06 16:07 ` Nikolay Borisov
  2018-11-12 21:30 ` David Sterba
  2 siblings, 1 reply; 15+ messages in thread
From: Qu Wenruo @ 2018-11-06 14:53 UTC (permalink / raw)
  To: Nikolay Borisov, linux-btrfs; +Cc: wqu, josef



On 2018/11/6 下午10:40, Nikolay Borisov wrote:
> When a metadata read is served the endio routine btree_readpage_end_io_hook
> is called which eventually runs the tree-checker. If tree-checker fails
> to validate the read eb then it sets EXTENT_BUFFER_CORRUPT flag. This
> leads to btree_read_extent_buffer_pages wrongly assuming that all
> available copies of this extent buffer are wrong and failing prematurely.
> Fix this modify btree_read_extent_buffer_pages to read all copies of
> the data.
> 
> This failure was exhibitted in xfstests btrfs/124 which would
> spuriously fail its balance operations. The reason was that when balance
> was run following re-introduction of the missing raid1 disk
> __btrfs_map_block would map the read request to stripe 0, which
> corresponded to devid 2 (the disk which is being removed in the test):
> 
>     item 2 key (FIRST_CHUNK_TREE CHUNK_ITEM 3553624064) itemoff 15975 itemsize 112
> 	length 1073741824 owner 2 stripe_len 65536 type DATA|RAID1
> 	io_align 65536 io_width 65536 sector_size 4096
> 	num_stripes 2 sub_stripes 1
> 		stripe 0 devid 2 offset 2156920832
> 		dev_uuid 8466c350-ed0c-4c3b-b17d-6379b445d5c8
> 		stripe 1 devid 1 offset 3553624064
> 		dev_uuid 1265d8db-5596-477e-af03-df08eb38d2ca
> 
> This caused read requests for a checksum item that to be routed to the
> stale disk which triggered the aforementioned logic involving
> EXTENT_BUFFER_CORRUPT flag. This then triggered cascading failures of
> the balance operation.
> 
> Signed-off-by: Nikolay Borisov <nborisov@suse.com>
> Suggested-by: Qu Wenruo <wqu@suse.com>

Reviewed-by: Qu Wenruo <wqu@suse.com>

However there is still a tiny piece missing.

Tree checker is done after some basic checks, including:
1) bytenr
2) level
3) fsid
4) csum

1~2) can be easily skipped just by pure luck.

But 3) and especially 4) are not that easy to hit.
Not to mention meeting both 3) and 4), since csum range covers fsid.

So I must say you're a really super lucky guy!

Thanks,
Qu


> Fixes: a826d6dcb32d ("Btrfs: check items for correctness as we search")
> ---
>  fs/btrfs/disk-io.c | 11 +----------
>  1 file changed, 1 insertion(+), 10 deletions(-)
> 
> diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
> index 00ee5e37e989..279c6dbcc736 100644
> --- a/fs/btrfs/disk-io.c
> +++ b/fs/btrfs/disk-io.c
> @@ -477,9 +477,9 @@ static int btree_read_extent_buffer_pages(struct btrfs_fs_info *fs_info,
>  	int mirror_num = 0;
>  	int failed_mirror = 0;
>  
> -	clear_bit(EXTENT_BUFFER_CORRUPT, &eb->bflags);
>  	io_tree = &BTRFS_I(fs_info->btree_inode)->io_tree;
>  	while (1) {
> +		clear_bit(EXTENT_BUFFER_CORRUPT, &eb->bflags);
>  		ret = read_extent_buffer_pages(io_tree, eb, WAIT_COMPLETE,
>  					       mirror_num);
>  		if (!ret) {
> @@ -493,15 +493,6 @@ static int btree_read_extent_buffer_pages(struct btrfs_fs_info *fs_info,
>  				break;
>  		}
>  
> -		/*
> -		 * This buffer's crc is fine, but its contents are corrupted, so
> -		 * there is no reason to read the other copies, they won't be
> -		 * any less wrong.
> -		 */
> -		if (test_bit(EXTENT_BUFFER_CORRUPT, &eb->bflags) ||
> -		    ret == -EUCLEAN)
> -			break;
> -
>  		num_copies = btrfs_num_copies(fs_info,
>  					      eb->start, eb->len);
>  		if (num_copies == 1)
> 

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH] btrfs: Always try all copies when reading extent buffers
  2018-11-06 14:53 ` Qu Wenruo
@ 2018-11-06 15:14   ` Nikolay Borisov
  2018-11-07  0:18     ` Qu Wenruo
  0 siblings, 1 reply; 15+ messages in thread
From: Nikolay Borisov @ 2018-11-06 15:14 UTC (permalink / raw)
  To: Qu Wenruo, linux-btrfs; +Cc: wqu, josef



On 6.11.18 г. 16:53 ч., Qu Wenruo wrote:
> 
> 
> On 2018/11/6 下午10:40, Nikolay Borisov wrote:
>> When a metadata read is served the endio routine btree_readpage_end_io_hook
>> is called which eventually runs the tree-checker. If tree-checker fails
>> to validate the read eb then it sets EXTENT_BUFFER_CORRUPT flag. This
>> leads to btree_read_extent_buffer_pages wrongly assuming that all
>> available copies of this extent buffer are wrong and failing prematurely.
>> Fix this modify btree_read_extent_buffer_pages to read all copies of
>> the data.
>>
>> This failure was exhibitted in xfstests btrfs/124 which would
>> spuriously fail its balance operations. The reason was that when balance
>> was run following re-introduction of the missing raid1 disk
>> __btrfs_map_block would map the read request to stripe 0, which
>> corresponded to devid 2 (the disk which is being removed in the test):
>>
>>     item 2 key (FIRST_CHUNK_TREE CHUNK_ITEM 3553624064) itemoff 15975 itemsize 112
>> 	length 1073741824 owner 2 stripe_len 65536 type DATA|RAID1
>> 	io_align 65536 io_width 65536 sector_size 4096
>> 	num_stripes 2 sub_stripes 1
>> 		stripe 0 devid 2 offset 2156920832
>> 		dev_uuid 8466c350-ed0c-4c3b-b17d-6379b445d5c8
>> 		stripe 1 devid 1 offset 3553624064
>> 		dev_uuid 1265d8db-5596-477e-af03-df08eb38d2ca
>>
>> This caused read requests for a checksum item that to be routed to the
>> stale disk which triggered the aforementioned logic involving
>> EXTENT_BUFFER_CORRUPT flag. This then triggered cascading failures of
>> the balance operation.
>>
>> Signed-off-by: Nikolay Borisov <nborisov@suse.com>
>> Suggested-by: Qu Wenruo <wqu@suse.com>
> 
> Reviewed-by: Qu Wenruo <wqu@suse.com>
> 
> However there is still a tiny piece missing.
> 
> Tree checker is done after some basic checks, including:
> 1) bytenr
> 2) level
> 3) fsid
> 4) csum
> 
> 1~2) can be easily skipped just by pure luck.
> 
> But 3) and especially 4) are not that easy to hit.
> Not to mention meeting both 3) and 4), since csum range covers fsid.
> 
> So I must say you're a really super lucky guy!

s/lucky/unlucky/ :)

I wonder if reads just land in some random memory on the stale disk
hence it's really a matter of what's "there" so that reads fail with
random reasons?

> 
> Thanks,
> Qu
> 
> 
>> Fixes: a826d6dcb32d ("Btrfs: check items for correctness as we search")
>> ---
>>  fs/btrfs/disk-io.c | 11 +----------
>>  1 file changed, 1 insertion(+), 10 deletions(-)
>>
>> diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
>> index 00ee5e37e989..279c6dbcc736 100644
>> --- a/fs/btrfs/disk-io.c
>> +++ b/fs/btrfs/disk-io.c
>> @@ -477,9 +477,9 @@ static int btree_read_extent_buffer_pages(struct btrfs_fs_info *fs_info,
>>  	int mirror_num = 0;
>>  	int failed_mirror = 0;
>>  
>> -	clear_bit(EXTENT_BUFFER_CORRUPT, &eb->bflags);
>>  	io_tree = &BTRFS_I(fs_info->btree_inode)->io_tree;
>>  	while (1) {
>> +		clear_bit(EXTENT_BUFFER_CORRUPT, &eb->bflags);
>>  		ret = read_extent_buffer_pages(io_tree, eb, WAIT_COMPLETE,
>>  					       mirror_num);
>>  		if (!ret) {
>> @@ -493,15 +493,6 @@ static int btree_read_extent_buffer_pages(struct btrfs_fs_info *fs_info,
>>  				break;
>>  		}
>>  
>> -		/*
>> -		 * This buffer's crc is fine, but its contents are corrupted, so
>> -		 * there is no reason to read the other copies, they won't be
>> -		 * any less wrong.
>> -		 */
>> -		if (test_bit(EXTENT_BUFFER_CORRUPT, &eb->bflags) ||
>> -		    ret == -EUCLEAN)
>> -			break;
>> -
>>  		num_copies = btrfs_num_copies(fs_info,
>>  					      eb->start, eb->len);
>>  		if (num_copies == 1)
>>
> 

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH] btrfs: Always try all copies when reading extent buffers
  2018-11-06 15:14   ` Nikolay Borisov
@ 2018-11-07  0:18     ` Qu Wenruo
  0 siblings, 0 replies; 15+ messages in thread
From: Qu Wenruo @ 2018-11-07  0:18 UTC (permalink / raw)
  To: Nikolay Borisov, linux-btrfs; +Cc: wqu, josef



On 2018/11/6 下午11:14, Nikolay Borisov wrote:
> 
> 
> On 6.11.18 г. 16:53 ч., Qu Wenruo wrote:
>>
>>
>> On 2018/11/6 下午10:40, Nikolay Borisov wrote:
>>> When a metadata read is served the endio routine btree_readpage_end_io_hook
>>> is called which eventually runs the tree-checker. If tree-checker fails
>>> to validate the read eb then it sets EXTENT_BUFFER_CORRUPT flag. This
>>> leads to btree_read_extent_buffer_pages wrongly assuming that all
>>> available copies of this extent buffer are wrong and failing prematurely.
>>> Fix this modify btree_read_extent_buffer_pages to read all copies of
>>> the data.
>>>
>>> This failure was exhibitted in xfstests btrfs/124 which would
>>> spuriously fail its balance operations. The reason was that when balance
>>> was run following re-introduction of the missing raid1 disk
>>> __btrfs_map_block would map the read request to stripe 0, which
>>> corresponded to devid 2 (the disk which is being removed in the test):
>>>
>>>     item 2 key (FIRST_CHUNK_TREE CHUNK_ITEM 3553624064) itemoff 15975 itemsize 112
>>> 	length 1073741824 owner 2 stripe_len 65536 type DATA|RAID1
>>> 	io_align 65536 io_width 65536 sector_size 4096
>>> 	num_stripes 2 sub_stripes 1
>>> 		stripe 0 devid 2 offset 2156920832
>>> 		dev_uuid 8466c350-ed0c-4c3b-b17d-6379b445d5c8
>>> 		stripe 1 devid 1 offset 3553624064
>>> 		dev_uuid 1265d8db-5596-477e-af03-df08eb38d2ca
>>>
>>> This caused read requests for a checksum item that to be routed to the
>>> stale disk which triggered the aforementioned logic involving
>>> EXTENT_BUFFER_CORRUPT flag. This then triggered cascading failures of
>>> the balance operation.
>>>
>>> Signed-off-by: Nikolay Borisov <nborisov@suse.com>
>>> Suggested-by: Qu Wenruo <wqu@suse.com>
>>
>> Reviewed-by: Qu Wenruo <wqu@suse.com>
>>
>> However there is still a tiny piece missing.
>>
>> Tree checker is done after some basic checks, including:
>> 1) bytenr
>> 2) level
>> 3) fsid
>> 4) csum
>>
>> 1~2) can be easily skipped just by pure luck.
>>
>> But 3) and especially 4) are not that easy to hit.
>> Not to mention meeting both 3) and 4), since csum range covers fsid.
>>
>> So I must say you're a really super lucky guy!
> 
> s/lucky/unlucky/ :)
> 
> I wonder if reads just land in some random memory on the stale disk
> hence it's really a matter of what's "there" so that reads fail with
> random reasons?

Even for random memory, you're still too lucky to hit it.

Thanks,
Qu

> 
>>
>> Thanks,
>> Qu
>>
>>
>>> Fixes: a826d6dcb32d ("Btrfs: check items for correctness as we search")
>>> ---
>>>  fs/btrfs/disk-io.c | 11 +----------
>>>  1 file changed, 1 insertion(+), 10 deletions(-)
>>>
>>> diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
>>> index 00ee5e37e989..279c6dbcc736 100644
>>> --- a/fs/btrfs/disk-io.c
>>> +++ b/fs/btrfs/disk-io.c
>>> @@ -477,9 +477,9 @@ static int btree_read_extent_buffer_pages(struct btrfs_fs_info *fs_info,
>>>  	int mirror_num = 0;
>>>  	int failed_mirror = 0;
>>>  
>>> -	clear_bit(EXTENT_BUFFER_CORRUPT, &eb->bflags);
>>>  	io_tree = &BTRFS_I(fs_info->btree_inode)->io_tree;
>>>  	while (1) {
>>> +		clear_bit(EXTENT_BUFFER_CORRUPT, &eb->bflags);
>>>  		ret = read_extent_buffer_pages(io_tree, eb, WAIT_COMPLETE,
>>>  					       mirror_num);
>>>  		if (!ret) {
>>> @@ -493,15 +493,6 @@ static int btree_read_extent_buffer_pages(struct btrfs_fs_info *fs_info,
>>>  				break;
>>>  		}
>>>  
>>> -		/*
>>> -		 * This buffer's crc is fine, but its contents are corrupted, so
>>> -		 * there is no reason to read the other copies, they won't be
>>> -		 * any less wrong.
>>> -		 */
>>> -		if (test_bit(EXTENT_BUFFER_CORRUPT, &eb->bflags) ||
>>> -		    ret == -EUCLEAN)
>>> -			break;
>>> -
>>>  		num_copies = btrfs_num_copies(fs_info,
>>>  					      eb->start, eb->len);
>>>  		if (num_copies == 1)
>>>
>>

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH] btrfs: Always try all copies when reading extent buffers
  2018-11-06 14:40 Nikolay Borisov
  2018-11-06 14:53 ` Qu Wenruo
@ 2018-11-06 16:07 ` Nikolay Borisov
  2018-11-07  0:23   ` Qu Wenruo
  2018-11-12 21:30 ` David Sterba
  2 siblings, 1 reply; 15+ messages in thread
From: Nikolay Borisov @ 2018-11-06 16:07 UTC (permalink / raw)
  To: linux-btrfs; +Cc: wqu, josef



On 6.11.18 г. 16:40 ч., Nikolay Borisov wrote:
> When a metadata read is served the endio routine btree_readpage_end_io_hook
> is called which eventually runs the tree-checker. If tree-checker fails
> to validate the read eb then it sets EXTENT_BUFFER_CORRUPT flag. This
> leads to btree_read_extent_buffer_pages wrongly assuming that all
> available copies of this extent buffer are wrong and failing prematurely.
> Fix this modify btree_read_extent_buffer_pages to read all copies of
> the data.
> 
> This failure was exhibitted in xfstests btrfs/124 which would
> spuriously fail its balance operations. The reason was that when balance
> was run following re-introduction of the missing raid1 disk
> __btrfs_map_block would map the read request to stripe 0, which
> corresponded to devid 2 (the disk which is being removed in the test):
> 
>     item 2 key (FIRST_CHUNK_TREE CHUNK_ITEM 3553624064) itemoff 15975 itemsize 112
> 	length 1073741824 owner 2 stripe_len 65536 type DATA|RAID1
> 	io_align 65536 io_width 65536 sector_size 4096
> 	num_stripes 2 sub_stripes 1
> 		stripe 0 devid 2 offset 2156920832
> 		dev_uuid 8466c350-ed0c-4c3b-b17d-6379b445d5c8
> 		stripe 1 devid 1 offset 3553624064
> 		dev_uuid 1265d8db-5596-477e-af03-df08eb38d2ca
> 
> This caused read requests for a checksum item that to be routed to the
> stale disk which triggered the aforementioned logic involving
> EXTENT_BUFFER_CORRUPT flag. This then triggered cascading failures of
> the balance operation.
> 
> Signed-off-by: Nikolay Borisov <nborisov@suse.com>
> Suggested-by: Qu Wenruo <wqu@suse.com>
> Fixes: a826d6dcb32d ("Btrfs: check items for correctness as we search")
> ---
>  fs/btrfs/disk-io.c | 11 +----------
>  1 file changed, 1 insertion(+), 10 deletions(-)
> 
> diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
> index 00ee5e37e989..279c6dbcc736 100644
> --- a/fs/btrfs/disk-io.c
> +++ b/fs/btrfs/disk-io.c
> @@ -477,9 +477,9 @@ static int btree_read_extent_buffer_pages(struct btrfs_fs_info *fs_info,
>  	int mirror_num = 0;
>  	int failed_mirror = 0;
>  
> -	clear_bit(EXTENT_BUFFER_CORRUPT, &eb->bflags);
>  	io_tree = &BTRFS_I(fs_info->btree_inode)->io_tree;
>  	while (1) {
> +		clear_bit(EXTENT_BUFFER_CORRUPT, &eb->bflags);
>  		ret = read_extent_buffer_pages(io_tree, eb, WAIT_COMPLETE,
>  					       mirror_num);
>  		if (!ret) {

Qu, 

Do you think it makes sense to do refactoring like below in
 a follow up patch: 

diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index 279c6dbcc736..9891e13a2b6f 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -482,16 +482,11 @@ static int btree_read_extent_buffer_pages(struct btrfs_fs_info *fs_info,
                clear_bit(EXTENT_BUFFER_CORRUPT, &eb->bflags);
                ret = read_extent_buffer_pages(io_tree, eb, WAIT_COMPLETE,
                                               mirror_num);
-               if (!ret) {
-                       if (verify_parent_transid(io_tree, eb,
-                                                  parent_transid, 0))
-                               ret = -EIO;
-                       else if (verify_level_key(fs_info, eb, level,
-                                                 first_key, parent_transid))
-                               ret = -EUCLEAN;
-                       else
+               if (!ret &&
+                   !verify_parent_transid(io_tree, eb, parent_transid, 0) &&
+                   !verify_level_key(fs_info, eb, level, first_key,
+                                    parent_transid))
                                break;
-               }


since the ret value doesn't really have any meaning or perhaps the 
verify_level_key and ret = -EUCLEAN could be reteinaed as well as the 
if (ret == EUCLEAN) break logic ? 

> @@ -493,15 +493,6 @@ static int btree_read_extent_buffer_pages(struct btrfs_fs_info *fs_info,
>  				break;
>  		}
>  
> -		/*
> -		 * This buffer's crc is fine, but its contents are corrupted, so
> -		 * there is no reason to read the other copies, they won't be
> -		 * any less wrong.
> -		 */
> -		if (test_bit(EXTENT_BUFFER_CORRUPT, &eb->bflags) ||
> -		    ret == -EUCLEAN)
> -			break;
> -
>  		num_copies = btrfs_num_copies(fs_info,
>  					      eb->start, eb->len);
>  		if (num_copies == 1)
> 

^ permalink raw reply related	[flat|nested] 15+ messages in thread

* Re: [PATCH] btrfs: Always try all copies when reading extent buffers
  2018-11-06 16:07 ` Nikolay Borisov
@ 2018-11-07  0:23   ` Qu Wenruo
  0 siblings, 0 replies; 15+ messages in thread
From: Qu Wenruo @ 2018-11-07  0:23 UTC (permalink / raw)
  To: Nikolay Borisov, linux-btrfs; +Cc: josef



On 2018/11/7 上午12:07, Nikolay Borisov wrote:
> 
> 
> On 6.11.18 г. 16:40 ч., Nikolay Borisov wrote:
>> When a metadata read is served the endio routine btree_readpage_end_io_hook
>> is called which eventually runs the tree-checker. If tree-checker fails
>> to validate the read eb then it sets EXTENT_BUFFER_CORRUPT flag. This
>> leads to btree_read_extent_buffer_pages wrongly assuming that all
>> available copies of this extent buffer are wrong and failing prematurely.
>> Fix this modify btree_read_extent_buffer_pages to read all copies of
>> the data.
>>
>> This failure was exhibitted in xfstests btrfs/124 which would
>> spuriously fail its balance operations. The reason was that when balance
>> was run following re-introduction of the missing raid1 disk
>> __btrfs_map_block would map the read request to stripe 0, which
>> corresponded to devid 2 (the disk which is being removed in the test):
>>
>>     item 2 key (FIRST_CHUNK_TREE CHUNK_ITEM 3553624064) itemoff 15975 itemsize 112
>> 	length 1073741824 owner 2 stripe_len 65536 type DATA|RAID1
>> 	io_align 65536 io_width 65536 sector_size 4096
>> 	num_stripes 2 sub_stripes 1
>> 		stripe 0 devid 2 offset 2156920832
>> 		dev_uuid 8466c350-ed0c-4c3b-b17d-6379b445d5c8
>> 		stripe 1 devid 1 offset 3553624064
>> 		dev_uuid 1265d8db-5596-477e-af03-df08eb38d2ca
>>
>> This caused read requests for a checksum item that to be routed to the
>> stale disk which triggered the aforementioned logic involving
>> EXTENT_BUFFER_CORRUPT flag. This then triggered cascading failures of
>> the balance operation.
>>
>> Signed-off-by: Nikolay Borisov <nborisov@suse.com>
>> Suggested-by: Qu Wenruo <wqu@suse.com>
>> Fixes: a826d6dcb32d ("Btrfs: check items for correctness as we search")
>> ---
>>  fs/btrfs/disk-io.c | 11 +----------
>>  1 file changed, 1 insertion(+), 10 deletions(-)
>>
>> diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
>> index 00ee5e37e989..279c6dbcc736 100644
>> --- a/fs/btrfs/disk-io.c
>> +++ b/fs/btrfs/disk-io.c
>> @@ -477,9 +477,9 @@ static int btree_read_extent_buffer_pages(struct btrfs_fs_info *fs_info,
>>  	int mirror_num = 0;
>>  	int failed_mirror = 0;
>>  
>> -	clear_bit(EXTENT_BUFFER_CORRUPT, &eb->bflags);
>>  	io_tree = &BTRFS_I(fs_info->btree_inode)->io_tree;
>>  	while (1) {
>> +		clear_bit(EXTENT_BUFFER_CORRUPT, &eb->bflags);
>>  		ret = read_extent_buffer_pages(io_tree, eb, WAIT_COMPLETE,
>>  					       mirror_num);
>>  		if (!ret) {
> 
> Qu, 
> 
> Do you think it makes sense to do refactoring like below in
>  a follow up patch: 
> 
> diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
> index 279c6dbcc736..9891e13a2b6f 100644
> --- a/fs/btrfs/disk-io.c
> +++ b/fs/btrfs/disk-io.c
> @@ -482,16 +482,11 @@ static int btree_read_extent_buffer_pages(struct btrfs_fs_info *fs_info,
>                 clear_bit(EXTENT_BUFFER_CORRUPT, &eb->bflags);
>                 ret = read_extent_buffer_pages(io_tree, eb, WAIT_COMPLETE,
>                                                mirror_num);
> -               if (!ret) {
> -                       if (verify_parent_transid(io_tree, eb,
> -                                                  parent_transid, 0))
> -                               ret = -EIO;
> -                       else if (verify_level_key(fs_info, eb, level,
> -                                                 first_key, parent_transid))
> -                               ret = -EUCLEAN;
> -                       else
> +               if (!ret &&
> +                   !verify_parent_transid(io_tree, eb, parent_transid, 0) &&
> +                   !verify_level_key(fs_info, eb, level, first_key,
> +                                    parent_transid))
>                                 break;
> -               }
> 
> 
> since the ret value doesn't really have any meaning or perhaps the 
> verify_level_key and ret = -EUCLEAN could be reteinaed as well as the 
> if (ret == EUCLEAN) break logic ? 

Yes, that's a valid cleanup.

Thanks,
Qu

> 
>> @@ -493,15 +493,6 @@ static int btree_read_extent_buffer_pages(struct btrfs_fs_info *fs_info,
>>  				break;
>>  		}
>>  
>> -		/*
>> -		 * This buffer's crc is fine, but its contents are corrupted, so
>> -		 * there is no reason to read the other copies, they won't be
>> -		 * any less wrong.
>> -		 */
>> -		if (test_bit(EXTENT_BUFFER_CORRUPT, &eb->bflags) ||
>> -		    ret == -EUCLEAN)
>> -			break;
>> -
>>  		num_copies = btrfs_num_copies(fs_info,
>>  					      eb->start, eb->len);
>>  		if (num_copies == 1)
>>

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH] btrfs: Always try all copies when reading extent buffers
  2018-11-06 14:40 Nikolay Borisov
  2018-11-06 14:53 ` Qu Wenruo
  2018-11-06 16:07 ` Nikolay Borisov
@ 2018-11-12 21:30 ` David Sterba
  2 siblings, 0 replies; 15+ messages in thread
From: David Sterba @ 2018-11-12 21:30 UTC (permalink / raw)
  To: Nikolay Borisov; +Cc: linux-btrfs, wqu, josef

On Tue, Nov 06, 2018 at 04:40:20PM +0200, Nikolay Borisov wrote:
> When a metadata read is served the endio routine btree_readpage_end_io_hook
> is called which eventually runs the tree-checker. If tree-checker fails
> to validate the read eb then it sets EXTENT_BUFFER_CORRUPT flag. This
> leads to btree_read_extent_buffer_pages wrongly assuming that all
> available copies of this extent buffer are wrong and failing prematurely.
> Fix this modify btree_read_extent_buffer_pages to read all copies of
> the data.
> 
> This failure was exhibitted in xfstests btrfs/124 which would
> spuriously fail its balance operations. The reason was that when balance
> was run following re-introduction of the missing raid1 disk
> __btrfs_map_block would map the read request to stripe 0, which
> corresponded to devid 2 (the disk which is being removed in the test):
> 
>     item 2 key (FIRST_CHUNK_TREE CHUNK_ITEM 3553624064) itemoff 15975 itemsize 112
> 	length 1073741824 owner 2 stripe_len 65536 type DATA|RAID1
> 	io_align 65536 io_width 65536 sector_size 4096
> 	num_stripes 2 sub_stripes 1
> 		stripe 0 devid 2 offset 2156920832
> 		dev_uuid 8466c350-ed0c-4c3b-b17d-6379b445d5c8
> 		stripe 1 devid 1 offset 3553624064
> 		dev_uuid 1265d8db-5596-477e-af03-df08eb38d2ca
> 
> This caused read requests for a checksum item that to be routed to the
> stale disk which triggered the aforementioned logic involving
> EXTENT_BUFFER_CORRUPT flag. This then triggered cascading failures of
> the balance operation.
> 
> Signed-off-by: Nikolay Borisov <nborisov@suse.com>
> Suggested-by: Qu Wenruo <wqu@suse.com>
> Fixes: a826d6dcb32d ("Btrfs: check items for correctness as we search")

Added to misc-next, thanks.

^ permalink raw reply	[flat|nested] 15+ messages in thread

end of thread, other threads:[~2018-12-06 10:48 UTC | newest]

Thread overview: 15+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2018-12-03 10:02 FAILED: patch "[PATCH] btrfs: Always try all copies when reading extent buffers" failed to apply to 4.14-stable tree gregkh
2018-12-04 12:47 ` [PATCH] btrfs: Always try all copies when reading extent buffers Nikolay Borisov
2018-12-06 10:47   ` Greg KH
  -- strict thread matches above, loose matches on Subject: below --
2018-12-03 10:02 FAILED: patch "[PATCH] btrfs: Always try all copies when reading extent buffers" failed to apply to 4.4-stable tree gregkh
2018-12-04 12:49 ` [PATCH] btrfs: Always try all copies when reading extent buffers Nikolay Borisov
2018-12-06 10:47   ` Greg KH
2018-12-03 10:02 FAILED: patch "[PATCH] btrfs: Always try all copies when reading extent buffers" failed to apply to 4.9-stable tree gregkh
2018-12-04 12:48 ` [PATCH] btrfs: Always try all copies when reading extent buffers Nikolay Borisov
2018-12-06 10:45   ` Greg KH
2018-12-06 10:48   ` Greg KH
2018-11-06 14:40 Nikolay Borisov
2018-11-06 14:53 ` Qu Wenruo
2018-11-06 15:14   ` Nikolay Borisov
2018-11-07  0:18     ` Qu Wenruo
2018-11-06 16:07 ` Nikolay Borisov
2018-11-07  0:23   ` Qu Wenruo
2018-11-12 21:30 ` David Sterba

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.