linux-btrfs.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH RFC 1/2] btrfs: scrub: Don't use inode page cache in scrub_handle_errored_block()
@ 2018-07-10  5:57 Qu Wenruo
  2018-07-10  5:57 ` [PATCH RFC 2/2] btrfs: scrub: Cleanup the nodatasum fixup code Qu Wenruo
  2018-07-10  7:00 ` [PATCH RFC 1/2] btrfs: scrub: Don't use inode page cache in scrub_handle_errored_block() Nikolay Borisov
  0 siblings, 2 replies; 7+ messages in thread
From: Qu Wenruo @ 2018-07-10  5:57 UTC (permalink / raw)
  To: linux-btrfs

When we need to fixup error blocks during scrub/dev-replace for
nodatasum extents, we still goes through the inode page cache and write
them back onto disk.

It's already proved that such usage of on-disk data could lead to
serious data corruption for compressed extent.
So here we also need to avoid such case, so avoid any calling to
scrub_fixup_nodatasum().

Signed-off-by: Qu Wenruo <wqu@suse.com>
---
 fs/btrfs/scrub.c | 18 ++++++++++--------
 1 file changed, 10 insertions(+), 8 deletions(-)

diff --git a/fs/btrfs/scrub.c b/fs/btrfs/scrub.c
index 572306036477..328232fa5646 100644
--- a/fs/btrfs/scrub.c
+++ b/fs/btrfs/scrub.c
@@ -1151,11 +1151,6 @@ static int scrub_handle_errored_block(struct scrub_block *sblock_to_check)
 		return ret;
 	}
 
-	if (sctx->is_dev_replace && !is_metadata && !have_csum) {
-		sblocks_for_recheck = NULL;
-		goto nodatasum_case;
-	}
-
 	/*
 	 * read all mirrors one after the other. This includes to
 	 * re-read the extent or metadata block that failed (that was
@@ -1268,13 +1263,20 @@ static int scrub_handle_errored_block(struct scrub_block *sblock_to_check)
 		goto out;
 	}
 
-	if (!is_metadata && !have_csum) {
+	/*
+	 * NOTE: Even for nodatasum data case, it's still possible that it's
+	 * compressed data extent, thus scrub_fixup_nodatasum(), which
+	 * write inode page cache onto disk, could cause serious data
+	 * corruption.
+	 *
+	 * So here we could only read from disk, and hopes our recovery
+	 * could reach disk before newer write.
+	 */
+	if (0 && !is_metadata && !have_csum) {
 		struct scrub_fixup_nodatasum *fixup_nodatasum;
 
 		WARN_ON(sctx->is_dev_replace);
 
-nodatasum_case:
-
 		/*
 		 * !is_metadata and !have_csum, this means that the data
 		 * might not be COWed, that it might be modified
-- 
2.18.0


^ permalink raw reply related	[flat|nested] 7+ messages in thread

* [PATCH RFC 2/2] btrfs: scrub: Cleanup the nodatasum fixup code
  2018-07-10  5:57 [PATCH RFC 1/2] btrfs: scrub: Don't use inode page cache in scrub_handle_errored_block() Qu Wenruo
@ 2018-07-10  5:57 ` Qu Wenruo
  2018-07-10  7:00 ` [PATCH RFC 1/2] btrfs: scrub: Don't use inode page cache in scrub_handle_errored_block() Nikolay Borisov
  1 sibling, 0 replies; 7+ messages in thread
From: Qu Wenruo @ 2018-07-10  5:57 UTC (permalink / raw)
  To: linux-btrfs

Since we no longer use the old method, which use inode page cache and
could cause serious data corruption for compressed nodatasum extent, to
fixup nodatasum error during scrub/replace, remove all related code.

Signed-off-by: Qu Wenruo <wqu@suse.com>
---
 fs/btrfs/scrub.c | 234 -----------------------------------------------
 1 file changed, 234 deletions(-)

diff --git a/fs/btrfs/scrub.c b/fs/btrfs/scrub.c
index 328232fa5646..8580b53fcf32 100644
--- a/fs/btrfs/scrub.c
+++ b/fs/btrfs/scrub.c
@@ -188,15 +188,6 @@ struct scrub_ctx {
 	refcount_t              refs;
 };
 
-struct scrub_fixup_nodatasum {
-	struct scrub_ctx	*sctx;
-	struct btrfs_device	*dev;
-	u64			logical;
-	struct btrfs_root	*root;
-	struct btrfs_work	work;
-	int			mirror_num;
-};
-
 struct scrub_nocow_inode {
 	u64			inum;
 	u64			offset;
@@ -882,194 +873,6 @@ static void scrub_print_warning(const char *errstr, struct scrub_block *sblock)
 	btrfs_free_path(path);
 }
 
-static int scrub_fixup_readpage(u64 inum, u64 offset, u64 root, void *fixup_ctx)
-{
-	struct page *page = NULL;
-	unsigned long index;
-	struct scrub_fixup_nodatasum *fixup = fixup_ctx;
-	int ret;
-	int corrected = 0;
-	struct btrfs_key key;
-	struct inode *inode = NULL;
-	struct btrfs_fs_info *fs_info;
-	u64 end = offset + PAGE_SIZE - 1;
-	struct btrfs_root *local_root;
-	int srcu_index;
-
-	key.objectid = root;
-	key.type = BTRFS_ROOT_ITEM_KEY;
-	key.offset = (u64)-1;
-
-	fs_info = fixup->root->fs_info;
-	srcu_index = srcu_read_lock(&fs_info->subvol_srcu);
-
-	local_root = btrfs_read_fs_root_no_name(fs_info, &key);
-	if (IS_ERR(local_root)) {
-		srcu_read_unlock(&fs_info->subvol_srcu, srcu_index);
-		return PTR_ERR(local_root);
-	}
-
-	key.type = BTRFS_INODE_ITEM_KEY;
-	key.objectid = inum;
-	key.offset = 0;
-	inode = btrfs_iget(fs_info->sb, &key, local_root, NULL);
-	srcu_read_unlock(&fs_info->subvol_srcu, srcu_index);
-	if (IS_ERR(inode))
-		return PTR_ERR(inode);
-
-	index = offset >> PAGE_SHIFT;
-
-	page = find_or_create_page(inode->i_mapping, index, GFP_NOFS);
-	if (!page) {
-		ret = -ENOMEM;
-		goto out;
-	}
-
-	if (PageUptodate(page)) {
-		if (PageDirty(page)) {
-			/*
-			 * we need to write the data to the defect sector. the
-			 * data that was in that sector is not in memory,
-			 * because the page was modified. we must not write the
-			 * modified page to that sector.
-			 *
-			 * TODO: what could be done here: wait for the delalloc
-			 *       runner to write out that page (might involve
-			 *       COW) and see whether the sector is still
-			 *       referenced afterwards.
-			 *
-			 * For the meantime, we'll treat this error
-			 * incorrectable, although there is a chance that a
-			 * later scrub will find the bad sector again and that
-			 * there's no dirty page in memory, then.
-			 */
-			ret = -EIO;
-			goto out;
-		}
-		ret = repair_io_failure(fs_info, inum, offset, PAGE_SIZE,
-					fixup->logical, page,
-					offset - page_offset(page),
-					fixup->mirror_num);
-		unlock_page(page);
-		corrected = !ret;
-	} else {
-		/*
-		 * we need to get good data first. the general readpage path
-		 * will call repair_io_failure for us, we just have to make
-		 * sure we read the bad mirror.
-		 */
-		ret = set_extent_bits(&BTRFS_I(inode)->io_tree, offset, end,
-					EXTENT_DAMAGED);
-		if (ret) {
-			/* set_extent_bits should give proper error */
-			WARN_ON(ret > 0);
-			if (ret > 0)
-				ret = -EFAULT;
-			goto out;
-		}
-
-		ret = extent_read_full_page(&BTRFS_I(inode)->io_tree, page,
-						btrfs_get_extent,
-						fixup->mirror_num);
-		wait_on_page_locked(page);
-
-		corrected = !test_range_bit(&BTRFS_I(inode)->io_tree, offset,
-						end, EXTENT_DAMAGED, 0, NULL);
-		if (!corrected)
-			clear_extent_bits(&BTRFS_I(inode)->io_tree, offset, end,
-						EXTENT_DAMAGED);
-	}
-
-out:
-	if (page)
-		put_page(page);
-
-	iput(inode);
-
-	if (ret < 0)
-		return ret;
-
-	if (ret == 0 && corrected) {
-		/*
-		 * we only need to call readpage for one of the inodes belonging
-		 * to this extent. so make iterate_extent_inodes stop
-		 */
-		return 1;
-	}
-
-	return -EIO;
-}
-
-static void scrub_fixup_nodatasum(struct btrfs_work *work)
-{
-	struct btrfs_fs_info *fs_info;
-	int ret;
-	struct scrub_fixup_nodatasum *fixup;
-	struct scrub_ctx *sctx;
-	struct btrfs_trans_handle *trans = NULL;
-	struct btrfs_path *path;
-	int uncorrectable = 0;
-
-	fixup = container_of(work, struct scrub_fixup_nodatasum, work);
-	sctx = fixup->sctx;
-	fs_info = fixup->root->fs_info;
-
-	path = btrfs_alloc_path();
-	if (!path) {
-		spin_lock(&sctx->stat_lock);
-		++sctx->stat.malloc_errors;
-		spin_unlock(&sctx->stat_lock);
-		uncorrectable = 1;
-		goto out;
-	}
-
-	trans = btrfs_join_transaction(fixup->root);
-	if (IS_ERR(trans)) {
-		uncorrectable = 1;
-		goto out;
-	}
-
-	/*
-	 * the idea is to trigger a regular read through the standard path. we
-	 * read a page from the (failed) logical address by specifying the
-	 * corresponding copynum of the failed sector. thus, that readpage is
-	 * expected to fail.
-	 * that is the point where on-the-fly error correction will kick in
-	 * (once it's finished) and rewrite the failed sector if a good copy
-	 * can be found.
-	 */
-	ret = iterate_inodes_from_logical(fixup->logical, fs_info, path,
-					  scrub_fixup_readpage, fixup, false);
-	if (ret < 0) {
-		uncorrectable = 1;
-		goto out;
-	}
-	WARN_ON(ret != 1);
-
-	spin_lock(&sctx->stat_lock);
-	++sctx->stat.corrected_errors;
-	spin_unlock(&sctx->stat_lock);
-
-out:
-	if (trans && !IS_ERR(trans))
-		btrfs_end_transaction(trans);
-	if (uncorrectable) {
-		spin_lock(&sctx->stat_lock);
-		++sctx->stat.uncorrectable_errors;
-		spin_unlock(&sctx->stat_lock);
-		btrfs_dev_replace_stats_inc(
-			&fs_info->dev_replace.num_uncorrectable_read_errors);
-		btrfs_err_rl_in_rcu(fs_info,
-		    "unable to fixup (nodatasum) error at logical %llu on dev %s",
-			fixup->logical, rcu_str_deref(fixup->dev->name));
-	}
-
-	btrfs_free_path(path);
-	kfree(fixup);
-
-	scrub_pending_trans_workers_dec(sctx);
-}
-
 static inline void scrub_get_recover(struct scrub_recover *recover)
 {
 	refcount_inc(&recover->refs);
@@ -1263,43 +1066,6 @@ static int scrub_handle_errored_block(struct scrub_block *sblock_to_check)
 		goto out;
 	}
 
-	/*
-	 * NOTE: Even for nodatasum data case, it's still possible that it's
-	 * compressed data extent, thus scrub_fixup_nodatasum(), which
-	 * write inode page cache onto disk, could cause serious data
-	 * corruption.
-	 *
-	 * So here we could only read from disk, and hopes our recovery
-	 * could reach disk before newer write.
-	 */
-	if (0 && !is_metadata && !have_csum) {
-		struct scrub_fixup_nodatasum *fixup_nodatasum;
-
-		WARN_ON(sctx->is_dev_replace);
-
-		/*
-		 * !is_metadata and !have_csum, this means that the data
-		 * might not be COWed, that it might be modified
-		 * concurrently. The general strategy to work on the
-		 * commit root does not help in the case when COW is not
-		 * used.
-		 */
-		fixup_nodatasum = kzalloc(sizeof(*fixup_nodatasum), GFP_NOFS);
-		if (!fixup_nodatasum)
-			goto did_not_correct_error;
-		fixup_nodatasum->sctx = sctx;
-		fixup_nodatasum->dev = dev;
-		fixup_nodatasum->logical = logical;
-		fixup_nodatasum->root = fs_info->extent_root;
-		fixup_nodatasum->mirror_num = failed_mirror_index + 1;
-		scrub_pending_trans_workers_inc(sctx);
-		btrfs_init_work(&fixup_nodatasum->work, btrfs_scrub_helper,
-				scrub_fixup_nodatasum, NULL, NULL);
-		btrfs_queue_work(fs_info->scrub_workers,
-				 &fixup_nodatasum->work);
-		goto out;
-	}
-
 	/*
 	 * now build and submit the bios for the other mirrors, check
 	 * checksums.
-- 
2.18.0


^ permalink raw reply related	[flat|nested] 7+ messages in thread

* Re: [PATCH RFC 1/2] btrfs: scrub: Don't use inode page cache in scrub_handle_errored_block()
  2018-07-10  5:57 [PATCH RFC 1/2] btrfs: scrub: Don't use inode page cache in scrub_handle_errored_block() Qu Wenruo
  2018-07-10  5:57 ` [PATCH RFC 2/2] btrfs: scrub: Cleanup the nodatasum fixup code Qu Wenruo
@ 2018-07-10  7:00 ` Nikolay Borisov
  2018-07-10  7:14   ` Qu Wenruo
  1 sibling, 1 reply; 7+ messages in thread
From: Nikolay Borisov @ 2018-07-10  7:00 UTC (permalink / raw)
  To: Qu Wenruo, linux-btrfs



On 10.07.2018 08:57, Qu Wenruo wrote:
> When we need to fixup error blocks during scrub/dev-replace for
> nodatasum extents, we still goes through the inode page cache and write
> them back onto disk.
> 
> It's already proved that such usage of on-disk data could lead to
> serious data corruption for compressed extent.
> So here we also need to avoid such case, so avoid any calling to
> scrub_fixup_nodatasum().
> 
> Signed-off-by: Qu Wenruo <wqu@suse.com>

Is this patch supposed to replace the 'if (0 ...' one that is in 4.18 -
if not, then the small fix should be reverted as we are introducing a
regression in 4.18 and then your full fix should go into 4.19.

I kind of lost the end of the whole nodatasum mess but your references
are really vague, if you have described somewhere (i.e in a mailing list
post or another, merged patch) how the corruption happens if we use the
page cache then put a reference in the changelog - either with a commit
id or use the Link: tag for the email thread.
> ---
>  fs/btrfs/scrub.c | 18 ++++++++++--------
>  1 file changed, 10 insertions(+), 8 deletions(-)
> 
> diff --git a/fs/btrfs/scrub.c b/fs/btrfs/scrub.c
> index 572306036477..328232fa5646 100644
> --- a/fs/btrfs/scrub.c
> +++ b/fs/btrfs/scrub.c
> @@ -1151,11 +1151,6 @@ static int scrub_handle_errored_block(struct scrub_block *sblock_to_check)
>  		return ret;
>  	}
>  
> -	if (sctx->is_dev_replace && !is_metadata && !have_csum) {
> -		sblocks_for_recheck = NULL;
> -		goto nodatasum_case;
> -	}
> -
>  	/*
>  	 * read all mirrors one after the other. This includes to
>  	 * re-read the extent or metadata block that failed (that was
> @@ -1268,13 +1263,20 @@ static int scrub_handle_errored_block(struct scrub_block *sblock_to_check)
>  		goto out;
>  	}
>  
> -	if (!is_metadata && !have_csum) {
> +	/*
> +	 * NOTE: Even for nodatasum data case, it's still possible that it's
> +	 * compressed data extent, thus scrub_fixup_nodatasum(), which
> +	 * write inode page cache onto disk, could cause serious data
> +	 * corruption.
> +	 *
> +	 * So here we could only read from disk, and hopes our recovery
> +	 * could reach disk before newer write.

If this code relies on 'hope' then I NACK it since hope cannot be
quantified hence the effect of this patch cannot be quantified. It
either fixes the problem or doesn't fix it. Sorry but this, coupled with
the vague changelog I'd rather not have it committed.

> +	 */
> +	if (0 && !is_metadata && !have_csum) {
>  		struct scrub_fixup_nodatasum *fixup_nodatasum;
>  
>  		WARN_ON(sctx->is_dev_replace);
>  
> -nodatasum_case:
> -
>  		/*
>  		 * !is_metadata and !have_csum, this means that the data
>  		 * might not be COWed, that it might be modified
> 

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH RFC 1/2] btrfs: scrub: Don't use inode page cache in scrub_handle_errored_block()
  2018-07-10  7:00 ` [PATCH RFC 1/2] btrfs: scrub: Don't use inode page cache in scrub_handle_errored_block() Nikolay Borisov
@ 2018-07-10  7:14   ` Qu Wenruo
  2018-07-10  7:31     ` Nikolay Borisov
  0 siblings, 1 reply; 7+ messages in thread
From: Qu Wenruo @ 2018-07-10  7:14 UTC (permalink / raw)
  To: Nikolay Borisov, Qu Wenruo, linux-btrfs



On 2018年07月10日 15:00, Nikolay Borisov wrote:
> 
> 
> On 10.07.2018 08:57, Qu Wenruo wrote:
>> When we need to fixup error blocks during scrub/dev-replace for
>> nodatasum extents, we still goes through the inode page cache and write
>> them back onto disk.
>>
>> It's already proved that such usage of on-disk data could lead to
>> serious data corruption for compressed extent.
>> So here we also need to avoid such case, so avoid any calling to
>> scrub_fixup_nodatasum().
>>
>> Signed-off-by: Qu Wenruo <wqu@suse.com>
> 
> Is this patch supposed to replace the 'if (0 ...' one that is in 4.18 -
> if not, then the small fix should be reverted as we are introducing a
> regression in 4.18 and then your full fix should go into 4.19.

It's an enhancement to the original 4.18 fix.
The fix in 4.18 is causing regression because I missed there is another
routine which could do the similar work (but more lame) but at different
timing.

> 
> I kind of lost the end of the whole nodatasum mess but your references
> are really vague, if you have described somewhere (i.e in a mailing list
> post or another, merged patch) how the corruption happens if we use the
> page cache then put a reference in the changelog - either with a commit
> id or use the Link: tag for the email thread.

For the whole mess, it's complex due to the complexity of scrub/replace.

The simple workflow is:
scrub main work (prepare needed info for corresponding workqueue)
|- copy_nocow_pages() if it's nodatasum and is doing replace
|- scrub_pages() all other cases goes here

The v4.18 fixes the problem by removing copy_nocow_pages().

However after scrub main work, we have fixup worker to handle read/csum
failure:
scrub_handle_errored_block()
|- scrub_fixup_nodatasum() if it's nodatasum
|- Or build bios itself

In scrub_fixup_nodatasum() it iterates each inode referring the extent
and tries to use inode page cache.
The problem is, scrub_fixup_nodatasum() is not doing it correctly, inode
extent lock is not even as good as copy_nocow_pages() and could cause a
lot of lockdep warning in tests like btrfs/100.

I should detect the problem and delete it in the first fix, but since
it's fixup worker I though the priority is not that high, until the
v4.18 fix is causing regression.


And for the description, it's a little hard to explain (well, just
explained above).
I'll try to make it more detailed in commit message.
However since it's happening in a different timing, I can't just refer
an existing test case/patch to explain this.

Thanks,
Qu

>> ---
>>  fs/btrfs/scrub.c | 18 ++++++++++--------
>>  1 file changed, 10 insertions(+), 8 deletions(-)
>>
>> diff --git a/fs/btrfs/scrub.c b/fs/btrfs/scrub.c
>> index 572306036477..328232fa5646 100644
>> --- a/fs/btrfs/scrub.c
>> +++ b/fs/btrfs/scrub.c
>> @@ -1151,11 +1151,6 @@ static int scrub_handle_errored_block(struct scrub_block *sblock_to_check)
>>  		return ret;
>>  	}
>>  
>> -	if (sctx->is_dev_replace && !is_metadata && !have_csum) {
>> -		sblocks_for_recheck = NULL;
>> -		goto nodatasum_case;
>> -	}
>> -
>>  	/*
>>  	 * read all mirrors one after the other. This includes to
>>  	 * re-read the extent or metadata block that failed (that was
>> @@ -1268,13 +1263,20 @@ static int scrub_handle_errored_block(struct scrub_block *sblock_to_check)
>>  		goto out;
>>  	}
>>  
>> -	if (!is_metadata && !have_csum) {
>> +	/*
>> +	 * NOTE: Even for nodatasum data case, it's still possible that it's
>> +	 * compressed data extent, thus scrub_fixup_nodatasum(), which
>> +	 * write inode page cache onto disk, could cause serious data
>> +	 * corruption.
>> +	 *
>> +	 * So here we could only read from disk, and hopes our recovery
>> +	 * could reach disk before newer write.
> 
> If this code relies on 'hope' then I NACK it since hope cannot be
> quantified hence the effect of this patch cannot be quantified. It
> either fixes the problem or doesn't fix it. Sorry but this, coupled with
> the vague changelog I'd rather not have it committed.
> 
>> +	 */
>> +	if (0 && !is_metadata && !have_csum) {
>>  		struct scrub_fixup_nodatasum *fixup_nodatasum;
>>  
>>  		WARN_ON(sctx->is_dev_replace);
>>  
>> -nodatasum_case:
>> -
>>  		/*
>>  		 * !is_metadata and !have_csum, this means that the data
>>  		 * might not be COWed, that it might be modified
>>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH RFC 1/2] btrfs: scrub: Don't use inode page cache in scrub_handle_errored_block()
  2018-07-10  7:14   ` Qu Wenruo
@ 2018-07-10  7:31     ` Nikolay Borisov
  0 siblings, 0 replies; 7+ messages in thread
From: Nikolay Borisov @ 2018-07-10  7:31 UTC (permalink / raw)
  To: Qu Wenruo, Qu Wenruo, linux-btrfs



On 10.07.2018 10:14, Qu Wenruo wrote:
> 
> 
> On 2018年07月10日 15:00, Nikolay Borisov wrote:
>>
>>
>> On 10.07.2018 08:57, Qu Wenruo wrote:
>>> When we need to fixup error blocks during scrub/dev-replace for
>>> nodatasum extents, we still goes through the inode page cache and write
>>> them back onto disk.
>>>
>>> It's already proved that such usage of on-disk data could lead to
>>> serious data corruption for compressed extent.
>>> So here we also need to avoid such case, so avoid any calling to
>>> scrub_fixup_nodatasum().
>>>
>>> Signed-off-by: Qu Wenruo <wqu@suse.com>
>>
>> Is this patch supposed to replace the 'if (0 ...' one that is in 4.18 -
>> if not, then the small fix should be reverted as we are introducing a
>> regression in 4.18 and then your full fix should go into 4.19.
> 
> It's an enhancement to the original 4.18 fix.
> The fix in 4.18 is causing regression because I missed there is another
> routine which could do the similar work (but more lame) but at different
> timing.
> 
>>
>> I kind of lost the end of the whole nodatasum mess but your references
>> are really vague, if you have described somewhere (i.e in a mailing list
>> post or another, merged patch) how the corruption happens if we use the
>> page cache then put a reference in the changelog - either with a commit
>> id or use the Link: tag for the email thread.
> 
> For the whole mess, it's complex due to the complexity of scrub/replace.
> 
> The simple workflow is:
> scrub main work (prepare needed info for corresponding workqueue)
> |- copy_nocow_pages() if it's nodatasum and is doing replace
> |- scrub_pages() all other cases goes here
> 
> The v4.18 fixes the problem by removing copy_nocow_pages().
> 
> However after scrub main work, we have fixup worker to handle read/csum
> failure:
> scrub_handle_errored_block()
> |- scrub_fixup_nodatasum() if it's nodatasum
> |- Or build bios itself
> 
> In scrub_fixup_nodatasum() it iterates each inode referring the extent
> and tries to use inode page cache.
> The problem is, scrub_fixup_nodatasum() is not doing it correctly, inode
> extent lock is not even as good as copy_nocow_pages() and could cause a
> lot of lockdep warning in tests like btrfs/100.
> 
> I should detect the problem and delete it in the first fix, but since
> it's fixup worker I though the priority is not that high, until the
> v4.18 fix is causing regression.
> 
> 
> And for the description, it's a little hard to explain (well, just
> explained above).
> I'll try to make it more detailed in commit message.

Please do, because imagine it's hard for you who understands the code
and what about someone else who is going to work on it 6 months from
now. That's why we need detailed and correct changelogs.

> However since it's happening in a different timing, I can't just refer
> an existing test case/patch to explain this.
> 
> Thanks,
> Qu
> 
>>> ---
>>>  fs/btrfs/scrub.c | 18 ++++++++++--------
>>>  1 file changed, 10 insertions(+), 8 deletions(-)
>>>
>>> diff --git a/fs/btrfs/scrub.c b/fs/btrfs/scrub.c
>>> index 572306036477..328232fa5646 100644
>>> --- a/fs/btrfs/scrub.c
>>> +++ b/fs/btrfs/scrub.c
>>> @@ -1151,11 +1151,6 @@ static int scrub_handle_errored_block(struct scrub_block *sblock_to_check)
>>>  		return ret;
>>>  	}
>>>  
>>> -	if (sctx->is_dev_replace && !is_metadata && !have_csum) {
>>> -		sblocks_for_recheck = NULL;
>>> -		goto nodatasum_case;
>>> -	}
>>> -
>>>  	/*
>>>  	 * read all mirrors one after the other. This includes to
>>>  	 * re-read the extent or metadata block that failed (that was
>>> @@ -1268,13 +1263,20 @@ static int scrub_handle_errored_block(struct scrub_block *sblock_to_check)
>>>  		goto out;
>>>  	}
>>>  
>>> -	if (!is_metadata && !have_csum) {
>>> +	/*
>>> +	 * NOTE: Even for nodatasum data case, it's still possible that it's
>>> +	 * compressed data extent, thus scrub_fixup_nodatasum(), which
>>> +	 * write inode page cache onto disk, could cause serious data
>>> +	 * corruption.
>>> +	 *
>>> +	 * So here we could only read from disk, and hopes our recovery
>>> +	 * could reach disk before newer write.
>>
>> If this code relies on 'hope' then I NACK it since hope cannot be
>> quantified hence the effect of this patch cannot be quantified. It
>> either fixes the problem or doesn't fix it. Sorry but this, coupled with
>> the vague changelog I'd rather not have it committed.
>>
>>> +	 */
>>> +	if (0 && !is_metadata && !have_csum) {
>>>  		struct scrub_fixup_nodatasum *fixup_nodatasum;
>>>  
>>>  		WARN_ON(sctx->is_dev_replace);
>>>  
>>> -nodatasum_case:
>>> -
>>>  		/*
>>>  		 * !is_metadata and !have_csum, this means that the data
>>>  		 * might not be COWed, that it might be modified
>>>
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 

^ permalink raw reply	[flat|nested] 7+ messages in thread

* [PATCH RFC 2/2] btrfs: scrub: Cleanup the nodatasum fixup code
  2018-07-11  5:41 [PATCH RFC v2 " Qu Wenruo
@ 2018-07-11  5:41 ` Qu Wenruo
  2018-07-19 14:01   ` David Sterba
  0 siblings, 1 reply; 7+ messages in thread
From: Qu Wenruo @ 2018-07-11  5:41 UTC (permalink / raw)
  To: linux-btrfs

Since we no longer use the old method, which use inode page cache and
could cause serious data corruption for compressed nodatasum extent, to
fixup nodatasum error during scrub/replace, remove all related code.

Signed-off-by: Qu Wenruo <wqu@suse.com>
---
 fs/btrfs/scrub.c | 234 -----------------------------------------------
 1 file changed, 234 deletions(-)

diff --git a/fs/btrfs/scrub.c b/fs/btrfs/scrub.c
index 328232fa5646..8580b53fcf32 100644
--- a/fs/btrfs/scrub.c
+++ b/fs/btrfs/scrub.c
@@ -188,15 +188,6 @@ struct scrub_ctx {
 	refcount_t              refs;
 };
 
-struct scrub_fixup_nodatasum {
-	struct scrub_ctx	*sctx;
-	struct btrfs_device	*dev;
-	u64			logical;
-	struct btrfs_root	*root;
-	struct btrfs_work	work;
-	int			mirror_num;
-};
-
 struct scrub_nocow_inode {
 	u64			inum;
 	u64			offset;
@@ -882,194 +873,6 @@ static void scrub_print_warning(const char *errstr, struct scrub_block *sblock)
 	btrfs_free_path(path);
 }
 
-static int scrub_fixup_readpage(u64 inum, u64 offset, u64 root, void *fixup_ctx)
-{
-	struct page *page = NULL;
-	unsigned long index;
-	struct scrub_fixup_nodatasum *fixup = fixup_ctx;
-	int ret;
-	int corrected = 0;
-	struct btrfs_key key;
-	struct inode *inode = NULL;
-	struct btrfs_fs_info *fs_info;
-	u64 end = offset + PAGE_SIZE - 1;
-	struct btrfs_root *local_root;
-	int srcu_index;
-
-	key.objectid = root;
-	key.type = BTRFS_ROOT_ITEM_KEY;
-	key.offset = (u64)-1;
-
-	fs_info = fixup->root->fs_info;
-	srcu_index = srcu_read_lock(&fs_info->subvol_srcu);
-
-	local_root = btrfs_read_fs_root_no_name(fs_info, &key);
-	if (IS_ERR(local_root)) {
-		srcu_read_unlock(&fs_info->subvol_srcu, srcu_index);
-		return PTR_ERR(local_root);
-	}
-
-	key.type = BTRFS_INODE_ITEM_KEY;
-	key.objectid = inum;
-	key.offset = 0;
-	inode = btrfs_iget(fs_info->sb, &key, local_root, NULL);
-	srcu_read_unlock(&fs_info->subvol_srcu, srcu_index);
-	if (IS_ERR(inode))
-		return PTR_ERR(inode);
-
-	index = offset >> PAGE_SHIFT;
-
-	page = find_or_create_page(inode->i_mapping, index, GFP_NOFS);
-	if (!page) {
-		ret = -ENOMEM;
-		goto out;
-	}
-
-	if (PageUptodate(page)) {
-		if (PageDirty(page)) {
-			/*
-			 * we need to write the data to the defect sector. the
-			 * data that was in that sector is not in memory,
-			 * because the page was modified. we must not write the
-			 * modified page to that sector.
-			 *
-			 * TODO: what could be done here: wait for the delalloc
-			 *       runner to write out that page (might involve
-			 *       COW) and see whether the sector is still
-			 *       referenced afterwards.
-			 *
-			 * For the meantime, we'll treat this error
-			 * incorrectable, although there is a chance that a
-			 * later scrub will find the bad sector again and that
-			 * there's no dirty page in memory, then.
-			 */
-			ret = -EIO;
-			goto out;
-		}
-		ret = repair_io_failure(fs_info, inum, offset, PAGE_SIZE,
-					fixup->logical, page,
-					offset - page_offset(page),
-					fixup->mirror_num);
-		unlock_page(page);
-		corrected = !ret;
-	} else {
-		/*
-		 * we need to get good data first. the general readpage path
-		 * will call repair_io_failure for us, we just have to make
-		 * sure we read the bad mirror.
-		 */
-		ret = set_extent_bits(&BTRFS_I(inode)->io_tree, offset, end,
-					EXTENT_DAMAGED);
-		if (ret) {
-			/* set_extent_bits should give proper error */
-			WARN_ON(ret > 0);
-			if (ret > 0)
-				ret = -EFAULT;
-			goto out;
-		}
-
-		ret = extent_read_full_page(&BTRFS_I(inode)->io_tree, page,
-						btrfs_get_extent,
-						fixup->mirror_num);
-		wait_on_page_locked(page);
-
-		corrected = !test_range_bit(&BTRFS_I(inode)->io_tree, offset,
-						end, EXTENT_DAMAGED, 0, NULL);
-		if (!corrected)
-			clear_extent_bits(&BTRFS_I(inode)->io_tree, offset, end,
-						EXTENT_DAMAGED);
-	}
-
-out:
-	if (page)
-		put_page(page);
-
-	iput(inode);
-
-	if (ret < 0)
-		return ret;
-
-	if (ret == 0 && corrected) {
-		/*
-		 * we only need to call readpage for one of the inodes belonging
-		 * to this extent. so make iterate_extent_inodes stop
-		 */
-		return 1;
-	}
-
-	return -EIO;
-}
-
-static void scrub_fixup_nodatasum(struct btrfs_work *work)
-{
-	struct btrfs_fs_info *fs_info;
-	int ret;
-	struct scrub_fixup_nodatasum *fixup;
-	struct scrub_ctx *sctx;
-	struct btrfs_trans_handle *trans = NULL;
-	struct btrfs_path *path;
-	int uncorrectable = 0;
-
-	fixup = container_of(work, struct scrub_fixup_nodatasum, work);
-	sctx = fixup->sctx;
-	fs_info = fixup->root->fs_info;
-
-	path = btrfs_alloc_path();
-	if (!path) {
-		spin_lock(&sctx->stat_lock);
-		++sctx->stat.malloc_errors;
-		spin_unlock(&sctx->stat_lock);
-		uncorrectable = 1;
-		goto out;
-	}
-
-	trans = btrfs_join_transaction(fixup->root);
-	if (IS_ERR(trans)) {
-		uncorrectable = 1;
-		goto out;
-	}
-
-	/*
-	 * the idea is to trigger a regular read through the standard path. we
-	 * read a page from the (failed) logical address by specifying the
-	 * corresponding copynum of the failed sector. thus, that readpage is
-	 * expected to fail.
-	 * that is the point where on-the-fly error correction will kick in
-	 * (once it's finished) and rewrite the failed sector if a good copy
-	 * can be found.
-	 */
-	ret = iterate_inodes_from_logical(fixup->logical, fs_info, path,
-					  scrub_fixup_readpage, fixup, false);
-	if (ret < 0) {
-		uncorrectable = 1;
-		goto out;
-	}
-	WARN_ON(ret != 1);
-
-	spin_lock(&sctx->stat_lock);
-	++sctx->stat.corrected_errors;
-	spin_unlock(&sctx->stat_lock);
-
-out:
-	if (trans && !IS_ERR(trans))
-		btrfs_end_transaction(trans);
-	if (uncorrectable) {
-		spin_lock(&sctx->stat_lock);
-		++sctx->stat.uncorrectable_errors;
-		spin_unlock(&sctx->stat_lock);
-		btrfs_dev_replace_stats_inc(
-			&fs_info->dev_replace.num_uncorrectable_read_errors);
-		btrfs_err_rl_in_rcu(fs_info,
-		    "unable to fixup (nodatasum) error at logical %llu on dev %s",
-			fixup->logical, rcu_str_deref(fixup->dev->name));
-	}
-
-	btrfs_free_path(path);
-	kfree(fixup);
-
-	scrub_pending_trans_workers_dec(sctx);
-}
-
 static inline void scrub_get_recover(struct scrub_recover *recover)
 {
 	refcount_inc(&recover->refs);
@@ -1263,43 +1066,6 @@ static int scrub_handle_errored_block(struct scrub_block *sblock_to_check)
 		goto out;
 	}
 
-	/*
-	 * NOTE: Even for nodatasum data case, it's still possible that it's
-	 * compressed data extent, thus scrub_fixup_nodatasum(), which
-	 * write inode page cache onto disk, could cause serious data
-	 * corruption.
-	 *
-	 * So here we could only read from disk, and hopes our recovery
-	 * could reach disk before newer write.
-	 */
-	if (0 && !is_metadata && !have_csum) {
-		struct scrub_fixup_nodatasum *fixup_nodatasum;
-
-		WARN_ON(sctx->is_dev_replace);
-
-		/*
-		 * !is_metadata and !have_csum, this means that the data
-		 * might not be COWed, that it might be modified
-		 * concurrently. The general strategy to work on the
-		 * commit root does not help in the case when COW is not
-		 * used.
-		 */
-		fixup_nodatasum = kzalloc(sizeof(*fixup_nodatasum), GFP_NOFS);
-		if (!fixup_nodatasum)
-			goto did_not_correct_error;
-		fixup_nodatasum->sctx = sctx;
-		fixup_nodatasum->dev = dev;
-		fixup_nodatasum->logical = logical;
-		fixup_nodatasum->root = fs_info->extent_root;
-		fixup_nodatasum->mirror_num = failed_mirror_index + 1;
-		scrub_pending_trans_workers_inc(sctx);
-		btrfs_init_work(&fixup_nodatasum->work, btrfs_scrub_helper,
-				scrub_fixup_nodatasum, NULL, NULL);
-		btrfs_queue_work(fs_info->scrub_workers,
-				 &fixup_nodatasum->work);
-		goto out;
-	}
-
 	/*
 	 * now build and submit the bios for the other mirrors, check
 	 * checksums.
-- 
2.18.0


^ permalink raw reply related	[flat|nested] 7+ messages in thread

* Re: [PATCH RFC 2/2] btrfs: scrub: Cleanup the nodatasum fixup code
  2018-07-11  5:41 ` [PATCH RFC 2/2] btrfs: scrub: Cleanup the nodatasum fixup code Qu Wenruo
@ 2018-07-19 14:01   ` David Sterba
  0 siblings, 0 replies; 7+ messages in thread
From: David Sterba @ 2018-07-19 14:01 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: linux-btrfs

On Wed, Jul 11, 2018 at 01:41:22PM +0800, Qu Wenruo wrote:
> Since we no longer use the old method, which use inode page cache and
> could cause serious data corruption for compressed nodatasum extent, to
> fixup nodatasum error during scrub/replace, remove all related code.
> 
> Signed-off-by: Qu Wenruo <wqu@suse.com>

The compiler warns that scrub_pending_trans_workers_inc and
scrub_pending_trans_workers_dec are now unused and it's right. I'll
delete the code as well. The last callers got removed by this patch.

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2018-07-19 14:44 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2018-07-10  5:57 [PATCH RFC 1/2] btrfs: scrub: Don't use inode page cache in scrub_handle_errored_block() Qu Wenruo
2018-07-10  5:57 ` [PATCH RFC 2/2] btrfs: scrub: Cleanup the nodatasum fixup code Qu Wenruo
2018-07-10  7:00 ` [PATCH RFC 1/2] btrfs: scrub: Don't use inode page cache in scrub_handle_errored_block() Nikolay Borisov
2018-07-10  7:14   ` Qu Wenruo
2018-07-10  7:31     ` Nikolay Borisov
  -- strict thread matches above, loose matches on Subject: below --
2018-07-11  5:41 [PATCH RFC v2 " Qu Wenruo
2018-07-11  5:41 ` [PATCH RFC 2/2] btrfs: scrub: Cleanup the nodatasum fixup code Qu Wenruo
2018-07-19 14:01   ` David Sterba

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).