Re: [PATCH] btrfs: raid56: Use correct stolen pages to calculate P/Q

linux-btrfs.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

From: Goffredo Baroncelli <kreijack@inwind.it>
To: Qu Wenruo <quwenruo@cn.fujitsu.com>
Cc: linux-btrfs@vger.kernel.org, ce3g8jdj@umail.furryterror.org
Subject: Re: [PATCH] btrfs: raid56: Use correct stolen pages to calculate P/Q
Date: Mon, 21 Nov 2016 19:48:51 +0100	[thread overview]
Message-ID: <94606bda-dab0-e7c9-7fc6-1af9069b64fc@inwind.it> (raw)
In-Reply-To: <20161121085016.7148-1-quwenruo@cn.fujitsu.com>

[-- Attachment #1: Type: text/plain, Size: 6741 bytes --]

Hi Qu,

I tested this succefully for RAID5 when doing a scrub (i.e.: I mount a corrupted disks, then I ran "btrfs scrub start ...", then I check the disks).

However if I do a "cat mnt/out.txt" (out.txt is the corrupted file):
1) the system detect that the file is corrupted   (good :) )
2) the system return the correct file content     (good :) )
3) the data on the platter are still wrong        (no good :( )


Enclosed the script which reproduces the problem. Note that:
If I corrupt the data, in the dmesg two time appears a line which says:

[ 3963.763384] BTRFS warning (device loop2): csum failed ino 257 off 0 csum 2280586218 expected csum 3192393815
[ 3963.766927] BTRFS warning (device loop2): csum failed ino 257 off 0 csum 2280586218 expected csum 3192393815

If I corrupt the parity, of course the system doesn't detect the corruption nor try to correct it. But this is the expected behavior.
 
BR
G.Baroncelli



On 2016-11-21 09:50, Qu Wenruo wrote:
> In the following situation, scrub will calculate wrong parity to
> overwrite correct one:
> 
> RAID5 full stripe:
> 
> Before
> |     Dev 1      |     Dev  2     |     Dev 3     |
> | Data stripe 1  | Data stripe 2  | Parity Stripe |
> --------------------------------------------------- 0
> | 0x0000 (Bad)   |     0xcdcd     |     0x0000    |
> --------------------------------------------------- 4K
> |     0xcdcd     |     0xcdcd     |     0x0000    |
> ...
> |     0xcdcd     |     0xcdcd     |     0x0000    |
> --------------------------------------------------- 64K
> 
> After scrubbing dev3 only:
> 
> |     Dev 1      |     Dev  2     |     Dev 3     |
> | Data stripe 1  | Data stripe 2  | Parity Stripe |
> --------------------------------------------------- 0
> | 0xcdcd (Good)  |     0xcdcd     | 0xcdcd (Bad)  |
> --------------------------------------------------- 4K
> |     0xcdcd     |     0xcdcd     |     0x0000    |
> ...
> |     0xcdcd     |     0xcdcd     |     0x0000    |
> --------------------------------------------------- 64K
> 
> The calltrace of such corruption is as following:
> 
> scrub_bio_end_io_worker() get called for each extent read out
> |- scriub_block_complete()
>    |- Data extent csum mismatch
>    |- scrub_handle_errored_block
>       |- scrub_recheck_block()
>          |- scrub_submit_raid56_bio_wait()
>             |- raid56_parity_recover()
> 
> Now we have a rbio with correct data stripe 1 recovered.
> Let's call it "good_rbio".
> 
> scrub_parity_check_and_repair()
> |- raid56_parity_submit_scrub_rbio()
>    |- lock_stripe_add()
>    |  |- steal_rbio()
>    |     |- Recovered data are steal from "good_rbio", stored into
>    |        rbio->stripe_pages[]
>    |        Now rbio->bio_pages[] are bad data read from disk.
>    |- async_scrub_parity()
>       |- scrub_parity_work() (delayed_call to scrub_parity_work)
> 
> scrub_parity_work()
> |- raid56_parity_scrub_stripe()
>    |- validate_rbio_for_parity_scrub()
>       |- finish_parity_scrub()
>          |- Recalculate parity using *BAD* pages in rbio->bio_pages[]
>             So good parity is overwritten with *BAD* one
> 
> The fix is to introduce 2 new members, bad_ondisk_a/b, to struct
> btrfs_raid_bio, to info scrub code to use correct data pages to
> re-calculate parity.
> 
> Reported-by: Goffredo Baroncelli <kreijack@inwind.it>
> Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
> ---
> Thanks to the above hell of delayed function all and damn stupid code
> logical, such bug is quite hard to trace.
> 
> The damn kernel scrub is already multi-thread, why do such meaningless
> delayed function call again and again?
> 
> What's wrong with single thread scrub?
> We can do thing like in each stripe for raid56 which is easy and
> straightforward, only delayed thing is to wake up waiter:
> 
> 	lock_full_stripe()
> 	if (!is_parity_stripe()) {
> 		prepare_data_stripe_bios()
> 		submit_and_wait_bios()
> 		if (check_csum() == 0)
> 			goto out;
> 	}
> 	prepare_full_stripe_bios()
> 	submit_and_wait_bios()
> 
> 	recover_raid56_stipres();
> 	prepare_full_stripe_write_bios()
> 	submit_and_wait_bios()
> 
> out:
>   	unlock_full_stripe()
> 
> We really need to re-work the whole damn scrub code.
> 
> Also, we need to enhance btrfs-progs to detect scrub problem(my
> submitted offline scrub is good enough for such usage), and tools to
> corrupt extents reliably to put it into xfstests test cases.
> 
> RAID56 scrub code is neither tested nor well-designed.
> ---
>  fs/btrfs/raid56.c | 29 ++++++++++++++++++++++++++++-
>  1 file changed, 28 insertions(+), 1 deletion(-)
> 
> diff --git a/fs/btrfs/raid56.c b/fs/btrfs/raid56.c
> index d016d4a..87e3565 100644
> --- a/fs/btrfs/raid56.c
> +++ b/fs/btrfs/raid56.c
> @@ -133,6 +133,16 @@ struct btrfs_raid_bio {
>  	/* second bad stripe (for raid6 use) */
>  	int failb;
>  
> +	/*
> +	 * For steal_rbio, we can steal recovered correct page,
> +	 * but in finish_parity_scrub(), we still use bad on-disk
> +	 * page to calculate parity.
> +	 * Use these members to info finish_parity_scrub() to use
> +	 * correct pages
> +	 */
> +	int bad_ondisk_a;
> +	int bad_ondisk_b;
> +
>  	int scrubp;
>  	/*
>  	 * number of pages needed to represent the full
> @@ -310,6 +320,12 @@ static void steal_rbio(struct btrfs_raid_bio *src, struct btrfs_raid_bio *dest)
>  	if (!test_bit(RBIO_CACHE_READY_BIT, &src->flags))
>  		return;
>  
> +	/* Record recovered stripe number */
> +	if (src->faila != -1)
> +		dest->bad_ondisk_a = src->faila;
> +	if (src->failb != -1)
> +		dest->bad_ondisk_b = src->failb;
> +
>  	for (i = 0; i < dest->nr_pages; i++) {
>  		s = src->stripe_pages[i];
>  		if (!s || !PageUptodate(s)) {
> @@ -998,6 +1014,8 @@ static struct btrfs_raid_bio *alloc_rbio(struct btrfs_root *root,
>  	rbio->stripe_npages = stripe_npages;
>  	rbio->faila = -1;
>  	rbio->failb = -1;
> +	rbio->bad_ondisk_a = -1;
> +	rbio->bad_ondisk_b = -1;
>  	atomic_set(&rbio->refs, 1);
>  	atomic_set(&rbio->error, 0);
>  	atomic_set(&rbio->stripes_pending, 0);
> @@ -2352,7 +2370,16 @@ static noinline void finish_parity_scrub(struct btrfs_raid_bio *rbio,
>  		void *parity;
>  		/* first collect one page from each data stripe */
>  		for (stripe = 0; stripe < nr_data; stripe++) {
> -			p = page_in_rbio(rbio, stripe, pagenr, 0);
> +
> +			/*
> +			 * Use stolen recovered page other than bad
> +			 * on disk pages
> +			 */
> +			if (stripe == rbio->bad_ondisk_a ||
> +			    stripe == rbio->bad_ondisk_b)
> +				p = rbio_stripe_page(rbio, stripe, pagenr);
> +			else
> +				p = page_in_rbio(rbio, stripe, pagenr, 0);
>  			pointers[stripe] = kmap(p);
>  		}
>  
> 


-- 
gpg @keyserver.linux.it: Goffredo Baroncelli <kreijackATinwind.it>
Key fingerprint BBF5 1610 0B64 DAC6 5F7D  17B2 0EDA 9B37 8B82 E0B5

[-- Attachment #2: do.sh --]
[-- Type: application/x-sh, Size: 3182 bytes --]

next prev parent reply	other threads:[~2016-11-21 18:48 UTC|newest]

Thread overview: 28+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2016-11-21  8:50 [PATCH] btrfs: raid56: Use correct stolen pages to calculate P/Q Qu Wenruo
2016-11-21 18:48 ` Goffredo Baroncelli [this message]
2016-11-22  0:28   ` Qu Wenruo
2016-11-22 18:02     ` Goffredo Baroncelli
2016-11-25  4:31       ` Zygo Blaxell
2016-11-25  4:40         ` Gareth Pye
2016-11-25  5:07           ` Zygo Blaxell
2016-11-26 13:12         ` Goffredo Baroncelli
2016-11-26 18:54           ` Zygo Blaxell
2016-11-26 23:16             ` Goffredo Baroncelli
2016-11-27 16:53               ` Zygo Blaxell
2016-11-28  0:40               ` Qu Wenruo
2016-11-28 18:45                 ` Goffredo Baroncelli
2016-11-28 19:01                   ` Christoph Anton Mitterer
2016-11-28 19:39                     ` Austin S. Hemmelgarn
2016-11-28  3:37           ` Christoph Anton Mitterer
2016-11-28  3:53             ` Andrei Borzenkov
2016-11-28  4:01               ` Christoph Anton Mitterer
2016-11-28 18:32             ` Goffredo Baroncelli
2016-11-28 19:00               ` Christoph Anton Mitterer
2016-11-28 21:48               ` Zygo Blaxell
2016-11-29  1:52                 ` Christoph Anton Mitterer
2016-11-29  3:19                   ` Zygo Blaxell
2016-11-29  7:35                   ` Adam Borowski
2016-11-29 14:24                     ` Christoph Anton Mitterer
2016-11-22 18:58 ` Chris Mason
2016-11-23  0:26   ` Qu Wenruo
2016-11-26 17:18     ` Chris Mason

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=94606bda-dab0-e7c9-7fc6-1af9069b64fc@inwind.it \
    --to=kreijack@inwind.it \
    --cc=ce3g8jdj@umail.furryterror.org \
    --cc=linux-btrfs@vger.kernel.org \
    --cc=quwenruo@cn.fujitsu.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).