From: Goffredo Baroncelli <kreijack@inwind.it>
To: Qu Wenruo <quwenruo@cn.fujitsu.com>
Cc: linux-btrfs@vger.kernel.org, ce3g8jdj@umail.furryterror.org
Subject: Re: [PATCH] btrfs: raid56: Use correct stolen pages to calculate P/Q
Date: Tue, 22 Nov 2016 19:02:13 +0100 [thread overview]
Message-ID: <a75c9a72-148c-9fbd-dfb8-7cde58bee9c9@inwind.it> (raw)
In-Reply-To: <f814eb1b-844b-2ace-c948-3be20da2fd29@cn.fujitsu.com>
On 2016-11-22 01:28, Qu Wenruo wrote:
>
>
> At 11/22/2016 02:48 AM, Goffredo Baroncelli wrote:
>> Hi Qu,
>>
>> I tested this succefully for RAID5 when doing a scrub (i.e.: I mount a corrupted disks, then I ran "btrfs scrub start ...", then I check the disks).
>>
>> However if I do a "cat mnt/out.txt" (out.txt is the corrupted file):
>> 1) the system detect that the file is corrupted (good :) )
>> 2) the system return the correct file content (good :) )
>> 3) the data on the platter are still wrong (no good :( )
>
> Do you mean, read the corrupted data won't repair it?
>
> IIRC that's the designed behavior.
:O
You are right... I was unaware of that....
So you can add a "tested-by: Goffredo Baroncelli <kreijack@inwind.it>"
BR
G.Baroncelli
>
> For RAID5/6 read, there are several different mode, like READ_REBUILD or SCRUB_PARITY.
>
> I'm not sure for write, but for read it won't write correct data.
>
> So it's a designed behavior if I don't miss something.
>
> Thanks,
> Qu
>
>>
>>
>> Enclosed the script which reproduces the problem. Note that:
>> If I corrupt the data, in the dmesg two time appears a line which says:
>>
>> [ 3963.763384] BTRFS warning (device loop2): csum failed ino 257 off 0 csum 2280586218 expected csum 3192393815
>> [ 3963.766927] BTRFS warning (device loop2): csum failed ino 257 off 0 csum 2280586218 expected csum 3192393815
>>
>> If I corrupt the parity, of course the system doesn't detect the corruption nor try to correct it. But this is the expected behavior.
>>
>> BR
>> G.Baroncelli
>>
>>
>>
>> On 2016-11-21 09:50, Qu Wenruo wrote:
>>> In the following situation, scrub will calculate wrong parity to
>>> overwrite correct one:
>>>
>>> RAID5 full stripe:
>>>
>>> Before
>>> | Dev 1 | Dev 2 | Dev 3 |
>>> | Data stripe 1 | Data stripe 2 | Parity Stripe |
>>> --------------------------------------------------- 0
>>> | 0x0000 (Bad) | 0xcdcd | 0x0000 |
>>> --------------------------------------------------- 4K
>>> | 0xcdcd | 0xcdcd | 0x0000 |
>>> ...
>>> | 0xcdcd | 0xcdcd | 0x0000 |
>>> --------------------------------------------------- 64K
>>>
>>> After scrubbing dev3 only:
>>>
>>> | Dev 1 | Dev 2 | Dev 3 |
>>> | Data stripe 1 | Data stripe 2 | Parity Stripe |
>>> --------------------------------------------------- 0
>>> | 0xcdcd (Good) | 0xcdcd | 0xcdcd (Bad) |
>>> --------------------------------------------------- 4K
>>> | 0xcdcd | 0xcdcd | 0x0000 |
>>> ...
>>> | 0xcdcd | 0xcdcd | 0x0000 |
>>> --------------------------------------------------- 64K
>>>
>>> The calltrace of such corruption is as following:
>>>
>>> scrub_bio_end_io_worker() get called for each extent read out
>>> |- scriub_block_complete()
>>> |- Data extent csum mismatch
>>> |- scrub_handle_errored_block
>>> |- scrub_recheck_block()
>>> |- scrub_submit_raid56_bio_wait()
>>> |- raid56_parity_recover()
>>>
>>> Now we have a rbio with correct data stripe 1 recovered.
>>> Let's call it "good_rbio".
>>>
>>> scrub_parity_check_and_repair()
>>> |- raid56_parity_submit_scrub_rbio()
>>> |- lock_stripe_add()
>>> | |- steal_rbio()
>>> | |- Recovered data are steal from "good_rbio", stored into
>>> | rbio->stripe_pages[]
>>> | Now rbio->bio_pages[] are bad data read from disk.
>>> |- async_scrub_parity()
>>> |- scrub_parity_work() (delayed_call to scrub_parity_work)
>>>
>>> scrub_parity_work()
>>> |- raid56_parity_scrub_stripe()
>>> |- validate_rbio_for_parity_scrub()
>>> |- finish_parity_scrub()
>>> |- Recalculate parity using *BAD* pages in rbio->bio_pages[]
>>> So good parity is overwritten with *BAD* one
>>>
>>> The fix is to introduce 2 new members, bad_ondisk_a/b, to struct
>>> btrfs_raid_bio, to info scrub code to use correct data pages to
>>> re-calculate parity.
>>>
>>> Reported-by: Goffredo Baroncelli <kreijack@inwind.it>
>>> Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
>>> ---
>>> Thanks to the above hell of delayed function all and damn stupid code
>>> logical, such bug is quite hard to trace.
>>>
>>> The damn kernel scrub is already multi-thread, why do such meaningless
>>> delayed function call again and again?
>>>
>>> What's wrong with single thread scrub?
>>> We can do thing like in each stripe for raid56 which is easy and
>>> straightforward, only delayed thing is to wake up waiter:
>>>
>>> lock_full_stripe()
>>> if (!is_parity_stripe()) {
>>> prepare_data_stripe_bios()
>>> submit_and_wait_bios()
>>> if (check_csum() == 0)
>>> goto out;
>>> }
>>> prepare_full_stripe_bios()
>>> submit_and_wait_bios()
>>>
>>> recover_raid56_stipres();
>>> prepare_full_stripe_write_bios()
>>> submit_and_wait_bios()
>>>
>>> out:
>>> unlock_full_stripe()
>>>
>>> We really need to re-work the whole damn scrub code.
>>>
>>> Also, we need to enhance btrfs-progs to detect scrub problem(my
>>> submitted offline scrub is good enough for such usage), and tools to
>>> corrupt extents reliably to put it into xfstests test cases.
>>>
>>> RAID56 scrub code is neither tested nor well-designed.
>>> ---
>>> fs/btrfs/raid56.c | 29 ++++++++++++++++++++++++++++-
>>> 1 file changed, 28 insertions(+), 1 deletion(-)
>>>
>>> diff --git a/fs/btrfs/raid56.c b/fs/btrfs/raid56.c
>>> index d016d4a..87e3565 100644
>>> --- a/fs/btrfs/raid56.c
>>> +++ b/fs/btrfs/raid56.c
>>> @@ -133,6 +133,16 @@ struct btrfs_raid_bio {
>>> /* second bad stripe (for raid6 use) */
>>> int failb;
>>>
>>> + /*
>>> + * For steal_rbio, we can steal recovered correct page,
>>> + * but in finish_parity_scrub(), we still use bad on-disk
>>> + * page to calculate parity.
>>> + * Use these members to info finish_parity_scrub() to use
>>> + * correct pages
>>> + */
>>> + int bad_ondisk_a;
>>> + int bad_ondisk_b;
>>> +
>>> int scrubp;
>>> /*
>>> * number of pages needed to represent the full
>>> @@ -310,6 +320,12 @@ static void steal_rbio(struct btrfs_raid_bio *src, struct btrfs_raid_bio *dest)
>>> if (!test_bit(RBIO_CACHE_READY_BIT, &src->flags))
>>> return;
>>>
>>> + /* Record recovered stripe number */
>>> + if (src->faila != -1)
>>> + dest->bad_ondisk_a = src->faila;
>>> + if (src->failb != -1)
>>> + dest->bad_ondisk_b = src->failb;
>>> +
>>> for (i = 0; i < dest->nr_pages; i++) {
>>> s = src->stripe_pages[i];
>>> if (!s || !PageUptodate(s)) {
>>> @@ -998,6 +1014,8 @@ static struct btrfs_raid_bio *alloc_rbio(struct btrfs_root *root,
>>> rbio->stripe_npages = stripe_npages;
>>> rbio->faila = -1;
>>> rbio->failb = -1;
>>> + rbio->bad_ondisk_a = -1;
>>> + rbio->bad_ondisk_b = -1;
>>> atomic_set(&rbio->refs, 1);
>>> atomic_set(&rbio->error, 0);
>>> atomic_set(&rbio->stripes_pending, 0);
>>> @@ -2352,7 +2370,16 @@ static noinline void finish_parity_scrub(struct btrfs_raid_bio *rbio,
>>> void *parity;
>>> /* first collect one page from each data stripe */
>>> for (stripe = 0; stripe < nr_data; stripe++) {
>>> - p = page_in_rbio(rbio, stripe, pagenr, 0);
>>> +
>>> + /*
>>> + * Use stolen recovered page other than bad
>>> + * on disk pages
>>> + */
>>> + if (stripe == rbio->bad_ondisk_a ||
>>> + stripe == rbio->bad_ondisk_b)
>>> + p = rbio_stripe_page(rbio, stripe, pagenr);
>>> + else
>>> + p = page_in_rbio(rbio, stripe, pagenr, 0);
>>> pointers[stripe] = kmap(p);
>>> }
>>>
>>>
>>
>>
>
>
>
--
gpg @keyserver.linux.it: Goffredo Baroncelli <kreijackATinwind.it>
Key fingerprint BBF5 1610 0B64 DAC6 5F7D 17B2 0EDA 9B37 8B82 E0B5
next prev parent reply other threads:[~2016-11-22 18:02 UTC|newest]
Thread overview: 28+ messages / expand[flat|nested] mbox.gz Atom feed top
2016-11-21 8:50 [PATCH] btrfs: raid56: Use correct stolen pages to calculate P/Q Qu Wenruo
2016-11-21 18:48 ` Goffredo Baroncelli
2016-11-22 0:28 ` Qu Wenruo
2016-11-22 18:02 ` Goffredo Baroncelli [this message]
2016-11-25 4:31 ` Zygo Blaxell
2016-11-25 4:40 ` Gareth Pye
2016-11-25 5:07 ` Zygo Blaxell
2016-11-26 13:12 ` Goffredo Baroncelli
2016-11-26 18:54 ` Zygo Blaxell
2016-11-26 23:16 ` Goffredo Baroncelli
2016-11-27 16:53 ` Zygo Blaxell
2016-11-28 0:40 ` Qu Wenruo
2016-11-28 18:45 ` Goffredo Baroncelli
2016-11-28 19:01 ` Christoph Anton Mitterer
2016-11-28 19:39 ` Austin S. Hemmelgarn
2016-11-28 3:37 ` Christoph Anton Mitterer
2016-11-28 3:53 ` Andrei Borzenkov
2016-11-28 4:01 ` Christoph Anton Mitterer
2016-11-28 18:32 ` Goffredo Baroncelli
2016-11-28 19:00 ` Christoph Anton Mitterer
2016-11-28 21:48 ` Zygo Blaxell
2016-11-29 1:52 ` Christoph Anton Mitterer
2016-11-29 3:19 ` Zygo Blaxell
2016-11-29 7:35 ` Adam Borowski
2016-11-29 14:24 ` Christoph Anton Mitterer
2016-11-22 18:58 ` Chris Mason
2016-11-23 0:26 ` Qu Wenruo
2016-11-26 17:18 ` Chris Mason
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=a75c9a72-148c-9fbd-dfb8-7cde58bee9c9@inwind.it \
--to=kreijack@inwind.it \
--cc=ce3g8jdj@umail.furryterror.org \
--cc=linux-btrfs@vger.kernel.org \
--cc=quwenruo@cn.fujitsu.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).