Re: some help for improvement in btrfs

public inbox for linux-btrfs@vger.kernel.org
 help / color / mirror / Atom feed

From: Qu Wenruo <quwenruo.btrfs@gmx.com>
To: hmsjwzb <hmsjwzb@zoho.com>,
	"linux-btrfs@vger.kernel.org" <linux-btrfs@vger.kernel.org>
Subject: Re: some help for improvement in btrfs
Date: Tue, 6 Sep 2022 16:37:51 +0800	[thread overview]
Message-ID: <1b00d889-bf4b-a092-38c0-fb6c6aa09fdf@gmx.com> (raw)
In-Reply-To: <9c295a8c-5167-116e-4fae-d548f1deb3b2@zoho.com>



On 2022/9/6 16:02, hmsjwzb wrote:
> Hi Qu,
>
> Thank you for providing this interesting case. I use qemu and gdb to debug this case and summarize as follows.
>
> [test case]
>
> mkfs.btrfs -f -m raid5 -d raid5 -b 1G /dev/vda /dev/vdb /dev/vdc
>
> mount /dev/vda /mnt
>
> xfs_io -f -c "pwrite -S 0xee 0 64k" /mnt/file1
>
> sync
>
> 	After the above command, the device look like this.
>             	119865344
> 	vda |     |     stripe 0:Data for file1            |      |
>             	98893824
> 	vdb |     |     stripe 1:redundant Data for parity |      |

Note, that is not redundant data, but some unused space, it can be garbage.

But still, we need this unused space to calculate the parity anyway.

>             	98893824
> 	vdc |     |     stripe 2:parity                    |      |
>
> 	We can see that stripe 1 is not used by btrfs filesystem.
>
> umount /dev/vda
>
> xfs_io -f -c "pwrite -S 0xff 119865344 64k" /dev/vda
>
>             	119865344
> 	vda |     |     stripe 0:Data for file1(Corrupted) |      |
>             	98893824
> 	vdb |     |     stripe 1:redundant Data for parity |      |
>             	98893824
> 	vdc |     |     stripe 2:parity                    |      |
>
> 	This command erase the data for file1 in stripe 0. I think it simulate the
> 	data loss in hardware.

Yep, that's correct.

>
> mount /dev/vda /mnt
>
> 	If we issue a read request for file1 now, the data for file1 can be recovered
> 	by raid5 mechanism.
>
> xfs_io -f -c "pwrite -S 0xee 0 64k" -c sync /mnt/file2
>
>             	119865344
> 	vda |     |     stripe 0:Data for file1(Corrupted)               |      |
>             	98893824
> 	vdb |     |     stripe 1:Data for file2                          |      |
>             	98893824
> 	vdc |     |     stripe 2:parity(Recomputed with Corrupted data)  |      |
>
> 	After the above command, the stripe 1 in vdb is used for file2. The parity data
> 	is recomputed with corrupted data in vda and data of file2. So the Data for file1
> 	is forever lost.

Exactly, that's the destructive RMW idea.

>
> cat /mnt/file1 > /dev/null
>
> 	This command will read the corrupted data in stripe 0. And the btrfs csum will find
> 	out the csum mismatch and print warnings.
>
> umount /mnt
>
> [some fix proposal]
>
> 	1. Can we do parity check before every write operation? If the parity check fails,
>             we just recover the data first and then do the write operation. We can do this
> 	   check before raid56_rmw_stripe.

That's the idea to fix the destructive RMW, aka when doing sub-stripe
write, we need to:

1. Collected needed data to do the recovery
    For data, it should be all csum inside the full stripe.
    For metadata, although the csum is inlined, we still need to find out
    which range has metadata, and this can be a little tricky.

2. Read out all data and stripe, including the range we're writing into
    Currently we skip the range we're going to write, but since we may
    need to do a recovery, we need the full stripe anyway.

3. Do full stripe verification before doing RMW.
    That's the core recovery, thankfully we should have very similiar
    code existing already.

>
> [question]
> 	I have noticed this patch.
>
> 		[PATCH PoC 0/9] btrfs: scrub: introduce a new family of ioctl, scrub_fs
> 		Hi Qu,
> 			Is some part of this patch aim to solve this problem?

Nope, that's just to improve scrub for RAID56, nothing related to this
destructive RMW thing at all.

Thanks,
Qu
>
> Thanks,
> Flint
>
>
> On 8/16/22 01:38, Qu Wenruo wrote:
>>
>>
>> On 2022/8/16 10:47, hmsjwzb wrote:
>>> Hi Qu,
>>>
>>> Sorry for interrupt you so many times.
>>>
>>> As for
>>>      scrub level checks at RAID56 substripe write time.
>>>
>>> Is this feature available in latest linux-next branch?
>>
>> Nope, no one is working on that, thus no patches at all.
>>
>>> Or may I need to get patches from mail list.
>>> What is the core function of this feature ?
>>
>> The following small script would explain it pretty well:
>>
>>    mkfs.btrfs -f -m raid5 -d raid5 -b 1G $dev1 $dev2 $dev3
>>    mount $dev1 $mnt
>>
>>    xfs_io -f -c "pwrite -S 0xee 0 64K" $mnt/file1
>>    sync
>>    umount $mnt
>>
>>    # Currupt data stripe 1 of full stripe of above 64K write
>>    xfs_io -f -c "pwrite -S 0xff 119865344 64K" $dev1
>>
>>    mount $dev1 $mnt
>>
>>    # Do a new write into data stripe 2,
>>    # We will trigger a RMW, which will use on-disk (corrupted) data to
>>    # generate new P/Q.
>>    xfs_io -f -c "pwrite -S 0xee 0 64K" -c sync $mnt/file2
>>
>>    # Now we can no longer read file1, as its data is corrupted, and
>>    # above write generated new P/Q using corrupted data stripe 1,
>>    # preventing us to recover the data stripe 1.
>>    cat $mnt/file1 > /dev/null
>>    umount $mnt
>>
>> Above script is the best way to demonstrate the "destructive RMW".
>> Although this is not btrfs specific (other RAID56 is also affected),
>> it's definitely a real problem.
>>
>> There are several different directions to solve it:
>>
>> - A way to add CSUM for P/Q stripes
>>    In theory this should be the easiest way implementation wise.
>>    We can easily know if a P/Q stripe is correct, then before doing
>>    RMW, we verify the result of P/Q.
>>    If the result doesn't match, we know some data stripe(s) are
>>    corrupted, then rebuild the data first before write.
>>
>>    Unfortunately, this needs a on-disk format.
>>
>> - Full stripe verification before writes
>>    This means, before we submit sub-stripe writes, we use some scrub like
>>    method to verify all data stripes first.
>>    Then we can do recovery if needed, then do writes.
>>
>>    Unfortunately, scrub-like checks has quite some limitations.
>>    Regular scrub only works on RO block groups, thus extent tree and csum
>>    tree are consistent.
>>    But for RAID56 writes, we have no such luxury, I'm not 100% sure if
>>    this can even pass stress tests.
>>
>> Thanks,
>> Qu
>>
>>>
>>> I think I may use qemu and gdb to get basic understanding about this feature.
>>>
>>> Thanks,
>>> Flint
>>>
>>> On 8/15/22 04:54, Qu Wenruo wrote:
>>>> scrub level checks at RAID56 substripe write time.

next prev parent reply	other threads:[~2022-09-06  8:42 UTC|newest]

Thread overview: 10+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
     [not found] <fb056073-5bd6-6143-9699-4a5af1bd496d@zoho.com>
     [not found] ` <655f97cc-64e6-9f57-5394-58f9c3b83a6f@gmx.com>
     [not found]   ` <40b209eb-9048-da0c-e776-5e143ab38571@zoho.com>
     [not found]     ` <72a78cc 0-4524-47e7-803c-7d094b8713ee@gmx.com>
     [not found]     ` <72a78cc0-4524-47e7-803c-7d094b8713ee@gmx.com>
     [not found]       ` <00984321-3006-764d-c29e-1304f89652ae@zoho.com>
     [not found]         ` <18300547-1811-e9da-252e-f9476dca078c@gmx.com>
     [not found]           ` <4691b710-3d71-bd 26-d00a-66cc398f57c5@zoho.com>
     [not found]           ` <4691b710-3d71-bd26-d00a-66cc398f57c5@zoho.com>
2022-08-16  5:38             ` some help for improvement in btrfs Qu Wenruo
2022-09-06  8:02               ` hmsjwzb
2022-09-06  8:37                 ` Qu Wenruo [this message]
2022-09-07  8:24                   ` hmsjwzb
2022-09-07  8:46                     ` Qu Wenruo
2022-09-08  9:21                       ` hmsjwzb
2022-09-08  9:28                         ` Qu Wenruo
2022-09-06  8:59                 ` delete whole file system Kengo.M
2022-09-06  9:12                   ` Hugo Mills
2022-09-06 10:28                     ` Kengo.M

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=1b00d889-bf4b-a092-38c0-fb6c6aa09fdf@gmx.com \
    --to=quwenruo.btrfs@gmx.com \
    --cc=hmsjwzb@zoho.com \
    --cc=linux-btrfs@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox