Re: some help for improvement in btrfs

public inbox for linux-btrfs@vger.kernel.org
 help / color / mirror / Atom feed

* Re: some help for improvement in btrfs
       [not found]           ` <4691b710-3d71-bd26-d00a-66cc398f57c5@zoho.com>
@ 2022-08-16  5:38             ` Qu Wenruo
  2022-09-06  8:02               ` hmsjwzb
  0 siblings, 1 reply; 10+ messages in thread
From: Qu Wenruo @ 2022-08-16  5:38 UTC (permalink / raw)
  To: hmsjwzb, linux-btrfs@vger.kernel.org

On 2022/8/16 10:47, hmsjwzb wrote:
> Hi Qu,
>
> Sorry for interrupt you so many times.
>
> As for
> 	scrub level checks at RAID56 substripe write time.
>
> Is this feature available in latest linux-next branch?

Nope, no one is working on that, thus no patches at all.

> Or may I need to get patches from mail list.
> What is the core function of this feature ?

The following small script would explain it pretty well:

   mkfs.btrfs -f -m raid5 -d raid5 -b 1G $dev1 $dev2 $dev3
   mount $dev1 $mnt

   xfs_io -f -c "pwrite -S 0xee 0 64K" $mnt/file1
   sync
   umount $mnt

   # Currupt data stripe 1 of full stripe of above 64K write
   xfs_io -f -c "pwrite -S 0xff 119865344 64K" $dev1

   mount $dev1 $mnt

   # Do a new write into data stripe 2,
   # We will trigger a RMW, which will use on-disk (corrupted) data to
   # generate new P/Q.
   xfs_io -f -c "pwrite -S 0xee 0 64K" -c sync $mnt/file2

   # Now we can no longer read file1, as its data is corrupted, and
   # above write generated new P/Q using corrupted data stripe 1,
   # preventing us to recover the data stripe 1.
   cat $mnt/file1 > /dev/null
   umount $mnt

Above script is the best way to demonstrate the "destructive RMW".
Although this is not btrfs specific (other RAID56 is also affected),
it's definitely a real problem.

There are several different directions to solve it:

- A way to add CSUM for P/Q stripes
   In theory this should be the easiest way implementation wise.
   We can easily know if a P/Q stripe is correct, then before doing
   RMW, we verify the result of P/Q.
   If the result doesn't match, we know some data stripe(s) are
   corrupted, then rebuild the data first before write.

   Unfortunately, this needs a on-disk format.

- Full stripe verification before writes
   This means, before we submit sub-stripe writes, we use some scrub like
   method to verify all data stripes first.
   Then we can do recovery if needed, then do writes.

   Unfortunately, scrub-like checks has quite some limitations.
   Regular scrub only works on RO block groups, thus extent tree and csum
   tree are consistent.
   But for RAID56 writes, we have no such luxury, I'm not 100% sure if
   this can even pass stress tests.

Thanks,
Qu

>
> I think I may use qemu and gdb to get basic understanding about this feature.
>
> Thanks,
> Flint
>
> On 8/15/22 04:54, Qu Wenruo wrote:
>> scrub level checks at RAID56 substripe write time.

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: some help for improvement in btrfs
  2022-08-16  5:38             ` some help for improvement in btrfs Qu Wenruo
@ 2022-09-06  8:02               ` hmsjwzb
  2022-09-06  8:37                 ` Qu Wenruo
  2022-09-06  8:59                 ` delete whole file system Kengo.M
  0 siblings, 2 replies; 10+ messages in thread
From: hmsjwzb @ 2022-09-06  8:02 UTC (permalink / raw)
  To: Qu Wenruo, linux-btrfs@vger.kernel.org

Hi Qu,

Thank you for providing this interesting case. I use qemu and gdb to debug this case and summarize as follows.

[test case]

mkfs.btrfs -f -m raid5 -d raid5 -b 1G /dev/vda /dev/vdb /dev/vdc

mount /dev/vda /mnt

xfs_io -f -c "pwrite -S 0xee 0 64k" /mnt/file1

sync

	After the above command, the device look like this.
           	119865344
	vda |     |     stripe 0:Data for file1            |      |
           	98893824
	vdb |     |     stripe 1:redundant Data for parity |      |
           	98893824
	vdc |     |     stripe 2:parity                    |      |

	We can see that stripe 1 is not used by btrfs filesystem.

umount /dev/vda

xfs_io -f -c "pwrite -S 0xff 119865344 64k" /dev/vda

           	119865344
	vda |     |     stripe 0:Data for file1(Corrupted) |      |
           	98893824
	vdb |     |     stripe 1:redundant Data for parity |      |
           	98893824
	vdc |     |     stripe 2:parity                    |      |

	This command erase the data for file1 in stripe 0. I think it simulate the 
	data loss in hardware.

mount /dev/vda /mnt

	If we issue a read request for file1 now, the data for file1 can be recovered
	by raid5 mechanism.

xfs_io -f -c "pwrite -S 0xee 0 64k" -c sync /mnt/file2

           	119865344
	vda |     |     stripe 0:Data for file1(Corrupted)               |      |
           	98893824
	vdb |     |     stripe 1:Data for file2                          |      |
           	98893824
	vdc |     |     stripe 2:parity(Recomputed with Corrupted data)  |      |

	After the above command, the stripe 1 in vdb is used for file2. The parity data
	is recomputed with corrupted data in vda and data of file2. So the Data for file1
	is forever lost.

cat /mnt/file1 > /dev/null
	
	This command will read the corrupted data in stripe 0. And the btrfs csum will find
	out the csum mismatch and print warnings.

umount /mnt

[some fix proposal]

	1. Can we do parity check before every write operation? If the parity check fails,
           we just recover the data first and then do the write operation. We can do this
	   check before raid56_rmw_stripe.

[question]
	I have noticed this patch.

		[PATCH PoC 0/9] btrfs: scrub: introduce a new family of ioctl, scrub_fs
		Hi Qu,
			Is some part of this patch aim to solve this problem?

Thanks,
Flint


On 8/16/22 01:38, Qu Wenruo wrote:
> 
> 
> On 2022/8/16 10:47, hmsjwzb wrote:
>> Hi Qu,
>>
>> Sorry for interrupt you so many times.
>>
>> As for
>>     scrub level checks at RAID56 substripe write time.
>>
>> Is this feature available in latest linux-next branch?
> 
> Nope, no one is working on that, thus no patches at all.
> 
>> Or may I need to get patches from mail list.
>> What is the core function of this feature ?
> 
> The following small script would explain it pretty well:
> 
>   mkfs.btrfs -f -m raid5 -d raid5 -b 1G $dev1 $dev2 $dev3
>   mount $dev1 $mnt
> 
>   xfs_io -f -c "pwrite -S 0xee 0 64K" $mnt/file1
>   sync
>   umount $mnt
> 
>   # Currupt data stripe 1 of full stripe of above 64K write
>   xfs_io -f -c "pwrite -S 0xff 119865344 64K" $dev1
> 
>   mount $dev1 $mnt
> 
>   # Do a new write into data stripe 2,
>   # We will trigger a RMW, which will use on-disk (corrupted) data to
>   # generate new P/Q.
>   xfs_io -f -c "pwrite -S 0xee 0 64K" -c sync $mnt/file2
> 
>   # Now we can no longer read file1, as its data is corrupted, and
>   # above write generated new P/Q using corrupted data stripe 1,
>   # preventing us to recover the data stripe 1.
>   cat $mnt/file1 > /dev/null
>   umount $mnt
> 
> Above script is the best way to demonstrate the "destructive RMW".
> Although this is not btrfs specific (other RAID56 is also affected),
> it's definitely a real problem.
> 
> There are several different directions to solve it:
> 
> - A way to add CSUM for P/Q stripes
>   In theory this should be the easiest way implementation wise.
>   We can easily know if a P/Q stripe is correct, then before doing
>   RMW, we verify the result of P/Q.
>   If the result doesn't match, we know some data stripe(s) are
>   corrupted, then rebuild the data first before write.
> 
>   Unfortunately, this needs a on-disk format.
> 
> - Full stripe verification before writes
>   This means, before we submit sub-stripe writes, we use some scrub like
>   method to verify all data stripes first.
>   Then we can do recovery if needed, then do writes.
> 
>   Unfortunately, scrub-like checks has quite some limitations.
>   Regular scrub only works on RO block groups, thus extent tree and csum
>   tree are consistent.
>   But for RAID56 writes, we have no such luxury, I'm not 100% sure if
>   this can even pass stress tests.
> 
> Thanks,
> Qu
> 
>>
>> I think I may use qemu and gdb to get basic understanding about this feature.
>>
>> Thanks,
>> Flint
>>
>> On 8/15/22 04:54, Qu Wenruo wrote:
>>> scrub level checks at RAID56 substripe write time.

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: some help for improvement in btrfs
  2022-09-06  8:02               ` hmsjwzb
@ 2022-09-06  8:37                 ` Qu Wenruo
  2022-09-07  8:24                   ` hmsjwzb
  2022-09-06  8:59                 ` delete whole file system Kengo.M
  1 sibling, 1 reply; 10+ messages in thread
From: Qu Wenruo @ 2022-09-06  8:37 UTC (permalink / raw)
  To: hmsjwzb, linux-btrfs@vger.kernel.org



On 2022/9/6 16:02, hmsjwzb wrote:
> Hi Qu,
>
> Thank you for providing this interesting case. I use qemu and gdb to debug this case and summarize as follows.
>
> [test case]
>
> mkfs.btrfs -f -m raid5 -d raid5 -b 1G /dev/vda /dev/vdb /dev/vdc
>
> mount /dev/vda /mnt
>
> xfs_io -f -c "pwrite -S 0xee 0 64k" /mnt/file1
>
> sync
>
> 	After the above command, the device look like this.
>             	119865344
> 	vda |     |     stripe 0:Data for file1            |      |
>             	98893824
> 	vdb |     |     stripe 1:redundant Data for parity |      |

Note, that is not redundant data, but some unused space, it can be garbage.

But still, we need this unused space to calculate the parity anyway.

>             	98893824
> 	vdc |     |     stripe 2:parity                    |      |
>
> 	We can see that stripe 1 is not used by btrfs filesystem.
>
> umount /dev/vda
>
> xfs_io -f -c "pwrite -S 0xff 119865344 64k" /dev/vda
>
>             	119865344
> 	vda |     |     stripe 0:Data for file1(Corrupted) |      |
>             	98893824
> 	vdb |     |     stripe 1:redundant Data for parity |      |
>             	98893824
> 	vdc |     |     stripe 2:parity                    |      |
>
> 	This command erase the data for file1 in stripe 0. I think it simulate the
> 	data loss in hardware.

Yep, that's correct.

>
> mount /dev/vda /mnt
>
> 	If we issue a read request for file1 now, the data for file1 can be recovered
> 	by raid5 mechanism.
>
> xfs_io -f -c "pwrite -S 0xee 0 64k" -c sync /mnt/file2
>
>             	119865344
> 	vda |     |     stripe 0:Data for file1(Corrupted)               |      |
>             	98893824
> 	vdb |     |     stripe 1:Data for file2                          |      |
>             	98893824
> 	vdc |     |     stripe 2:parity(Recomputed with Corrupted data)  |      |
>
> 	After the above command, the stripe 1 in vdb is used for file2. The parity data
> 	is recomputed with corrupted data in vda and data of file2. So the Data for file1
> 	is forever lost.

Exactly, that's the destructive RMW idea.

>
> cat /mnt/file1 > /dev/null
>
> 	This command will read the corrupted data in stripe 0. And the btrfs csum will find
> 	out the csum mismatch and print warnings.
>
> umount /mnt
>
> [some fix proposal]
>
> 	1. Can we do parity check before every write operation? If the parity check fails,
>             we just recover the data first and then do the write operation. We can do this
> 	   check before raid56_rmw_stripe.

That's the idea to fix the destructive RMW, aka when doing sub-stripe
write, we need to:

1. Collected needed data to do the recovery
    For data, it should be all csum inside the full stripe.
    For metadata, although the csum is inlined, we still need to find out
    which range has metadata, and this can be a little tricky.

2. Read out all data and stripe, including the range we're writing into
    Currently we skip the range we're going to write, but since we may
    need to do a recovery, we need the full stripe anyway.

3. Do full stripe verification before doing RMW.
    That's the core recovery, thankfully we should have very similiar
    code existing already.

>
> [question]
> 	I have noticed this patch.
>
> 		[PATCH PoC 0/9] btrfs: scrub: introduce a new family of ioctl, scrub_fs
> 		Hi Qu,
> 			Is some part of this patch aim to solve this problem?

Nope, that's just to improve scrub for RAID56, nothing related to this
destructive RMW thing at all.

Thanks,
Qu
>
> Thanks,
> Flint
>
>
> On 8/16/22 01:38, Qu Wenruo wrote:
>>
>>
>> On 2022/8/16 10:47, hmsjwzb wrote:
>>> Hi Qu,
>>>
>>> Sorry for interrupt you so many times.
>>>
>>> As for
>>>      scrub level checks at RAID56 substripe write time.
>>>
>>> Is this feature available in latest linux-next branch?
>>
>> Nope, no one is working on that, thus no patches at all.
>>
>>> Or may I need to get patches from mail list.
>>> What is the core function of this feature ?
>>
>> The following small script would explain it pretty well:
>>
>>    mkfs.btrfs -f -m raid5 -d raid5 -b 1G $dev1 $dev2 $dev3
>>    mount $dev1 $mnt
>>
>>    xfs_io -f -c "pwrite -S 0xee 0 64K" $mnt/file1
>>    sync
>>    umount $mnt
>>
>>    # Currupt data stripe 1 of full stripe of above 64K write
>>    xfs_io -f -c "pwrite -S 0xff 119865344 64K" $dev1
>>
>>    mount $dev1 $mnt
>>
>>    # Do a new write into data stripe 2,
>>    # We will trigger a RMW, which will use on-disk (corrupted) data to
>>    # generate new P/Q.
>>    xfs_io -f -c "pwrite -S 0xee 0 64K" -c sync $mnt/file2
>>
>>    # Now we can no longer read file1, as its data is corrupted, and
>>    # above write generated new P/Q using corrupted data stripe 1,
>>    # preventing us to recover the data stripe 1.
>>    cat $mnt/file1 > /dev/null
>>    umount $mnt
>>
>> Above script is the best way to demonstrate the "destructive RMW".
>> Although this is not btrfs specific (other RAID56 is also affected),
>> it's definitely a real problem.
>>
>> There are several different directions to solve it:
>>
>> - A way to add CSUM for P/Q stripes
>>    In theory this should be the easiest way implementation wise.
>>    We can easily know if a P/Q stripe is correct, then before doing
>>    RMW, we verify the result of P/Q.
>>    If the result doesn't match, we know some data stripe(s) are
>>    corrupted, then rebuild the data first before write.
>>
>>    Unfortunately, this needs a on-disk format.
>>
>> - Full stripe verification before writes
>>    This means, before we submit sub-stripe writes, we use some scrub like
>>    method to verify all data stripes first.
>>    Then we can do recovery if needed, then do writes.
>>
>>    Unfortunately, scrub-like checks has quite some limitations.
>>    Regular scrub only works on RO block groups, thus extent tree and csum
>>    tree are consistent.
>>    But for RAID56 writes, we have no such luxury, I'm not 100% sure if
>>    this can even pass stress tests.
>>
>> Thanks,
>> Qu
>>
>>>
>>> I think I may use qemu and gdb to get basic understanding about this feature.
>>>
>>> Thanks,
>>> Flint
>>>
>>> On 8/15/22 04:54, Qu Wenruo wrote:
>>>> scrub level checks at RAID56 substripe write time.

^ permalink raw reply	[flat|nested] 10+ messages in thread

* delete whole file system
  2022-09-06  8:02               ` hmsjwzb
  2022-09-06  8:37                 ` Qu Wenruo
@ 2022-09-06  8:59                 ` Kengo.M
  2022-09-06  9:12                   ` Hugo Mills
  1 sibling, 1 reply; 10+ messages in thread
From: Kengo.M @ 2022-09-06  8:59 UTC (permalink / raw)
  To: linux-btrfs

Hi folks

I made raid5 file system by btrfs like below

sudo mkfs.btrfs -L raid5.btrfs -d raid5 -m raid1 -f /dev/sda /dev/sdb 
/dev/sdc /dev/sdd /dev/sde

And mount /mnt/raid5.btrfs

sudo btrfs filesystem show

Label: 'raid5.btrf'  uuid: 23a34a45-8f5e-40f5-8cda-xxxxxxxxxxxx
         Total devices 5 FS bytes used 128.00KiB
         devid    1 size 2.73TiB used 1.13GiB path /dev/sda
         devid    2 size 2.73TiB used 1.13GiB path /dev/sdb
         devid    3 size 2.73TiB used 1.13GiB path /dev/sdc
         devid    4 size 2.73TiB used 1.13GiB path /dev/sdd
         devid    5 size 2.73TiB used 1.13GiB path /dev/sde


So,I want to delete this file system.

btrfs device delete /dev/sda /mnt/raid5.btrfs

But delete /dev/sda only.

Please Someone tell me how to delete whole this file system.

BRD

Kengo.m

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: delete whole file system
  2022-09-06  8:59                 ` delete whole file system Kengo.M
@ 2022-09-06  9:12                   ` Hugo Mills
  2022-09-06 10:28                     ` Kengo.M
  0 siblings, 1 reply; 10+ messages in thread
From: Hugo Mills @ 2022-09-06  9:12 UTC (permalink / raw)
  To: Kengo.M; +Cc: linux-btrfs

On Tue, Sep 06, 2022 at 05:59:21PM +0900, Kengo.M wrote:
> Hi folks
> 
> I made raid5 file system by btrfs like below
> 
> sudo mkfs.btrfs -L raid5.btrfs -d raid5 -m raid1 -f /dev/sda /dev/sdb
> /dev/sdc /dev/sdd /dev/sde
> 
> And mount /mnt/raid5.btrfs
> 
> sudo btrfs filesystem show
> 
> Label: 'raid5.btrf'  uuid: 23a34a45-8f5e-40f5-8cda-xxxxxxxxxxxx
>         Total devices 5 FS bytes used 128.00KiB
>         devid    1 size 2.73TiB used 1.13GiB path /dev/sda
>         devid    2 size 2.73TiB used 1.13GiB path /dev/sdb
>         devid    3 size 2.73TiB used 1.13GiB path /dev/sdc
>         devid    4 size 2.73TiB used 1.13GiB path /dev/sdd
>         devid    5 size 2.73TiB used 1.13GiB path /dev/sde
> 
> 
> So,I want to delete this file system.
> 
> btrfs device delete /dev/sda /mnt/raid5.btrfs
> 
> But delete /dev/sda only.
> 
> Please Someone tell me how to delete whole this file system.
> 
> BRD
> 
> Kengo.m

   "btrfs dev delete" remove a device from the SF leaving the rest of
it intact. To destroy the filesystem completely, wipe the start of the
device on each device. You can do that with any tool that will write
data (dd if=/dev/zero is a popular one here), but there's a generic
tool called wipefs that will do it for any filesystem with minimal
writes to the device and a good level of recoverability if you get it
wrong.

   Hugo.

-- 
Hugo Mills             | Klytus, I'm bored. What plaything can you offer me
hugo@... carfax.org.uk | today?
http://carfax.org.uk/  |
PGP: E2AB1DE4          |                      Ming the Merciless, Flash Gordon

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: delete whole file system
  2022-09-06  9:12                   ` Hugo Mills
@ 2022-09-06 10:28                     ` Kengo.M
  0 siblings, 0 replies; 10+ messages in thread
From: Kengo.M @ 2022-09-06 10:28 UTC (permalink / raw)
  To: Hugo Mills; +Cc: linux-btrfs

Hi Hugo

Thanks for your inform.

I know wipefs.
But it's too primitive.
I will try to later.
Thanks again

kengo
At 10:12 +0100 2022.09.06, Hugo Mills wrote:
>On Tue, Sep 06, 2022 at 05:59:21PM +0900, Kengo.M wrote:
>>  Hi folks
>>
>>  I made raid5 file system by btrfs like below
>>
>>  sudo mkfs.btrfs -L raid5.btrfs -d raid5 -m raid1 -f /dev/sda /dev/sdb
>>  /dev/sdc /dev/sdd /dev/sde
>>
>>  And mount /mnt/raid5.btrfs
>>
>>  sudo btrfs filesystem show
>>
>>  Label: 'raid5.btrf'  uuid: 23a34a45-8f5e-40f5-8cda-xxxxxxxxxxxx
>>          Total devices 5 FS bytes used 128.00KiB
>>          devid    1 size 2.73TiB used 1.13GiB path /dev/sda
>>          devid    2 size 2.73TiB used 1.13GiB path /dev/sdb
>>          devid    3 size 2.73TiB used 1.13GiB path /dev/sdc
>>          devid    4 size 2.73TiB used 1.13GiB path /dev/sdd
>>          devid    5 size 2.73TiB used 1.13GiB path /dev/sde
>>
>>
>>  So,I want to delete this file system.
>>
>>  btrfs device delete /dev/sda /mnt/raid5.btrfs
>>
>>  But delete /dev/sda only.
>>
>>  Please Someone tell me how to delete whole this file system.
>>
>>  BRD
>>
>>  Kengo.m
>
>    "btrfs dev delete" remove a device from the SF leaving the rest of
>it intact. To destroy the filesystem completely, wipe the start of the
>device on each device. You can do that with any tool that will write
>data (dd if=/dev/zero is a popular one here), but there's a generic
>tool called wipefs that will do it for any filesystem with minimal
>writes to the device and a good level of recoverability if you get it
>wrong.
>
>    Hugo.
>
>--
>Hugo Mills             | Klytus, I'm bored. What plaything can you offer me
>hugo@... carfax.org.uk | today?
>http://carfax.org.uk/  |
>PGP: E2AB1DE4          |                      Ming the Merciless, Flash Gordon


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: some help for improvement in btrfs
  2022-09-06  8:37                 ` Qu Wenruo
@ 2022-09-07  8:24                   ` hmsjwzb
  2022-09-07  8:46                     ` Qu Wenruo
  0 siblings, 1 reply; 10+ messages in thread
From: hmsjwzb @ 2022-09-07  8:24 UTC (permalink / raw)
  To: Qu Wenruo, linux-btrfs@vger.kernel.org



On 9/6/22 04:37, Qu Wenruo wrote:
> 
> 
> On 2022/9/6 16:02, hmsjwzb wrote:
>> Hi Qu,
>>
>> Thank you for providing this interesting case. I use qemu and gdb to debug this case and summarize as follows.
>>
>> [test case]
>>
>> mkfs.btrfs -f -m raid5 -d raid5 -b 1G /dev/vda /dev/vdb /dev/vdc
>>
>> mount /dev/vda /mnt
>>
>> xfs_io -f -c "pwrite -S 0xee 0 64k" /mnt/file1
>>
>> sync
>>
>>     After the above command, the device look like this.
>>                 119865344
>>     vda |     |     stripe 0:Data for file1            |      |
>>                 98893824
>>     vdb |     |     stripe 1:redundant Data for parity |      |
> 
> Note, that is not redundant data, but some unused space, it can be garbage.
> 
> But still, we need this unused space to calculate the parity anyway.
> 
>>                 98893824
>>     vdc |     |     stripe 2:parity                    |      |
>>
>>     We can see that stripe 1 is not used by btrfs filesystem.
>>
>> umount /dev/vda
>>
>> xfs_io -f -c "pwrite -S 0xff 119865344 64k" /dev/vda
>>
>>                 119865344
>>     vda |     |     stripe 0:Data for file1(Corrupted) |      |
>>                 98893824
>>     vdb |     |     stripe 1:redundant Data for parity |      |
>>                 98893824
>>     vdc |     |     stripe 2:parity                    |      |
>>
>>     This command erase the data for file1 in stripe 0. I think it simulate the
>>     data loss in hardware.
> 
> Yep, that's correct.
> 
>>
>> mount /dev/vda /mnt
>>
>>     If we issue a read request for file1 now, the data for file1 can be recovered
>>     by raid5 mechanism.
>>
>> xfs_io -f -c "pwrite -S 0xee 0 64k" -c sync /mnt/file2
>>
>>                 119865344
>>     vda |     |     stripe 0:Data for file1(Corrupted)               |      |
>>                 98893824
>>     vdb |     |     stripe 1:Data for file2                          |      |
>>                 98893824
>>     vdc |     |     stripe 2:parity(Recomputed with Corrupted data)  |      |
>>
>>     After the above command, the stripe 1 in vdb is used for file2. The parity data
>>     is recomputed with corrupted data in vda and data of file2. So the Data for file1
>>     is forever lost.
> 
> Exactly, that's the destructive RMW idea.
> 
>>
>> cat /mnt/file1 > /dev/null
>>
>>     This command will read the corrupted data in stripe 0. And the btrfs csum will find
>>     out the csum mismatch and print warnings.
>>
>> umount /mnt
>>
>> [some fix proposal]
>>
>>     1. Can we do parity check before every write operation? If the parity check fails,
>>             we just recover the data first and then do the write operation. We can do this
>>        check before raid56_rmw_stripe.
> 
> That's the idea to fix the destructive RMW, aka when doing sub-stripe
> write, we need to:
> 
> 1. Collected needed data to do the recovery
>    For data, it should be all csum inside the full stripe.
>    For metadata, although the csum is inlined, we still need to find out
>    which range has metadata, and this can be a little tricky.

Hi Qu,

  As for fix solution, I think the following method don't need a on-disk format.

    For Data:
      When we write to disk, we do checksum for the writing data.

        dev1 |...|DDUUUUUUUUUU|...|
        dev2 |...|UUUUUUUUUUUU|...|
        dev3 |...|PPPPPPPPPPPP|...|

	D:data Space:unused P:parity  U:unused

      The checksum block will look like the following.
        dev1 |...|CC          |...|
        dev2 |...|            |...|
        dev3 |...|            |...|

        C:Checksum  

      So if data is corrupted in the area we write, we can know it. But when the corruption happened in
      the unused block. We can do nothing about it.
      The checksum of blocks marked with C will be calculated and inserted into checksum tree.

      But what if we do the checksum of the full stripe.
                 |<--stripe1->|
        dev1 |...|CCCCCCCCCCCC|...|
                 |<--stripe2->|
        dev2 |...|CCCCCCCCCCCC|...|
        dev3 |...|            |...|

      So in the rmw process, we can do the following operation.
        1.Calculate the checksum of blocks in stripe1 and compare them with the checksum in checksum tree.
          If mismatch then stripe1 is corrupted, otherwise stripe1 is good.
        2.We can do the same process for stripe2. So we can tell whether stripe2 is corrupted or good.
	3.In this condition, if parity check failed, we can know which stripe is corrupted and use the good
          data to recover the bad.

Thanks,
Flint

> 2. Read out all data and stripe, including the range we're writing into
>    Currently we skip the range we're going to write, but since we may
>    need to do a recovery, we need the full stripe anyway.
> 
> 3. Do full stripe verification before doing RMW.
>    That's the core recovery, thankfully we should have very similiar
>    code existing already.
> 
>>
>> [question]
>>     I have noticed this patch.
>>
>>         [PATCH PoC 0/9] btrfs: scrub: introduce a new family of ioctl, scrub_fs
>>         Hi Qu,
>>             Is some part of this patch aim to solve this problem?
> 
> Nope, that's just to improve scrub for RAID56, nothing related to this
> destructive RMW thing at all.
> 
> Thanks,
> Qu
>>
>> Thanks,
>> Flint
>>
>>
>> On 8/16/22 01:38, Qu Wenruo wrote:
>>>
>>>
>>> On 2022/8/16 10:47, hmsjwzb wrote:
>>>> Hi Qu,
>>>>
>>>> Sorry for interrupt you so many times.
>>>>
>>>> As for
>>>>      scrub level checks at RAID56 substripe write time.
>>>>
>>>> Is this feature available in latest linux-next branch?
>>>
>>> Nope, no one is working on that, thus no patches at all.
>>>
>>>> Or may I need to get patches from mail list.
>>>> What is the core function of this feature ?
>>>
>>> The following small script would explain it pretty well:
>>>
>>>    mkfs.btrfs -f -m raid5 -d raid5 -b 1G $dev1 $dev2 $dev3
>>>    mount $dev1 $mnt
>>>
>>>    xfs_io -f -c "pwrite -S 0xee 0 64K" $mnt/file1
>>>    sync
>>>    umount $mnt
>>>
>>>    # Currupt data stripe 1 of full stripe of above 64K write
>>>    xfs_io -f -c "pwrite -S 0xff 119865344 64K" $dev1
>>>
>>>    mount $dev1 $mnt
>>>
>>>    # Do a new write into data stripe 2,
>>>    # We will trigger a RMW, which will use on-disk (corrupted) data to
>>>    # generate new P/Q.
>>>    xfs_io -f -c "pwrite -S 0xee 0 64K" -c sync $mnt/file2
>>>
>>>    # Now we can no longer read file1, as its data is corrupted, and
>>>    # above write generated new P/Q using corrupted data stripe 1,
>>>    # preventing us to recover the data stripe 1.
>>>    cat $mnt/file1 > /dev/null
>>>    umount $mnt
>>>
>>> Above script is the best way to demonstrate the "destructive RMW".
>>> Although this is not btrfs specific (other RAID56 is also affected),
>>> it's definitely a real problem.
>>>
>>> There are several different directions to solve it:
>>>
>>> - A way to add CSUM for P/Q stripes
>>>    In theory this should be the easiest way implementation wise.
>>>    We can easily know if a P/Q stripe is correct, then before doing
>>>    RMW, we verify the result of P/Q.
>>>    If the result doesn't match, we know some data stripe(s) are
>>>    corrupted, then rebuild the data first before write.
>>>
>>>    Unfortunately, this needs a on-disk format.
>>>
>>> - Full stripe verification before writes
>>>    This means, before we submit sub-stripe writes, we use some scrub like
>>>    method to verify all data stripes first.
>>>    Then we can do recovery if needed, then do writes.
>>>
>>>    Unfortunately, scrub-like checks has quite some limitations.
>>>    Regular scrub only works on RO block groups, thus extent tree and csum
>>>    tree are consistent.
>>>    But for RAID56 writes, we have no such luxury, I'm not 100% sure if
>>>    this can even pass stress tests.
>>>
>>> Thanks,
>>> Qu
>>>
>>>>
>>>> I think I may use qemu and gdb to get basic understanding about this feature.
>>>>
>>>> Thanks,
>>>> Flint
>>>>
>>>> On 8/15/22 04:54, Qu Wenruo wrote:
>>>>> scrub level checks at RAID56 substripe write time.

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: some help for improvement in btrfs
  2022-09-07  8:24                   ` hmsjwzb
@ 2022-09-07  8:46                     ` Qu Wenruo
  2022-09-08  9:21                       ` hmsjwzb
  0 siblings, 1 reply; 10+ messages in thread
From: Qu Wenruo @ 2022-09-07  8:46 UTC (permalink / raw)
  To: hmsjwzb, linux-btrfs@vger.kernel.org



On 2022/9/7 16:24, hmsjwzb wrote:
>
[...]
>> That's the idea to fix the destructive RMW, aka when doing sub-stripe
>> write, we need to:
>>
>> 1. Collected needed data to do the recovery
>>     For data, it should be all csum inside the full stripe.
>>     For metadata, although the csum is inlined, we still need to find out
>>     which range has metadata, and this can be a little tricky.
>
> Hi Qu,
>
>    As for fix solution, I think the following method don't need a on-disk format.

I never mentioned we need a on-disk format change.

I just mentioned some pitfalls you need to look after.

>
>      For Data:
>        When we write to disk, we do checksum for the writing data.
>
>          dev1 |...|DDUUUUUUUUUU|...|
>          dev2 |...|UUUUUUUUUUUU|...|
>          dev3 |...|PPPPPPPPPPPP|...|
>
> 	D:data Space:unused P:parity  U:unused
>
>        The checksum block will look like the following.
>          dev1 |...|CC          |...|
>          dev2 |...|            |...|
>          dev3 |...|            |...|
>
>          C:Checksum
>
>        So if data is corrupted in the area we write, we can know it. But when the corruption happened in
>        the unused block. We can do nothing about it.

We don't need to bother corruption happened in unused space at all.

We only need to ensure our data and parity is correct.

So if the data sectors are fine, although parity mismatches, it's not a
problem, we just do the regular RMW, and correct parity will be
re-calculated and write back to that disk.

>        The checksum of blocks marked with C will be calculated and inserted into checksum tree.

Why insert? There is no insert needed at all.

And furthermore, data csum insert happens way later, RAID56 layer should
not bother to insert the csum.

If you're talking about writing the first two sectors, they should not
have csum at all, as data COW ensured we can only write data into unused
space.

Please make it clear what you're really wanting to do, using W for
blocks to write, U for unused, C for old data which has csum.

(Don't bother NODATASUM case for now).

Thanks,
Qu

>
>        But what if we do the checksum of the full stripe.
>                   |<--stripe1->|
>          dev1 |...|CCCCCCCCCCCC|...|
>                   |<--stripe2->|
>          dev2 |...|CCCCCCCCCCCC|...|
>          dev3 |...|            |...|
>
>        So in the rmw process, we can do the following operation.
>          1.Calculate the checksum of blocks in stripe1 and compare them with the checksum in checksum tree.
>            If mismatch then stripe1 is corrupted, otherwise stripe1 is good.
>          2.We can do the same process for stripe2. So we can tell whether stripe2 is corrupted or good.
> 	3.In this condition, if parity check failed, we can know which stripe is corrupted and use the good
>            data to recover the bad.
>
> Thanks,
> Flint
>
>> 2. Read out all data and stripe, including the range we're writing into
>>     Currently we skip the range we're going to write, but since we may
>>     need to do a recovery, we need the full stripe anyway.
>>
>> 3. Do full stripe verification before doing RMW.
>>     That's the core recovery, thankfully we should have very similiar
>>     code existing already.
>>
>>>
>>> [question]
>>>      I have noticed this patch.
>>>
>>>          [PATCH PoC 0/9] btrfs: scrub: introduce a new family of ioctl, scrub_fs
>>>          Hi Qu,
>>>              Is some part of this patch aim to solve this problem?
>>
>> Nope, that's just to improve scrub for RAID56, nothing related to this
>> destructive RMW thing at all.
>>
>> Thanks,
>> Qu
>>>
>>> Thanks,
>>> Flint
>>>
>>>
>>> On 8/16/22 01:38, Qu Wenruo wrote:
>>>>
>>>>
>>>> On 2022/8/16 10:47, hmsjwzb wrote:
>>>>> Hi Qu,
>>>>>
>>>>> Sorry for interrupt you so many times.
>>>>>
>>>>> As for
>>>>>       scrub level checks at RAID56 substripe write time.
>>>>>
>>>>> Is this feature available in latest linux-next branch?
>>>>
>>>> Nope, no one is working on that, thus no patches at all.
>>>>
>>>>> Or may I need to get patches from mail list.
>>>>> What is the core function of this feature ?
>>>>
>>>> The following small script would explain it pretty well:
>>>>
>>>>     mkfs.btrfs -f -m raid5 -d raid5 -b 1G $dev1 $dev2 $dev3
>>>>     mount $dev1 $mnt
>>>>
>>>>     xfs_io -f -c "pwrite -S 0xee 0 64K" $mnt/file1
>>>>     sync
>>>>     umount $mnt
>>>>
>>>>     # Currupt data stripe 1 of full stripe of above 64K write
>>>>     xfs_io -f -c "pwrite -S 0xff 119865344 64K" $dev1
>>>>
>>>>     mount $dev1 $mnt
>>>>
>>>>     # Do a new write into data stripe 2,
>>>>     # We will trigger a RMW, which will use on-disk (corrupted) data to
>>>>     # generate new P/Q.
>>>>     xfs_io -f -c "pwrite -S 0xee 0 64K" -c sync $mnt/file2
>>>>
>>>>     # Now we can no longer read file1, as its data is corrupted, and
>>>>     # above write generated new P/Q using corrupted data stripe 1,
>>>>     # preventing us to recover the data stripe 1.
>>>>     cat $mnt/file1 > /dev/null
>>>>     umount $mnt
>>>>
>>>> Above script is the best way to demonstrate the "destructive RMW".
>>>> Although this is not btrfs specific (other RAID56 is also affected),
>>>> it's definitely a real problem.
>>>>
>>>> There are several different directions to solve it:
>>>>
>>>> - A way to add CSUM for P/Q stripes
>>>>     In theory this should be the easiest way implementation wise.
>>>>     We can easily know if a P/Q stripe is correct, then before doing
>>>>     RMW, we verify the result of P/Q.
>>>>     If the result doesn't match, we know some data stripe(s) are
>>>>     corrupted, then rebuild the data first before write.
>>>>
>>>>     Unfortunately, this needs a on-disk format.
>>>>
>>>> - Full stripe verification before writes
>>>>     This means, before we submit sub-stripe writes, we use some scrub like
>>>>     method to verify all data stripes first.
>>>>     Then we can do recovery if needed, then do writes.
>>>>
>>>>     Unfortunately, scrub-like checks has quite some limitations.
>>>>     Regular scrub only works on RO block groups, thus extent tree and csum
>>>>     tree are consistent.
>>>>     But for RAID56 writes, we have no such luxury, I'm not 100% sure if
>>>>     this can even pass stress tests.
>>>>
>>>> Thanks,
>>>> Qu
>>>>
>>>>>
>>>>> I think I may use qemu and gdb to get basic understanding about this feature.
>>>>>
>>>>> Thanks,
>>>>> Flint
>>>>>
>>>>> On 8/15/22 04:54, Qu Wenruo wrote:
>>>>>> scrub level checks at RAID56 substripe write time.

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: some help for improvement in btrfs
  2022-09-07  8:46                     ` Qu Wenruo
@ 2022-09-08  9:21                       ` hmsjwzb
  2022-09-08  9:28                         ` Qu Wenruo
  0 siblings, 1 reply; 10+ messages in thread
From: hmsjwzb @ 2022-09-08  9:21 UTC (permalink / raw)
  To: Qu Wenruo, linux-btrfs@vger.kernel.org

Hi Qu,

Thanks for your reply. Sorry for the ambiguous and inaccuracy in last mail.
This email is intend to express my idea in detail.

[Main idea]

    Calculate the checksum of all blocks in a full stripe. Use these checksums to
    tell which stripe is corrupted.


Let's go back to the destructive RMW.

[test case]

    mkfs.btrfs -f -m raid5 -d raid5 -b 1G /dev/vda /dev/vdb /dev/vdc
    mount /dev/vda /mnt
    xfs_io -f -c "pwrite -S 0xee 0 64k" /mnt/file1
    sync
    umount /dev/vda
    xfs_io -f -c "pwrite -S 0xff 119865344 64k" /dev/vda
    mount /dev/vda /mnt
    xfs_io -f -c "pwrite -S 0xee 0 64k" -c sync /mnt/file2

        At this point, the layout of devices is as follows.

                |<---stripe1---->|
        vda |...|CCCCCCCCCCCCCCCC|...|

                |<---stripe2---->|
        vdb |...|UUUUUUUUUUUUUUUU|...|

                |<---parity----->|
        vdc |...|PPPPPPPPPPPPPPPP|...|

		C:corrupted    U:unused    P:parity

        Before the data of file2 written to the stripe2 of vdb, we can still
        recover the data of stripe1 by stripe2 and parity.
        But Here is the problem.

        How can we know which stripe is corrupted?
          We need checksum for whole stripe.



        But If we calculate the checksum of all blocks of a stripe, we can also tell which stripe is corrupted.
        As for our test case, Here is my plan.

        1. If we use some part or all of stripe1, then we calculate the checksum of stripe1 and stripe2.

        When the write request of file2 comes, we can calculate the checksum of stripe1 and stripe2 block by block
        in the RMW process. If some block in stripe1 mismatch, then stripe1 is corrupted. If some block in stripe2 mismatch
        then stripe2 is corrupted.

        In this case, we know stripe1 is corrupted, so we can use stripe2 and parity to recover stripe1.

        In my opinion, the checksum stored in checksum tree is by byte number. So we can calculate the checksum of stripe2
        during the writing process and store it to csum tree in later end_io process.

    cat /mnt/file1 > /dev/null
    umount /mnt

Thanks,
Flint

On 9/7/22 04:46, Qu Wenruo wrote:
> 
> 
> On 2022/9/7 16:24, hmsjwzb wrote:
>>
> [...]
>>> That's the idea to fix the destructive RMW, aka when doing sub-stripe
>>> write, we need to:
>>>
>>> 1. Collected needed data to do the recovery
>>>     For data, it should be all csum inside the full stripe.
>>>     For metadata, although the csum is inlined, we still need to find out
>>>     which range has metadata, and this can be a little tricky.
>>
>> Hi Qu,
>>
>>    As for fix solution, I think the following method don't need a on-disk format.
> 
> I never mentioned we need a on-disk format change.
> 
> I just mentioned some pitfalls you need to look after.
> 
>>
>>      For Data:
>>        When we write to disk, we do checksum for the writing data.
>>
>>          dev1 |...|DDUUUUUUUUUU|...|
>>          dev2 |...|UUUUUUUUUUUU|...|
>>          dev3 |...|PPPPPPPPPPPP|...|
>>
>>     D:data Space:unused P:parity  U:unused
>>
>>        The checksum block will look like the following.
>>          dev1 |...|CC          |...|
>>          dev2 |...|            |...|
>>          dev3 |...|            |...|
>>
>>          C:Checksum
>>
>>        So if data is corrupted in the area we write, we can know it. But when the corruption happened in
>>        the unused block. We can do nothing about it.
> 
> We don't need to bother corruption happened in unused space at all.
> 
> We only need to ensure our data and parity is correct.
> 
> So if the data sectors are fine, although parity mismatches, it's not a
> problem, we just do the regular RMW, and correct parity will be
> re-calculated and write back to that disk.
> 
>>        The checksum of blocks marked with C will be calculated and inserted into checksum tree.
> 
> Why insert? There is no insert needed at all.
> 
> And furthermore, data csum insert happens way later, RAID56 layer should
> not bother to insert the csum.
> 
> If you're talking about writing the first two sectors, they should not
> have csum at all, as data COW ensured we can only write data into unused
> space.
> 
> Please make it clear what you're really wanting to do, using W for
> blocks to write, U for unused, C for old data which has csum.
> 
> (Don't bother NODATASUM case for now).
> 
> Thanks,
> Qu
> 
>>
>>        But what if we do the checksum of the full stripe.
>>                   |<--stripe1->|
>>          dev1 |...|CCCCCCCCCCCC|...|
>>                   |<--stripe2->|
>>          dev2 |...|CCCCCCCCCCCC|...|
>>          dev3 |...|            |...|
>>
>>        So in the rmw process, we can do the following operation.
>>          1.Calculate the checksum of blocks in stripe1 and compare them with the checksum in checksum tree.
>>            If mismatch then stripe1 is corrupted, otherwise stripe1 is good.
>>          2.We can do the same process for stripe2. So we can tell whether stripe2 is corrupted or good.
>>     3.In this condition, if parity check failed, we can know which stripe is corrupted and use the good
>>            data to recover the bad.
>>
>> Thanks,
>> Flint
>>
>>> 2. Read out all data and stripe, including the range we're writing into
>>>     Currently we skip the range we're going to write, but since we may
>>>     need to do a recovery, we need the full stripe anyway.
>>>
>>> 3. Do full stripe verification before doing RMW.
>>>     That's the core recovery, thankfully we should have very similiar
>>>     code existing already.
>>>
>>>>
>>>> [question]
>>>>      I have noticed this patch.
>>>>
>>>>          [PATCH PoC 0/9] btrfs: scrub: introduce a new family of ioctl, scrub_fs
>>>>          Hi Qu,
>>>>              Is some part of this patch aim to solve this problem?
>>>
>>> Nope, that's just to improve scrub for RAID56, nothing related to this
>>> destructive RMW thing at all.
>>>
>>> Thanks,
>>> Qu
>>>>
>>>> Thanks,
>>>> Flint
>>>>
>>>>
>>>> On 8/16/22 01:38, Qu Wenruo wrote:
>>>>>
>>>>>
>>>>> On 2022/8/16 10:47, hmsjwzb wrote:
>>>>>> Hi Qu,
>>>>>>
>>>>>> Sorry for interrupt you so many times.
>>>>>>
>>>>>> As for
>>>>>>       scrub level checks at RAID56 substripe write time.
>>>>>>
>>>>>> Is this feature available in latest linux-next branch?
>>>>>
>>>>> Nope, no one is working on that, thus no patches at all.
>>>>>
>>>>>> Or may I need to get patches from mail list.
>>>>>> What is the core function of this feature ?
>>>>>
>>>>> The following small script would explain it pretty well:
>>>>>
>>>>>     mkfs.btrfs -f -m raid5 -d raid5 -b 1G $dev1 $dev2 $dev3
>>>>>     mount $dev1 $mnt
>>>>>
>>>>>     xfs_io -f -c "pwrite -S 0xee 0 64K" $mnt/file1
>>>>>     sync
>>>>>     umount $mnt
>>>>>
>>>>>     # Currupt data stripe 1 of full stripe of above 64K write
>>>>>     xfs_io -f -c "pwrite -S 0xff 119865344 64K" $dev1
>>>>>
>>>>>     mount $dev1 $mnt
>>>>>
>>>>>     # Do a new write into data stripe 2,
>>>>>     # We will trigger a RMW, which will use on-disk (corrupted) data to
>>>>>     # generate new P/Q.
>>>>>     xfs_io -f -c "pwrite -S 0xee 0 64K" -c sync $mnt/file2
>>>>>
>>>>>     # Now we can no longer read file1, as its data is corrupted, and
>>>>>     # above write generated new P/Q using corrupted data stripe 1,
>>>>>     # preventing us to recover the data stripe 1.
>>>>>     cat $mnt/file1 > /dev/null
>>>>>     umount $mnt
>>>>>
>>>>> Above script is the best way to demonstrate the "destructive RMW".
>>>>> Although this is not btrfs specific (other RAID56 is also affected),
>>>>> it's definitely a real problem.
>>>>>
>>>>> There are several different directions to solve it:
>>>>>
>>>>> - A way to add CSUM for P/Q stripes
>>>>>     In theory this should be the easiest way implementation wise.
>>>>>     We can easily know if a P/Q stripe is correct, then before doing
>>>>>     RMW, we verify the result of P/Q.
>>>>>     If the result doesn't match, we know some data stripe(s) are
>>>>>     corrupted, then rebuild the data first before write.
>>>>>
>>>>>     Unfortunately, this needs a on-disk format.
>>>>>
>>>>> - Full stripe verification before writes
>>>>>     This means, before we submit sub-stripe writes, we use some scrub like
>>>>>     method to verify all data stripes first.
>>>>>     Then we can do recovery if needed, then do writes.
>>>>>
>>>>>     Unfortunately, scrub-like checks has quite some limitations.
>>>>>     Regular scrub only works on RO block groups, thus extent tree and csum
>>>>>     tree are consistent.
>>>>>     But for RAID56 writes, we have no such luxury, I'm not 100% sure if
>>>>>     this can even pass stress tests.
>>>>>
>>>>> Thanks,
>>>>> Qu
>>>>>
>>>>>>
>>>>>> I think I may use qemu and gdb to get basic understanding about this feature.
>>>>>>
>>>>>> Thanks,
>>>>>> Flint
>>>>>>
>>>>>> On 8/15/22 04:54, Qu Wenruo wrote:
>>>>>>> scrub level checks at RAID56 substripe write time.

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: some help for improvement in btrfs
  2022-09-08  9:21                       ` hmsjwzb
@ 2022-09-08  9:28                         ` Qu Wenruo
  0 siblings, 0 replies; 10+ messages in thread
From: Qu Wenruo @ 2022-09-08  9:28 UTC (permalink / raw)
  To: hmsjwzb, linux-btrfs@vger.kernel.org



On 2022/9/8 17:21, hmsjwzb wrote:
> Hi Qu,
>
> Thanks for your reply. Sorry for the ambiguous and inaccuracy in last mail.
> This email is intend to express my idea in detail.
>
> [Main idea]
>
>      Calculate the checksum of all blocks in a full stripe. Use these checksums to
>      tell which stripe is corrupted.

That means a on-disk format change first.

>
>
> Let's go back to the destructive RMW.
>
> [test case]
>
>      mkfs.btrfs -f -m raid5 -d raid5 -b 1G /dev/vda /dev/vdb /dev/vdc
>      mount /dev/vda /mnt
>      xfs_io -f -c "pwrite -S 0xee 0 64k" /mnt/file1
>      sync
>      umount /dev/vda
>      xfs_io -f -c "pwrite -S 0xff 119865344 64k" /dev/vda
>      mount /dev/vda /mnt
>      xfs_io -f -c "pwrite -S 0xee 0 64k" -c sync /mnt/file2
>
>          At this point, the layout of devices is as follows.
>
>                  |<---stripe1---->|
>          vda |...|CCCCCCCCCCCCCCCC|...|
>
>                  |<---stripe2---->|
>          vdb |...|UUUUUUUUUUUUUUUU|...|
>
>                  |<---parity----->|
>          vdc |...|PPPPPPPPPPPPPPPP|...|
>
> 		C:corrupted    U:unused    P:parity
>
>          Before the data of file2 written to the stripe2 of vdb, we can still
>          recover the data of stripe1 by stripe2 and parity.
>          But Here is the problem.
>
>          How can we know which stripe is corrupted?
>            We need checksum for whole stripe.

That's unnecessary.

If we read that file1, we know it's corrupted and will trigger recovery.

So before writing the 2nd data stripe, if we search for checksum of the
full stripe (which would only find the csum for the data stripe 1).

And read out all stripes (data stripe 1, data stripe 2, parity stripe),
then we can compare the checksum against what we found (even csum only
covers data stripe 1).

Then for the range we have csum, we can verify if the csum matches.
If not, try recovery using data stripe 2 and parity, and check again.

>
>
>
>          But If we calculate the checksum of all blocks of a stripe, we can also tell which stripe is corrupted.
>          As for our test case, Here is my plan.

I see no obvious benefit.

You need a on-disk format change first, then you also need extra
metadata updates to handle the full stripe update.

Note that, parity is updated more frequently, thus the metadata update
is not that a small thing.

>
>          1. If we use some part or all of stripe1, then we calculate the checksum of stripe1 and stripe2.
>
>          When the write request of file2 comes, we can calculate the checksum of stripe1 and stripe2 block by block
>          in the RMW process. If some block in stripe1 mismatch, then stripe1 is corrupted. If some block in stripe2 mismatch
>          then stripe2 is corrupted.
>
>          In this case, we know stripe1 is corrupted, so we can use stripe2 and parity to recover stripe1.
>
>          In my opinion, the checksum stored in checksum tree is by byte number. So we can calculate the checksum of stripe2
>          during the writing process and store it to csum tree in later end_io process.
>
>      cat /mnt/file1 > /dev/null
>      umount /mnt
>
> Thanks,
> Flint
>
> On 9/7/22 04:46, Qu Wenruo wrote:
>>
>>
>> On 2022/9/7 16:24, hmsjwzb wrote:
>>>
>> [...]
>>>> That's the idea to fix the destructive RMW, aka when doing sub-stripe
>>>> write, we need to:
>>>>
>>>> 1. Collected needed data to do the recovery
>>>>      For data, it should be all csum inside the full stripe.
>>>>      For metadata, although the csum is inlined, we still need to find out
>>>>      which range has metadata, and this can be a little tricky.
>>>
>>> Hi Qu,
>>>
>>>     As for fix solution, I think the following method don't need a on-disk format.
>>
>> I never mentioned we need a on-disk format change.
>>
>> I just mentioned some pitfalls you need to look after.
>>
>>>
>>>       For Data:
>>>         When we write to disk, we do checksum for the writing data.
>>>
>>>           dev1 |...|DDUUUUUUUUUU|...|
>>>           dev2 |...|UUUUUUUUUUUU|...|
>>>           dev3 |...|PPPPPPPPPPPP|...|
>>>
>>>      D:data Space:unused P:parity  U:unused
>>>
>>>         The checksum block will look like the following.
>>>           dev1 |...|CC          |...|
>>>           dev2 |...|            |...|
>>>           dev3 |...|            |...|
>>>
>>>           C:Checksum
>>>
>>>         So if data is corrupted in the area we write, we can know it. But when the corruption happened in
>>>         the unused block. We can do nothing about it.
>>
>> We don't need to bother corruption happened in unused space at all.
>>
>> We only need to ensure our data and parity is correct.
>>
>> So if the data sectors are fine, although parity mismatches, it's not a
>> problem, we just do the regular RMW, and correct parity will be
>> re-calculated and write back to that disk.
>>
>>>         The checksum of blocks marked with C will be calculated and inserted into checksum tree.
>>
>> Why insert? There is no insert needed at all.
>>
>> And furthermore, data csum insert happens way later, RAID56 layer should
>> not bother to insert the csum.
>>
>> If you're talking about writing the first two sectors, they should not
>> have csum at all, as data COW ensured we can only write data into unused
>> space.
>>
>> Please make it clear what you're really wanting to do, using W for
>> blocks to write, U for unused, C for old data which has csum.
>>
>> (Don't bother NODATASUM case for now).
>>
>> Thanks,
>> Qu
>>
>>>
>>>         But what if we do the checksum of the full stripe.
>>>                    |<--stripe1->|
>>>           dev1 |...|CCCCCCCCCCCC|...|
>>>                    |<--stripe2->|
>>>           dev2 |...|CCCCCCCCCCCC|...|
>>>           dev3 |...|            |...|
>>>
>>>         So in the rmw process, we can do the following operation.
>>>           1.Calculate the checksum of blocks in stripe1 and compare them with the checksum in checksum tree.
>>>             If mismatch then stripe1 is corrupted, otherwise stripe1 is good.
>>>           2.We can do the same process for stripe2. So we can tell whether stripe2 is corrupted or good.
>>>      3.In this condition, if parity check failed, we can know which stripe is corrupted and use the good
>>>             data to recover the bad.
>>>
>>> Thanks,
>>> Flint
>>>
>>>> 2. Read out all data and stripe, including the range we're writing into
>>>>      Currently we skip the range we're going to write, but since we may
>>>>      need to do a recovery, we need the full stripe anyway.
>>>>
>>>> 3. Do full stripe verification before doing RMW.
>>>>      That's the core recovery, thankfully we should have very similiar
>>>>      code existing already.
>>>>
>>>>>
>>>>> [question]
>>>>>       I have noticed this patch.
>>>>>
>>>>>           [PATCH PoC 0/9] btrfs: scrub: introduce a new family of ioctl, scrub_fs
>>>>>           Hi Qu,
>>>>>               Is some part of this patch aim to solve this problem?
>>>>
>>>> Nope, that's just to improve scrub for RAID56, nothing related to this
>>>> destructive RMW thing at all.
>>>>
>>>> Thanks,
>>>> Qu
>>>>>
>>>>> Thanks,
>>>>> Flint
>>>>>
>>>>>
>>>>> On 8/16/22 01:38, Qu Wenruo wrote:
>>>>>>
>>>>>>
>>>>>> On 2022/8/16 10:47, hmsjwzb wrote:
>>>>>>> Hi Qu,
>>>>>>>
>>>>>>> Sorry for interrupt you so many times.
>>>>>>>
>>>>>>> As for
>>>>>>>        scrub level checks at RAID56 substripe write time.
>>>>>>>
>>>>>>> Is this feature available in latest linux-next branch?
>>>>>>
>>>>>> Nope, no one is working on that, thus no patches at all.
>>>>>>
>>>>>>> Or may I need to get patches from mail list.
>>>>>>> What is the core function of this feature ?
>>>>>>
>>>>>> The following small script would explain it pretty well:
>>>>>>
>>>>>>      mkfs.btrfs -f -m raid5 -d raid5 -b 1G $dev1 $dev2 $dev3
>>>>>>      mount $dev1 $mnt
>>>>>>
>>>>>>      xfs_io -f -c "pwrite -S 0xee 0 64K" $mnt/file1
>>>>>>      sync
>>>>>>      umount $mnt
>>>>>>
>>>>>>      # Currupt data stripe 1 of full stripe of above 64K write
>>>>>>      xfs_io -f -c "pwrite -S 0xff 119865344 64K" $dev1
>>>>>>
>>>>>>      mount $dev1 $mnt
>>>>>>
>>>>>>      # Do a new write into data stripe 2,
>>>>>>      # We will trigger a RMW, which will use on-disk (corrupted) data to
>>>>>>      # generate new P/Q.
>>>>>>      xfs_io -f -c "pwrite -S 0xee 0 64K" -c sync $mnt/file2
>>>>>>
>>>>>>      # Now we can no longer read file1, as its data is corrupted, and
>>>>>>      # above write generated new P/Q using corrupted data stripe 1,
>>>>>>      # preventing us to recover the data stripe 1.
>>>>>>      cat $mnt/file1 > /dev/null
>>>>>>      umount $mnt
>>>>>>
>>>>>> Above script is the best way to demonstrate the "destructive RMW".
>>>>>> Although this is not btrfs specific (other RAID56 is also affected),
>>>>>> it's definitely a real problem.
>>>>>>
>>>>>> There are several different directions to solve it:
>>>>>>
>>>>>> - A way to add CSUM for P/Q stripes
>>>>>>      In theory this should be the easiest way implementation wise.
>>>>>>      We can easily know if a P/Q stripe is correct, then before doing
>>>>>>      RMW, we verify the result of P/Q.
>>>>>>      If the result doesn't match, we know some data stripe(s) are
>>>>>>      corrupted, then rebuild the data first before write.
>>>>>>
>>>>>>      Unfortunately, this needs a on-disk format.
>>>>>>
>>>>>> - Full stripe verification before writes
>>>>>>      This means, before we submit sub-stripe writes, we use some scrub like
>>>>>>      method to verify all data stripes first.
>>>>>>      Then we can do recovery if needed, then do writes.
>>>>>>
>>>>>>      Unfortunately, scrub-like checks has quite some limitations.
>>>>>>      Regular scrub only works on RO block groups, thus extent tree and csum
>>>>>>      tree are consistent.
>>>>>>      But for RAID56 writes, we have no such luxury, I'm not 100% sure if
>>>>>>      this can even pass stress tests.
>>>>>>
>>>>>> Thanks,
>>>>>> Qu
>>>>>>
>>>>>>>
>>>>>>> I think I may use qemu and gdb to get basic understanding about this feature.
>>>>>>>
>>>>>>> Thanks,
>>>>>>> Flint
>>>>>>>
>>>>>>> On 8/15/22 04:54, Qu Wenruo wrote:
>>>>>>>> scrub level checks at RAID56 substripe write time.

^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2022-09-08  9:28 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
     [not found] <fb056073-5bd6-6143-9699-4a5af1bd496d@zoho.com>
     [not found] ` <655f97cc-64e6-9f57-5394-58f9c3b83a6f@gmx.com>
     [not found]   ` <40b209eb-9048-da0c-e776-5e143ab38571@zoho.com>
     [not found]     ` <72a78cc 0-4524-47e7-803c-7d094b8713ee@gmx.com>
     [not found]     ` <72a78cc0-4524-47e7-803c-7d094b8713ee@gmx.com>
     [not found]       ` <00984321-3006-764d-c29e-1304f89652ae@zoho.com>
     [not found]         ` <18300547-1811-e9da-252e-f9476dca078c@gmx.com>
     [not found]           ` <4691b710-3d71-bd 26-d00a-66cc398f57c5@zoho.com>
     [not found]           ` <4691b710-3d71-bd26-d00a-66cc398f57c5@zoho.com>
2022-08-16  5:38             ` some help for improvement in btrfs Qu Wenruo
2022-09-06  8:02               ` hmsjwzb
2022-09-06  8:37                 ` Qu Wenruo
2022-09-07  8:24                   ` hmsjwzb
2022-09-07  8:46                     ` Qu Wenruo
2022-09-08  9:21                       ` hmsjwzb
2022-09-08  9:28                         ` Qu Wenruo
2022-09-06  8:59                 ` delete whole file system Kengo.M
2022-09-06  9:12                   ` Hugo Mills
2022-09-06 10:28                     ` Kengo.M

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox