Re: [PATCH RFC 5/5] btrfs: raid56: do full stripe data checksum verification before RMW

linux-btrfs.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

From: Qu Wenruo <quwenruo.btrfs@gmx.com>
To: dsterba@suse.cz, Qu Wenruo <wqu@suse.com>
Cc: linux-btrfs@vger.kernel.org
Subject: Re: [PATCH RFC 5/5] btrfs: raid56: do full stripe data checksum verification before RMW
Date: Wed, 26 Oct 2022 07:31:38 +0800	[thread overview]
Message-ID: <21b3a262-7e66-179c-dbc1-b7fab5477734@gmx.com> (raw)
In-Reply-To: <20221025142140.GN5824@twin.jikos.cz>



On 2022/10/25 22:21, David Sterba wrote:
> On Fri, Oct 14, 2022 at 03:17:13PM +0800, Qu Wenruo wrote:
>> [BUG]
>> For the following small script, btrfs will be unable to recover the
>> content of file1:
>>
>>    mkfs.btrfs -f -m raid1 -d raid5 -b 1G $dev1 $dev2 $dev3
>>
>>    mount $dev1 $mnt
>>    xfs_io -f -c "pwrite -S 0xff 0 64k" -c sync $mnt/file1
>>    md5sum $mnt/file1
>>    umount $mnt
>>
>>    # Corrupt the above 64K data stripe.
>>    xfs_io -f -c "pwrite -S 0x00 323026944 64K" -c sync $dev3
>>    mount $dev1 $mnt
>>
>>    # Write a new 64K, which should be in the other data stripe
>>    # And this is a sub-stripe write, which will cause RMW
>>    xfs_io -f -c "pwrite 0 64k" -c sync $mnt/file2
>>    md5sum $mnt/file1
>>    umount $mnt
>>
>> Above md5sum would fail.
>>
>> [CAUSE]
>> There is a long existing problem for raid56 (not limited to btrfs
>> raid56) that, if we already have some corrupted on-disk data, and then
>> trigger a sub-stripe write (which needs RMW cycle), it can cause further
>> damage into P/Q stripe.
>>
>>    Disk 1: data 1 |0x000000000000| <- Corrupted
>>    Disk 2: data 2 |0x000000000000|
>>    Disk 2: parity |0xffffffffffff|
>>
>> In above case, data 1 is already corrupted, the original data should be
>> 64KiB of 0xff.
>>
>> At this stage, if we read data 1, and it has data checksum, we can still
>> recover going the regular RAID56 recovery path.
>>
>> But if now we decide to write some data into data 2, then we need to go
>> RMW.
>> Just say we want to write 64KiB of '0x00' into data 2, then we read the
>> on-disk data of data 1, calculate the new parity, resulting the
>> following layout:
>>
>>    Disk 1: data 1 |0x000000000000| <- Corrupted
>>    Disk 2: data 2 |0x000000000000| <- New '0x00' writes
>>    Disk 2: parity |0x000000000000| <- New Parity.
>>
>> But the new parity is calculated using the *corrupted* data 1, we can
>> no longer recover the correct data of data1.
>> Thus the corruption is forever there.
>>
>> [FIX]
>> To solve above problem, this patch will do a full stripe data checksum
>> verification at RMW time.
>>
>> This involves the following changes:
>>
>> - For raid56_rmw_stripe() we need to read the full stripe
>>    Unlike the old behavior, which will only read out the missing part,
>>    now we have to read out the full stripe (including data and P/Q), so
>>    we can do recovery later.
>>
>>    All the read will directly go into the rbio stripe sectors.
>>
>>    This will not affect cached case, as cached rbio will always has all
>>    its data sectors cached. And since cached sectors are already
>>    after a RMW (which implies the same data csum check), they should be
>>    always correct.
>>
>> - Do data checksum verification for the full stripe
>>    The target is only the rbio stripe sectors.
>>
>>    The checksum is already filled before we reach finish_rmw().
>>    Currently we only do the verification if there is no missing device.
>>
>>    The checksum verification will do the following work:
>>
>>    * Verify the data checksums for sectors which has data checksum of
>>      a vertical stripe.
>>
>>      For sectors which doesn't has csum, they will be considered fine.
>>
>>    * Check if we need to repair the vertical stripe
>>
>>      If no checksum mismatch, we can continue to the next vertical
>>      stripe.
>>      If too many corrupted sectors than the tolerance, we error out,
>>      marking all the bios failed, to avoid further corruption.
>>
>>    * Repair the vertical stripe
>>
>>    * Recheck the content of the failed sectors
>>
>>      If repaired sectors can not pass the checksum verification, we
>>      error out
>>
>> This is a much strict workflow, meaning now we will not write a full
>> stripe if it can not pass full stripe data verification.
>>
>> Obviously this will have a performance impact, as we are doing more
>> works during RMW cycle:
>>
>> - Fetch the data checksum
>> - Do checksum verification for all data stripes
>> - Do recovery if needed
>> - Do checksum verification again
>>
>> But for full stripe write we won't do it at all, thus for fully
>> optimized RAID56 workload (always full stripe write), there should be no
>> extra overhead.
>>
>> To me, the extra overhead looks reasonable, as data consistency is way
>> more important than performance.
>>
>> Signed-off-by: Qu Wenruo <wqu@suse.com>
>> ---
>>   fs/btrfs/raid56.c | 279 +++++++++++++++++++++++++++++++++++++++++-----
>>   1 file changed, 251 insertions(+), 28 deletions(-)
>>
>> diff --git a/fs/btrfs/raid56.c b/fs/btrfs/raid56.c
>> index 8f7e25001a2b..e51c07bd4c7b 100644
>> --- a/fs/btrfs/raid56.c
>> +++ b/fs/btrfs/raid56.c
>> @@ -1203,6 +1203,150 @@ static void bio_get_trace_info(struct btrfs_raid_bio *rbio, struct bio *bio,
>>   	trace_info->stripe_nr = -1;
>>   }
>>
>> +static void assign_one_failed(int failed, int *faila, int *failb)
>> +{
>> +	if (*faila < 0)
>> +		*faila = failed;
>> +	else if (*failb < 0)
>> +		*failb = failed;
>> +}
>> +
>> +static int recover_vertical(struct btrfs_raid_bio *rbio, int sector_nr,
>> +			    int extra_faila, int extra_failb,
>> +			    void **pointers, void **unmap_array);
>> +
>> +static int rmw_validate_data_csums(struct btrfs_raid_bio *rbio)
>> +{
>> +	struct btrfs_fs_info *fs_info = rbio->bioc->fs_info;
>> +	int sector_nr;
>> +	int tolerance;
>> +	void **pointers = NULL;
>> +	void **unmap_array = NULL;
>> +	int ret = 0;
>> +
>> +	/* No checksum, no need to check. */
>> +	if (!rbio->csum_buf || !rbio->csum_bitmap)
>> +		return 0;
>> +
>> +	/* No datacsum in the range. */
>> +	if (bitmap_empty(rbio->csum_bitmap,
>> +			 rbio->nr_data * rbio->stripe_nsectors))
>> +		return 0;
>> +
>> +	pointers = kcalloc(rbio->real_stripes, sizeof(void *), GFP_NOFS);
>> +	unmap_array = kcalloc(rbio->real_stripes, sizeof(void *), GFP_NOFS);
>> +	if (!pointers || !unmap_array) {
>> +		ret = -ENOMEM;
>> +		goto out;
>> +	}
>> +
>> +	if (rbio->bioc->map_type & BTRFS_BLOCK_GROUP_RAID5)
>> +		tolerance = 1;
>> +	else
>> +		tolerance = 2;
>
> Please avoid the "if (...map & TYPE)" checks for values that we have
> have in the raid attribute table, in this case it's tolerated_failures.

In fact, we already have one, it's rbio->bioc->max_error.

I'll go that way instead.

Thanks,
Qu
>
>> +
>> +	for (sector_nr = 0; sector_nr < rbio->stripe_nsectors; sector_nr++) {
>> +		int stripe_nr;
>> +		int found_error = 0;
>>   	     total_sector_nr++) {
>>   		sector = rbio_stripe_sector(rbio, stripe, sectornr);
>> +		 */
>> +		sector = rbio_stripe_sector(rbio, stripe, sectornr);
>>   	bio_endio(bio);
>
>> +static int recover_vertical(struct btrfs_raid_bio *rbio, int sector_nr,
>> +			    int extra_faila, int extra_failb,
>> +			    void **pointers, void **unmap_array)
>>   {
>>   	struct btrfs_fs_info *fs_info = rbio->bioc->fs_info;
>>   	struct sector_ptr *sector;
>>   	const u32 sectorsize = fs_info->sectorsize;
>> -	const int faila = rbio->faila;
>> -	const int failb = rbio->failb;
>> +	int faila = -1;
>> +	int failb = -1;
>> +	int failed = 0;
>> +	int tolerance;
>>   	int stripe_nr;
>>
>> +	if (rbio->bioc->map_type & BTRFS_BLOCK_GROUP_RAID5)
>> +		tolerance = 1;
>> +	else
>> +		tolerance = 2;
>
> And here.
>
>> +
>> +	failed = calculate_failab(rbio, extra_faila, extra_failb, &faila, &failb);
>> +
>> +	if (failed > tolerance)
>> +		return -EIO;
>> +
>>   	/*
>>   	 * Now we just use bitmap to mark the horizontal stripes in
>>   	 * which we have data when doing parity scrub.
>>   	 */
>>   	if (rbio->operation == BTRFS_RBIO_PARITY_SCRUB &&
>>   	    !test_bit(sector_nr, &rbio->dbitmap))
>> -		return;
>> +		return 0;
>>
>>   	/*
>>   	 * Setup our array of pointers with sectors from each stripe

next prev parent reply	other threads:[~2022-10-25 23:31 UTC|newest]

Thread overview: 11+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2022-10-14  7:17 [PATCH RFC 0/5] btrfs: raid56: do full stripe data checksum verification and recovery at RMW time Qu Wenruo
2022-10-14  7:17 ` [PATCH RFC 1/5] btrfs: refactor btrfs_lookup_csums_range() Qu Wenruo
2022-10-14  7:17 ` [PATCH RFC 2/5] btrfs: raid56: refactor __raid_recover_end_io() Qu Wenruo
2022-10-14  7:17 ` [PATCH RFC 3/5] btrfs: introduce a bitmap based csum range search function Qu Wenruo
2022-10-14  7:17 ` [PATCH RFC 4/5] btrfs: raid56: prepare data checksums for later sub-stripe verification Qu Wenruo
2022-10-14  7:17 ` [PATCH RFC 5/5] btrfs: raid56: do full stripe data checksum verification before RMW Qu Wenruo
2022-10-25 14:21   ` David Sterba
2022-10-25 23:31     ` Qu Wenruo [this message]
2022-10-25 13:48 ` [PATCH RFC 0/5] btrfs: raid56: do full stripe data checksum verification and recovery at RMW time David Sterba
2022-10-25 23:30   ` Qu Wenruo
2022-10-26 13:19     ` David Sterba

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=21b3a262-7e66-179c-dbc1-b7fab5477734@gmx.com \
    --to=quwenruo.btrfs@gmx.com \
    --cc=dsterba@suse.cz \
    --cc=linux-btrfs@vger.kernel.org \
    --cc=wqu@suse.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).