From: Qu Wenruo <wqu@suse.com>
To: linux-btrfs@vger.kernel.org
Subject: Re: [PATCH 0/5] btrfs: raid56: destructive RMW fix for RAID5 data profiles
Date: Mon, 14 Nov 2022 09:59:47 +0800 [thread overview]
Message-ID: <5fe30279-3816-ecfb-25e2-188ba49e6cec@suse.com> (raw)
In-Reply-To: <cover.1668384746.git.wqu@suse.com>
On 2022/11/14 08:26, Qu Wenruo wrote:
> [DESTRUCTIVE RMW]
> Btrfs (and generic) RAID56 is always vulnerable against destructive RMW.
>
> Such destructive RMW can happen when one of the data stripe is
> corrupted, then a sub-stripe write into other data stripes happens, like
> the following:
>
>
> Disk 1: data 1 |0x000000000000| <- Corrupted
> Disk 2: data 2 |0x000000000000|
> Disk 2: parity |0xffffffffffff|
>
> In above case, if we write data into disk 2, we will got something like
> this:
>
> Disk 1: data 1 |0x000000000000| <- Corrupted
> Disk 2: data 2 |0x000000000000| <- New '0x00' writes
> Disk 2: parity |0x000000000000| <- New Parity.
>
> Since the new parity is calculated using the new writes and the
> corrupted data, we lost the chance to recovery data stripe 1, and that
> corruption will forever be there.
>
> [SOLUTION]
> This series will close the destructive RMW problem for RAID5 data
> profiles by:
>
> - Read the full stripe before doing sub-stripe writes.
>
> - Also fetch the data csum for the data stripes:
>
> - Verify every data sector against their data checksum
>
> Now even with above corrupted data, we know there are some data csum
> mismatch, we will have an error bitmap to record such mismatches.
> We treat read error (like missing device) and data csum mismatch the
> same.
>
> If there is no csum (NODATASUM or metadata), we just trust the data
> unconditionally.
>
> So we will have an error bitmap for above corrupted case like this:
>
> Disk 1: data 1: |XXXXXXX|
> Disk 2: data 2: | |
>
> - Rebuild just as usual
> Since data 1 has all its sectors marked error in the error bitmap,
> we rebuild the sectors of data stripe 1.
>
> - Verify the repaired sectors against their csum
> If csum matches, we can continue.
> Or we error out.
>
> - Continue the RMW cycle using the repaired sectors
> Now we have correct data and will re-calculate the correct parity.
>
> [TODO]
> - Iterate all RAID6 combinations
> Currently we don't try all combinations of RAID6 during the repair.
> Thus for RAID6 we treat it just like RAID5 in RMW.
>
> Currently the RAID6 recovery combination is only exhausted during
> recovery path, relying on the caller the increase the mirror number.
>
> Can be implemented later for RMW path.
>
> - Write back the repaired sectors
> Currently we don't write back the repaired sectors, thus if we read
> the corrupted sectors, we rely on the recover path, and since the new
> parity is calculated using the recovered sectors, we can get the
> correct data without any problem.
>
> But if we write back the repaired sectors during RMW, we can save the
> reader some time without going into recovery path at all.
>
> This is just a performance optimization, thus I believe it can be
> implemented later.
>
> Qu Wenruo (5):
> btrfs: use btrfs_dev_name() helper to handle missing devices better
Please ignore this one, this patch has already been sent as an
independent patch.
The remaining 4 are the main patches.
Thanks,
Qu
> btrfs: refactor btrfs_lookup_csums_range()
> btrfs: introduce a bitmap based csum range search function
> btrfs: raid56: prepare data checksums for later RMW data csum
> verification
> btrfs: raid56: do data csum verification during RMW cycle
>
> fs/btrfs/check-integrity.c | 2 +-
> fs/btrfs/dev-replace.c | 15 +--
> fs/btrfs/disk-io.c | 2 +-
> fs/btrfs/extent-tree.c | 2 +-
> fs/btrfs/extent_io.c | 3 +-
> fs/btrfs/file-item.c | 196 ++++++++++++++++++++++++++----
> fs/btrfs/file-item.h | 8 +-
> fs/btrfs/inode.c | 5 +-
> fs/btrfs/ioctl.c | 4 +-
> fs/btrfs/raid56.c | 243 ++++++++++++++++++++++++++++++++-----
> fs/btrfs/raid56.h | 12 ++
> fs/btrfs/relocation.c | 4 +-
> fs/btrfs/scrub.c | 28 ++---
> fs/btrfs/super.c | 2 +-
> fs/btrfs/tree-log.c | 15 ++-
> fs/btrfs/volumes.c | 16 +--
> fs/btrfs/volumes.h | 9 ++
> 17 files changed, 451 insertions(+), 115 deletions(-)
>
next prev parent reply other threads:[~2022-11-14 2:00 UTC|newest]
Thread overview: 14+ messages / expand[flat|nested] mbox.gz Atom feed top
2022-11-14 0:26 [PATCH 0/5] btrfs: raid56: destructive RMW fix for RAID5 data profiles Qu Wenruo
2022-11-14 0:26 ` [PATCH 1/5] btrfs: use btrfs_dev_name() helper to handle missing devices better Qu Wenruo
2022-11-14 16:06 ` Goffredo Baroncelli
2022-11-14 22:40 ` Qu Wenruo
2022-11-14 23:54 ` David Sterba
2022-11-14 0:26 ` [PATCH 2/5] btrfs: refactor btrfs_lookup_csums_range() Qu Wenruo
2022-11-14 0:26 ` [PATCH 3/5] btrfs: introduce a bitmap based csum range search function Qu Wenruo
2022-11-14 0:26 ` [PATCH 4/5] btrfs: raid56: prepare data checksums for later RMW data csum verification Qu Wenruo
2022-11-15 6:40 ` Dan Carpenter
2022-11-15 13:18 ` David Sterba
2022-11-14 0:26 ` [PATCH 5/5] btrfs: raid56: do data csum verification during RMW cycle Qu Wenruo
2022-11-14 1:59 ` Qu Wenruo [this message]
2022-11-15 13:24 ` [PATCH 0/5] btrfs: raid56: destructive RMW fix for RAID5 data profiles David Sterba
2022-11-16 3:21 ` Qu Wenruo
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=5fe30279-3816-ecfb-25e2-188ba49e6cec@suse.com \
--to=wqu@suse.com \
--cc=linux-btrfs@vger.kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).