From: Qu Wenruo <wqu@suse.com>
To: linux-btrfs@vger.kernel.org
Subject: [PATCH 0/5] btrfs: raid56: destructive RMW fix for RAID5 data profiles
Date: Mon, 14 Nov 2022 08:26:29 +0800 [thread overview]
Message-ID: <cover.1668384746.git.wqu@suse.com> (raw)
[DESTRUCTIVE RMW]
Btrfs (and generic) RAID56 is always vulnerable against destructive RMW.
Such destructive RMW can happen when one of the data stripe is
corrupted, then a sub-stripe write into other data stripes happens, like
the following:
Disk 1: data 1 |0x000000000000| <- Corrupted
Disk 2: data 2 |0x000000000000|
Disk 2: parity |0xffffffffffff|
In above case, if we write data into disk 2, we will got something like
this:
Disk 1: data 1 |0x000000000000| <- Corrupted
Disk 2: data 2 |0x000000000000| <- New '0x00' writes
Disk 2: parity |0x000000000000| <- New Parity.
Since the new parity is calculated using the new writes and the
corrupted data, we lost the chance to recovery data stripe 1, and that
corruption will forever be there.
[SOLUTION]
This series will close the destructive RMW problem for RAID5 data
profiles by:
- Read the full stripe before doing sub-stripe writes.
- Also fetch the data csum for the data stripes:
- Verify every data sector against their data checksum
Now even with above corrupted data, we know there are some data csum
mismatch, we will have an error bitmap to record such mismatches.
We treat read error (like missing device) and data csum mismatch the
same.
If there is no csum (NODATASUM or metadata), we just trust the data
unconditionally.
So we will have an error bitmap for above corrupted case like this:
Disk 1: data 1: |XXXXXXX|
Disk 2: data 2: | |
- Rebuild just as usual
Since data 1 has all its sectors marked error in the error bitmap,
we rebuild the sectors of data stripe 1.
- Verify the repaired sectors against their csum
If csum matches, we can continue.
Or we error out.
- Continue the RMW cycle using the repaired sectors
Now we have correct data and will re-calculate the correct parity.
[TODO]
- Iterate all RAID6 combinations
Currently we don't try all combinations of RAID6 during the repair.
Thus for RAID6 we treat it just like RAID5 in RMW.
Currently the RAID6 recovery combination is only exhausted during
recovery path, relying on the caller the increase the mirror number.
Can be implemented later for RMW path.
- Write back the repaired sectors
Currently we don't write back the repaired sectors, thus if we read
the corrupted sectors, we rely on the recover path, and since the new
parity is calculated using the recovered sectors, we can get the
correct data without any problem.
But if we write back the repaired sectors during RMW, we can save the
reader some time without going into recovery path at all.
This is just a performance optimization, thus I believe it can be
implemented later.
Qu Wenruo (5):
btrfs: use btrfs_dev_name() helper to handle missing devices better
btrfs: refactor btrfs_lookup_csums_range()
btrfs: introduce a bitmap based csum range search function
btrfs: raid56: prepare data checksums for later RMW data csum
verification
btrfs: raid56: do data csum verification during RMW cycle
fs/btrfs/check-integrity.c | 2 +-
fs/btrfs/dev-replace.c | 15 +--
fs/btrfs/disk-io.c | 2 +-
fs/btrfs/extent-tree.c | 2 +-
fs/btrfs/extent_io.c | 3 +-
fs/btrfs/file-item.c | 196 ++++++++++++++++++++++++++----
fs/btrfs/file-item.h | 8 +-
fs/btrfs/inode.c | 5 +-
fs/btrfs/ioctl.c | 4 +-
fs/btrfs/raid56.c | 243 ++++++++++++++++++++++++++++++++-----
fs/btrfs/raid56.h | 12 ++
fs/btrfs/relocation.c | 4 +-
fs/btrfs/scrub.c | 28 ++---
fs/btrfs/super.c | 2 +-
fs/btrfs/tree-log.c | 15 ++-
fs/btrfs/volumes.c | 16 +--
fs/btrfs/volumes.h | 9 ++
17 files changed, 451 insertions(+), 115 deletions(-)
--
2.38.1
next reply other threads:[~2022-11-14 0:27 UTC|newest]
Thread overview: 14+ messages / expand[flat|nested] mbox.gz Atom feed top
2022-11-14 0:26 Qu Wenruo [this message]
2022-11-14 0:26 ` [PATCH 1/5] btrfs: use btrfs_dev_name() helper to handle missing devices better Qu Wenruo
2022-11-14 16:06 ` Goffredo Baroncelli
2022-11-14 22:40 ` Qu Wenruo
2022-11-14 23:54 ` David Sterba
2022-11-14 0:26 ` [PATCH 2/5] btrfs: refactor btrfs_lookup_csums_range() Qu Wenruo
2022-11-14 0:26 ` [PATCH 3/5] btrfs: introduce a bitmap based csum range search function Qu Wenruo
2022-11-14 0:26 ` [PATCH 4/5] btrfs: raid56: prepare data checksums for later RMW data csum verification Qu Wenruo
2022-11-15 6:40 ` Dan Carpenter
2022-11-15 13:18 ` David Sterba
2022-11-14 0:26 ` [PATCH 5/5] btrfs: raid56: do data csum verification during RMW cycle Qu Wenruo
2022-11-14 1:59 ` [PATCH 0/5] btrfs: raid56: destructive RMW fix for RAID5 data profiles Qu Wenruo
2022-11-15 13:24 ` David Sterba
2022-11-16 3:21 ` Qu Wenruo
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=cover.1668384746.git.wqu@suse.com \
--to=wqu@suse.com \
--cc=linux-btrfs@vger.kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).