linux-btrfs.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 0/5] btrfs: raid56: destructive RMW fix for RAID5 data profiles
@ 2022-11-14  0:26 Qu Wenruo
  2022-11-14  0:26 ` [PATCH 1/5] btrfs: use btrfs_dev_name() helper to handle missing devices better Qu Wenruo
                   ` (6 more replies)
  0 siblings, 7 replies; 14+ messages in thread
From: Qu Wenruo @ 2022-11-14  0:26 UTC (permalink / raw)
  To: linux-btrfs

[DESTRUCTIVE RMW]
Btrfs (and generic) RAID56 is always vulnerable against destructive RMW.

Such destructive RMW can happen when one of the data stripe is
corrupted, then a sub-stripe write into other data stripes happens, like
the following:


  Disk 1: data 1 |0x000000000000| <- Corrupted
  Disk 2: data 2 |0x000000000000|
  Disk 2: parity |0xffffffffffff|

In above case, if we write data into disk 2, we will got something like
this:

  Disk 1: data 1 |0x000000000000| <- Corrupted
  Disk 2: data 2 |0x000000000000| <- New '0x00' writes
  Disk 2: parity |0x000000000000| <- New Parity.

Since the new parity is calculated using the new writes and the
corrupted data, we lost the chance to recovery data stripe 1, and that
corruption will forever be there.

[SOLUTION]
This series will close the destructive RMW problem for RAID5 data
profiles by:

- Read the full stripe before doing sub-stripe writes.

- Also fetch the data csum for the data stripes:

- Verify every data sector against their data checksum

  Now even with above corrupted data, we know there are some data csum
  mismatch, we will have an error bitmap to record such mismatches.
  We treat read error (like missing device) and data csum mismatch the
  same.

  If there is no csum (NODATASUM or metadata), we just trust the data
  unconditionally.

  So we will have an error bitmap for above corrupted case like this:

  Disk 1: data 1: |XXXXXXX|
  Disk 2: data 2: |       |

- Rebuild just as usual
  Since data 1 has all its sectors marked error in the error bitmap,
  we rebuild the sectors of data stripe 1.

- Verify the repaired sectors against their csum
  If csum matches, we can continue.
  Or we error out.

- Continue the RMW cycle using the repaired sectors
  Now we have correct data and will re-calculate the correct parity.

[TODO]
- Iterate all RAID6 combinations
  Currently we don't try all combinations of RAID6 during the repair.
  Thus for RAID6 we treat it just like RAID5 in RMW.

  Currently the RAID6 recovery combination is only exhausted during
  recovery path, relying on the caller the increase the mirror number.

  Can be implemented later for RMW path.

- Write back the repaired sectors
  Currently we don't write back the repaired sectors, thus if we read
  the corrupted sectors, we rely on the recover path, and since the new
  parity is calculated using the recovered sectors, we can get the
  correct data without any problem.

  But if we write back the repaired sectors during RMW, we can save the
  reader some time without going into recovery path at all.

  This is just a performance optimization, thus I believe it can be
  implemented later.

Qu Wenruo (5):
  btrfs: use btrfs_dev_name() helper to handle missing devices better
  btrfs: refactor btrfs_lookup_csums_range()
  btrfs: introduce a bitmap based csum range search function
  btrfs: raid56: prepare data checksums for later RMW data csum 
    verification
  btrfs: raid56: do data csum verification during RMW cycle

 fs/btrfs/check-integrity.c |   2 +-
 fs/btrfs/dev-replace.c     |  15 +--
 fs/btrfs/disk-io.c         |   2 +-
 fs/btrfs/extent-tree.c     |   2 +-
 fs/btrfs/extent_io.c       |   3 +-
 fs/btrfs/file-item.c       | 196 ++++++++++++++++++++++++++----
 fs/btrfs/file-item.h       |   8 +-
 fs/btrfs/inode.c           |   5 +-
 fs/btrfs/ioctl.c           |   4 +-
 fs/btrfs/raid56.c          | 243 ++++++++++++++++++++++++++++++++-----
 fs/btrfs/raid56.h          |  12 ++
 fs/btrfs/relocation.c      |   4 +-
 fs/btrfs/scrub.c           |  28 ++---
 fs/btrfs/super.c           |   2 +-
 fs/btrfs/tree-log.c        |  15 ++-
 fs/btrfs/volumes.c         |  16 +--
 fs/btrfs/volumes.h         |   9 ++
 17 files changed, 451 insertions(+), 115 deletions(-)

-- 
2.38.1


^ permalink raw reply	[flat|nested] 14+ messages in thread

end of thread, other threads:[~2022-11-16  3:21 UTC | newest]

Thread overview: 14+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2022-11-14  0:26 [PATCH 0/5] btrfs: raid56: destructive RMW fix for RAID5 data profiles Qu Wenruo
2022-11-14  0:26 ` [PATCH 1/5] btrfs: use btrfs_dev_name() helper to handle missing devices better Qu Wenruo
2022-11-14 16:06   ` Goffredo Baroncelli
2022-11-14 22:40     ` Qu Wenruo
2022-11-14 23:54       ` David Sterba
2022-11-14  0:26 ` [PATCH 2/5] btrfs: refactor btrfs_lookup_csums_range() Qu Wenruo
2022-11-14  0:26 ` [PATCH 3/5] btrfs: introduce a bitmap based csum range search function Qu Wenruo
2022-11-14  0:26 ` [PATCH 4/5] btrfs: raid56: prepare data checksums for later RMW data csum verification Qu Wenruo
2022-11-15  6:40   ` Dan Carpenter
2022-11-15 13:18     ` David Sterba
2022-11-14  0:26 ` [PATCH 5/5] btrfs: raid56: do data csum verification during RMW cycle Qu Wenruo
2022-11-14  1:59 ` [PATCH 0/5] btrfs: raid56: destructive RMW fix for RAID5 data profiles Qu Wenruo
2022-11-15 13:24 ` David Sterba
2022-11-16  3:21   ` Qu Wenruo

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).