* [PATCH 0/5] btrfs: raid56: destructive RMW fix for RAID5 data profiles
@ 2022-11-14 0:26 Qu Wenruo
2022-11-14 0:26 ` [PATCH 1/5] btrfs: use btrfs_dev_name() helper to handle missing devices better Qu Wenruo
` (6 more replies)
0 siblings, 7 replies; 14+ messages in thread
From: Qu Wenruo @ 2022-11-14 0:26 UTC (permalink / raw)
To: linux-btrfs
[DESTRUCTIVE RMW]
Btrfs (and generic) RAID56 is always vulnerable against destructive RMW.
Such destructive RMW can happen when one of the data stripe is
corrupted, then a sub-stripe write into other data stripes happens, like
the following:
Disk 1: data 1 |0x000000000000| <- Corrupted
Disk 2: data 2 |0x000000000000|
Disk 2: parity |0xffffffffffff|
In above case, if we write data into disk 2, we will got something like
this:
Disk 1: data 1 |0x000000000000| <- Corrupted
Disk 2: data 2 |0x000000000000| <- New '0x00' writes
Disk 2: parity |0x000000000000| <- New Parity.
Since the new parity is calculated using the new writes and the
corrupted data, we lost the chance to recovery data stripe 1, and that
corruption will forever be there.
[SOLUTION]
This series will close the destructive RMW problem for RAID5 data
profiles by:
- Read the full stripe before doing sub-stripe writes.
- Also fetch the data csum for the data stripes:
- Verify every data sector against their data checksum
Now even with above corrupted data, we know there are some data csum
mismatch, we will have an error bitmap to record such mismatches.
We treat read error (like missing device) and data csum mismatch the
same.
If there is no csum (NODATASUM or metadata), we just trust the data
unconditionally.
So we will have an error bitmap for above corrupted case like this:
Disk 1: data 1: |XXXXXXX|
Disk 2: data 2: | |
- Rebuild just as usual
Since data 1 has all its sectors marked error in the error bitmap,
we rebuild the sectors of data stripe 1.
- Verify the repaired sectors against their csum
If csum matches, we can continue.
Or we error out.
- Continue the RMW cycle using the repaired sectors
Now we have correct data and will re-calculate the correct parity.
[TODO]
- Iterate all RAID6 combinations
Currently we don't try all combinations of RAID6 during the repair.
Thus for RAID6 we treat it just like RAID5 in RMW.
Currently the RAID6 recovery combination is only exhausted during
recovery path, relying on the caller the increase the mirror number.
Can be implemented later for RMW path.
- Write back the repaired sectors
Currently we don't write back the repaired sectors, thus if we read
the corrupted sectors, we rely on the recover path, and since the new
parity is calculated using the recovered sectors, we can get the
correct data without any problem.
But if we write back the repaired sectors during RMW, we can save the
reader some time without going into recovery path at all.
This is just a performance optimization, thus I believe it can be
implemented later.
Qu Wenruo (5):
btrfs: use btrfs_dev_name() helper to handle missing devices better
btrfs: refactor btrfs_lookup_csums_range()
btrfs: introduce a bitmap based csum range search function
btrfs: raid56: prepare data checksums for later RMW data csum
verification
btrfs: raid56: do data csum verification during RMW cycle
fs/btrfs/check-integrity.c | 2 +-
fs/btrfs/dev-replace.c | 15 +--
fs/btrfs/disk-io.c | 2 +-
fs/btrfs/extent-tree.c | 2 +-
fs/btrfs/extent_io.c | 3 +-
fs/btrfs/file-item.c | 196 ++++++++++++++++++++++++++----
fs/btrfs/file-item.h | 8 +-
fs/btrfs/inode.c | 5 +-
fs/btrfs/ioctl.c | 4 +-
fs/btrfs/raid56.c | 243 ++++++++++++++++++++++++++++++++-----
fs/btrfs/raid56.h | 12 ++
fs/btrfs/relocation.c | 4 +-
fs/btrfs/scrub.c | 28 ++---
fs/btrfs/super.c | 2 +-
fs/btrfs/tree-log.c | 15 ++-
fs/btrfs/volumes.c | 16 +--
fs/btrfs/volumes.h | 9 ++
17 files changed, 451 insertions(+), 115 deletions(-)
--
2.38.1
^ permalink raw reply [flat|nested] 14+ messages in thread
* [PATCH 1/5] btrfs: use btrfs_dev_name() helper to handle missing devices better
2022-11-14 0:26 [PATCH 0/5] btrfs: raid56: destructive RMW fix for RAID5 data profiles Qu Wenruo
@ 2022-11-14 0:26 ` Qu Wenruo
2022-11-14 16:06 ` Goffredo Baroncelli
2022-11-14 0:26 ` [PATCH 2/5] btrfs: refactor btrfs_lookup_csums_range() Qu Wenruo
` (5 subsequent siblings)
6 siblings, 1 reply; 14+ messages in thread
From: Qu Wenruo @ 2022-11-14 0:26 UTC (permalink / raw)
To: linux-btrfs
[BUG]
If dev-replace failed to re-construct its data/metadata, the kernel
message would be incorrect for the missing device:
BTRFS info (device dm-1): dev_replace from <missing disk> (devid 2) to /dev/mapper/test-scratch2 started
BTRFS error (device dm-1): failed to rebuild valid logical 38862848 for dev (efault)
Note the above "dev (efault)" of the second line.
While the first line is properly reporting "<missing disk>".
[CAUSE]
Although dev-replace is using btrfs_dev_name(), the heavy lifting work
is still done by scrub (scrub is reused by both dev-replace and regular
scrub).
Unfortunately scrub code never uses btrfs_dev_name() helper, as it's
only declared locally inside dev-replace.c.
[FIX]
Fix the output by:
- Move the btrfs_dev_name() helper to volumes.h
- Use btrfs_dev_name() to replace open-coded rcu_str_defref() calls
Only zoned code is not touched, as I'm not familiar with degraded
zoned code.
Now the output looks pretty sane:
BTRFS info (device dm-1): dev_replace from <missing disk> (devid 2) to /dev/mapper/test-scratch2 started
BTRFS error (device dm-1): failed to rebuild valid logical 38862848 for dev <missing disk>
Signed-off-by: Qu Wenruo <wqu@suse.com>
---
fs/btrfs/check-integrity.c | 2 +-
fs/btrfs/dev-replace.c | 15 +++------------
fs/btrfs/disk-io.c | 2 +-
fs/btrfs/extent-tree.c | 2 +-
fs/btrfs/extent_io.c | 3 +--
fs/btrfs/ioctl.c | 4 ++--
fs/btrfs/scrub.c | 20 +++++++++-----------
fs/btrfs/super.c | 2 +-
fs/btrfs/volumes.c | 16 ++++++++--------
fs/btrfs/volumes.h | 9 +++++++++
10 files changed, 36 insertions(+), 39 deletions(-)
diff --git a/fs/btrfs/check-integrity.c b/fs/btrfs/check-integrity.c
index 7ff0703ef3e4..82e49d985019 100644
--- a/fs/btrfs/check-integrity.c
+++ b/fs/btrfs/check-integrity.c
@@ -757,7 +757,7 @@ static int btrfsic_process_superblock_dev_mirror(
btrfs_info_in_rcu(fs_info,
"new initial S-block (bdev %p, %s) @%llu (%pg/%llu/%d)",
superblock_bdev,
- rcu_str_deref(device->name), dev_bytenr,
+ btrfs_dev_name(device), dev_bytenr,
dev_state->bdev, dev_bytenr,
superblock_mirror_num);
list_add(&superblock_tmp->all_blocks_node,
diff --git a/fs/btrfs/dev-replace.c b/fs/btrfs/dev-replace.c
index 9c4a8649a0f4..78696d331639 100644
--- a/fs/btrfs/dev-replace.c
+++ b/fs/btrfs/dev-replace.c
@@ -18,7 +18,6 @@
#include "volumes.h"
#include "async-thread.h"
#include "check-integrity.h"
-#include "rcu-string.h"
#include "dev-replace.h"
#include "sysfs.h"
#include "zoned.h"
@@ -451,14 +450,6 @@ int btrfs_run_dev_replace(struct btrfs_trans_handle *trans)
return ret;
}
-static char* btrfs_dev_name(struct btrfs_device *device)
-{
- if (!device || test_bit(BTRFS_DEV_STATE_MISSING, &device->dev_state))
- return "<missing disk>";
- else
- return rcu_str_deref(device->name);
-}
-
static int mark_block_group_to_copy(struct btrfs_fs_info *fs_info,
struct btrfs_device *src_dev)
{
@@ -674,7 +665,7 @@ static int btrfs_dev_replace_start(struct btrfs_fs_info *fs_info,
"dev_replace from %s (devid %llu) to %s started",
btrfs_dev_name(src_device),
src_device->devid,
- rcu_str_deref(tgt_device->name));
+ btrfs_dev_name(tgt_device));
/*
* from now on, the writes to the srcdev are all duplicated to
@@ -933,7 +924,7 @@ static int btrfs_dev_replace_finishing(struct btrfs_fs_info *fs_info,
"btrfs_scrub_dev(%s, %llu, %s) failed %d",
btrfs_dev_name(src_device),
src_device->devid,
- rcu_str_deref(tgt_device->name), scrub_ret);
+ btrfs_dev_name(tgt_device), scrub_ret);
error:
up_write(&dev_replace->rwsem);
mutex_unlock(&fs_info->chunk_mutex);
@@ -951,7 +942,7 @@ static int btrfs_dev_replace_finishing(struct btrfs_fs_info *fs_info,
"dev_replace from %s (devid %llu) to %s finished",
btrfs_dev_name(src_device),
src_device->devid,
- rcu_str_deref(tgt_device->name));
+ btrfs_dev_name(tgt_device));
clear_bit(BTRFS_DEV_STATE_REPLACE_TGT, &tgt_device->dev_state);
tgt_device->devid = src_device->devid;
src_device->devid = BTRFS_DEV_REPLACE_DEVID;
diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index 559e7f3d727e..91a088210e5a 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -3944,7 +3944,7 @@ static void btrfs_end_super_write(struct bio *bio)
if (bio->bi_status) {
btrfs_warn_rl_in_rcu(device->fs_info,
"lost page write due to IO error on %s (%d)",
- rcu_str_deref(device->name),
+ btrfs_dev_name(device),
blk_status_to_errno(bio->bi_status));
ClearPageUptodate(page);
SetPageError(page);
diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index 10cc757a602b..17f599027c3d 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -6048,7 +6048,7 @@ static int btrfs_trim_free_extents(struct btrfs_device *device, u64 *trimmed)
btrfs_warn_in_rcu(fs_info,
"ignoring attempt to trim beyond device size: offset %llu length %llu device %s device size %llu",
start, end - start + 1,
- rcu_str_deref(device->name),
+ btrfs_dev_name(device),
device->total_bytes);
mutex_unlock(&fs_info->chunk_mutex);
ret = 0;
diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index 704ae7c08867..65ba5c3658cf 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -611,8 +611,7 @@ static int repair_io_failure(struct btrfs_fs_info *fs_info, u64 ino, u64 start,
btrfs_info_rl_in_rcu(fs_info,
"read error corrected: ino %llu off %llu (dev %s sector %llu)",
- ino, start,
- rcu_str_deref(dev->name), sector);
+ ino, start, btrfs_dev_name(dev), sector);
ret = 0;
out_bio_uninit:
diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c
index 7a9e697d9931..bed74a3ff574 100644
--- a/fs/btrfs/ioctl.c
+++ b/fs/btrfs/ioctl.c
@@ -1228,7 +1228,7 @@ static noinline int btrfs_ioctl_resize(struct file *file,
if (ret == 0 && new_size != old_size)
btrfs_info_in_rcu(fs_info,
"resize device %s (devid %llu) from %llu to %llu",
- rcu_str_deref(device->name), device->devid,
+ btrfs_dev_name(device), device->devid,
old_size, new_size);
out_finish:
btrfs_exclop_finish(fs_info);
@@ -2860,7 +2860,7 @@ static long btrfs_ioctl_dev_info(struct btrfs_fs_info *fs_info,
di_args->total_bytes = btrfs_device_get_total_bytes(dev);
memcpy(di_args->uuid, dev->uuid, sizeof(di_args->uuid));
if (dev->name) {
- strncpy(di_args->path, rcu_str_deref(dev->name),
+ strncpy(di_args->path, btrfs_dev_name(dev),
sizeof(di_args->path) - 1);
di_args->path[sizeof(di_args->path) - 1] = 0;
} else {
diff --git a/fs/btrfs/scrub.c b/fs/btrfs/scrub.c
index 288a97992aa4..0ce5b773867f 100644
--- a/fs/btrfs/scrub.c
+++ b/fs/btrfs/scrub.c
@@ -17,7 +17,6 @@
#include "extent_io.h"
#include "dev-replace.h"
#include "check-integrity.h"
-#include "rcu-string.h"
#include "raid56.h"
#include "block-group.h"
#include "zoned.h"
@@ -877,7 +876,7 @@ static int scrub_print_warning_inode(u64 inum, u64 offset, u64 num_bytes,
btrfs_warn_in_rcu(fs_info,
"%s at logical %llu on dev %s, physical %llu, root %llu, inode %llu, offset %llu, length %u, links %u (path: %s)",
swarn->errstr, swarn->logical,
- rcu_str_deref(swarn->dev->name),
+ btrfs_dev_name(swarn->dev),
swarn->physical,
root, inum, offset,
fs_info->sectorsize, nlink,
@@ -891,7 +890,7 @@ static int scrub_print_warning_inode(u64 inum, u64 offset, u64 num_bytes,
btrfs_warn_in_rcu(fs_info,
"%s at logical %llu on dev %s, physical %llu, root %llu, inode %llu, offset %llu: path resolving failed with ret=%d",
swarn->errstr, swarn->logical,
- rcu_str_deref(swarn->dev->name),
+ btrfs_dev_name(swarn->dev),
swarn->physical,
root, inum, offset, ret);
@@ -922,8 +921,7 @@ static void scrub_print_warning(const char *errstr, struct scrub_block *sblock)
/* Super block error, no need to search extent tree. */
if (sblock->sectors[0]->flags & BTRFS_EXTENT_FLAG_SUPER) {
btrfs_warn_in_rcu(fs_info, "%s on device %s, physical %llu",
- errstr, rcu_str_deref(dev->name),
- sblock->physical);
+ errstr, btrfs_dev_name(dev), sblock->physical);
return;
}
path = btrfs_alloc_path();
@@ -954,7 +952,7 @@ static void scrub_print_warning(const char *errstr, struct scrub_block *sblock)
btrfs_warn_in_rcu(fs_info,
"%s at logical %llu on dev %s, physical %llu: metadata %s (level %d) in tree %llu",
errstr, swarn.logical,
- rcu_str_deref(dev->name),
+ btrfs_dev_name(dev),
swarn.physical,
ref_level ? "node" : "leaf",
ret < 0 ? -1 : ref_level,
@@ -1377,7 +1375,7 @@ static int scrub_handle_errored_block(struct scrub_block *sblock_to_check)
spin_unlock(&sctx->stat_lock);
btrfs_err_rl_in_rcu(fs_info,
"fixed up error at logical %llu on dev %s",
- logical, rcu_str_deref(dev->name));
+ logical, btrfs_dev_name(dev));
}
} else {
did_not_correct_error:
@@ -1386,7 +1384,7 @@ static int scrub_handle_errored_block(struct scrub_block *sblock_to_check)
spin_unlock(&sctx->stat_lock);
btrfs_err_rl_in_rcu(fs_info,
"unable to fixup (regular) error at logical %llu on dev %s",
- logical, rcu_str_deref(dev->name));
+ logical, btrfs_dev_name(dev));
}
out:
@@ -2332,14 +2330,14 @@ static void scrub_missing_raid56_worker(struct work_struct *work)
spin_unlock(&sctx->stat_lock);
btrfs_err_rl_in_rcu(fs_info,
"IO error rebuilding logical %llu for dev %s",
- logical, rcu_str_deref(dev->name));
+ logical, btrfs_dev_name(dev));
} else if (sblock->header_error || sblock->checksum_error) {
spin_lock(&sctx->stat_lock);
sctx->stat.uncorrectable_errors++;
spin_unlock(&sctx->stat_lock);
btrfs_err_rl_in_rcu(fs_info,
"failed to rebuild valid logical %llu for dev %s",
- logical, rcu_str_deref(dev->name));
+ logical, btrfs_dev_name(dev));
} else {
scrub_write_block_to_dev_replace(sblock);
}
@@ -4303,7 +4301,7 @@ int btrfs_scrub_dev(struct btrfs_fs_info *fs_info, u64 devid, u64 start,
mutex_unlock(&fs_info->fs_devices->device_list_mutex);
btrfs_err_in_rcu(fs_info,
"scrub on devid %llu: filesystem on %s is not writable",
- devid, rcu_str_deref(dev->name));
+ devid, btrfs_dev_name(dev));
ret = -EROFS;
goto out;
}
diff --git a/fs/btrfs/super.c b/fs/btrfs/super.c
index d54bfec8e506..ea83dd9f735a 100644
--- a/fs/btrfs/super.c
+++ b/fs/btrfs/super.c
@@ -2336,7 +2336,7 @@ static int btrfs_show_devname(struct seq_file *m, struct dentry *root)
* the end of RCU grace period.
*/
rcu_read_lock();
- seq_escape(m, rcu_str_deref(fs_info->fs_devices->latest_dev->name), " \t\n\\");
+ seq_escape(m, btrfs_dev_name(fs_info->fs_devices->latest_dev), " \t\n\\");
rcu_read_unlock();
return 0;
diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index 49d6e79d4343..2d0950a17f48 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -941,7 +941,7 @@ static noinline struct btrfs_device *device_list_add(const char *path,
}
btrfs_info_in_rcu(NULL,
"devid %llu device path %s changed to %s scanned by %s (%d)",
- devid, rcu_str_deref(device->name),
+ devid, btrfs_dev_name(device),
path, current->comm,
task_pid_nr(current));
}
@@ -2101,7 +2101,7 @@ int btrfs_rm_device(struct btrfs_fs_info *fs_info,
if (btrfs_pinned_by_swapfile(fs_info, device)) {
btrfs_warn_in_rcu(fs_info,
"cannot remove device %s (devid %llu) due to active swapfile",
- rcu_str_deref(device->name), device->devid);
+ btrfs_dev_name(device), device->devid);
return -ETXTBSY;
}
@@ -6827,7 +6827,7 @@ static void btrfs_submit_dev_bio(struct btrfs_device *dev, struct bio *bio)
btrfs_debug_in_rcu(dev->fs_info,
"%s: rw %d 0x%x, sector=%llu, dev=%lu (%s id %llu), size=%u",
__func__, bio_op(bio), bio->bi_opf, bio->bi_iter.bi_sector,
- (unsigned long)dev->bdev->bd_dev, rcu_str_deref(dev->name),
+ (unsigned long)dev->bdev->bd_dev, btrfs_dev_name(dev),
dev->devid, bio->bi_iter.bi_size);
btrfsic_check_bio(bio);
@@ -7908,7 +7908,7 @@ static int update_dev_stat_item(struct btrfs_trans_handle *trans,
if (ret < 0) {
btrfs_warn_in_rcu(fs_info,
"error %d while searching for dev_stats item for device %s",
- ret, rcu_str_deref(device->name));
+ ret, btrfs_dev_name(device));
goto out;
}
@@ -7919,7 +7919,7 @@ static int update_dev_stat_item(struct btrfs_trans_handle *trans,
if (ret != 0) {
btrfs_warn_in_rcu(fs_info,
"delete too small dev_stats item for device %s failed %d",
- rcu_str_deref(device->name), ret);
+ btrfs_dev_name(device), ret);
goto out;
}
ret = 1;
@@ -7933,7 +7933,7 @@ static int update_dev_stat_item(struct btrfs_trans_handle *trans,
if (ret < 0) {
btrfs_warn_in_rcu(fs_info,
"insert dev_stats item for device %s failed %d",
- rcu_str_deref(device->name), ret);
+ btrfs_dev_name(device), ret);
goto out;
}
}
@@ -7998,7 +7998,7 @@ void btrfs_dev_stat_inc_and_print(struct btrfs_device *dev, int index)
return;
btrfs_err_rl_in_rcu(dev->fs_info,
"bdev %s errs: wr %u, rd %u, flush %u, corrupt %u, gen %u",
- rcu_str_deref(dev->name),
+ btrfs_dev_name(dev),
btrfs_dev_stat_read(dev, BTRFS_DEV_STAT_WRITE_ERRS),
btrfs_dev_stat_read(dev, BTRFS_DEV_STAT_READ_ERRS),
btrfs_dev_stat_read(dev, BTRFS_DEV_STAT_FLUSH_ERRS),
@@ -8018,7 +8018,7 @@ static void btrfs_dev_stat_print_on_load(struct btrfs_device *dev)
btrfs_info_in_rcu(dev->fs_info,
"bdev %s errs: wr %u, rd %u, flush %u, corrupt %u, gen %u",
- rcu_str_deref(dev->name),
+ btrfs_dev_name(dev),
btrfs_dev_stat_read(dev, BTRFS_DEV_STAT_WRITE_ERRS),
btrfs_dev_stat_read(dev, BTRFS_DEV_STAT_READ_ERRS),
btrfs_dev_stat_read(dev, BTRFS_DEV_STAT_FLUSH_ERRS),
diff --git a/fs/btrfs/volumes.h b/fs/btrfs/volumes.h
index efa6a3d48cd8..db6b69c96c0d 100644
--- a/fs/btrfs/volumes.h
+++ b/fs/btrfs/volumes.h
@@ -12,6 +12,7 @@
#include "async-thread.h"
#include "messages.h"
#include "disk-io.h"
+#include "rcu-string.h"
#define BTRFS_MAX_DATA_CHUNK_SIZE (10ULL * SZ_1G)
@@ -770,6 +771,14 @@ static inline void btrfs_dev_stat_set(struct btrfs_device *dev,
atomic_inc(&dev->dev_stats_ccnt);
}
+static inline char* btrfs_dev_name(struct btrfs_device *device)
+{
+ if (!device || test_bit(BTRFS_DEV_STATE_MISSING, &device->dev_state))
+ return "<missing disk>";
+ else
+ return rcu_str_deref(device->name);
+}
+
void btrfs_commit_device_sizes(struct btrfs_transaction *trans);
struct list_head * __attribute_const__ btrfs_get_fs_uuids(void);
--
2.38.1
^ permalink raw reply related [flat|nested] 14+ messages in thread
* [PATCH 2/5] btrfs: refactor btrfs_lookup_csums_range()
2022-11-14 0:26 [PATCH 0/5] btrfs: raid56: destructive RMW fix for RAID5 data profiles Qu Wenruo
2022-11-14 0:26 ` [PATCH 1/5] btrfs: use btrfs_dev_name() helper to handle missing devices better Qu Wenruo
@ 2022-11-14 0:26 ` Qu Wenruo
2022-11-14 0:26 ` [PATCH 3/5] btrfs: introduce a bitmap based csum range search function Qu Wenruo
` (4 subsequent siblings)
6 siblings, 0 replies; 14+ messages in thread
From: Qu Wenruo @ 2022-11-14 0:26 UTC (permalink / raw)
To: linux-btrfs
The refactor involves the following parts:
- Introduce bytes_to_csum_size() and csum_size_to_bytes() helpers
As we have quite some open-coded calculation, some of them are even
split into two assignments just to fit 80 chars limit.
- Remove the @csum_size parameter from max_ordered_sum_bytes()
Csum size can be fetched from @fs_info.
And we will use the csum_size_to_bytes() helper anyway.
- Add a comment explaining how we handle the first search result
- Use newly introduced helpers to cleanup btrfs_lookup_csums_range()
- Move variables declaration to the minimal scope
- Never mix number of sectors with bytes
There are several locations doing things like:
size = min_t(size_t, csum_end - start,
max_ordered_sum_bytes(fs_info));
...
size >>= fs_info->sectorsize_bits
Or
offset = (start - key.offset) >> fs_info->sectorsize_bits;
offset *= csum_size;
Make sure those variables can only represent BYTES inside the
function, by using the above bytes_to_csum_size() helpers.
Signed-off-by: Qu Wenruo <wqu@suse.com>
---
fs/btrfs/file-item.c | 69 ++++++++++++++++++++++++++++++--------------
1 file changed, 48 insertions(+), 21 deletions(-)
diff --git a/fs/btrfs/file-item.c b/fs/btrfs/file-item.c
index 456f71b42a9c..652a9bc3b676 100644
--- a/fs/btrfs/file-item.c
+++ b/fs/btrfs/file-item.c
@@ -126,12 +126,26 @@ int btrfs_inode_clear_file_extent_range(struct btrfs_inode *inode, u64 start,
start + len - 1, EXTENT_DIRTY, NULL);
}
-static inline u32 max_ordered_sum_bytes(struct btrfs_fs_info *fs_info,
- u16 csum_size)
+static size_t bytes_to_csum_size(struct btrfs_fs_info *fs_info, u32 bytes)
{
- u32 ncsums = (PAGE_SIZE - sizeof(struct btrfs_ordered_sum)) / csum_size;
+ ASSERT(IS_ALIGNED(bytes, fs_info->sectorsize));
- return ncsums * fs_info->sectorsize;
+ return (bytes >> fs_info->sectorsize_bits) * fs_info->csum_size;
+}
+
+static size_t csum_size_to_bytes(struct btrfs_fs_info *fs_info, u32 csum_size)
+{
+ ASSERT(IS_ALIGNED(csum_size, fs_info->csum_size));
+
+ return (csum_size / fs_info->csum_size) << fs_info->sectorsize_bits;
+}
+
+static inline u32 max_ordered_sum_bytes(struct btrfs_fs_info *fs_info)
+{
+ int max_csum_size = round_down(PAGE_SIZE - sizeof(struct btrfs_ordered_sum),
+ fs_info->csum_size);
+
+ return csum_size_to_bytes(fs_info, max_csum_size);
}
/*
@@ -140,9 +154,8 @@ static inline u32 max_ordered_sum_bytes(struct btrfs_fs_info *fs_info,
*/
static int btrfs_ordered_sum_size(struct btrfs_fs_info *fs_info, unsigned long bytes)
{
- int num_sectors = (int)DIV_ROUND_UP(bytes, fs_info->sectorsize);
-
- return sizeof(struct btrfs_ordered_sum) + num_sectors * fs_info->csum_size;
+ return sizeof(struct btrfs_ordered_sum) +
+ bytes_to_csum_size(fs_info, bytes);
}
int btrfs_insert_hole_extent(struct btrfs_trans_handle *trans,
@@ -526,11 +539,7 @@ int btrfs_lookup_csums_range(struct btrfs_root *root, u64 start, u64 end,
struct btrfs_ordered_sum *sums;
struct btrfs_csum_item *item;
LIST_HEAD(tmplist);
- unsigned long offset;
int ret;
- size_t size;
- u64 csum_end;
- const u32 csum_size = fs_info->csum_size;
ASSERT(IS_ALIGNED(start, fs_info->sectorsize) &&
IS_ALIGNED(end + 1, fs_info->sectorsize));
@@ -556,16 +565,33 @@ int btrfs_lookup_csums_range(struct btrfs_root *root, u64 start, u64 end,
if (ret > 0 && path->slots[0] > 0) {
leaf = path->nodes[0];
btrfs_item_key_to_cpu(leaf, &key, path->slots[0] - 1);
+
+ /*
+ * There are two cases we can hit here for the previous
+ * csum item.
+ *
+ * |<- search range ->|
+ * |<- csum item ->|
+ *
+ * Or
+ * |<- search range ->|
+ * |<- csum item ->|
+ *
+ * Check if the previous csum item covers the leading part
+ * of the search range.
+ * If so we have to start from previous csum item.
+ */
if (key.objectid == BTRFS_EXTENT_CSUM_OBJECTID &&
key.type == BTRFS_EXTENT_CSUM_KEY) {
- offset = (start - key.offset) >> fs_info->sectorsize_bits;
- if (offset * csum_size <
+ if (bytes_to_csum_size(fs_info, start - key.offset) <
btrfs_item_size(leaf, path->slots[0] - 1))
path->slots[0]--;
}
}
while (start <= end) {
+ u64 csum_end;
+
leaf = path->nodes[0];
if (path->slots[0] >= btrfs_header_nritems(leaf)) {
ret = btrfs_next_leaf(root, path);
@@ -585,8 +611,8 @@ int btrfs_lookup_csums_range(struct btrfs_root *root, u64 start, u64 end,
if (key.offset > start)
start = key.offset;
- size = btrfs_item_size(leaf, path->slots[0]);
- csum_end = key.offset + (size / csum_size) * fs_info->sectorsize;
+ csum_end = key.offset + csum_size_to_bytes(fs_info,
+ btrfs_item_size(leaf, path->slots[0]));
if (csum_end <= start) {
path->slots[0]++;
continue;
@@ -596,8 +622,11 @@ int btrfs_lookup_csums_range(struct btrfs_root *root, u64 start, u64 end,
item = btrfs_item_ptr(path->nodes[0], path->slots[0],
struct btrfs_csum_item);
while (start < csum_end) {
+ unsigned long offset;
+ size_t size;
+
size = min_t(size_t, csum_end - start,
- max_ordered_sum_bytes(fs_info, csum_size));
+ max_ordered_sum_bytes(fs_info));
sums = kzalloc(btrfs_ordered_sum_size(fs_info, size),
GFP_NOFS);
if (!sums) {
@@ -608,16 +637,14 @@ int btrfs_lookup_csums_range(struct btrfs_root *root, u64 start, u64 end,
sums->bytenr = start;
sums->len = (int)size;
- offset = (start - key.offset) >> fs_info->sectorsize_bits;
- offset *= csum_size;
- size >>= fs_info->sectorsize_bits;
+ offset = bytes_to_csum_size(fs_info, start - key.offset);
read_extent_buffer(path->nodes[0],
sums->sums,
((unsigned long)item) + offset,
- csum_size * size);
+ bytes_to_csum_size(fs_info, size));
- start += fs_info->sectorsize * size;
+ start += size;
list_add_tail(&sums->list, &tmplist);
}
path->slots[0]++;
--
2.38.1
^ permalink raw reply related [flat|nested] 14+ messages in thread
* [PATCH 3/5] btrfs: introduce a bitmap based csum range search function
2022-11-14 0:26 [PATCH 0/5] btrfs: raid56: destructive RMW fix for RAID5 data profiles Qu Wenruo
2022-11-14 0:26 ` [PATCH 1/5] btrfs: use btrfs_dev_name() helper to handle missing devices better Qu Wenruo
2022-11-14 0:26 ` [PATCH 2/5] btrfs: refactor btrfs_lookup_csums_range() Qu Wenruo
@ 2022-11-14 0:26 ` Qu Wenruo
2022-11-14 0:26 ` [PATCH 4/5] btrfs: raid56: prepare data checksums for later RMW data csum verification Qu Wenruo
` (3 subsequent siblings)
6 siblings, 0 replies; 14+ messages in thread
From: Qu Wenruo @ 2022-11-14 0:26 UTC (permalink / raw)
To: linux-btrfs
Although we have an existing function, btrfs_lookup_csums_range(), to
find all data checksum for a range, it's based on a btrfs_ordered_sum
list.
For the incoming RAID56 data checksum verification at RMW time, we don't
want to waste time memory allocating the temporary memory.
So this patch will introduce a new helper, btrfs_lookup_csums_bitmap().
The new helper will use bitmap based result, which will be a perfect fit
for later RAID56 usage.
Signed-off-by: Qu Wenruo <wqu@suse.com>
---
fs/btrfs/file-item.c | 127 +++++++++++++++++++++++++++++++++++++++++-
fs/btrfs/file-item.h | 8 ++-
fs/btrfs/inode.c | 5 +-
fs/btrfs/relocation.c | 4 +-
fs/btrfs/scrub.c | 8 +--
fs/btrfs/tree-log.c | 15 +++--
6 files changed, 144 insertions(+), 23 deletions(-)
diff --git a/fs/btrfs/file-item.c b/fs/btrfs/file-item.c
index 652a9bc3b676..0e4019a5d139 100644
--- a/fs/btrfs/file-item.c
+++ b/fs/btrfs/file-item.c
@@ -528,9 +528,9 @@ blk_status_t btrfs_lookup_bio_sums(struct inode *inode, struct bio *bio, u8 *dst
return ret;
}
-int btrfs_lookup_csums_range(struct btrfs_root *root, u64 start, u64 end,
- struct list_head *list, int search_commit,
- bool nowait)
+int btrfs_lookup_csums_list(struct btrfs_root *root, u64 start, u64 end,
+ struct list_head *list, int search_commit,
+ bool nowait)
{
struct btrfs_fs_info *fs_info = root->fs_info;
struct btrfs_key key;
@@ -662,6 +662,127 @@ int btrfs_lookup_csums_range(struct btrfs_root *root, u64 start, u64 end,
return ret;
}
+/*
+ * Do the same work as btrfs_lookup_csums_list(), the different is in how
+ * we return the result.
+ *
+ * This version will set the corresponding bits in @csum_bitmap to represent
+ * there is a csum found.
+ * Each bit represents a sector. Thus caller should ensure @csum_buf passed
+ * in is large enough to contain all csums.
+ */
+int btrfs_lookup_csums_bitmap(struct btrfs_root *root, u64 start, u64 end,
+ u8 *csum_buf, unsigned long *csum_bitmap)
+{
+ struct btrfs_fs_info *fs_info = root->fs_info;
+ struct btrfs_key key;
+ struct btrfs_path *path;
+ struct extent_buffer *leaf;
+ struct btrfs_csum_item *item;
+ const u64 orig_start = start;
+ int ret;
+
+ ASSERT(IS_ALIGNED(start, fs_info->sectorsize) &&
+ IS_ALIGNED(end + 1, fs_info->sectorsize));
+
+ path = btrfs_alloc_path();
+ if (!path)
+ return -ENOMEM;
+
+ key.objectid = BTRFS_EXTENT_CSUM_OBJECTID;
+ key.offset = start;
+ key.type = BTRFS_EXTENT_CSUM_KEY;
+
+ ret = btrfs_search_slot(NULL, root, &key, path, 0, 0);
+ if (ret < 0)
+ goto fail;
+ if (ret > 0 && path->slots[0] > 0) {
+ leaf = path->nodes[0];
+ btrfs_item_key_to_cpu(leaf, &key, path->slots[0] - 1);
+
+ /*
+ * There are two cases we can hit here for the previous
+ * csum item.
+ *
+ * |<- search range ->|
+ * |<- csum item ->|
+ *
+ * Or
+ * |<- search range ->|
+ * |<- csum item ->|
+ *
+ * Check if the previous csum item covers the leading part
+ * of the search range.
+ * If so we have to start from previous csum item.
+ */
+ if (key.objectid == BTRFS_EXTENT_CSUM_OBJECTID &&
+ key.type == BTRFS_EXTENT_CSUM_KEY) {
+ if (bytes_to_csum_size(fs_info, start - key.offset) <
+ btrfs_item_size(leaf, path->slots[0] - 1))
+ path->slots[0]--;
+ }
+ }
+
+ while (start <= end) {
+ u64 csum_end;
+
+ leaf = path->nodes[0];
+ if (path->slots[0] >= btrfs_header_nritems(leaf)) {
+ ret = btrfs_next_leaf(root, path);
+ if (ret < 0)
+ goto fail;
+ if (ret > 0)
+ break;
+ leaf = path->nodes[0];
+ }
+
+ btrfs_item_key_to_cpu(leaf, &key, path->slots[0]);
+ if (key.objectid != BTRFS_EXTENT_CSUM_OBJECTID ||
+ key.type != BTRFS_EXTENT_CSUM_KEY ||
+ key.offset > end)
+ break;
+
+ if (key.offset > start)
+ start = key.offset;
+
+ csum_end = key.offset + csum_size_to_bytes(fs_info,
+ btrfs_item_size(leaf, path->slots[0]));
+ if (csum_end <= start) {
+ path->slots[0]++;
+ continue;
+ }
+
+ csum_end = min(csum_end, end + 1);
+ item = btrfs_item_ptr(path->nodes[0], path->slots[0],
+ struct btrfs_csum_item);
+ while (start < csum_end) {
+ unsigned long offset;
+ size_t size;
+ u8 *csum_dest = csum_buf + bytes_to_csum_size(fs_info,
+ start - orig_start);
+
+ size = min_t(size_t, csum_end - start, end + 1 - start);
+
+ offset = bytes_to_csum_size(fs_info, start - key.offset);
+
+ read_extent_buffer(path->nodes[0], csum_dest,
+ ((unsigned long)item) + offset,
+ bytes_to_csum_size(fs_info, size));
+
+ bitmap_set(csum_bitmap,
+ (start - orig_start) >> fs_info->sectorsize_bits,
+ size >> fs_info->sectorsize_bits);
+
+ start += size;
+ }
+ path->slots[0]++;
+ }
+ ret = 0;
+fail:
+ btrfs_free_path(path);
+ return ret;
+}
+
/*
* Calculate checksums of the data contained inside a bio.
*
diff --git a/fs/btrfs/file-item.h b/fs/btrfs/file-item.h
index ba12711cf933..95b371208b30 100644
--- a/fs/btrfs/file-item.h
+++ b/fs/btrfs/file-item.h
@@ -18,9 +18,11 @@ int btrfs_csum_file_blocks(struct btrfs_trans_handle *trans,
struct btrfs_ordered_sum *sums);
blk_status_t btrfs_csum_one_bio(struct btrfs_inode *inode, struct bio *bio,
u64 offset, bool one_ordered);
-int btrfs_lookup_csums_range(struct btrfs_root *root, u64 start, u64 end,
- struct list_head *list, int search_commit,
- bool nowait);
+int btrfs_lookup_csums_list(struct btrfs_root *root, u64 start, u64 end,
+ struct list_head *list, int search_commit,
+ bool nowait);
+int btrfs_lookup_csums_bitmap(struct btrfs_root *root, u64 start, u64 end,
+ u8 *csum_buf, unsigned long *csum_bitmap);
void btrfs_extent_item_to_extent_map(struct btrfs_inode *inode,
const struct btrfs_path *path,
struct btrfs_file_extent_item *fi,
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 83898bca39d5..4248e6cabbdc 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -1709,9 +1709,8 @@ static noinline int csum_exist_in_range(struct btrfs_fs_info *fs_info,
int ret;
LIST_HEAD(list);
- ret = btrfs_lookup_csums_range(csum_root, bytenr,
- bytenr + num_bytes - 1, &list, 0,
- nowait);
+ ret = btrfs_lookup_csums_list(csum_root, bytenr, bytenr + num_bytes - 1,
+ &list, 0, nowait);
if (ret == 0 && list_empty(&list))
return 0;
diff --git a/fs/btrfs/relocation.c b/fs/btrfs/relocation.c
index 56c8afa6f6d2..aa80e51bc8ca 100644
--- a/fs/btrfs/relocation.c
+++ b/fs/btrfs/relocation.c
@@ -4357,8 +4357,8 @@ int btrfs_reloc_clone_csums(struct btrfs_inode *inode, u64 file_pos, u64 len)
disk_bytenr = file_pos + inode->index_cnt;
csum_root = btrfs_csum_root(fs_info, disk_bytenr);
- ret = btrfs_lookup_csums_range(csum_root, disk_bytenr,
- disk_bytenr + len - 1, &list, 0, false);
+ ret = btrfs_lookup_csums_list(csum_root, disk_bytenr,
+ disk_bytenr + len - 1, &list, 0, false);
if (ret)
goto out;
diff --git a/fs/btrfs/scrub.c b/fs/btrfs/scrub.c
index 0ce5b773867f..52b346795f66 100644
--- a/fs/btrfs/scrub.c
+++ b/fs/btrfs/scrub.c
@@ -3238,9 +3238,9 @@ static int scrub_raid56_data_stripe_for_parity(struct scrub_ctx *sctx,
extent_dev = bioc->stripes[0].dev;
btrfs_put_bioc(bioc);
- ret = btrfs_lookup_csums_range(csum_root, extent_start,
- extent_start + extent_size - 1,
- &sctx->csum_list, 1, false);
+ ret = btrfs_lookup_csums_list(csum_root, extent_start,
+ extent_start + extent_size - 1,
+ &sctx->csum_list, 1, false);
if (ret) {
scrub_parity_mark_sectors_error(sparity, extent_start,
extent_size);
@@ -3464,7 +3464,7 @@ static int scrub_simple_mirror(struct scrub_ctx *sctx,
cur_logical;
if (extent_flags & BTRFS_EXTENT_FLAG_DATA) {
- ret = btrfs_lookup_csums_range(csum_root, cur_logical,
+ ret = btrfs_lookup_csums_list(csum_root, cur_logical,
cur_logical + scrub_len - 1,
&sctx->csum_list, 1, false);
if (ret)
diff --git a/fs/btrfs/tree-log.c b/fs/btrfs/tree-log.c
index b2528d5694fb..c9a8cbe837fe 100644
--- a/fs/btrfs/tree-log.c
+++ b/fs/btrfs/tree-log.c
@@ -826,7 +826,7 @@ static noinline int replay_one_extent(struct btrfs_trans_handle *trans,
btrfs_file_extent_num_bytes(eb, item);
}
- ret = btrfs_lookup_csums_range(root->log_root,
+ ret = btrfs_lookup_csums_list(root->log_root,
csum_start, csum_end - 1,
&ordered_sums, 0, false);
if (ret)
@@ -4392,9 +4392,9 @@ static noinline int copy_items(struct btrfs_trans_handle *trans,
csum_root = btrfs_csum_root(trans->fs_info, disk_bytenr);
disk_bytenr += extent_offset;
- ret = btrfs_lookup_csums_range(csum_root, disk_bytenr,
- disk_bytenr + extent_num_bytes - 1,
- &ordered_sums, 0, false);
+ ret = btrfs_lookup_csums_list(csum_root, disk_bytenr,
+ disk_bytenr + extent_num_bytes - 1,
+ &ordered_sums, 0, false);
if (ret)
goto out;
@@ -4587,10 +4587,9 @@ static int log_extent_csums(struct btrfs_trans_handle *trans,
/* block start is already adjusted for the file extent offset. */
csum_root = btrfs_csum_root(trans->fs_info, em->block_start);
- ret = btrfs_lookup_csums_range(csum_root,
- em->block_start + csum_offset,
- em->block_start + csum_offset +
- csum_len - 1, &ordered_sums, 0, false);
+ ret = btrfs_lookup_csums_list(csum_root, em->block_start + csum_offset,
+ em->block_start + csum_offset +
+ csum_len - 1, &ordered_sums, 0, false);
if (ret)
return ret;
--
2.38.1
^ permalink raw reply related [flat|nested] 14+ messages in thread
* [PATCH 4/5] btrfs: raid56: prepare data checksums for later RMW data csum verification
2022-11-14 0:26 [PATCH 0/5] btrfs: raid56: destructive RMW fix for RAID5 data profiles Qu Wenruo
` (2 preceding siblings ...)
2022-11-14 0:26 ` [PATCH 3/5] btrfs: introduce a bitmap based csum range search function Qu Wenruo
@ 2022-11-14 0:26 ` Qu Wenruo
2022-11-15 6:40 ` Dan Carpenter
2022-11-14 0:26 ` [PATCH 5/5] btrfs: raid56: do data csum verification during RMW cycle Qu Wenruo
` (2 subsequent siblings)
6 siblings, 1 reply; 14+ messages in thread
From: Qu Wenruo @ 2022-11-14 0:26 UTC (permalink / raw)
To: linux-btrfs
This is for later data verification at RMW time.
This patch will try to allocate the needed memory for a locked rbio if
the rbio is for data exclusively (we don't want to handle mixed bg yet).
And the memory will be released when the rbio is finished.
Signed-off-by: Qu Wenruo <wqu@suse.com>
---
fs/btrfs/raid56.c | 74 +++++++++++++++++++++++++++++++++++++++++++++++
fs/btrfs/raid56.h | 12 ++++++++
2 files changed, 86 insertions(+)
diff --git a/fs/btrfs/raid56.c b/fs/btrfs/raid56.c
index 11be5d0a7eab..242d3611d2dd 100644
--- a/fs/btrfs/raid56.c
+++ b/fs/btrfs/raid56.c
@@ -20,6 +20,7 @@
#include "volumes.h"
#include "raid56.h"
#include "async-thread.h"
+#include "file-item.h"
/* set when additional merges to this rbio are not allowed */
#define RBIO_RMW_LOCKED_BIT 1
@@ -835,6 +836,11 @@ static void rbio_orig_end_io(struct btrfs_raid_bio *rbio, blk_status_t err)
struct bio *cur = bio_list_get(&rbio->bio_list);
struct bio *extra;
+ kfree(rbio->csum_buf);
+ bitmap_free(rbio->csum_bitmap);
+ rbio->csum_buf = NULL;
+ rbio->csum_bitmap = NULL;
+
/*
* Clear the data bitmap, as the rbio may be cached for later usage.
* do this before before unlock_stripe() so there will be no new bio
@@ -2048,6 +2054,67 @@ void raid56_parity_recover(struct bio *bio, struct btrfs_io_context *bioc,
start_async_work(rbio, recover_rbio_work);
}
+static void fill_data_csums(struct btrfs_raid_bio *rbio)
+{
+ struct btrfs_fs_info *fs_info = rbio->bioc->fs_info;
+ struct btrfs_root *csum_root = btrfs_csum_root(fs_info,
+ rbio->bioc->raid_map[0]);
+ const u64 start = rbio->bioc->raid_map[0];
+ const u32 len = (rbio->nr_data * rbio->stripe_nsectors) <<
+ fs_info->sectorsize_bits;
+ int ret;
+
+ /* The rbio should not has its csum buffer initialized. */
+ ASSERT(!rbio->csum_buf && !rbio->csum_bitmap);
+
+ /*
+ * Skip the csum search if:
+ *
+ * - The rbio doesn't belongs to data block groups
+ * Then we are doing IO for tree blocks, no need to
+ * search csums.
+ *
+ * - The rbio belongs to mixed block groups
+ * This is to avoid deadlock, as we're already holding
+ * the full stripe lock, if we trigger a metadata read, and
+ * it needs to do raid56 recovery, we will deadlock.
+ */
+ if (!(rbio->bioc->map_type & BTRFS_BLOCK_GROUP_DATA) ||
+ rbio->bioc->map_type & BTRFS_BLOCK_GROUP_METADATA)
+ return;
+
+ rbio->csum_buf = kzalloc(rbio->nr_data * rbio->stripe_nsectors *
+ fs_info->csum_size, GFP_NOFS);
+ rbio->csum_bitmap = bitmap_zalloc(rbio->nr_data * rbio->stripe_nsectors,
+ GFP_NOFS);
+ if (!rbio->csum_buf || !rbio->csum_bitmap)
+ goto error;
+
+ ret = btrfs_lookup_csums_bitmap(csum_root, start, start + len - 1,
+ rbio->csum_buf, rbio->csum_bitmap);
+ if (ret < 0)
+ goto error;
+ if (bitmap_empty(rbio->csum_bitmap, len >> fs_info->sectorsize_bits))
+ goto no_csum;
+ return;
+
+error:
+ /*
+ * We failed to allocate memory or grab the csum,
+ * but it's not the end of day, we can still continue.
+ * But better to warn users that RMW is no longer safe for this
+ * particular sub-stripe write.
+ */
+ btrfs_warn_rl(fs_info,
+"sub-stripe write for full stripe %llu is not safe, failed to get csum: %d",
+ rbio->bioc->raid_map[0], ret);
+no_csum:
+ kfree(rbio->csum_buf);
+ bitmap_free(rbio->csum_bitmap);
+ rbio->csum_buf = NULL;
+ rbio->csum_bitmap = NULL;
+}
+
static int rmw_read_and_wait(struct btrfs_raid_bio *rbio)
{
struct bio_list bio_list;
@@ -2056,6 +2123,13 @@ static int rmw_read_and_wait(struct btrfs_raid_bio *rbio)
bio_list_init(&bio_list);
+ /*
+ * Fill the data csum we need for data verification.
+ * We need to fill the csum_bitmap/csum_buf first, as our endio
+ * function will try to verify the data sectors.
+ */
+ fill_data_csums(rbio);
+
ret = rmw_assemble_read_bios(rbio, &bio_list);
if (ret < 0)
goto out;
diff --git a/fs/btrfs/raid56.h b/fs/btrfs/raid56.h
index a2e653e93fd8..951acd52cd4f 100644
--- a/fs/btrfs/raid56.h
+++ b/fs/btrfs/raid56.h
@@ -129,6 +129,18 @@ struct btrfs_raid_bio {
* Thus making it much harder to iterate.
*/
unsigned long *error_bitmap;
+
+ /*
+ * Checksum buffer if the rbio is for data.
+ * The buffer should cover all data sectors (exlcuding P/Q sectors).
+ */
+ u8 *csum_buf;
+
+ /*
+ * Each bit represents if the corresponding sector has data csum found.
+ * Should only cover data sectors (excluding P/Q sectors).
+ */
+ unsigned long *csum_bitmap;
};
/*
--
2.38.1
^ permalink raw reply related [flat|nested] 14+ messages in thread
* [PATCH 5/5] btrfs: raid56: do data csum verification during RMW cycle
2022-11-14 0:26 [PATCH 0/5] btrfs: raid56: destructive RMW fix for RAID5 data profiles Qu Wenruo
` (3 preceding siblings ...)
2022-11-14 0:26 ` [PATCH 4/5] btrfs: raid56: prepare data checksums for later RMW data csum verification Qu Wenruo
@ 2022-11-14 0:26 ` Qu Wenruo
2022-11-14 1:59 ` [PATCH 0/5] btrfs: raid56: destructive RMW fix for RAID5 data profiles Qu Wenruo
2022-11-15 13:24 ` David Sterba
6 siblings, 0 replies; 14+ messages in thread
From: Qu Wenruo @ 2022-11-14 0:26 UTC (permalink / raw)
To: linux-btrfs
[BUG]
For the following small script, btrfs will be unable to recover the
content of file1:
mkfs.btrfs -f -m raid1 -d raid5 -b 1G $dev1 $dev2 $dev3
mount $dev1 $mnt
xfs_io -f -c "pwrite -S 0xff 0 64k" -c sync $mnt/file1
md5sum $mnt/file1
umount $mnt
# Corrupt the above 64K data stripe.
xfs_io -f -c "pwrite -S 0x00 323026944 64K" -c sync $dev3
mount $dev1 $mnt
# Write a new 64K, which should be in the other data stripe
# And this is a sub-stripe write, which will cause RMW
xfs_io -f -c "pwrite 0 64k" -c sync $mnt/file2
md5sum $mnt/file1
umount $mnt
Above md5sum would fail.
[CAUSE]
There is a long existing problem for raid56 (not limited to btrfs
raid56) that, if we already have some corrupted on-disk data, and then
trigger a sub-stripe write (which needs RMW cycle), it can cause further
damage into P/Q stripe.
Disk 1: data 1 |0x000000000000| <- Corrupted
Disk 2: data 2 |0x000000000000|
Disk 2: parity |0xffffffffffff|
In above case, data 1 is already corrupted, the original data should be
64KiB of 0xff.
At this stage, if we read data 1, and it has data checksum, we can still
recover going the regular RAID56 recovery path.
But if now we decide to write some data into data 2, then we need to go
RMW.
Just say we want to write 64KiB of '0x00' into data 2, then we read the
on-disk data of data 1, calculate the new parity, resulting the
following layout:
Disk 1: data 1 |0x000000000000| <- Corrupted
Disk 2: data 2 |0x000000000000| <- New '0x00' writes
Disk 2: parity |0x000000000000| <- New Parity.
But the new parity is calculated using the *corrupted* data 1, we can
no longer recover the correct data of data1.
Thus the corruption is forever there.
[FIX]
To solve above problem, this patch will do a full stripe data checksum
verification at RMW time.
This involves the following changes:
- Always read the full stripe (including data/P/Q) when doing RMW
Before we only read the missing data sectors, but since we may do a
data csum verification and recovery, we need to read everything out.
Please note that, if we have a cached rbio, we don't need to read
anything, and can treat it the same as full stripe write.
As only stripe with all its csum matches can be cached.
- Verifup the data csum during read.
The target is only the rbio stripe sectors, and only if the rbio
already has csum_buf/csum_bitmap filled.
And sectors which can not pass csum verification will has its bit
set in error_bitmap.
- Always call recovery_sectors() after we read out all the sectors
Since error_bitmap will be updated during read, recover_sectors()
can easily find out all the bad sectors and try to recover (if still
under tolerance).
And since recovery_sectors() is already migrated to use error_bitmap,
it can skip vertical stripes which don't have any error.
- Verify the repaired sectors against its csum in recover_vertical()
- Rename rmw_read_and_wait() to rmw_read_wait_recover()
Since we will always recover the sectors, the old name is no longer
accurate.
Furthermore since recovery is already done in rmw_read_wait_recover(),
we no longer needs to call recovery_sectors() inside rmw_rbio().
Obviously this will have a performance impact, as we are doing more
works during RMW cycle:
- Fetch the data checksum
- Do checksum verification for all data stripes
- Do checksum verification again after repair
But for full stripe write or cached rbio we won't have the overhead all,
thus for fully optimized RAID56 workload (always full stripe write),
there should be no extra overhead.
To me, the extra overhead looks reasonable, as data consistency is way
more important than performance.
Signed-off-by: Qu Wenruo <wqu@suse.com>
---
fs/btrfs/raid56.c | 169 +++++++++++++++++++++++++++++++++++++---------
1 file changed, 137 insertions(+), 32 deletions(-)
diff --git a/fs/btrfs/raid56.c b/fs/btrfs/raid56.c
index 242d3611d2dd..b7317e0cc616 100644
--- a/fs/btrfs/raid56.c
+++ b/fs/btrfs/raid56.c
@@ -21,6 +21,7 @@
#include "raid56.h"
#include "async-thread.h"
#include "file-item.h"
+#include "btrfs_inode.h"
/* set when additional merges to this rbio are not allowed */
#define RBIO_RMW_LOCKED_BIT 1
@@ -1433,14 +1434,56 @@ static void rbio_update_error_bitmap(struct btrfs_raid_bio *rbio, struct bio *bi
bio_size >> rbio->bioc->fs_info->sectorsize_bits);
}
+/* Verify the data sectors at read time. */
+static void verify_bio_data_sectors(struct btrfs_raid_bio *rbio,
+ struct bio *bio)
+{
+ struct btrfs_fs_info *fs_info = rbio->bioc->fs_info;
+ int total_sector_nr = get_bio_sector_nr(rbio, bio);
+ struct bio_vec *bvec;
+ struct bvec_iter_all iter_all;
+
+ /* No data csum for the whole stripe, no need to verify. */
+ if (!rbio->csum_bitmap || !rbio->csum_buf)
+ return;
+
+ /* P/Q stripes, they have no data csum to verify against. */
+ if (total_sector_nr >= rbio->nr_data * rbio->stripe_nsectors)
+ return;
+
+ bio_for_each_segment_all(bvec, bio, iter_all) {
+ int bv_offset;
+
+ for (bv_offset = bvec->bv_offset;
+ bv_offset < bvec->bv_offset + bvec->bv_len;
+ bv_offset += fs_info->sectorsize, total_sector_nr++) {
+ u8 csum_buf[BTRFS_CSUM_SIZE];
+ u8 *expected_csum = rbio->csum_buf +
+ total_sector_nr * fs_info->csum_size;
+ int ret;
+
+ /* No csum for this sector, skip to the next sector. */
+ if (!test_bit(total_sector_nr, rbio->csum_bitmap))
+ continue;
+
+ ret = btrfs_check_sector_csum(fs_info, bvec->bv_page,
+ bv_offset, csum_buf, expected_csum);
+ if (ret < 0)
+ set_bit(total_sector_nr, rbio->error_bitmap);
+ }
+ }
+}
+
static void raid_wait_read_end_io(struct bio *bio)
{
struct btrfs_raid_bio *rbio = bio->bi_private;
- if (bio->bi_status)
+ if (bio->bi_status) {
rbio_update_error_bitmap(rbio, bio);
- else
+ } else {
set_bio_pages_uptodate(rbio, bio);
+ verify_bio_data_sectors(rbio, bio);
+ }
bio_put(bio);
if (atomic_dec_and_test(&rbio->stripes_pending))
@@ -1469,37 +1512,25 @@ static void submit_read_bios(struct btrfs_raid_bio *rbio,
static int rmw_assemble_read_bios(struct btrfs_raid_bio *rbio,
struct bio_list *bio_list)
{
- const int nr_data_sectors = rbio->stripe_nsectors * rbio->nr_data;
struct bio *bio;
int total_sector_nr;
int ret = 0;
ASSERT(bio_list_size(bio_list) == 0);
- /* Build a list of bios to read all the missing data sectors. */
- for (total_sector_nr = 0; total_sector_nr < nr_data_sectors;
+ /*
+ * Build a list of bios to read all sectors (including data and P/Q).
+ *
+ * This behaviro is to compensate the later csum verification and
+ * recovery.
+ */
+ for (total_sector_nr = 0; total_sector_nr < rbio->nr_sectors;
total_sector_nr++) {
struct sector_ptr *sector;
int stripe = total_sector_nr / rbio->stripe_nsectors;
int sectornr = total_sector_nr % rbio->stripe_nsectors;
- /*
- * We want to find all the sectors missing from the rbio and
- * read them from the disk. If sector_in_rbio() finds a page
- * in the bio list we don't need to read it off the stripe.
- */
- sector = sector_in_rbio(rbio, stripe, sectornr, 1);
- if (sector)
- continue;
-
sector = rbio_stripe_sector(rbio, stripe, sectornr);
- /*
- * The bio cache may have handed us an uptodate page. If so,
- * use it.
- */
- if (sector->uptodate)
- continue;
-
ret = rbio_add_io_sector(rbio, bio_list, sector,
stripe, sectornr, REQ_OP_READ);
if (ret)
@@ -1670,6 +1701,42 @@ void raid56_parity_write(struct bio *bio, struct btrfs_io_context *bioc)
bio_endio(bio);
}
+static int verify_one_sector(struct btrfs_raid_bio *rbio,
+ int stripe_nr, int sector_nr)
+{
+ struct btrfs_fs_info *fs_info = rbio->bioc->fs_info;
+ struct sector_ptr *sector;
+ u8 csum_buf[BTRFS_CSUM_SIZE];
+ u8 *csum_expected;
+ int ret;
+
+ if (!rbio->csum_bitmap || !rbio->csum_buf)
+ return 0;
+
+ /* No way to verify P/Q as they are not covered by data csum. */
+ if (stripe_nr >= rbio->nr_data)
+ return 0;
+ /*
+ * If we're rebuilding a read, we have to use pages from the
+ * bio list if possible.
+ */
+ if ((rbio->operation == BTRFS_RBIO_READ_REBUILD ||
+ rbio->operation == BTRFS_RBIO_REBUILD_MISSING)) {
+ sector = sector_in_rbio(rbio, stripe_nr, sector_nr, 0);
+ } else {
+ sector = rbio_stripe_sector(rbio, stripe_nr, sector_nr);
+ }
+
+ ASSERT(sector->page);
+
+ csum_expected = rbio->csum_buf +
+ (stripe_nr * rbio->stripe_nsectors + sector_nr) *
+ fs_info->csum_size;
+ ret = btrfs_check_sector_csum(fs_info, sector->page, sector->pgoff,
+ csum_buf, csum_expected);
+ return ret;
+}
+
/*
* Recover a vertical stripe specified by @sector_nr.
* @*pointers are the pre-allocated pointers by the caller, so we don't
@@ -1685,6 +1752,7 @@ static int recover_vertical(struct btrfs_raid_bio *rbio, int sector_nr,
int faila;
int failb;
int stripe_nr;
+ int ret = 0;
/*
* Now we just use bitmap to mark the horizontal stripes in
@@ -1805,12 +1873,23 @@ static int recover_vertical(struct btrfs_raid_bio *rbio, int sector_nr,
* uptodate.
* Especially if we determine to cache the rbio, we need to
* have at least all data sectors uptodate.
+ *
+ * If possible, also check if the repaired sector matches its data
+ * checksum.
*/
if (faila >= 0) {
+ ret = verify_one_sector(rbio, faila, sector_nr);
+ if (ret < 0)
+ goto cleanup;
+
sector = rbio_stripe_sector(rbio, faila, sector_nr);
sector->uptodate = 1;
}
if (failb >= 0) {
+ ret = verify_one_sector(rbio, faila, sector_nr);
+ if (ret < 0)
+ goto cleanup;
+
sector = rbio_stripe_sector(rbio, failb, sector_nr);
sector->uptodate = 1;
}
@@ -1818,7 +1897,7 @@ static int recover_vertical(struct btrfs_raid_bio *rbio, int sector_nr,
cleanup:
for (stripe_nr = rbio->real_stripes - 1; stripe_nr >= 0; stripe_nr--)
kunmap_local(unmap_array[stripe_nr]);
- return 0;
+ return ret;
}
static int recover_sectors(struct btrfs_raid_bio *rbio)
@@ -2115,7 +2194,7 @@ static void fill_data_csums(struct btrfs_raid_bio *rbio)
rbio->csum_bitmap = NULL;
}
-static int rmw_read_and_wait(struct btrfs_raid_bio *rbio)
+static int rmw_read_wait_recover(struct btrfs_raid_bio *rbio)
{
struct bio_list bio_list;
struct bio *bio;
@@ -2136,6 +2215,12 @@ static int rmw_read_and_wait(struct btrfs_raid_bio *rbio)
submit_read_bios(rbio, &bio_list);
wait_event(rbio->io_wait, atomic_read(&rbio->stripes_pending) == 0);
+
+ /*
+ * We may or may not have any corrupted sectors (including missing dev
+ * and csum mismatch), just let recover_sectors() to handle them all.
+ */
+ ret = recover_sectors(rbio);
return ret;
out:
while ((bio = bio_list_pop(&bio_list)))
@@ -2175,6 +2260,28 @@ static void submit_write_bios(struct btrfs_raid_bio *rbio,
}
}
+/*
+ * To determine if we need to read any sector from the disk.
+ * Should only be utilized in RMW path, to skip cached rbio.
+ */
+static bool need_read_stripe_sectors(struct btrfs_raid_bio *rbio)
+{
+ int i;
+
+ for (i = 0; i < rbio->nr_data * rbio->stripe_nsectors; i++) {
+ struct sector_ptr *sector = &rbio->stripe_sectors[i];
+
+ /*
+ * We have a sector which doesn't have page nor uptodate,
+ * thus this rbio can not be cached one, as cached one must
+ * have all its data sectors present and uptodate.
+ */
+ if (!sector->page || !sector->uptodate)
+ return true;
+ }
+ return false;
+}
+
static int rmw_rbio(struct btrfs_raid_bio *rbio)
{
struct bio_list bio_list;
@@ -2189,9 +2296,13 @@ static int rmw_rbio(struct btrfs_raid_bio *rbio)
if (ret < 0)
return ret;
- /* Full stripe write, can write the full stripe right now. */
- if (rbio_is_full(rbio))
+ /*
+ * Either full stripe write, or we have every data sector already
+ * cached, can go to write path immediately.
+ */
+ if (rbio_is_full(rbio) || !need_read_stripe_sectors(rbio))
goto write;
+
/*
* Now we're doing sub-stripe write, also need all data stripes to do
* the full RMW.
@@ -2202,16 +2313,10 @@ static int rmw_rbio(struct btrfs_raid_bio *rbio)
index_rbio_pages(rbio);
- ret = rmw_read_and_wait(rbio);
+ ret = rmw_read_wait_recover(rbio);
if (ret < 0)
return ret;
- /* We have read errors, try recovery path. */
- if (!bitmap_empty(rbio->error_bitmap, rbio->nr_sectors)) {
- ret = recover_rbio(rbio);
- if (ret < 0)
- return ret;
- }
write:
/*
* At this stage we're not allowed to add any new bios to the
--
2.38.1
^ permalink raw reply related [flat|nested] 14+ messages in thread
* Re: [PATCH 0/5] btrfs: raid56: destructive RMW fix for RAID5 data profiles
2022-11-14 0:26 [PATCH 0/5] btrfs: raid56: destructive RMW fix for RAID5 data profiles Qu Wenruo
` (4 preceding siblings ...)
2022-11-14 0:26 ` [PATCH 5/5] btrfs: raid56: do data csum verification during RMW cycle Qu Wenruo
@ 2022-11-14 1:59 ` Qu Wenruo
2022-11-15 13:24 ` David Sterba
6 siblings, 0 replies; 14+ messages in thread
From: Qu Wenruo @ 2022-11-14 1:59 UTC (permalink / raw)
To: linux-btrfs
On 2022/11/14 08:26, Qu Wenruo wrote:
> [DESTRUCTIVE RMW]
> Btrfs (and generic) RAID56 is always vulnerable against destructive RMW.
>
> Such destructive RMW can happen when one of the data stripe is
> corrupted, then a sub-stripe write into other data stripes happens, like
> the following:
>
>
> Disk 1: data 1 |0x000000000000| <- Corrupted
> Disk 2: data 2 |0x000000000000|
> Disk 2: parity |0xffffffffffff|
>
> In above case, if we write data into disk 2, we will got something like
> this:
>
> Disk 1: data 1 |0x000000000000| <- Corrupted
> Disk 2: data 2 |0x000000000000| <- New '0x00' writes
> Disk 2: parity |0x000000000000| <- New Parity.
>
> Since the new parity is calculated using the new writes and the
> corrupted data, we lost the chance to recovery data stripe 1, and that
> corruption will forever be there.
>
> [SOLUTION]
> This series will close the destructive RMW problem for RAID5 data
> profiles by:
>
> - Read the full stripe before doing sub-stripe writes.
>
> - Also fetch the data csum for the data stripes:
>
> - Verify every data sector against their data checksum
>
> Now even with above corrupted data, we know there are some data csum
> mismatch, we will have an error bitmap to record such mismatches.
> We treat read error (like missing device) and data csum mismatch the
> same.
>
> If there is no csum (NODATASUM or metadata), we just trust the data
> unconditionally.
>
> So we will have an error bitmap for above corrupted case like this:
>
> Disk 1: data 1: |XXXXXXX|
> Disk 2: data 2: | |
>
> - Rebuild just as usual
> Since data 1 has all its sectors marked error in the error bitmap,
> we rebuild the sectors of data stripe 1.
>
> - Verify the repaired sectors against their csum
> If csum matches, we can continue.
> Or we error out.
>
> - Continue the RMW cycle using the repaired sectors
> Now we have correct data and will re-calculate the correct parity.
>
> [TODO]
> - Iterate all RAID6 combinations
> Currently we don't try all combinations of RAID6 during the repair.
> Thus for RAID6 we treat it just like RAID5 in RMW.
>
> Currently the RAID6 recovery combination is only exhausted during
> recovery path, relying on the caller the increase the mirror number.
>
> Can be implemented later for RMW path.
>
> - Write back the repaired sectors
> Currently we don't write back the repaired sectors, thus if we read
> the corrupted sectors, we rely on the recover path, and since the new
> parity is calculated using the recovered sectors, we can get the
> correct data without any problem.
>
> But if we write back the repaired sectors during RMW, we can save the
> reader some time without going into recovery path at all.
>
> This is just a performance optimization, thus I believe it can be
> implemented later.
>
> Qu Wenruo (5):
> btrfs: use btrfs_dev_name() helper to handle missing devices better
Please ignore this one, this patch has already been sent as an
independent patch.
The remaining 4 are the main patches.
Thanks,
Qu
> btrfs: refactor btrfs_lookup_csums_range()
> btrfs: introduce a bitmap based csum range search function
> btrfs: raid56: prepare data checksums for later RMW data csum
> verification
> btrfs: raid56: do data csum verification during RMW cycle
>
> fs/btrfs/check-integrity.c | 2 +-
> fs/btrfs/dev-replace.c | 15 +--
> fs/btrfs/disk-io.c | 2 +-
> fs/btrfs/extent-tree.c | 2 +-
> fs/btrfs/extent_io.c | 3 +-
> fs/btrfs/file-item.c | 196 ++++++++++++++++++++++++++----
> fs/btrfs/file-item.h | 8 +-
> fs/btrfs/inode.c | 5 +-
> fs/btrfs/ioctl.c | 4 +-
> fs/btrfs/raid56.c | 243 ++++++++++++++++++++++++++++++++-----
> fs/btrfs/raid56.h | 12 ++
> fs/btrfs/relocation.c | 4 +-
> fs/btrfs/scrub.c | 28 ++---
> fs/btrfs/super.c | 2 +-
> fs/btrfs/tree-log.c | 15 ++-
> fs/btrfs/volumes.c | 16 +--
> fs/btrfs/volumes.h | 9 ++
> 17 files changed, 451 insertions(+), 115 deletions(-)
>
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCH 1/5] btrfs: use btrfs_dev_name() helper to handle missing devices better
2022-11-14 0:26 ` [PATCH 1/5] btrfs: use btrfs_dev_name() helper to handle missing devices better Qu Wenruo
@ 2022-11-14 16:06 ` Goffredo Baroncelli
2022-11-14 22:40 ` Qu Wenruo
0 siblings, 1 reply; 14+ messages in thread
From: Goffredo Baroncelli @ 2022-11-14 16:06 UTC (permalink / raw)
To: Qu Wenruo, linux-btrfs
On 14/11/2022 01.26, Qu Wenruo wrote:
[...]
> @@ -770,6 +771,14 @@ static inline void btrfs_dev_stat_set(struct btrfs_device *dev,
> atomic_inc(&dev->dev_stats_ccnt);
> }
>
> +static inline char* btrfs_dev_name(struct btrfs_device *device)
Because we are returning a static char*, should we mark the return
values as "const char*" ?
static inline const char* btrfs_dev_name(struct btrfs_device *device)
> +{
> + if (!device || test_bit(BTRFS_DEV_STATE_MISSING, &device->dev_state))
> + return "<missing disk>";
> + else
> + return rcu_str_deref(device->name);
> +}
> +
> void btrfs_commit_device_sizes(struct btrfs_transaction *trans);
>
> struct list_head * __attribute_const__ btrfs_get_fs_uuids(void);
--
gpg @keyserver.linux.it: Goffredo Baroncelli <kreijackATinwind.it>
Key fingerprint BBF5 1610 0B64 DAC6 5F7D 17B2 0EDA 9B37 8B82 E0B5
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCH 1/5] btrfs: use btrfs_dev_name() helper to handle missing devices better
2022-11-14 16:06 ` Goffredo Baroncelli
@ 2022-11-14 22:40 ` Qu Wenruo
2022-11-14 23:54 ` David Sterba
0 siblings, 1 reply; 14+ messages in thread
From: Qu Wenruo @ 2022-11-14 22:40 UTC (permalink / raw)
To: kreijack, Qu Wenruo, linux-btrfs
On 2022/11/15 00:06, Goffredo Baroncelli wrote:
> On 14/11/2022 01.26, Qu Wenruo wrote:
> [...]
>> @@ -770,6 +771,14 @@ static inline void btrfs_dev_stat_set(struct
>> btrfs_device *dev,
>> atomic_inc(&dev->dev_stats_ccnt);
>> }
>> +static inline char* btrfs_dev_name(struct btrfs_device *device)
>
> Because we are returning a static char*, should we mark the return
> values as "const char*" ?
>
> static inline const char* btrfs_dev_name(struct btrfs_device *device)
Right, that would be more accurate.
Not sure if David can fold this change during merging.
Thanks,
Qu
>
>> +{
>> + if (!device || test_bit(BTRFS_DEV_STATE_MISSING,
>> &device->dev_state))
>> + return "<missing disk>";
>> + else
>> + return rcu_str_deref(device->name);
>> +}
>> +
>> void btrfs_commit_device_sizes(struct btrfs_transaction *trans);
>> struct list_head * __attribute_const__ btrfs_get_fs_uuids(void);
>
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCH 1/5] btrfs: use btrfs_dev_name() helper to handle missing devices better
2022-11-14 22:40 ` Qu Wenruo
@ 2022-11-14 23:54 ` David Sterba
0 siblings, 0 replies; 14+ messages in thread
From: David Sterba @ 2022-11-14 23:54 UTC (permalink / raw)
To: Qu Wenruo; +Cc: kreijack, Qu Wenruo, linux-btrfs
On Tue, Nov 15, 2022 at 06:40:30AM +0800, Qu Wenruo wrote:
>
>
> On 2022/11/15 00:06, Goffredo Baroncelli wrote:
> > On 14/11/2022 01.26, Qu Wenruo wrote:
> > [...]
> >> @@ -770,6 +771,14 @@ static inline void btrfs_dev_stat_set(struct
> >> btrfs_device *dev,
> >> atomic_inc(&dev->dev_stats_ccnt);
> >> }
> >> +static inline char* btrfs_dev_name(struct btrfs_device *device)
> >
> > Because we are returning a static char*, should we mark the return
> > values as "const char*" ?
> >
> > static inline const char* btrfs_dev_name(struct btrfs_device *device)
>
> Right, that would be more accurate.
>
> Not sure if David can fold this change during merging.
No problem, updated.
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCH 4/5] btrfs: raid56: prepare data checksums for later RMW data csum verification
2022-11-14 0:26 ` [PATCH 4/5] btrfs: raid56: prepare data checksums for later RMW data csum verification Qu Wenruo
@ 2022-11-15 6:40 ` Dan Carpenter
2022-11-15 13:18 ` David Sterba
0 siblings, 1 reply; 14+ messages in thread
From: Dan Carpenter @ 2022-11-15 6:40 UTC (permalink / raw)
To: oe-kbuild, Qu Wenruo, linux-btrfs; +Cc: lkp, oe-kbuild-all
Hi Qu,
https://git-scm.com/docs/git-format-patch#_base_tree_information]
url: https://github.com/intel-lab-lkp/linux/commits/Qu-Wenruo/btrfs-raid56-destructive-RMW-fix-for-RAID5-data-profiles/20221114-082910
base: https://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux.git for-next
patch link: https://lore.kernel.org/r/6620c738ea5f959085bfad0c0c843880f4c4e6e2.1668384746.git.wqu%40suse.com
patch subject: [PATCH 4/5] btrfs: raid56: prepare data checksums for later RMW data csum verification
config: x86_64-randconfig-m001-20221114
compiler: gcc-11 (Debian 11.3.0-8) 11.3.0
If you fix the issue, kindly add following tag where applicable
| Reported-by: kernel test robot <lkp@intel.com>
| Reported-by: Dan Carpenter <error27@gmail.com>
New smatch warnings:
fs/btrfs/raid56.c:2108 fill_data_csums() error: uninitialized symbol 'ret'.
Old smatch warnings:
fs/btrfs/raid56.c:1017 get_rbio_veritical_errors() error: we previously assumed 'faila' could be null (see line 1011)
vim +/ret +2108 fs/btrfs/raid56.c
f24e7de3761574 Qu Wenruo 2022-11-14 2057 static void fill_data_csums(struct btrfs_raid_bio *rbio)
f24e7de3761574 Qu Wenruo 2022-11-14 2058 {
f24e7de3761574 Qu Wenruo 2022-11-14 2059 struct btrfs_fs_info *fs_info = rbio->bioc->fs_info;
f24e7de3761574 Qu Wenruo 2022-11-14 2060 struct btrfs_root *csum_root = btrfs_csum_root(fs_info,
f24e7de3761574 Qu Wenruo 2022-11-14 2061 rbio->bioc->raid_map[0]);
f24e7de3761574 Qu Wenruo 2022-11-14 2062 const u64 start = rbio->bioc->raid_map[0];
f24e7de3761574 Qu Wenruo 2022-11-14 2063 const u32 len = (rbio->nr_data * rbio->stripe_nsectors) <<
f24e7de3761574 Qu Wenruo 2022-11-14 2064 fs_info->sectorsize_bits;
f24e7de3761574 Qu Wenruo 2022-11-14 2065 int ret;
f24e7de3761574 Qu Wenruo 2022-11-14 2066
f24e7de3761574 Qu Wenruo 2022-11-14 2067 /* The rbio should not has its csum buffer initialized. */
f24e7de3761574 Qu Wenruo 2022-11-14 2068 ASSERT(!rbio->csum_buf && !rbio->csum_bitmap);
f24e7de3761574 Qu Wenruo 2022-11-14 2069
f24e7de3761574 Qu Wenruo 2022-11-14 2070 /*
f24e7de3761574 Qu Wenruo 2022-11-14 2071 * Skip the csum search if:
f24e7de3761574 Qu Wenruo 2022-11-14 2072 *
f24e7de3761574 Qu Wenruo 2022-11-14 2073 * - The rbio doesn't belongs to data block groups
f24e7de3761574 Qu Wenruo 2022-11-14 2074 * Then we are doing IO for tree blocks, no need to
f24e7de3761574 Qu Wenruo 2022-11-14 2075 * search csums.
f24e7de3761574 Qu Wenruo 2022-11-14 2076 *
f24e7de3761574 Qu Wenruo 2022-11-14 2077 * - The rbio belongs to mixed block groups
f24e7de3761574 Qu Wenruo 2022-11-14 2078 * This is to avoid deadlock, as we're already holding
f24e7de3761574 Qu Wenruo 2022-11-14 2079 * the full stripe lock, if we trigger a metadata read, and
f24e7de3761574 Qu Wenruo 2022-11-14 2080 * it needs to do raid56 recovery, we will deadlock.
f24e7de3761574 Qu Wenruo 2022-11-14 2081 */
f24e7de3761574 Qu Wenruo 2022-11-14 2082 if (!(rbio->bioc->map_type & BTRFS_BLOCK_GROUP_DATA) ||
f24e7de3761574 Qu Wenruo 2022-11-14 2083 rbio->bioc->map_type & BTRFS_BLOCK_GROUP_METADATA)
f24e7de3761574 Qu Wenruo 2022-11-14 2084 return;
f24e7de3761574 Qu Wenruo 2022-11-14 2085
f24e7de3761574 Qu Wenruo 2022-11-14 2086 rbio->csum_buf = kzalloc(rbio->nr_data * rbio->stripe_nsectors *
f24e7de3761574 Qu Wenruo 2022-11-14 2087 fs_info->csum_size, GFP_NOFS);
f24e7de3761574 Qu Wenruo 2022-11-14 2088 rbio->csum_bitmap = bitmap_zalloc(rbio->nr_data * rbio->stripe_nsectors,
f24e7de3761574 Qu Wenruo 2022-11-14 2089 GFP_NOFS);
f24e7de3761574 Qu Wenruo 2022-11-14 2090 if (!rbio->csum_buf || !rbio->csum_bitmap)
f24e7de3761574 Qu Wenruo 2022-11-14 2091 goto error;
"ret" uninitialized on this path.
f24e7de3761574 Qu Wenruo 2022-11-14 2092
f24e7de3761574 Qu Wenruo 2022-11-14 2093 ret = btrfs_lookup_csums_bitmap(csum_root, start, start + len - 1,
f24e7de3761574 Qu Wenruo 2022-11-14 2094 rbio->csum_buf, rbio->csum_bitmap);
f24e7de3761574 Qu Wenruo 2022-11-14 2095 if (ret < 0)
f24e7de3761574 Qu Wenruo 2022-11-14 2096 goto error;
f24e7de3761574 Qu Wenruo 2022-11-14 2097 if (bitmap_empty(rbio->csum_bitmap, len >> fs_info->sectorsize_bits))
f24e7de3761574 Qu Wenruo 2022-11-14 2098 goto no_csum;
f24e7de3761574 Qu Wenruo 2022-11-14 2099 return;
f24e7de3761574 Qu Wenruo 2022-11-14 2100
f24e7de3761574 Qu Wenruo 2022-11-14 2101 error:
f24e7de3761574 Qu Wenruo 2022-11-14 2102 /*
f24e7de3761574 Qu Wenruo 2022-11-14 2103 * We failed to allocate memory or grab the csum,
f24e7de3761574 Qu Wenruo 2022-11-14 2104 * but it's not the end of day, we can still continue.
f24e7de3761574 Qu Wenruo 2022-11-14 2105 * But better to warn users that RMW is no longer safe for this
f24e7de3761574 Qu Wenruo 2022-11-14 2106 * particular sub-stripe write.
f24e7de3761574 Qu Wenruo 2022-11-14 2107 */
f24e7de3761574 Qu Wenruo 2022-11-14 @2108 btrfs_warn_rl(fs_info,
f24e7de3761574 Qu Wenruo 2022-11-14 2109 "sub-stripe write for full stripe %llu is not safe, failed to get csum: %d",
f24e7de3761574 Qu Wenruo 2022-11-14 2110 rbio->bioc->raid_map[0], ret);
^^^
f24e7de3761574 Qu Wenruo 2022-11-14 2111 no_csum:
f24e7de3761574 Qu Wenruo 2022-11-14 2112 kfree(rbio->csum_buf);
f24e7de3761574 Qu Wenruo 2022-11-14 2113 bitmap_free(rbio->csum_bitmap);
f24e7de3761574 Qu Wenruo 2022-11-14 2114 rbio->csum_buf = NULL;
f24e7de3761574 Qu Wenruo 2022-11-14 2115 rbio->csum_bitmap = NULL;
f24e7de3761574 Qu Wenruo 2022-11-14 2116 }
--
0-DAY CI Kernel Test Service
https://01.org/lkp
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCH 4/5] btrfs: raid56: prepare data checksums for later RMW data csum verification
2022-11-15 6:40 ` Dan Carpenter
@ 2022-11-15 13:18 ` David Sterba
0 siblings, 0 replies; 14+ messages in thread
From: David Sterba @ 2022-11-15 13:18 UTC (permalink / raw)
To: Dan Carpenter; +Cc: oe-kbuild, Qu Wenruo, linux-btrfs, lkp, oe-kbuild-all
On Tue, Nov 15, 2022 at 09:40:22AM +0300, Dan Carpenter wrote:
> Hi Qu,
>
> https://git-scm.com/docs/git-format-patch#_base_tree_information]
>
> url: https://github.com/intel-lab-lkp/linux/commits/Qu-Wenruo/btrfs-raid56-destructive-RMW-fix-for-RAID5-data-profiles/20221114-082910
> base: https://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux.git for-next
> patch link: https://lore.kernel.org/r/6620c738ea5f959085bfad0c0c843880f4c4e6e2.1668384746.git.wqu%40suse.com
> patch subject: [PATCH 4/5] btrfs: raid56: prepare data checksums for later RMW data csum verification
> config: x86_64-randconfig-m001-20221114
> compiler: gcc-11 (Debian 11.3.0-8) 11.3.0
>
> If you fix the issue, kindly add following tag where applicable
> | Reported-by: kernel test robot <lkp@intel.com>
> | Reported-by: Dan Carpenter <error27@gmail.com>
>
> New smatch warnings:
> fs/btrfs/raid56.c:2108 fill_data_csums() error: uninitialized symbol 'ret'.
>
> Old smatch warnings:
> fs/btrfs/raid56.c:1017 get_rbio_veritical_errors() error: we previously assumed 'faila' could be null (see line 1011)
>
> vim +/ret +2108 fs/btrfs/raid56.c
>
> f24e7de3761574 Qu Wenruo 2022-11-14 2057 static void fill_data_csums(struct btrfs_raid_bio *rbio)
> f24e7de3761574 Qu Wenruo 2022-11-14 2058 {
> f24e7de3761574 Qu Wenruo 2022-11-14 2059 struct btrfs_fs_info *fs_info = rbio->bioc->fs_info;
> f24e7de3761574 Qu Wenruo 2022-11-14 2060 struct btrfs_root *csum_root = btrfs_csum_root(fs_info,
> f24e7de3761574 Qu Wenruo 2022-11-14 2061 rbio->bioc->raid_map[0]);
> f24e7de3761574 Qu Wenruo 2022-11-14 2062 const u64 start = rbio->bioc->raid_map[0];
> f24e7de3761574 Qu Wenruo 2022-11-14 2063 const u32 len = (rbio->nr_data * rbio->stripe_nsectors) <<
> f24e7de3761574 Qu Wenruo 2022-11-14 2064 fs_info->sectorsize_bits;
> f24e7de3761574 Qu Wenruo 2022-11-14 2065 int ret;
> f24e7de3761574 Qu Wenruo 2022-11-14 2066
> f24e7de3761574 Qu Wenruo 2022-11-14 2067 /* The rbio should not has its csum buffer initialized. */
> f24e7de3761574 Qu Wenruo 2022-11-14 2068 ASSERT(!rbio->csum_buf && !rbio->csum_bitmap);
> f24e7de3761574 Qu Wenruo 2022-11-14 2069
> f24e7de3761574 Qu Wenruo 2022-11-14 2070 /*
> f24e7de3761574 Qu Wenruo 2022-11-14 2071 * Skip the csum search if:
> f24e7de3761574 Qu Wenruo 2022-11-14 2072 *
> f24e7de3761574 Qu Wenruo 2022-11-14 2073 * - The rbio doesn't belongs to data block groups
> f24e7de3761574 Qu Wenruo 2022-11-14 2074 * Then we are doing IO for tree blocks, no need to
> f24e7de3761574 Qu Wenruo 2022-11-14 2075 * search csums.
> f24e7de3761574 Qu Wenruo 2022-11-14 2076 *
> f24e7de3761574 Qu Wenruo 2022-11-14 2077 * - The rbio belongs to mixed block groups
> f24e7de3761574 Qu Wenruo 2022-11-14 2078 * This is to avoid deadlock, as we're already holding
> f24e7de3761574 Qu Wenruo 2022-11-14 2079 * the full stripe lock, if we trigger a metadata read, and
> f24e7de3761574 Qu Wenruo 2022-11-14 2080 * it needs to do raid56 recovery, we will deadlock.
> f24e7de3761574 Qu Wenruo 2022-11-14 2081 */
> f24e7de3761574 Qu Wenruo 2022-11-14 2082 if (!(rbio->bioc->map_type & BTRFS_BLOCK_GROUP_DATA) ||
> f24e7de3761574 Qu Wenruo 2022-11-14 2083 rbio->bioc->map_type & BTRFS_BLOCK_GROUP_METADATA)
> f24e7de3761574 Qu Wenruo 2022-11-14 2084 return;
> f24e7de3761574 Qu Wenruo 2022-11-14 2085
> f24e7de3761574 Qu Wenruo 2022-11-14 2086 rbio->csum_buf = kzalloc(rbio->nr_data * rbio->stripe_nsectors *
> f24e7de3761574 Qu Wenruo 2022-11-14 2087 fs_info->csum_size, GFP_NOFS);
> f24e7de3761574 Qu Wenruo 2022-11-14 2088 rbio->csum_bitmap = bitmap_zalloc(rbio->nr_data * rbio->stripe_nsectors,
> f24e7de3761574 Qu Wenruo 2022-11-14 2089 GFP_NOFS);
> f24e7de3761574 Qu Wenruo 2022-11-14 2090 if (!rbio->csum_buf || !rbio->csum_bitmap)
> f24e7de3761574 Qu Wenruo 2022-11-14 2091 goto error;
>
> "ret" uninitialized on this path.
Thanks for the report, now initialized to -ENOMEM.
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCH 0/5] btrfs: raid56: destructive RMW fix for RAID5 data profiles
2022-11-14 0:26 [PATCH 0/5] btrfs: raid56: destructive RMW fix for RAID5 data profiles Qu Wenruo
` (5 preceding siblings ...)
2022-11-14 1:59 ` [PATCH 0/5] btrfs: raid56: destructive RMW fix for RAID5 data profiles Qu Wenruo
@ 2022-11-15 13:24 ` David Sterba
2022-11-16 3:21 ` Qu Wenruo
6 siblings, 1 reply; 14+ messages in thread
From: David Sterba @ 2022-11-15 13:24 UTC (permalink / raw)
To: Qu Wenruo; +Cc: linux-btrfs
On Mon, Nov 14, 2022 at 08:26:29AM +0800, Qu Wenruo wrote:
> [DESTRUCTIVE RMW]
> Btrfs (and generic) RAID56 is always vulnerable against destructive RMW.
>
> Such destructive RMW can happen when one of the data stripe is
> corrupted, then a sub-stripe write into other data stripes happens, like
> the following:
>
>
> Disk 1: data 1 |0x000000000000| <- Corrupted
> Disk 2: data 2 |0x000000000000|
> Disk 2: parity |0xffffffffffff|
>
> In above case, if we write data into disk 2, we will got something like
> this:
>
> Disk 1: data 1 |0x000000000000| <- Corrupted
> Disk 2: data 2 |0x000000000000| <- New '0x00' writes
> Disk 2: parity |0x000000000000| <- New Parity.
>
> Since the new parity is calculated using the new writes and the
> corrupted data, we lost the chance to recovery data stripe 1, and that
> corruption will forever be there.
...
> [TODO]
> - Iterate all RAID6 combinations
> Currently we don't try all combinations of RAID6 during the repair.
> Thus for RAID6 we treat it just like RAID5 in RMW.
>
> Currently the RAID6 recovery combination is only exhausted during
> recovery path, relying on the caller the increase the mirror number.
>
> Can be implemented later for RMW path.
>
> - Write back the repaired sectors
> Currently we don't write back the repaired sectors, thus if we read
> the corrupted sectors, we rely on the recover path, and since the new
> parity is calculated using the recovered sectors, we can get the
> correct data without any problem.
>
> But if we write back the repaired sectors during RMW, we can save the
> reader some time without going into recovery path at all.
>
> This is just a performance optimization, thus I believe it can be
> implemented later.
Even with the todo and potential performance drop due to mandatory
stripe caching I think this is worth merging at this point, so patches
are in misc-next now.
Regarding the perf drop, I'll try to get some results besides functional
testing but if anybody with a decent setup for raid5 can do some
destructive tests it would be most welcome.
Fallback plan is to revert the last patch if it turns out to be too
problematic.
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCH 0/5] btrfs: raid56: destructive RMW fix for RAID5 data profiles
2022-11-15 13:24 ` David Sterba
@ 2022-11-16 3:21 ` Qu Wenruo
0 siblings, 0 replies; 14+ messages in thread
From: Qu Wenruo @ 2022-11-16 3:21 UTC (permalink / raw)
To: dsterba, Qu Wenruo; +Cc: linux-btrfs
On 2022/11/15 21:24, David Sterba wrote:
> On Mon, Nov 14, 2022 at 08:26:29AM +0800, Qu Wenruo wrote:
>> [DESTRUCTIVE RMW]
>> Btrfs (and generic) RAID56 is always vulnerable against destructive RMW.
>>
>> Such destructive RMW can happen when one of the data stripe is
>> corrupted, then a sub-stripe write into other data stripes happens, like
>> the following:
>>
>>
>> Disk 1: data 1 |0x000000000000| <- Corrupted
>> Disk 2: data 2 |0x000000000000|
>> Disk 2: parity |0xffffffffffff|
>>
>> In above case, if we write data into disk 2, we will got something like
>> this:
>>
>> Disk 1: data 1 |0x000000000000| <- Corrupted
>> Disk 2: data 2 |0x000000000000| <- New '0x00' writes
>> Disk 2: parity |0x000000000000| <- New Parity.
>>
>> Since the new parity is calculated using the new writes and the
>> corrupted data, we lost the chance to recovery data stripe 1, and that
>> corruption will forever be there.
> ...
>> [TODO]
>> - Iterate all RAID6 combinations
>> Currently we don't try all combinations of RAID6 during the repair.
>> Thus for RAID6 we treat it just like RAID5 in RMW.
>>
>> Currently the RAID6 recovery combination is only exhausted during
>> recovery path, relying on the caller the increase the mirror number.
>>
>> Can be implemented later for RMW path.
>>
>> - Write back the repaired sectors
>> Currently we don't write back the repaired sectors, thus if we read
>> the corrupted sectors, we rely on the recover path, and since the new
>> parity is calculated using the recovered sectors, we can get the
>> correct data without any problem.
>>
>> But if we write back the repaired sectors during RMW, we can save the
>> reader some time without going into recovery path at all.
>>
>> This is just a performance optimization, thus I believe it can be
>> implemented later.
>
> Even with the todo and potential performance drop due to mandatory
> stripe caching I think this is worth merging at this point, so patches
> are in misc-next now.
>
> Regarding the perf drop, I'll try to get some results besides functional
> testing but if anybody with a decent setup for raid5 can do some
> destructive tests it would be most welcome.
>
> Fallback plan is to revert the last patch if it turns out to be too
> problematic.
If you're concerned about the fallback plan, I'd say we need to at least
drop the last two patches.
Sure, dropping the last one will remove the verification and the extra
reads, but unfortunately we will still fetch the csum for RMW cycles.
Although the csum fetch would at most cost 2 leaves, the cost is still
there.
Thus if needed, I can do more refactoring of the series, to separate the
final patch.
Although I hope we don't need to go that path.
Thanks,
Qu
^ permalink raw reply [flat|nested] 14+ messages in thread
end of thread, other threads:[~2022-11-16 3:21 UTC | newest]
Thread overview: 14+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2022-11-14 0:26 [PATCH 0/5] btrfs: raid56: destructive RMW fix for RAID5 data profiles Qu Wenruo
2022-11-14 0:26 ` [PATCH 1/5] btrfs: use btrfs_dev_name() helper to handle missing devices better Qu Wenruo
2022-11-14 16:06 ` Goffredo Baroncelli
2022-11-14 22:40 ` Qu Wenruo
2022-11-14 23:54 ` David Sterba
2022-11-14 0:26 ` [PATCH 2/5] btrfs: refactor btrfs_lookup_csums_range() Qu Wenruo
2022-11-14 0:26 ` [PATCH 3/5] btrfs: introduce a bitmap based csum range search function Qu Wenruo
2022-11-14 0:26 ` [PATCH 4/5] btrfs: raid56: prepare data checksums for later RMW data csum verification Qu Wenruo
2022-11-15 6:40 ` Dan Carpenter
2022-11-15 13:18 ` David Sterba
2022-11-14 0:26 ` [PATCH 5/5] btrfs: raid56: do data csum verification during RMW cycle Qu Wenruo
2022-11-14 1:59 ` [PATCH 0/5] btrfs: raid56: destructive RMW fix for RAID5 data profiles Qu Wenruo
2022-11-15 13:24 ` David Sterba
2022-11-16 3:21 ` Qu Wenruo
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).