* [RFC PATCH 1/2] btrfs: csum: Introduce partial csum for tree block.
2015-07-10 4:09 [RFC PATCH 0/2] Btrfs partial csum support Qu Wenruo
@ 2015-07-10 4:09 ` Qu Wenruo
2015-07-10 4:09 ` [RFC PATCH 2/2] btrfs: scrub: Add support partial csum Qu Wenruo
2015-07-20 8:12 ` [RFC PATCH 0/2] Btrfs partial csum support Qu Wenruo
2 siblings, 0 replies; 4+ messages in thread
From: Qu Wenruo @ 2015-07-10 4:09 UTC (permalink / raw)
To: linux-btrfs
Introduce the new partial csum mechanism for tree block.
[Old tree block csum]
0 4 8 12 16 20 24 28 32
-------------------------------------------------
|csum | unused, all 0 |
-------------------------------------------------
Csum is the crc32 of the whole tree block data.
[New tree block csum]
-------------------------------------------------
|csum0|csum1|csum2|csum3|csum4|csum5|csum6|csum7|
-------------------------------------------------
Where csum0 is the same as the old one, crc32 of the whole tree block
data.
But csum1~csum7 will restore crc32 of each eighth part.
Take example of 16K leafsize, then:
csum1: crc32 of BTRFS_CSUM_SIZE~4K
csum2: crc32 of 4K~6K
...
csum7: crc32 of 14K~16K
This provides the ability for btrfs not only to detect corruption but
also to know where corruption is.
Further improve the robustness of btrfs.
Although the best practise is to introduce new csum type and put every
eighth crc32 into corresponding place, but the benefit is not worthy to
break the backward compatibility.
So keep csum0 and modify csum1 range to keep backward compatibility.
Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
---
fs/btrfs/disk-io.c | 74 ++++++++++++++++++++++++++++++++++++------------------
1 file changed, 49 insertions(+), 25 deletions(-)
diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index 2ef9a4b..b2d8526 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -271,47 +271,75 @@ void btrfs_csum_final(u32 crc, char *result)
}
/*
- * compute the csum for a btree block, and either verify it or write it
- * into the csum field of the block.
+ * Calcuate partial crc32 for each part.
+ *
+ * Part should be in [0, 7].
+ * Part 0 is the old crc32 of the whole leaf/node.
+ * Part 1 is the crc32 of 32~ 2/8 of leaf/node.
+ * Part 2 is the crc32 of 3/8 of leaf/node.
+ * Part 3 is the crc32 of 4/8 of lean/node and so on.
*/
-static int csum_tree_block(struct btrfs_fs_info *fs_info,
- struct extent_buffer *buf,
- int verify)
+static int csum_tree_block_part(struct extent_buffer *buf,
+ char *result, int part)
{
- u16 csum_size = btrfs_super_csum_size(fs_info->super_copy);
- char *result = NULL;
+ int offset;
+ int err;
unsigned long len;
unsigned long cur_len;
- unsigned long offset = BTRFS_CSUM_SIZE;
- char *kaddr;
unsigned long map_start;
unsigned long map_len;
- int err;
+ char *kaddr;
u32 crc = ~(u32)0;
- unsigned long inline_result;
- len = buf->len - offset;
+ BUG_ON(part >= 8 || part < 0);
+ BUG_ON(ALIGN(buf->len, 8) != buf->len);
+
+ if (part == 0) {
+ offset = BTRFS_CSUM_SIZE;
+ len = buf->len - offset;
+ } else if (part == 1) {
+ offset = BTRFS_CSUM_SIZE;
+ len = buf->len * 2 / 8 - offset;
+ } else {
+ offset = part * buf->len / 8;
+ len = buf->len / 8;
+ }
+
while (len > 0) {
err = map_private_extent_buffer(buf, offset, 32,
&kaddr, &map_start, &map_len);
if (err)
- return 1;
+ return err;
cur_len = min(len, map_len - (offset - map_start));
crc = btrfs_csum_data(kaddr + offset - map_start,
crc, cur_len);
len -= cur_len;
offset += cur_len;
}
- if (csum_size > sizeof(inline_result)) {
- result = kzalloc(csum_size, GFP_NOFS);
- if (!result)
+ btrfs_csum_final(crc, result + BTRFS_CSUM_SIZE * part / 8);
+ return 0;
+}
+
+/*
+ * compute the csum for a btree block, and either verify it or write it
+ * into the csum field of the block.
+ */
+static int csum_tree_block(struct btrfs_fs_info *fs_info,
+ struct extent_buffer *buf,
+ int verify)
+{
+ u16 csum_size = btrfs_super_csum_size(fs_info->super_copy);
+ char result[BTRFS_CSUM_SIZE] = {0};
+ int err;
+ int index = 0;
+
+ /* get every part csum */
+ for (index = 0; index < 8; index++) {
+ err = csum_tree_block_part(buf, result, index);
+ if (err)
return 1;
- } else {
- result = (char *)&inline_result;
}
- btrfs_csum_final(crc, result);
-
if (verify) {
if (memcmp_extent_buffer(buf, result, 0, csum_size)) {
u32 val;
@@ -324,15 +352,11 @@ static int csum_tree_block(struct btrfs_fs_info *fs_info,
"level %d\n",
fs_info->sb->s_id, buf->start,
val, found, btrfs_header_level(buf));
- if (result != (char *)&inline_result)
- kfree(result);
return 1;
}
} else {
- write_extent_buffer(buf, result, 0, csum_size);
+ write_extent_buffer(buf, result, 0, BTRFS_CSUM_SIZE);
}
- if (result != (char *)&inline_result)
- kfree(result);
return 0;
}
--
2.4.5
^ permalink raw reply related [flat|nested] 4+ messages in thread
* [RFC PATCH 2/2] btrfs: scrub: Add support partial csum
2015-07-10 4:09 [RFC PATCH 0/2] Btrfs partial csum support Qu Wenruo
2015-07-10 4:09 ` [RFC PATCH 1/2] btrfs: csum: Introduce partial csum for tree block Qu Wenruo
@ 2015-07-10 4:09 ` Qu Wenruo
2015-07-20 8:12 ` [RFC PATCH 0/2] Btrfs partial csum support Qu Wenruo
2 siblings, 0 replies; 4+ messages in thread
From: Qu Wenruo @ 2015-07-10 4:09 UTC (permalink / raw)
To: linux-btrfs; +Cc: Zhao Lei
From: Zhao Lei <zhaolei@cn.fujitsu.com>
Add scrub support for partial csum.
The only challenge is that, scrub is done in unit of bio(or page size
yet), but partial csum is done in unit of 1/8 of nodesize.
So here a new function scrub_check_node_checksum and a new tree block
csum check loop is introduced to do partial csum check while reading the
tree block.
Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com>
Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
---
fs/btrfs/scrub.c | 207 ++++++++++++++++++++++++++++++++++++++++++++++++++++++-
1 file changed, 206 insertions(+), 1 deletion(-)
diff --git a/fs/btrfs/scrub.c b/fs/btrfs/scrub.c
index ab58115..0610474 100644
--- a/fs/btrfs/scrub.c
+++ b/fs/btrfs/scrub.c
@@ -307,6 +307,7 @@ static void copy_nocow_pages_worker(struct btrfs_work *work);
static void __scrub_blocked_if_needed(struct btrfs_fs_info *fs_info);
static void scrub_blocked_if_needed(struct btrfs_fs_info *fs_info);
static void scrub_put_ctx(struct scrub_ctx *sctx);
+static int scrub_check_fsid(u8 fsid[], struct scrub_page *spage);
static void scrub_pending_bio_inc(struct scrub_ctx *sctx)
@@ -878,6 +879,91 @@ static inline void scrub_put_recover(struct scrub_recover *recover)
}
/*
+ * Page_bad arg should be a page include leaf header
+ *
+ * Return 0 if this header seems correct,
+ * Return 1 on other cases
+ */
+static int scrub_check_head(struct scrub_page *spage, u8 *csum)
+{
+ void *mapped_buffer;
+ struct btrfs_header *h;
+
+ mapped_buffer = kmap_atomic(spage->page);
+ h = (struct btrfs_header *)mapped_buffer;
+
+ if (spage->logical != btrfs_stack_header_bytenr(h))
+ goto header_err;
+ if (!scrub_check_fsid(h->fsid, spage))
+ goto header_err;
+ if (memcmp(h->chunk_tree_uuid,
+ spage->dev->dev_root->fs_info->chunk_tree_uuid,
+ BTRFS_UUID_SIZE))
+ goto header_err;
+ if (spage->generation != btrfs_stack_header_generation(h))
+ goto header_err;
+
+ if (csum)
+ memcpy(csum, h->csum, sizeof(h->csum));
+
+ kunmap_atomic(mapped_buffer);
+ return 0;
+
+header_err:
+ kunmap_atomic(mapped_buffer);
+ return 1;
+}
+
+/*
+ * return 1 if checksum ok, 0 on other case
+ */
+static int scrub_check_node_checksum(struct scrub_block *sblock,
+ int part,
+ u8 *csum)
+{
+ int offset;
+ int len;
+ u32 crc = ~(u32)0;
+
+ if (part == 0) {
+ offset = BTRFS_CSUM_SIZE;
+ len = sblock->sctx->nodesize - BTRFS_CSUM_SIZE;
+ } else if (part == 1) {
+ offset = BTRFS_CSUM_SIZE;
+ len = sblock->sctx->nodesize * 2 / 8 - BTRFS_CSUM_SIZE;
+ } else {
+ offset = part * sblock->sctx->nodesize / 8;
+ len = sblock->sctx->nodesize / 8;
+ }
+
+ while (len > 0) {
+ int page_num = offset / PAGE_SIZE;
+ int page_data_offset = offset - page_num * PAGE_SIZE;
+ int page_data_len = min(len,
+ (int)(PAGE_SIZE - page_data_offset));
+ u8 *mapped_buffer;
+
+ WARN_ON(page_num >= sblock->page_count);
+
+ if (sblock->pagev[page_num]->io_error)
+ return 0;
+
+ mapped_buffer = kmap_atomic(
+ sblock->pagev[page_num]->page);
+
+ crc = btrfs_csum_data(mapped_buffer + page_data_offset, crc,
+ page_data_len);
+
+ offset += page_data_len;
+ len -= page_data_len;
+
+ kunmap_atomic(mapped_buffer);
+ }
+ btrfs_csum_final(crc, (char *)&crc);
+ return (crc == ((u32 *)csum)[part]);
+}
+
+/*
* scrub_handle_errored_block gets called when either verification of the
* pages failed or the bio failed to read, e.g. with EIO. In the latter
* case, this function handles all pages in the bio, even though only one
@@ -905,6 +991,9 @@ static int scrub_handle_errored_block(struct scrub_block *sblock_to_check)
int success;
static DEFINE_RATELIMIT_STATE(_rs, DEFAULT_RATELIMIT_INTERVAL,
DEFAULT_RATELIMIT_BURST);
+ u8 node_csum[BTRFS_CSUM_SIZE];
+ int get_right_sum = 0;
+ int per_page_recover_start = 0;
BUG_ON(sblock_to_check->page_count < 1);
fs_info = sctx->dev_root->fs_info;
@@ -1151,11 +1240,125 @@ nodatasum_case:
* area are unreadable.
*/
success = 1;
+
+ /*
+ * maybe some mirror's head is broken
+ * we select to use right head for checksum
+ */
+ for (mirror_index = 0; mirror_index < BTRFS_MAX_MIRRORS &&
+ sblocks_for_recheck[mirror_index].page_count > 0;
+ mirror_index++) {
+ if (scrub_check_head(sblocks_for_recheck[mirror_index].pagev[0],
+ node_csum) == 0) {
+ get_right_sum = 1;
+ break;
+ }
+ }
+
for (page_num = 0; page_num < sblock_bad->page_count;
page_num++) {
struct scrub_page *page_bad = sblock_bad->pagev[page_num];
struct scrub_block *sblock_other = NULL;
+ if (is_metadata && get_right_sum) {
+ /*
+ * For tree block which may support partial csum
+ *
+ * | page | page | page | page | page | page |
+ * | checksum | checksum | checksum |
+ * ^ ^
+ * | |
+ * | page_num
+ * |
+ * per_page_recover_start
+ *
+ * |<- done ->|
+ */
+ int start_csum_part;
+ int next_csum_part;
+ int sub_page_num;
+
+ /*
+ * Don't worry that start_csum_part is rounded in
+ * calculate, because per_page_recover_start should
+ * always align with checksum
+ */
+ start_csum_part = per_page_recover_start * 8 *
+ sblock_to_check->sctx->sectorsize /
+ sblock_to_check->sctx->nodesize;
+ start_csum_part = start_csum_part ? : 1;
+ next_csum_part = (page_num + 1) * 8 *
+ sblock_to_check->sctx->sectorsize /
+ sblock_to_check->sctx->nodesize;
+ next_csum_part = next_csum_part ? : 1;
+
+ if (next_csum_part == start_csum_part) {
+ /* this page hasn't wrap to next checksum */
+ continue;
+ }
+
+ /*
+ * to find which mirror have right data for current
+ * csum parts
+ */
+ for (mirror_index = 0;
+ mirror_index < BTRFS_MAX_MIRRORS &&
+ sblocks_for_recheck[mirror_index].page_count > 0;
+ mirror_index++) {
+ int csum_part;
+
+ for (csum_part = start_csum_part;
+ csum_part < next_csum_part; csum_part++) {
+ if (!scrub_check_node_checksum(
+ sblocks_for_recheck +
+ mirror_index, csum_part,
+ node_csum)) {
+ break;
+ }
+ }
+ if (csum_part == next_csum_part) {
+ /*
+ * all part of the mirror has right csum
+ */
+ sblock_other = sblocks_for_recheck +
+ mirror_index;
+ break;
+ }
+ }
+
+ if (sctx->is_dev_replace) {
+ if (!sblock_other)
+ sblock_other = sblock_bad;
+
+ for (sub_page_num = per_page_recover_start;
+ sub_page_num <= page_num; sub_page_num++) {
+ if (scrub_write_page_to_dev_replace(
+ sblock_other,
+ sub_page_num) != 0) {
+ btrfs_dev_replace_stats_inc(
+ &sctx->dev_root->
+ fs_info->dev_replace.
+ num_write_errors);
+ success = 0;
+ }
+ }
+ } else if (sblock_other) {
+ for (sub_page_num = per_page_recover_start;
+ sub_page_num <= page_num; sub_page_num++) {
+ if (!scrub_repair_page_from_good_copy(
+ sblock_bad,
+ sblock_other,
+ sub_page_num, 0))
+ page_bad->io_error = 0;
+ else
+ success = 0;
+ }
+ }
+
+ per_page_recover_start = page_num + 1;
+
+ continue;
+ }
/* skip no-io-error page in scrub */
if (!page_bad->io_error && !sctx->is_dev_replace)
continue;
@@ -1321,6 +1524,7 @@ static int scrub_setup_recheck_block(struct scrub_block *original_sblock,
struct btrfs_fs_info *fs_info = sctx->dev_root->fs_info;
u64 length = original_sblock->page_count * PAGE_SIZE;
u64 logical = original_sblock->pagev[0]->logical;
+ u64 generation = original_sblock->pagev[0]->generation;
struct scrub_recover *recover;
struct btrfs_bio *bbio;
u64 sublen;
@@ -1387,7 +1591,7 @@ leave_nomem:
scrub_page_get(page);
sblock->pagev[page_index] = page;
page->logical = logical;
-
+ page->generation = generation;
scrub_stripe_index_and_offset(logical,
bbio->map_type,
bbio->raid_map,
@@ -1839,6 +2043,7 @@ static int scrub_checksum(struct scrub_block *sblock)
WARN_ON(sblock->page_count < 1);
flags = sblock->pagev[0]->flags;
ret = 0;
+
if (flags & BTRFS_EXTENT_FLAG_DATA)
ret = scrub_checksum_data(sblock);
else if (flags & BTRFS_EXTENT_FLAG_TREE_BLOCK)
--
2.4.5
^ permalink raw reply related [flat|nested] 4+ messages in thread
* Re: [RFC PATCH 0/2] Btrfs partial csum support
2015-07-10 4:09 [RFC PATCH 0/2] Btrfs partial csum support Qu Wenruo
2015-07-10 4:09 ` [RFC PATCH 1/2] btrfs: csum: Introduce partial csum for tree block Qu Wenruo
2015-07-10 4:09 ` [RFC PATCH 2/2] btrfs: scrub: Add support partial csum Qu Wenruo
@ 2015-07-20 8:12 ` Qu Wenruo
2 siblings, 0 replies; 4+ messages in thread
From: Qu Wenruo @ 2015-07-20 8:12 UTC (permalink / raw)
To: linux-btrfs
Ping
Any comments?
Thanks,
Qu
Qu Wenruo wrote on 2015/07/10 12:09 +0800:
> This patchset will add partial csum support for btrfs.
>
> Partial csum will take full advantage of the 32 bytes csum space inside
> the tree block, while still maintain backward compatibility on old
> kernels.
>
> The overall idea is like the following on 16K leaf:
> [Old tree block csum]
> 0 4 8 12 16 20 24 28 32
> -------------------------------------------------
> |csum | unused, all 0 |
> -------------------------------------------------
> Csum is the crc32 of the whole tree block data.
>
> [New tree block csum]
> -------------------------------------------------
> |csum0|csum1|csum2|csum3|csum4|csum5|csum6|csum7|
> -------------------------------------------------
> Where csum0 is the same as the old one, crc32 of the whole tree block
> data.
>
> And csum1~csum7 will restore crc32 of each eighth part.
> Take example of 16K leafsize, then:
> csum1: crc32 of BTRFS_CSUM_SIZE~4K
> csum2: crc32 of 4K~6K
> ...
> csum7: crc32 of 14K~16K
>
>
> When nodesize is small, like 4K, partial csum is completely useless.
> But when nodesize grows up, like 32K, each partial csum will just covers
> a page, making scrub able to judge which page is OK even without reading
> out the whole tree block.
>
> And add the possibility to fix case like corruption happens at all
> mirror but in different part.
> Such case should be more possible if nodesize goes up beyond 16K.
>
> Qu Wenruo (1):
> btrfs: csum: Introduce partial csum for tree block.
>
> Zhao Lei (1):
> btrfs: scrub: Add support partial csum
>
> fs/btrfs/disk-io.c | 74 ++++++++++++-------
> fs/btrfs/scrub.c | 207 ++++++++++++++++++++++++++++++++++++++++++++++++++++-
> 2 files changed, 255 insertions(+), 26 deletions(-)
>
^ permalink raw reply [flat|nested] 4+ messages in thread