From: Qu Wenruo <quwenruo.btrfs@gmx.com>
To: Christoph Hellwig <hch@lst.de>, Chris Mason <clm@fb.com>,
Josef Bacik <josef@toxicpanda.com>,
David Sterba <dsterba@suse.com>
Cc: Qu Wenruo <wqu@suse.com>, linux-btrfs@vger.kernel.org
Subject: Re: [PATCH 12/15] btrfs: add new read repair infrastructure
Date: Wed, 18 May 2022 07:04:22 +0800 [thread overview]
Message-ID: <e2c4e54c-548b-2221-3bf7-31f6c14a03ee@gmx.com> (raw)
In-Reply-To: <20220517145039.3202184-13-hch@lst.de>
On 2022/5/17 22:50, Christoph Hellwig wrote:
> This adds a new read repair implementation for btrfs. It is synchronous
> in that the end I/O handlers call them, and will get back the results
> instead of potentially getting multiple concurrent calls back into the
> original end I/O handler. The synchronous nature has the following
> advantages:
>
> - there is no need for a per-I/O tree of I/O failures, as everything
> related to the I/O failure can be handled locally
> - not having separate repair end I/O helpers will in the future help
> to reuse the direct I/O bio from iomap for the actual submission and
> thus remove the btrfs_dio_private infrastructure
>
> Because submitting many sector size synchronous I/Os would be very
> slow when multiple sectors (or a whole read) fail, this new code instead
> submits a single read and repair write bio for each contiguous section.
> It uses clone of the bio to do that and thus does not need to allocate
> any extra bio_vecs. Note that this cloning is open coded instead of
> using the block layer clone helpers as the clone is based on the save
> iter in the btrfs_bio, and not bio.bi_iter, which at the point that the
> repair code is called has been advanced by the low-level driver.
>
> Signed-off-by: Christoph Hellwig <hch@lst.de>
> ---
> fs/btrfs/Makefile | 2 +-
> fs/btrfs/read-repair.c | 211 +++++++++++++++++++++++++++++++++++++++++
> fs/btrfs/read-repair.h | 33 +++++++
> fs/btrfs/super.c | 9 +-
> 4 files changed, 252 insertions(+), 3 deletions(-)
> create mode 100644 fs/btrfs/read-repair.c
> create mode 100644 fs/btrfs/read-repair.h
>
> diff --git a/fs/btrfs/Makefile b/fs/btrfs/Makefile
> index 99f9995670ea3..0b2605c750cab 100644
> --- a/fs/btrfs/Makefile
> +++ b/fs/btrfs/Makefile
> @@ -31,7 +31,7 @@ btrfs-y += super.o ctree.o extent-tree.o print-tree.o root-tree.o dir-item.o \
> backref.o ulist.o qgroup.o send.o dev-replace.o raid56.o \
> uuid-tree.o props.o free-space-tree.o tree-checker.o space-info.o \
> block-rsv.o delalloc-space.o block-group.o discard.o reflink.o \
> - subpage.o tree-mod-log.o
> + subpage.o tree-mod-log.o read-repair.o
>
> btrfs-$(CONFIG_BTRFS_FS_POSIX_ACL) += acl.o
> btrfs-$(CONFIG_BTRFS_FS_CHECK_INTEGRITY) += check-integrity.o
> diff --git a/fs/btrfs/read-repair.c b/fs/btrfs/read-repair.c
> new file mode 100644
> index 0000000000000..3ac93bfe09e4f
> --- /dev/null
> +++ b/fs/btrfs/read-repair.c
> @@ -0,0 +1,211 @@
> +// SPDX-License-Identifier: GPL-2.0
> +/*
> + * Copyright (c) 2022 Christoph Hellwig.
> + */
> +#include "ctree.h"
> +#include "volumes.h"
> +#include "read-repair.h"
> +#include "btrfs_inode.h"
> +
> +static struct bio_set read_repair_bioset;
> +
> +static int next_mirror(struct btrfs_read_repair *rr, int cur_mirror)
> +{
> + if (cur_mirror == rr->num_copies)
> + return cur_mirror + 1 - rr->num_copies;
> + return cur_mirror + 1;
> +}
> +
> +static int prev_mirror(struct btrfs_read_repair *rr, int cur_mirror)
> +{
> + if (cur_mirror == 1)
> + return rr->num_copies;
> + return cur_mirror - 1;
> +}
> +
> +/*
> + * Clone a new bio from the src_bbio, using the saved iter in the btrfs_bio
> + * instead of bi_iter.
> + */
> +static struct btrfs_bio *btrfs_repair_bio_clone(struct btrfs_bio *src_bbio,
> + u64 offset, u32 size, unsigned int op)
> +{
> + struct btrfs_bio *bbio;
> + struct bio *bio;
> +
> + bio = bio_alloc_bioset(NULL, 0, op | REQ_SYNC, GFP_NOFS,
> + &read_repair_bioset);
> + bio_set_flag(bio, BIO_CLONED);
Do we need to bother setting the CLONED flag?
Without CLONED flag, we can easily go bio_for_each_segment_all() in the
endio function without the need of bbio->iter, thus can save some memory.
> +
> + bio->bi_io_vec = src_bbio->bio.bi_io_vec;
> + bio->bi_iter = src_bbio->iter;
> + bio_advance(bio, offset);
> + bio->bi_iter.bi_size = size;
> +
> + bbio = btrfs_bio(bio);
> + memset(bbio, 0, offsetof(struct btrfs_bio, bio));
> + bbio->iter = bbio->bio.bi_iter;
> + bbio->file_offset = src_bbio->file_offset + offset;
> +
> + return bbio;
> +}
> +
> +static void btrfs_repair_one_mirror(struct btrfs_bio *read_bbio,
> + struct btrfs_bio *failed_bbio, struct inode *inode,
> + int bad_mirror)
> +{
> + struct btrfs_fs_info *fs_info = btrfs_sb(inode->i_sb);
> + struct btrfs_inode *bi = BTRFS_I(inode);
> + u64 logical = read_bbio->iter.bi_sector << SECTOR_SHIFT;
> + u64 file_offset = read_bbio->file_offset;
> + struct btrfs_bio *write_bbio;
> + blk_status_t ret;
> +
> + /*
> + * For zoned file systems repair has to relocate the whole zone.
> + */
> + if (btrfs_repair_one_zone(fs_info, logical))
> + return;
> +
> + /*
> + * For RAID56, we can not just write the bad data back, as any write
> + * will trigger RMW and read back the corrrupted on-disk stripe, causing
> + * further damage.
> + *
> + * Perform a special low-level repair that bypasses btrfs_map_bio.
> + */
> + if (btrfs_is_parity_mirror(fs_info, logical, fs_info->sectorsize)) {
> + struct bvec_iter iter;
> + struct bio_vec bv;
> + u32 offset;
> +
> + btrfs_bio_for_each_sector(fs_info, bv, read_bbio, iter, offset)
> + btrfs_repair_io_failure(fs_info, btrfs_ino(bi),
> + file_offset + offset,
> + fs_info->sectorsize,
> + logical + offset,
> + bv.bv_page, bv.bv_offset,
> + bad_mirror);
> + return;
> + }
> +
> + /*
> + * Otherwise just clone the whole bio and write it back to the
> + * previously bad mirror.
> + */
> + write_bbio = btrfs_repair_bio_clone(read_bbio, 0,
> + read_bbio->iter.bi_size, REQ_OP_WRITE);
Do we need to clone the whole bio?
Considering under most cases the read repair is already the cold path,
have more than one corruption is already rare.
> + ret = btrfs_map_bio_wait(fs_info, &write_bbio->bio, bad_mirror);
> + bio_put(&write_bbio->bio);
> +
> + btrfs_info_rl(fs_info,
> + "%s: root %lld ino %llu off %llu logical %llu/%u from good mirror %d",
> + ret ? "failed to correct read error" : "read error corrected",
> + bi->root->root_key.objectid, btrfs_ino(bi),
> + file_offset, logical, read_bbio->iter.bi_size, bad_mirror);
> +}
> +
> +static bool btrfs_repair_read_bio(struct btrfs_bio *bbio,
> + struct btrfs_bio *failed_bbio, struct inode *inode,
> + int read_mirror)
> +{
> + struct btrfs_fs_info *fs_info = btrfs_sb(inode->i_sb);
> + u32 start_offset = bbio->file_offset - failed_bbio->file_offset;
> + u8 csum[BTRFS_CSUM_SIZE];
> + struct bvec_iter iter;
> + struct bio_vec bv;
> + u32 offset;
> +
> + if (btrfs_map_bio_wait(fs_info, &bbio->bio, read_mirror))
> + return false;
> +
> + if (BTRFS_I(inode)->flags & BTRFS_INODE_NODATASUM)
> + return true;
> +
> + btrfs_bio_for_each_sector(fs_info, bv, bbio, iter, offset) {
> + u8 *expected_csum =
> + btrfs_csum_ptr(fs_info, failed_bbio->csum,
> + start_offset + offset);
> +
> + if (btrfs_check_data_sector(fs_info, bv.bv_page, bv.bv_offset,
> + csum, expected_csum))
> + return false;
> + }
> +
> + return true;
> +}
> +
> +bool __btrfs_read_repair_end(struct btrfs_read_repair *rr,
> + struct btrfs_bio *failed_bbio, struct inode *inode,
> + u64 end_offset, repair_endio_t endio)
The reason I don't use "end" in my patchset is, any thing related "end"
will let people to think it has something related with endio.
And in fact when checking the function, I really thought it's something
related to endio, but it's the equivalent of btrfs_read_repair_finish().
> +{
> + u8 bad_mirror = failed_bbio->mirror_num;
> + u8 read_mirror = next_mirror(rr, bad_mirror);
> + struct btrfs_bio *read_bbio = NULL;
> + bool uptodate = false;
> +
> + do {
> + if (read_bbio)
> + bio_put(&read_bbio->bio);
> + read_bbio = btrfs_repair_bio_clone(failed_bbio,
> + rr->start_offset, end_offset - rr->start_offset,
> + REQ_OP_READ);
> + if (btrfs_repair_read_bio(read_bbio, failed_bbio, inode,
> + read_mirror)) {
A big NONO here.
Function btrfs_repair_read_bio() will only return true if all of its
data matches csum.
Consider the following case:
Profile RAID1C3, 2 sectors to read, the initial mirror is 1.
Mirror 1: |X|X|
Mirror 2: |X| |
Mirror 3: | |X|
Now we will got -EIO, but in reality, we can repair the read by using
the first sector from mirror 3 and the 2nd sector from mirror 2.
This is a behavior regression.
And that's why the original repair and all my patchsets are doing sector
by sector check, not a full range check.
Thanks,
Qu
> + do {
> + btrfs_repair_one_mirror(read_bbio, failed_bbio,
> + inode, bad_mirror);
> + } while ((bad_mirror = prev_mirror(rr, bad_mirror)) !=
> + failed_bbio->mirror_num);
> + uptodate = true;
> + break;
> + }
> + bad_mirror = read_mirror;
> + read_mirror = next_mirror(rr, bad_mirror);
> + } while (read_mirror != failed_bbio->mirror_num);
> +
> + if (endio)
> + endio(read_bbio, inode, uptodate);
> + bio_put(&read_bbio->bio);
> +
> + rr->in_use = false;
> + return uptodate;
> +}
> +
> +bool btrfs_read_repair_add(struct btrfs_read_repair *rr,
> + struct btrfs_bio *failed_bbio, struct inode *inode,
> + u64 start_offset)
> +{
> + if (rr->in_use)
> + return true;
> +
> + if (!rr->num_copies) {
> + struct btrfs_fs_info *fs_info = btrfs_sb(inode->i_sb);
> +
> + rr->num_copies = btrfs_num_copies(fs_info,
> + failed_bbio->iter.bi_sector << SECTOR_SHIFT,
> + failed_bbio->iter.bi_size);
> + }
> +
> + /*
> + * If there is no other copy of the data to recovery from, give up now
> + * and don't even try to build up a larget batch.
> + */
> + if (rr->num_copies < 2)
> + return false;
> +
> + rr->in_use = true;
> + rr->start_offset = start_offset;
> + return true;
> +}
> +
> +void btrfs_read_repair_exit(void)
> +{
> + bioset_exit(&read_repair_bioset);
> +}
> +
> +int __init btrfs_read_repair_init(void)
> +{
> + return bioset_init(&read_repair_bioset, BIO_POOL_SIZE,
> + offsetof(struct btrfs_bio, bio), 0);
> +}
> diff --git a/fs/btrfs/read-repair.h b/fs/btrfs/read-repair.h
> new file mode 100644
> index 0000000000000..e371790af2b3e
> --- /dev/null
> +++ b/fs/btrfs/read-repair.h
> @@ -0,0 +1,33 @@
> +/* SPDX-License-Identifier: GPL-2.0 */
> +#ifndef BTRFS_READ_REPAIR_H
> +#define BTRFS_READ_REPAIR_H
> +
> +struct btrfs_read_repair {
> + u64 start_offset;
> + bool in_use;
> + int num_copies;
> +};
> +
> +typedef void (*repair_endio_t)(struct btrfs_bio *repair_bbio,
> + struct inode *inode, bool uptodate);
> +
> +bool btrfs_read_repair_add(struct btrfs_read_repair *rr,
> + struct btrfs_bio *failed_bbio, struct inode *inode,
> + u64 bio_offset);
> +bool __btrfs_read_repair_end(struct btrfs_read_repair *rr,
> + struct btrfs_bio *failed_bbio, struct inode *inode,
> + u64 end_offset, repair_endio_t end_io);
> +static inline bool btrfs_read_repair_end(struct btrfs_read_repair *rr,
> + struct btrfs_bio *failed_bbio, struct inode *inode,
> + u64 end_offset, repair_endio_t endio)
> +{
> + if (!rr->in_use)
> + return true;
> + return __btrfs_read_repair_end(rr, failed_bbio, inode, end_offset,
> + endio);
> +}
> +
> +int __init btrfs_read_repair_init(void);
> +void btrfs_read_repair_exit(void);
> +
> +#endif /* BTRFS_READ_REPAIR_H */
> diff --git a/fs/btrfs/super.c b/fs/btrfs/super.c
> index b1fdc6a26c76e..b33ad892c3058 100644
> --- a/fs/btrfs/super.c
> +++ b/fs/btrfs/super.c
> @@ -48,6 +48,7 @@
> #include "block-group.h"
> #include "discard.h"
> #include "qgroup.h"
> +#include "read-repair.h"
> #define CREATE_TRACE_POINTS
> #include <trace/events/btrfs.h>
>
> @@ -2641,10 +2642,12 @@ static int __init init_btrfs_fs(void)
> err = extent_io_init();
> if (err)
> goto free_cachep;
> -
> - err = extent_state_cache_init();
> + err = btrfs_read_repair_init();
> if (err)
> goto free_extent_io;
> + err = extent_state_cache_init();
> + if (err)
> + goto free_read_repair;
>
> err = extent_map_init();
> if (err)
> @@ -2708,6 +2711,8 @@ static int __init init_btrfs_fs(void)
> extent_map_exit();
> free_extent_state_cache:
> extent_state_cache_exit();
> +free_read_repair:
> + btrfs_read_repair_exit();
> free_extent_io:
> extent_io_exit();
> free_cachep:
next prev parent reply other threads:[~2022-05-17 23:04 UTC|newest]
Thread overview: 55+ messages / expand[flat|nested] mbox.gz Atom feed top
2022-05-17 14:50 simple synchronous read repair Christoph Hellwig
2022-05-17 14:50 ` [PATCH 01/15] btrfs: introduce a pure data checksum checking helper Christoph Hellwig
2022-05-17 14:59 ` Johannes Thumshirn
2022-05-18 8:44 ` Christoph Hellwig
2022-05-20 8:45 ` Nikolay Borisov
2022-05-20 16:24 ` Christoph Hellwig
2022-05-17 14:50 ` [PATCH 02/15] btrfs: quit early if the fs has no RAID56 support for raid56 related checks Christoph Hellwig
2022-05-17 15:00 ` Johannes Thumshirn
2022-05-18 17:07 ` Anand Jain
2022-05-20 8:47 ` Nikolay Borisov
2022-05-20 16:25 ` Christoph Hellwig
2022-05-20 22:36 ` Qu Wenruo
2022-05-17 14:50 ` [PATCH 03/15] btrfs: save the original bi_iter into btrfs_bio for buffered read Christoph Hellwig
2022-05-17 14:50 ` [PATCH 04/15] btrfs: remove duplicated parameters from submit_data_read_repair() Christoph Hellwig
2022-05-17 15:35 ` Johannes Thumshirn
2022-05-20 10:05 ` Nikolay Borisov
2022-05-17 14:50 ` [PATCH 05/15] btrfs: add a helper to iterate through a btrfs_bio with sector sized chunks Christoph Hellwig
2022-05-17 15:27 ` Johannes Thumshirn
2022-05-18 8:46 ` Christoph Hellwig
2022-05-18 10:07 ` Qu Wenruo
2022-05-20 16:27 ` Christoph Hellwig
2022-05-21 1:16 ` Qu Wenruo
2022-05-17 14:50 ` [PATCH 06/15] btrfs: make repair_io_failure available outside of extent_io.c Christoph Hellwig
2022-05-17 15:18 ` Johannes Thumshirn
2022-05-17 14:50 ` [PATCH 07/15] btrfs: factor out a helper to end a single sector from submit_data_read_repair Christoph Hellwig
2022-05-17 15:18 ` Johannes Thumshirn
2022-05-17 22:17 ` Qu Wenruo
2022-05-17 14:50 ` [PATCH 08/15] btrfs: refactor end_bio_extent_readpage Christoph Hellwig
2022-05-17 22:22 ` Qu Wenruo
2022-05-18 8:48 ` Christoph Hellwig
2022-05-17 14:50 ` [PATCH 09/15] btrfs: factor out a btrfs_csum_ptr helper Christoph Hellwig
2022-05-17 15:24 ` Johannes Thumshirn
2022-05-18 8:45 ` Christoph Hellwig
2022-05-17 14:50 ` [PATCH 10/15] btrfs: add a btrfs_map_bio_wait helper Christoph Hellwig
2022-05-17 15:37 ` Johannes Thumshirn
2022-05-17 22:26 ` Qu Wenruo
2022-05-18 8:47 ` Christoph Hellwig
2022-05-17 14:50 ` [PATCH 11/15] btrfs: set ->file_offset in end_bio_extent_readpage Christoph Hellwig
2022-05-17 22:47 ` Qu Wenruo
2022-05-17 14:50 ` [PATCH 12/15] btrfs: add new read repair infrastructure Christoph Hellwig
2022-05-17 23:04 ` Qu Wenruo [this message]
2022-05-18 8:54 ` Christoph Hellwig
2022-05-18 10:20 ` Qu Wenruo
2022-05-18 12:48 ` Christoph Hellwig
2022-05-19 9:36 ` Christoph Hellwig
2022-05-19 10:41 ` Qu Wenruo
2022-05-19 10:45 ` Nikolay Borisov
2022-05-19 10:46 ` Qu Wenruo
2022-05-19 10:50 ` Christoph Hellwig
2022-05-19 11:27 ` Qu Wenruo
2022-05-20 6:43 ` Why btrfs no longer allocate the extent at the beginning of an empty chunk (was: Re: [PATCH 12/15] btrfs: add new read repair infrastructure) Qu Wenruo
2022-05-20 15:25 ` [PATCH 12/15] btrfs: add new read repair infrastructure Christoph Hellwig
2022-05-17 14:50 ` [PATCH 13/15] btrfs: use the new read repair code for direct I/O Christoph Hellwig
2022-05-17 14:50 ` [PATCH 14/15] btrfs: use the new read repair code for buffered reads Christoph Hellwig
2022-05-17 14:50 ` [PATCH 15/15] btrfs: remove io_failure_record infrastructure completely Christoph Hellwig
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=e2c4e54c-548b-2221-3bf7-31f6c14a03ee@gmx.com \
--to=quwenruo.btrfs@gmx.com \
--cc=clm@fb.com \
--cc=dsterba@suse.com \
--cc=hch@lst.de \
--cc=josef@toxicpanda.com \
--cc=linux-btrfs@vger.kernel.org \
--cc=wqu@suse.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox