public inbox for linux-btrfs@vger.kernel.org
 help / color / mirror / Atom feed
From: Boris Burkov <boris@bur.io>
To: Christoph Hellwig <hch@lst.de>
Cc: Chris Mason <clm@fb.com>, Josef Bacik <josef@toxicpanda.com>,
	David Sterba <dsterba@suse.com>,
	linux-btrfs@vger.kernel.org
Subject: Re: [PATCH 4/4] btrfs: fix repair of compressed extents
Date: Wed, 29 Jun 2022 17:18:07 -0700	[thread overview]
Message-ID: <YrzrvwOx8/Jgf2Co@zen> (raw)
In-Reply-To: <20220623055338.3833616-5-hch@lst.de>

On Thu, Jun 23, 2022 at 07:53:38AM +0200, Christoph Hellwig wrote:
> Currently the checksum of compressed extents is verified based on the
> compressed data and the lower btrfs_bio, but the actual repair process
> is driven by end_bio_extent_readpage on the upper btrfs_bio for the
> decompressed data.
> 
> This has a bunch of issues, including not being able to properly
> communicate the failed mirror up in case that the I/O submission got
> preempted, a general loss of if an error was an I/O error or a checksum
> verification failure, but most importantly that this design causes
> btrfs_clean_io_failure to eventually write back the uncompressed good
> data onto the disk sectors that are supposed to contain compressed data.
> 
> Fix this by moving the repair to the lower btrfs_bio.  To do so, a fair
> amount of code has to be reshuffled:
> 
>  a) the lower btrfs_bio now needs a valid csum pointer.  The easiest way
>     to archive that is to pass NULL btrfs_lookup_bio_sums and just use
>     the btrfs_bio management of csums.  For a compressed_bio that is
>     split into multiple btrfs_bios this mean additional memory
>     allocations, but the code becomes a lot more regular.
>  b) checksum verifiaction now runs diretly on the lower btrfs_bio instead
>     of the compressed_bio.  This actually nicely simplifies the end I/O
>     processing.
>  c) btrfs_repair_one_sector can't just look up the logical address for
>     the file offset any more, as there is no coresponding relative
>     offsets that apply to the file offset and the logic address for
>     compressed extents.  Instead require that the saved bvec_iter in the
>     btrfs_bio is filled out for all read bios and use that, which again
>     removes a fair amount of code.
> 
> Signed-off-by: Christoph Hellwig <hch@lst.de>
> ---
>  fs/btrfs/compression.c | 171 ++++++++++++++---------------------------
>  fs/btrfs/compression.h |   7 --
>  fs/btrfs/ctree.h       |   2 +
>  fs/btrfs/extent_io.c   |  46 +++--------
>  fs/btrfs/extent_io.h   |   1 -
>  fs/btrfs/inode.c       |   7 ++
>  6 files changed, 77 insertions(+), 157 deletions(-)
> 
> diff --git a/fs/btrfs/compression.c b/fs/btrfs/compression.c
> index e756da640fd7b..c8b14a5bd89be 100644
> --- a/fs/btrfs/compression.c
> +++ b/fs/btrfs/compression.c
> @@ -136,66 +136,14 @@ static int compression_decompress(int type, struct list_head *ws,
>  
>  static int btrfs_decompress_bio(struct compressed_bio *cb);
>  
> -static inline int compressed_bio_size(struct btrfs_fs_info *fs_info,
> -				      unsigned long disk_size)
> -{
> -	return sizeof(struct compressed_bio) +
> -		(DIV_ROUND_UP(disk_size, fs_info->sectorsize)) * fs_info->csum_size;
> -}
> -
> -static int check_compressed_csum(struct btrfs_inode *inode, struct bio *bio,
> -				 u64 disk_start)
> -{
> -	struct btrfs_fs_info *fs_info = inode->root->fs_info;
> -	const u32 csum_size = fs_info->csum_size;
> -	const u32 sectorsize = fs_info->sectorsize;
> -	struct page *page;
> -	unsigned int i;
> -	u8 csum[BTRFS_CSUM_SIZE];
> -	struct compressed_bio *cb = bio->bi_private;
> -	u8 *cb_sum = cb->sums;
> -
> -	if ((inode->flags & BTRFS_INODE_NODATASUM) ||
> -	    test_bit(BTRFS_FS_STATE_NO_CSUMS, &fs_info->fs_state))
> -		return 0;
> -
> -	for (i = 0; i < cb->nr_pages; i++) {
> -		u32 pg_offset;
> -		u32 bytes_left = PAGE_SIZE;
> -		page = cb->compressed_pages[i];
> -
> -		/* Determine the remaining bytes inside the page first */
> -		if (i == cb->nr_pages - 1)
> -			bytes_left = cb->compressed_len - i * PAGE_SIZE;
> -
> -		/* Hash through the page sector by sector */
> -		for (pg_offset = 0; pg_offset < bytes_left;
> -		     pg_offset += sectorsize) {
> -			int ret;
> -
> -			ret = btrfs_check_sector_csum(fs_info, page, pg_offset,
> -						      csum, cb_sum);
> -			if (ret) {
> -				btrfs_print_data_csum_error(inode, disk_start,
> -						csum, cb_sum, cb->mirror_num);
> -				if (btrfs_bio(bio)->device)
> -					btrfs_dev_stat_inc_and_print(
> -						btrfs_bio(bio)->device,
> -						BTRFS_DEV_STAT_CORRUPTION_ERRS);
> -				return -EIO;
> -			}
> -			cb_sum += csum_size;
> -			disk_start += sectorsize;
> -		}
> -	}
> -	return 0;
> -}
> -
>  static void finish_compressed_bio_read(struct compressed_bio *cb)
>  {
>  	unsigned int index;
>  	struct page *page;
>  
> +	if (cb->status == BLK_STS_OK)
> +		cb->status = errno_to_blk_status(btrfs_decompress_bio(cb));
> +
>  	/* Release the compressed pages */
>  	for (index = 0; index < cb->nr_pages; index++) {
>  		page = cb->compressed_pages[index];
> @@ -233,59 +181,54 @@ static void finish_compressed_bio_read(struct compressed_bio *cb)
>  	kfree(cb);
>  }
>  
> -/* when we finish reading compressed pages from the disk, we
> - * decompress them and then run the bio end_io routines on the
> - * decompressed pages (in the inode address space).
> - *
> - * This allows the checksumming and other IO error handling routines
> - * to work normally
> - *
> - * The compressed pages are freed here, and it must be run
> - * in process context
> +/*
> + * Verify the checksums and kick off repair if needed on the uncompressed data
> + * before decompressing it into the original bio and freeing the uncompressed
> + * pages.
>   */
>  static void end_compressed_bio_read(struct bio *bio)
>  {
>  	struct compressed_bio *cb = bio->bi_private;
> -	struct inode *inode;
> -	unsigned int mirror = btrfs_bio(bio)->mirror_num;
> -	int ret = 0;
> -
> -	if (bio->bi_status)
> -		cb->status = bio->bi_status;
> -
> -	if (!refcount_dec_and_test(&cb->pending_ios))
> -		goto out;
> -
> -	/*
> -	 * Record the correct mirror_num in cb->orig_bio so that
> -	 * read-repair can work properly.
> -	 */
> -	btrfs_bio(cb->orig_bio)->mirror_num = mirror;
> -	cb->mirror_num = mirror;
> -
> -	/*
> -	 * Some IO in this cb have failed, just skip checksum as there
> -	 * is no way it could be correct.
> -	 */
> -	if (cb->status != BLK_STS_OK)
> -		goto csum_failed;
> +	struct inode *inode = cb->inode;
> +	struct btrfs_fs_info *fs_info = btrfs_sb(inode->i_sb);
> +	struct btrfs_inode *bi = BTRFS_I(inode);
> +	bool csum = !(bi->flags & BTRFS_INODE_NODATASUM) &&
> +		    !test_bit(BTRFS_FS_STATE_NO_CSUMS, &fs_info->fs_state);
> +	blk_status_t status = bio->bi_status;
> +	struct btrfs_bio *bbio = btrfs_bio(bio);
> +	struct bvec_iter iter;
> +	struct bio_vec bv;
> +	u32 offset;
> +
> +	btrfs_bio_for_each_sector(fs_info, bv, bbio, iter, offset) {
> +		u64 start = bbio->file_offset + offset;
> +
> +		if (!status &&
> +		    (!csum || !check_data_csum(inode, bbio, offset, bv.bv_page,
> +					       bv.bv_offset))) {
> +			clean_io_failure(fs_info, &bi->io_failure_tree,
> +					 &bi->io_tree, start, bv.bv_page,
> +					 btrfs_ino(bi), bv.bv_offset);
> +		} else {
> +			int ret;
>  
> -	inode = cb->inode;
> -	ret = check_compressed_csum(BTRFS_I(inode), bio,
> -				    bio->bi_iter.bi_sector << 9);
> -	if (ret)
> -		goto csum_failed;
> +			refcount_inc(&cb->pending_ios);
> +			ret = btrfs_repair_one_sector(inode, bbio, offset,
> +					bv.bv_page, bv.bv_offset,
> +					btrfs_submit_data_read_bio);
> +			if (ret) {
> +				refcount_dec(&cb->pending_ios);
> +				status = errno_to_blk_status(ret);
> +			}
> +		}
> +	}
>  
> -	/* ok, we're the last bio for this extent, lets start
> -	 * the decompression.
> -	 */
> -	ret = btrfs_decompress_bio(cb);
> +	if (status)
> +		cb->status = status;
>  
> -csum_failed:
> -	if (ret)
> -		cb->status = errno_to_blk_status(ret);
> -	finish_compressed_bio_read(cb);
> -out:
> +	if (refcount_dec_and_test(&cb->pending_ios))
> +		finish_compressed_bio_read(cb);
> +	btrfs_bio_free_csum(bbio);
>  	bio_put(bio);
>  }
>  
> @@ -478,7 +421,7 @@ blk_status_t btrfs_submit_compressed_write(struct btrfs_inode *inode, u64 start,
>  
>  	ASSERT(IS_ALIGNED(start, fs_info->sectorsize) &&
>  	       IS_ALIGNED(len, fs_info->sectorsize));
> -	cb = kmalloc(compressed_bio_size(fs_info, compressed_len), GFP_NOFS);
> +	cb = kmalloc(sizeof(struct compressed_bio), GFP_NOFS);
>  	if (!cb)
>  		return BLK_STS_RESOURCE;
>  	refcount_set(&cb->pending_ios, 1);
> @@ -486,7 +429,6 @@ blk_status_t btrfs_submit_compressed_write(struct btrfs_inode *inode, u64 start,
>  	cb->inode = &inode->vfs_inode;
>  	cb->start = start;
>  	cb->len = len;
> -	cb->mirror_num = 0;
>  	cb->compressed_pages = compressed_pages;
>  	cb->compressed_len = compressed_len;
>  	cb->writeback = writeback;
> @@ -755,7 +697,6 @@ void btrfs_submit_compressed_read(struct inode *inode, struct bio *bio,
>  	blk_status_t ret;
>  	int ret2;
>  	int i;
> -	u8 *sums;
>  
>  	em_tree = &BTRFS_I(inode)->extent_tree;
>  
> @@ -773,7 +714,7 @@ void btrfs_submit_compressed_read(struct inode *inode, struct bio *bio,
>  
>  	ASSERT(em->compress_type != BTRFS_COMPRESS_NONE);
>  	compressed_len = em->block_len;
> -	cb = kmalloc(compressed_bio_size(fs_info, compressed_len), GFP_NOFS);
> +	cb = kmalloc(sizeof(struct compressed_bio), GFP_NOFS);
>  	if (!cb) {
>  		ret = BLK_STS_RESOURCE;
>  		goto out;
> @@ -782,8 +723,6 @@ void btrfs_submit_compressed_read(struct inode *inode, struct bio *bio,
>  	refcount_set(&cb->pending_ios, 1);
>  	cb->status = BLK_STS_OK;
>  	cb->inode = inode;
> -	cb->mirror_num = mirror_num;
> -	sums = cb->sums;
>  
>  	cb->start = em->orig_start;
>  	em_len = em->len;
> @@ -867,19 +806,25 @@ void btrfs_submit_compressed_read(struct inode *inode, struct bio *bio,
>  			submit = true;
>  
>  		if (submit) {
> -			unsigned int nr_sectors;
> +			/* Save the original iter for read repair */
> +			if (bio_op(comp_bio) == REQ_OP_READ)
> +				btrfs_bio(comp_bio)->iter = comp_bio->bi_iter;
> +
> +			/*
> +			 * Just stash the initial offset of this chunk, as there
> +			 * is no direct correlation between compressed pages and
> +			 * the original file offset.  The field is only used for
> +			 * priting error messages anyway.
> +			 */
> +			btrfs_bio(comp_bio)->file_offset = file_offset;
>  
> -			ret = btrfs_lookup_bio_sums(inode, comp_bio, sums);
> +			ret = btrfs_lookup_bio_sums(inode, comp_bio, NULL);
>  			if (ret) {
>  				comp_bio->bi_status = ret;
>  				bio_endio(comp_bio);
>  				break;
>  			}
>  
> -			nr_sectors = DIV_ROUND_UP(comp_bio->bi_iter.bi_size,
> -						  fs_info->sectorsize);
> -			sums += fs_info->csum_size * nr_sectors;
> -
>  			ASSERT(comp_bio->bi_iter.bi_size);
>  			btrfs_submit_bio(fs_info, comp_bio, mirror_num);
>  			comp_bio = NULL;
> diff --git a/fs/btrfs/compression.h b/fs/btrfs/compression.h
> index 0e4cbf04fd866..e9ef24034cad0 100644
> --- a/fs/btrfs/compression.h
> +++ b/fs/btrfs/compression.h
> @@ -59,19 +59,12 @@ struct compressed_bio {
>  
>  	/* IO errors */
>  	blk_status_t status;
> -	int mirror_num;
>  
>  	union {
>  		/* For reads, this is the bio we are copying the data into */
>  		struct bio *orig_bio;
>  		struct work_struct write_end_work;
>  	};
> -
> -	/*
> -	 * the start of a variable length array of checksums only
> -	 * used by reads
> -	 */
> -	u8 sums[];
>  };
>  
>  static inline unsigned int btrfs_compress_type(unsigned int type_level)
> diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
> index 164f54e6aa447..12f59e35755fa 100644
> --- a/fs/btrfs/ctree.h
> +++ b/fs/btrfs/ctree.h
> @@ -3290,6 +3290,8 @@ void btrfs_submit_data_read_bio(struct inode *inode, struct bio *bio,
>  			int mirror_num, enum btrfs_compression_type compress_type);
>  int btrfs_check_sector_csum(struct btrfs_fs_info *fs_info, struct page *page,
>  			    u32 pgoff, u8 *csum, const u8 * const csum_expected);

As far as I can tell, this is redundant with the last patch.

> +int check_data_csum(struct inode *inode, struct btrfs_bio *bbio, u32 bio_offset,
> +		    struct page *page, u32 pgoff);
>  unsigned int btrfs_verify_data_csum(struct btrfs_bio *bbio,
>  				    u32 bio_offset, struct page *page,
>  				    u64 start, u64 end);
> diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
> index ec7bdb3fa0921..587d2ba20b53b 100644
> --- a/fs/btrfs/extent_io.c
> +++ b/fs/btrfs/extent_io.c
> @@ -2543,13 +2543,10 @@ static struct io_failure_record *btrfs_get_io_failure_record(struct inode *inode
>  	struct btrfs_fs_info *fs_info = btrfs_sb(inode->i_sb);
>  	u64 start = bbio->file_offset + bio_offset;
>  	struct io_failure_record *failrec;
> -	struct extent_map *em;
>  	struct extent_io_tree *failure_tree = &BTRFS_I(inode)->io_failure_tree;
>  	struct extent_io_tree *tree = &BTRFS_I(inode)->io_tree;
> -	struct extent_map_tree *em_tree = &BTRFS_I(inode)->extent_tree;
>  	const u32 sectorsize = fs_info->sectorsize;
>  	int ret;
> -	u64 logical;
>  
>  	failrec = get_state_failrec(failure_tree, start);
>  	if (!IS_ERR(failrec)) {
> @@ -2573,41 +2570,14 @@ static struct io_failure_record *btrfs_get_io_failure_record(struct inode *inode
>  	failrec->start = start;
>  	failrec->len = sectorsize;
>  	failrec->failed_mirror = failrec->this_mirror = bbio->mirror_num;
> -	failrec->compress_type = BTRFS_COMPRESS_NONE;
> -
> -	read_lock(&em_tree->lock);
> -	em = lookup_extent_mapping(em_tree, start, failrec->len);
> -	if (!em) {
> -		read_unlock(&em_tree->lock);
> -		kfree(failrec);
> -		return ERR_PTR(-EIO);
> -	}
> -
> -	if (em->start > start || em->start + em->len <= start) {
> -		free_extent_map(em);
> -		em = NULL;
> -	}
> -	read_unlock(&em_tree->lock);
> -	if (!em) {
> -		kfree(failrec);
> -		return ERR_PTR(-EIO);
> -	}
> -
> -	logical = start - em->start;
> -	logical = em->block_start + logical;
> -	if (test_bit(EXTENT_FLAG_COMPRESSED, &em->flags)) {
> -		logical = em->block_start;
> -		failrec->compress_type = em->compress_type;
> -	}
> +	failrec->logical = (bbio->iter.bi_sector << SECTOR_SHIFT) + bio_offset;
>  
>  	btrfs_debug(fs_info,
> -		    "Get IO Failure Record: (new) logical=%llu, start=%llu, len=%llu",
> -		    logical, start, failrec->len);
> -
> -	failrec->logical = logical;
> -	free_extent_map(em);
> +		    "Get IO Failure Record: (new) logical=%llu, start=%llu",
> +		    failrec->logical, start);
>  
> -	failrec->num_copies = btrfs_num_copies(fs_info, logical, sectorsize);
> +	failrec->num_copies = btrfs_num_copies(fs_info, failrec->logical,
> +					       sectorsize);
>  	if (failrec->num_copies == 1) {
>  		/*
>  		 * we only have a single copy of the data, so don't bother with
> @@ -2709,7 +2679,7 @@ int btrfs_repair_one_sector(struct inode *inode, struct btrfs_bio *failed_bbio,
>  	 * will be handled by the endio on the repair_bio, so we can't return an
>  	 * error here.
>  	 */
> -	submit_bio_hook(inode, repair_bio, failrec->this_mirror, failrec->compress_type);
> +	submit_bio_hook(inode, repair_bio, failrec->this_mirror, 0);
>  	return BLK_STS_OK;
>  }
>  
> @@ -3115,6 +3085,10 @@ static void end_bio_extent_readpage(struct bio *bio)
>  			 * Only try to repair bios that actually made it to a
>  			 * device.  If the bio failed to be submitted mirror
>  			 * is 0 and we need to fail it without retrying.
> +			 *
> +			 * This also includes the high level bios for compressed
> +			 * extents - these never make it to a device and repair
> +			 * is already handled on the lower compressed bio.
>  			 */
>  			if (mirror > 0)
>  				repair = true;
> diff --git a/fs/btrfs/extent_io.h b/fs/btrfs/extent_io.h
> index a78051c7627c4..9dec34c009e91 100644
> --- a/fs/btrfs/extent_io.h
> +++ b/fs/btrfs/extent_io.h
> @@ -261,7 +261,6 @@ struct io_failure_record {
>  	u64 start;
>  	u64 len;
>  	u64 logical;
> -	enum btrfs_compression_type compress_type;
>  	int this_mirror;
>  	int failed_mirror;
>  	int num_copies;
> diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
> index 429428fde4a88..eea351216db33 100644
> --- a/fs/btrfs/inode.c
> +++ b/fs/btrfs/inode.c
> @@ -2707,6 +2707,9 @@ void btrfs_submit_data_read_bio(struct inode *inode, struct bio *bio,
>  		return;
>  	}
>  
> +	/* Save the original iter for read repair */
> +	btrfs_bio(bio)->iter = bio->bi_iter;
> +
>  	/*
>  	 * Lookup bio sums does extra checks around whether we need to csum or
>  	 * not, which is why we ignore skip_sum here.
> @@ -8000,6 +8003,10 @@ static void btrfs_submit_dio_bio(struct bio *bio, struct inode *inode,
>  	struct btrfs_fs_info *fs_info = btrfs_sb(inode->i_sb);
>  	struct btrfs_dio_private *dip = bio->bi_private;
>  	blk_status_t ret;
> +		
> +	/* Save the original iter for read repair */
> +	if (btrfs_op(bio) == BTRFS_MAP_READ)
> +		btrfs_bio(bio)->iter = bio->bi_iter;
>  
>  	if (BTRFS_I(inode)->flags & BTRFS_INODE_NODATASUM)
>  		goto map;
> -- 
> 2.30.2
> 

  reply	other threads:[~2022-06-30  0:18 UTC|newest]

Thread overview: 21+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2022-06-23  5:53 fix read repair on compressed extents Christoph Hellwig
2022-06-23  5:53 ` [PATCH 1/4] btrfs: simplify the pending I/O counting in struct compressed_bio Christoph Hellwig
2022-06-29 23:42   ` Boris Burkov
2022-06-30  4:22     ` Christoph Hellwig
2022-06-23  5:53 ` [PATCH 2/4] btrfs: pass a btrfs_bio to btrfs_repair_one_sector Christoph Hellwig
2022-06-29 23:44   ` Boris Burkov
2022-06-30  4:23     ` Christoph Hellwig
2022-06-23  5:53 ` [PATCH 3/4] btrfs: remove the start argument to check_data_csum Christoph Hellwig
2022-06-29 23:48   ` Boris Burkov
2022-06-23  5:53 ` [PATCH 4/4] btrfs: fix repair of compressed extents Christoph Hellwig
2022-06-30  0:18   ` Boris Burkov [this message]
2022-06-30  4:24     ` Christoph Hellwig
2022-06-23  8:14 ` fix read repair on " Qu Wenruo
2022-06-23 12:58   ` Christoph Hellwig
2022-06-29  8:42 ` Christoph Hellwig
2022-06-29 19:04   ` Boris Burkov
2022-06-29 19:08     ` Christoph Hellwig
2022-06-29 19:38       ` Boris Burkov
  -- strict thread matches above, loose matches on Subject: below --
2022-06-30 16:01 fix read repair on compressed extents v2 Christoph Hellwig
2022-06-30 16:01 ` [PATCH 4/4] btrfs: fix repair of compressed extents Christoph Hellwig
2022-07-07 12:50   ` Nikolay Borisov
2022-07-07 13:30     ` Christoph Hellwig

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=YrzrvwOx8/Jgf2Co@zen \
    --to=boris@bur.io \
    --cc=clm@fb.com \
    --cc=dsterba@suse.com \
    --cc=hch@lst.de \
    --cc=josef@toxicpanda.com \
    --cc=linux-btrfs@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox