Re: [PATCH] Btrfs: fix read corruption of compressed and shared extents

linux-btrfs.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

From: Liu Bo <bo.li.liu@oracle.com>
To: fdmanana@kernel.org
Cc: linux-btrfs@vger.kernel.org, Filipe Manana <fdmanana@suse.com>,
	stable@vger.kernel.org
Subject: Re: [PATCH] Btrfs: fix read corruption of compressed and shared extents
Date: Tue, 15 Sep 2015 16:09:15 +0800	[thread overview]
Message-ID: <20150915080914.GA4634@localhost.localdomain> (raw)
In-Reply-To: <1442219346-14641-1-git-send-email-fdmanana@kernel.org>

On Mon, Sep 14, 2015 at 09:29:06AM +0100, fdmanana@kernel.org wrote:
> From: Filipe Manana <fdmanana@suse.com>
> 
> If a file has a range pointing to a compressed extent, followed by
> another range that points to the same compressed extent and a read
> operation attempts to read both ranges (either completely or part of
> them), the pages that correspond to the second range are incorrectly
> filled with zeroes.
> 
> Consider the following example:
> 
>   File layout
>   [0 - 8K]                      [8K - 24K]
>       |                             |
>       |                             |
>    points to extent X,         points to extent X,
>    offset 4K, length of 8K     offset 0, length 16K
> 
>   [extent X, compressed length = 4K uncompressed length = 16K]
> 
> If a readpages() call spans the 2 ranges, a single bio to read the extent
> is submitted - extent_io.c:submit_extent_page() would only create a new
> bio to cover the second range pointing to the extent if the extent it
> points to had a different logical address than the extent associated with
> the first range. This has a consequence of the compressed read end io
> handler (compression.c:end_compressed_bio_read()) finish once the extent
> is decompressed into the pages covering the first range, leaving the
> remaining pages (belonging to the second range) filled with zeroes (done
> by compression.c:btrfs_clear_biovec_end()).
> 
> So fix this by submitting the current bio whenever we find a range
> pointing to a compressed extent that was preceded by a range with a
> different extent map. This is the simplest solution for this corner
> case. Making the end io callback populate both ranges (or more, if we
> have multiple pointing to the same extent) is a much more complex
> solution since each bio is tightly coupled with a single extent map and
> the extent maps associated to the ranges pointing to the shared extent
> can have different offsets and lengths.
> 
> The following test case for fstests triggers the issue:
> 
>   seq=`basename $0`
>   seqres=$RESULT_DIR/$seq
>   echo "QA output created by $seq"
>   tmp=/tmp/$$
>   status=1	# failure is the default!
>   trap "_cleanup; exit \$status" 0 1 2 3 15
> 
>   _cleanup()
>   {
>       rm -f $tmp.*
>   }
> 
>   # get standard environment, filters and checks
>   . ./common/rc
>   . ./common/filter
> 
>   # real QA test starts here
>   _need_to_be_root
>   _supported_fs btrfs
>   _supported_os Linux
>   _require_scratch
>   _require_cloner
> 
>   rm -f $seqres.full
> 
>   test_clone_and_read_compressed_extent()
>   {
>       local mount_opts=$1
> 
>       _scratch_mkfs >>$seqres.full 2>&1
>       _scratch_mount $mount_opts
> 
>       # Create a test file with a single extent that is compressed (the
>       # data we write into it is highly compressible no matter which
>       # compression algorithm is used, zlib or lzo).
>       $XFS_IO_PROG -f -c "pwrite -S 0xaa 0K 4K"        \
>                       -c "pwrite -S 0xbb 4K 8K"        \
>                       -c "pwrite -S 0xcc 12K 4K"       \
>                       $SCRATCH_MNT/foo | _filter_xfs_io
> 
>       # Now clone our extent into an adjacent offset.
>       $CLONER_PROG -s $((4 * 1024)) -d $((16 * 1024)) -l $((8 * 1024)) \
>           $SCRATCH_MNT/foo $SCRATCH_MNT/foo
> 
>       # Same as before but for this file we clone the extent into a lower
>       # file offset.
>       $XFS_IO_PROG -f -c "pwrite -S 0xaa 8K 4K"         \
>                       -c "pwrite -S 0xbb 12K 8K"        \
>                       -c "pwrite -S 0xcc 20K 4K"        \
>                       $SCRATCH_MNT/bar | _filter_xfs_io
> 
>       $CLONER_PROG -s $((12 * 1024)) -d 0 -l $((8 * 1024)) \
>           $SCRATCH_MNT/bar $SCRATCH_MNT/bar
> 
>       echo "File digests before unmounting filesystem:"
>       md5sum $SCRATCH_MNT/foo | _filter_scratch
>       md5sum $SCRATCH_MNT/bar | _filter_scratch
> 
>       # Evicting the inode or clearing the page cache before reading
>       # again the file would also trigger the bug - reads were returning
>       # all bytes in the range corresponding to the second reference to
>       # the extent with a value of 0, but the correct data was persisted
>       # (it was a bug exclusively in the read path). The issue happened
>       # only if the same readpages() call targeted pages belonging to the
>       # first and second ranges that point to the same compressed extent.
>       _scratch_remount
> 
>       echo "File digests after mounting filesystem again:"
>       # Must match the same digests we got before.
>       md5sum $SCRATCH_MNT/foo | _filter_scratch
>       md5sum $SCRATCH_MNT/bar | _filter_scratch
>   }
> 
>   echo -e "\nTesting with zlib compression..."
>   test_clone_and_read_compressed_extent "-o compress=zlib"
> 
>   _scratch_unmount
> 
>   echo -e "\nTesting with lzo compression..."
>   test_clone_and_read_compressed_extent "-o compress=lzo"
> 
>   status=0
>   exit

Looks good to me.

Reviewed-by: Liu Bo <bo.li.liu@oracle.com>

Thanks,

-liubo
> 
> Cc: stable@vger.kernel.org
> Signed-off-by: Filipe Manana <fdmanana@suse.com>
> ---
>  fs/btrfs/extent_io.c | 65 +++++++++++++++++++++++++++++++++++++++++++++-------
>  1 file changed, 57 insertions(+), 8 deletions(-)
> 
> diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
> index fa19f2f..11aa8f7 100644
> --- a/fs/btrfs/extent_io.c
> +++ b/fs/btrfs/extent_io.c
> @@ -2805,7 +2805,8 @@ static int submit_extent_page(int rw, struct extent_io_tree *tree,
>  			      bio_end_io_t end_io_func,
>  			      int mirror_num,
>  			      unsigned long prev_bio_flags,
> -			      unsigned long bio_flags)
> +			      unsigned long bio_flags,
> +			      bool force_bio_submit)
>  {
>  	int ret = 0;
>  	struct bio *bio;
> @@ -2823,6 +2824,7 @@ static int submit_extent_page(int rw, struct extent_io_tree *tree,
>  			contig = bio_end_sector(bio) == sector;
>  
>  		if (prev_bio_flags != bio_flags || !contig ||
> +		    force_bio_submit ||
>  		    merge_bio(rw, tree, page, offset, page_size, bio, bio_flags) ||
>  		    bio_add_page(bio, page, page_size, offset) < page_size) {
>  			ret = submit_one_bio(rw, bio, mirror_num,
> @@ -2922,7 +2924,8 @@ static int __do_readpage(struct extent_io_tree *tree,
>  			 get_extent_t *get_extent,
>  			 struct extent_map **em_cached,
>  			 struct bio **bio, int mirror_num,
> -			 unsigned long *bio_flags, int rw)
> +			 unsigned long *bio_flags, int rw,
> +			 u64 *prev_em_start)
>  {
>  	struct inode *inode = page->mapping->host;
>  	u64 start = page_offset(page);
> @@ -2970,6 +2973,7 @@ static int __do_readpage(struct extent_io_tree *tree,
>  	}
>  	while (cur <= end) {
>  		unsigned long pnr = (last_byte >> PAGE_CACHE_SHIFT) + 1;
> +		bool force_bio_submit = false;
>  
>  		if (cur >= last_byte) {
>  			char *userpage;
> @@ -3020,6 +3024,49 @@ static int __do_readpage(struct extent_io_tree *tree,
>  		block_start = em->block_start;
>  		if (test_bit(EXTENT_FLAG_PREALLOC, &em->flags))
>  			block_start = EXTENT_MAP_HOLE;
> +
> +		/*
> +		 * If we have a file range that points to a compressed extent
> +		 * and it's followed by a consecutive file range that points to
> +		 * to the same compressed extent (possibly with a different
> +		 * offset and/or length, so it either points to the whole extent
> +		 * or only part of it), we must make sure we do not submit a
> +		 * single bio to populate the pages for the 2 ranges because
> +		 * this makes the compressed extent read zero out the pages
> +		 * belonging to the 2nd range. Imagine the following scenario:
> +		 *
> +		 *  File layout
> +		 *  [0 - 8K]                     [8K - 24K]
> +		 *    |                               |
> +		 *    |                               |
> +		 * points to extent X,         points to extent X,
> +		 * offset 4K, length of 8K     offset 0, length 16K
> +		 *
> +		 * [extent X, compressed length = 4K uncompressed length = 16K]
> +		 *
> +		 * If the bio to read the compressed extent covers both ranges,
> +		 * it will decompress extent X into the pages belonging to the
> +		 * first range and then it will stop, zeroing out the remaining
> +		 * pages that belong to the other range that points to extent X.
> +		 * So here we make sure we submit 2 bios, one for the first
> +		 * range and another one for the third range. Both will target
> +		 * the same physical extent from disk, but we can't currently
> +		 * make the compressed bio endio callback populate the pages
> +		 * for both ranges because each compressed bio is tightly
> +		 * coupled with a single extent map, and each range can have
> +		 * an extent map with a different offset value relative to the
> +		 * uncompressed data of our extent and different lengths. This
> +		 * is a corner case so we prioritize correctness over
> +		 * non-optimal behavior (submitting 2 bios for the same extent).
> +		 */
> +		if (test_bit(EXTENT_FLAG_COMPRESSED, &em->flags) &&
> +		    prev_em_start && *prev_em_start != (u64)-1 &&
> +		    *prev_em_start != em->orig_start)
> +			force_bio_submit = true;
> +
> +		if (prev_em_start)
> +			*prev_em_start = em->orig_start;
> +
>  		free_extent_map(em);
>  		em = NULL;
>  
> @@ -3069,7 +3116,8 @@ static int __do_readpage(struct extent_io_tree *tree,
>  					 bdev, bio, pnr,
>  					 end_bio_extent_readpage, mirror_num,
>  					 *bio_flags,
> -					 this_bio_flag);
> +					 this_bio_flag,
> +					 force_bio_submit);
>  		if (!ret) {
>  			nr++;
>  			*bio_flags = this_bio_flag;
> @@ -3101,6 +3149,7 @@ static inline void __do_contiguous_readpages(struct extent_io_tree *tree,
>  	struct inode *inode;
>  	struct btrfs_ordered_extent *ordered;
>  	int index;
> +	u64 prev_em_start = (u64)-1;
>  
>  	inode = pages[0]->mapping->host;
>  	while (1) {
> @@ -3116,7 +3165,7 @@ static inline void __do_contiguous_readpages(struct extent_io_tree *tree,
>  
>  	for (index = 0; index < nr_pages; index++) {
>  		__do_readpage(tree, pages[index], get_extent, em_cached, bio,
> -			      mirror_num, bio_flags, rw);
> +			      mirror_num, bio_flags, rw, &prev_em_start);
>  		page_cache_release(pages[index]);
>  	}
>  }
> @@ -3184,7 +3233,7 @@ static int __extent_read_full_page(struct extent_io_tree *tree,
>  	}
>  
>  	ret = __do_readpage(tree, page, get_extent, NULL, bio, mirror_num,
> -			    bio_flags, rw);
> +			    bio_flags, rw, NULL);
>  	return ret;
>  }
>  
> @@ -3210,7 +3259,7 @@ int extent_read_full_page_nolock(struct extent_io_tree *tree, struct page *page,
>  	int ret;
>  
>  	ret = __do_readpage(tree, page, get_extent, NULL, &bio, mirror_num,
> -				      &bio_flags, READ);
> +			    &bio_flags, READ, NULL);
>  	if (bio)
>  		ret = submit_one_bio(READ, bio, mirror_num, bio_flags);
>  	return ret;
> @@ -3463,7 +3512,7 @@ static noinline_for_stack int __extent_writepage_io(struct inode *inode,
>  						 sector, iosize, pg_offset,
>  						 bdev, &epd->bio, max_nr,
>  						 end_bio_extent_writepage,
> -						 0, 0, 0);
> +						 0, 0, 0, false);
>  			if (ret)
>  				SetPageError(page);
>  		}
> @@ -3765,7 +3814,7 @@ static noinline_for_stack int write_one_eb(struct extent_buffer *eb,
>  		ret = submit_extent_page(rw, tree, wbc, p, offset >> 9,
>  					 PAGE_CACHE_SIZE, 0, bdev, &epd->bio,
>  					 -1, end_bio_extent_buffer_writepage,
> -					 0, epd->bio_flags, bio_flags);
> +					 0, epd->bio_flags, bio_flags, false);
>  		epd->bio_flags = bio_flags;
>  		if (ret) {
>  			set_btree_ioerr(p);
> -- 
> 2.1.3
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

     prev parent reply	other threads:[~2015-09-15  8:09 UTC|newest]

Thread overview: 6+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2015-09-14  8:29 [PATCH] Btrfs: fix read corruption of compressed and shared extents fdmanana
2015-09-14  9:08 ` Qu Wenruo
2015-09-14  9:22   ` Filipe Manana
2015-09-14  9:34     ` Qu Wenruo
2015-09-14 12:50       ` Chris Mason
2015-09-15  8:09 ` Liu Bo [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20150915080914.GA4634@localhost.localdomain \
    --to=bo.li.liu@oracle.com \
    --cc=fdmanana@kernel.org \
    --cc=fdmanana@suse.com \
    --cc=linux-btrfs@vger.kernel.org \
    --cc=stable@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).