From: Liu Bo <bo.li.liu@oracle.com>
To: fdmanana@kernel.org
Cc: linux-btrfs@vger.kernel.org, Filipe Manana <fdmanana@suse.com>,
stable@vger.kernel.org
Subject: Re: [PATCH] Btrfs: fix read corruption of compressed and shared extents
Date: Tue, 15 Sep 2015 16:09:15 +0800 [thread overview]
Message-ID: <20150915080914.GA4634@localhost.localdomain> (raw)
In-Reply-To: <1442219346-14641-1-git-send-email-fdmanana@kernel.org>
On Mon, Sep 14, 2015 at 09:29:06AM +0100, fdmanana@kernel.org wrote:
> From: Filipe Manana <fdmanana@suse.com>
>
> If a file has a range pointing to a compressed extent, followed by
> another range that points to the same compressed extent and a read
> operation attempts to read both ranges (either completely or part of
> them), the pages that correspond to the second range are incorrectly
> filled with zeroes.
>
> Consider the following example:
>
> File layout
> [0 - 8K] [8K - 24K]
> | |
> | |
> points to extent X, points to extent X,
> offset 4K, length of 8K offset 0, length 16K
>
> [extent X, compressed length = 4K uncompressed length = 16K]
>
> If a readpages() call spans the 2 ranges, a single bio to read the extent
> is submitted - extent_io.c:submit_extent_page() would only create a new
> bio to cover the second range pointing to the extent if the extent it
> points to had a different logical address than the extent associated with
> the first range. This has a consequence of the compressed read end io
> handler (compression.c:end_compressed_bio_read()) finish once the extent
> is decompressed into the pages covering the first range, leaving the
> remaining pages (belonging to the second range) filled with zeroes (done
> by compression.c:btrfs_clear_biovec_end()).
>
> So fix this by submitting the current bio whenever we find a range
> pointing to a compressed extent that was preceded by a range with a
> different extent map. This is the simplest solution for this corner
> case. Making the end io callback populate both ranges (or more, if we
> have multiple pointing to the same extent) is a much more complex
> solution since each bio is tightly coupled with a single extent map and
> the extent maps associated to the ranges pointing to the shared extent
> can have different offsets and lengths.
>
> The following test case for fstests triggers the issue:
>
> seq=`basename $0`
> seqres=$RESULT_DIR/$seq
> echo "QA output created by $seq"
> tmp=/tmp/$$
> status=1 # failure is the default!
> trap "_cleanup; exit \$status" 0 1 2 3 15
>
> _cleanup()
> {
> rm -f $tmp.*
> }
>
> # get standard environment, filters and checks
> . ./common/rc
> . ./common/filter
>
> # real QA test starts here
> _need_to_be_root
> _supported_fs btrfs
> _supported_os Linux
> _require_scratch
> _require_cloner
>
> rm -f $seqres.full
>
> test_clone_and_read_compressed_extent()
> {
> local mount_opts=$1
>
> _scratch_mkfs >>$seqres.full 2>&1
> _scratch_mount $mount_opts
>
> # Create a test file with a single extent that is compressed (the
> # data we write into it is highly compressible no matter which
> # compression algorithm is used, zlib or lzo).
> $XFS_IO_PROG -f -c "pwrite -S 0xaa 0K 4K" \
> -c "pwrite -S 0xbb 4K 8K" \
> -c "pwrite -S 0xcc 12K 4K" \
> $SCRATCH_MNT/foo | _filter_xfs_io
>
> # Now clone our extent into an adjacent offset.
> $CLONER_PROG -s $((4 * 1024)) -d $((16 * 1024)) -l $((8 * 1024)) \
> $SCRATCH_MNT/foo $SCRATCH_MNT/foo
>
> # Same as before but for this file we clone the extent into a lower
> # file offset.
> $XFS_IO_PROG -f -c "pwrite -S 0xaa 8K 4K" \
> -c "pwrite -S 0xbb 12K 8K" \
> -c "pwrite -S 0xcc 20K 4K" \
> $SCRATCH_MNT/bar | _filter_xfs_io
>
> $CLONER_PROG -s $((12 * 1024)) -d 0 -l $((8 * 1024)) \
> $SCRATCH_MNT/bar $SCRATCH_MNT/bar
>
> echo "File digests before unmounting filesystem:"
> md5sum $SCRATCH_MNT/foo | _filter_scratch
> md5sum $SCRATCH_MNT/bar | _filter_scratch
>
> # Evicting the inode or clearing the page cache before reading
> # again the file would also trigger the bug - reads were returning
> # all bytes in the range corresponding to the second reference to
> # the extent with a value of 0, but the correct data was persisted
> # (it was a bug exclusively in the read path). The issue happened
> # only if the same readpages() call targeted pages belonging to the
> # first and second ranges that point to the same compressed extent.
> _scratch_remount
>
> echo "File digests after mounting filesystem again:"
> # Must match the same digests we got before.
> md5sum $SCRATCH_MNT/foo | _filter_scratch
> md5sum $SCRATCH_MNT/bar | _filter_scratch
> }
>
> echo -e "\nTesting with zlib compression..."
> test_clone_and_read_compressed_extent "-o compress=zlib"
>
> _scratch_unmount
>
> echo -e "\nTesting with lzo compression..."
> test_clone_and_read_compressed_extent "-o compress=lzo"
>
> status=0
> exit
Looks good to me.
Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
Thanks,
-liubo
>
> Cc: stable@vger.kernel.org
> Signed-off-by: Filipe Manana <fdmanana@suse.com>
> ---
> fs/btrfs/extent_io.c | 65 +++++++++++++++++++++++++++++++++++++++++++++-------
> 1 file changed, 57 insertions(+), 8 deletions(-)
>
> diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
> index fa19f2f..11aa8f7 100644
> --- a/fs/btrfs/extent_io.c
> +++ b/fs/btrfs/extent_io.c
> @@ -2805,7 +2805,8 @@ static int submit_extent_page(int rw, struct extent_io_tree *tree,
> bio_end_io_t end_io_func,
> int mirror_num,
> unsigned long prev_bio_flags,
> - unsigned long bio_flags)
> + unsigned long bio_flags,
> + bool force_bio_submit)
> {
> int ret = 0;
> struct bio *bio;
> @@ -2823,6 +2824,7 @@ static int submit_extent_page(int rw, struct extent_io_tree *tree,
> contig = bio_end_sector(bio) == sector;
>
> if (prev_bio_flags != bio_flags || !contig ||
> + force_bio_submit ||
> merge_bio(rw, tree, page, offset, page_size, bio, bio_flags) ||
> bio_add_page(bio, page, page_size, offset) < page_size) {
> ret = submit_one_bio(rw, bio, mirror_num,
> @@ -2922,7 +2924,8 @@ static int __do_readpage(struct extent_io_tree *tree,
> get_extent_t *get_extent,
> struct extent_map **em_cached,
> struct bio **bio, int mirror_num,
> - unsigned long *bio_flags, int rw)
> + unsigned long *bio_flags, int rw,
> + u64 *prev_em_start)
> {
> struct inode *inode = page->mapping->host;
> u64 start = page_offset(page);
> @@ -2970,6 +2973,7 @@ static int __do_readpage(struct extent_io_tree *tree,
> }
> while (cur <= end) {
> unsigned long pnr = (last_byte >> PAGE_CACHE_SHIFT) + 1;
> + bool force_bio_submit = false;
>
> if (cur >= last_byte) {
> char *userpage;
> @@ -3020,6 +3024,49 @@ static int __do_readpage(struct extent_io_tree *tree,
> block_start = em->block_start;
> if (test_bit(EXTENT_FLAG_PREALLOC, &em->flags))
> block_start = EXTENT_MAP_HOLE;
> +
> + /*
> + * If we have a file range that points to a compressed extent
> + * and it's followed by a consecutive file range that points to
> + * to the same compressed extent (possibly with a different
> + * offset and/or length, so it either points to the whole extent
> + * or only part of it), we must make sure we do not submit a
> + * single bio to populate the pages for the 2 ranges because
> + * this makes the compressed extent read zero out the pages
> + * belonging to the 2nd range. Imagine the following scenario:
> + *
> + * File layout
> + * [0 - 8K] [8K - 24K]
> + * | |
> + * | |
> + * points to extent X, points to extent X,
> + * offset 4K, length of 8K offset 0, length 16K
> + *
> + * [extent X, compressed length = 4K uncompressed length = 16K]
> + *
> + * If the bio to read the compressed extent covers both ranges,
> + * it will decompress extent X into the pages belonging to the
> + * first range and then it will stop, zeroing out the remaining
> + * pages that belong to the other range that points to extent X.
> + * So here we make sure we submit 2 bios, one for the first
> + * range and another one for the third range. Both will target
> + * the same physical extent from disk, but we can't currently
> + * make the compressed bio endio callback populate the pages
> + * for both ranges because each compressed bio is tightly
> + * coupled with a single extent map, and each range can have
> + * an extent map with a different offset value relative to the
> + * uncompressed data of our extent and different lengths. This
> + * is a corner case so we prioritize correctness over
> + * non-optimal behavior (submitting 2 bios for the same extent).
> + */
> + if (test_bit(EXTENT_FLAG_COMPRESSED, &em->flags) &&
> + prev_em_start && *prev_em_start != (u64)-1 &&
> + *prev_em_start != em->orig_start)
> + force_bio_submit = true;
> +
> + if (prev_em_start)
> + *prev_em_start = em->orig_start;
> +
> free_extent_map(em);
> em = NULL;
>
> @@ -3069,7 +3116,8 @@ static int __do_readpage(struct extent_io_tree *tree,
> bdev, bio, pnr,
> end_bio_extent_readpage, mirror_num,
> *bio_flags,
> - this_bio_flag);
> + this_bio_flag,
> + force_bio_submit);
> if (!ret) {
> nr++;
> *bio_flags = this_bio_flag;
> @@ -3101,6 +3149,7 @@ static inline void __do_contiguous_readpages(struct extent_io_tree *tree,
> struct inode *inode;
> struct btrfs_ordered_extent *ordered;
> int index;
> + u64 prev_em_start = (u64)-1;
>
> inode = pages[0]->mapping->host;
> while (1) {
> @@ -3116,7 +3165,7 @@ static inline void __do_contiguous_readpages(struct extent_io_tree *tree,
>
> for (index = 0; index < nr_pages; index++) {
> __do_readpage(tree, pages[index], get_extent, em_cached, bio,
> - mirror_num, bio_flags, rw);
> + mirror_num, bio_flags, rw, &prev_em_start);
> page_cache_release(pages[index]);
> }
> }
> @@ -3184,7 +3233,7 @@ static int __extent_read_full_page(struct extent_io_tree *tree,
> }
>
> ret = __do_readpage(tree, page, get_extent, NULL, bio, mirror_num,
> - bio_flags, rw);
> + bio_flags, rw, NULL);
> return ret;
> }
>
> @@ -3210,7 +3259,7 @@ int extent_read_full_page_nolock(struct extent_io_tree *tree, struct page *page,
> int ret;
>
> ret = __do_readpage(tree, page, get_extent, NULL, &bio, mirror_num,
> - &bio_flags, READ);
> + &bio_flags, READ, NULL);
> if (bio)
> ret = submit_one_bio(READ, bio, mirror_num, bio_flags);
> return ret;
> @@ -3463,7 +3512,7 @@ static noinline_for_stack int __extent_writepage_io(struct inode *inode,
> sector, iosize, pg_offset,
> bdev, &epd->bio, max_nr,
> end_bio_extent_writepage,
> - 0, 0, 0);
> + 0, 0, 0, false);
> if (ret)
> SetPageError(page);
> }
> @@ -3765,7 +3814,7 @@ static noinline_for_stack int write_one_eb(struct extent_buffer *eb,
> ret = submit_extent_page(rw, tree, wbc, p, offset >> 9,
> PAGE_CACHE_SIZE, 0, bdev, &epd->bio,
> -1, end_bio_extent_buffer_writepage,
> - 0, epd->bio_flags, bio_flags);
> + 0, epd->bio_flags, bio_flags, false);
> epd->bio_flags = bio_flags;
> if (ret) {
> set_btree_ioerr(p);
> --
> 2.1.3
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
prev parent reply other threads:[~2015-09-15 8:09 UTC|newest]
Thread overview: 6+ messages / expand[flat|nested] mbox.gz Atom feed top
2015-09-14 8:29 [PATCH] Btrfs: fix read corruption of compressed and shared extents fdmanana
2015-09-14 9:08 ` Qu Wenruo
2015-09-14 9:22 ` Filipe Manana
2015-09-14 9:34 ` Qu Wenruo
2015-09-14 12:50 ` Chris Mason
2015-09-15 8:09 ` Liu Bo [this message]
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20150915080914.GA4634@localhost.localdomain \
--to=bo.li.liu@oracle.com \
--cc=fdmanana@kernel.org \
--cc=fdmanana@suse.com \
--cc=linux-btrfs@vger.kernel.org \
--cc=stable@vger.kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).