From: Mike Fleetwood <mike.fleetwood@googlemail.com>
To: fdmanana@kernel.org
Cc: linux-btrfs <linux-btrfs@vger.kernel.org>,
Filipe Manana <fdmanana@suse.com>
Subject: Re: [PATCH] Btrfs: fix file read corruption after extent cloning and fsync
Date: Wed, 19 Aug 2015 13:26:24 +0100 [thread overview]
Message-ID: <CAMU1PDjctOgrd3P2h2XDmKLQpM5XmSyZGqeDz8sqaZaOBg2JGw@mail.gmail.com> (raw)
In-Reply-To: <1439979115-14122-1-git-send-email-fdmanana@kernel.org>
On 19 August 2015 at 11:11, <fdmanana@kernel.org> wrote:
> From: Filipe Manana <fdmanana@suse.com>
>
> If we partially clone one extent of a file into a lower offset of the
> file, fsync the file, power fail and then mount the fs to trigger log
> replay, we can get multiple checksum items in the csum tree that overlap
> each other and result in checksum lookup failures later. Those failures
> can make file data read requests assume a checksum value of 0, but they
> will not return an error (-EIO for example) to userspace exactly because
> the expected checksum value 0 is a special value that makes the read bio
> endio callback return success and set all the bytes of the corresponding
> page with the value 0x01 (at fs/btrfs/inode.c:__readpage_endio_check()).
> From a userspace perspective this is equivalent to file corruption
> because we are not returning what was written to the file.
>
> Details about how this can happen, and why, are included inline in the
> following reproducer test case for fstests and the comment added to
> tree-log.c.
>
> seq=`basename $0`
> seqres=$RESULT_DIR/$seq
> echo "QA output created by $seq"
> tmp=/tmp/$$
> status=1 # failure is the default!
> trap "_cleanup; exit \$status" 0 1 2 3 15
>
> _cleanup()
> {
> _cleanup_flakey
> rm -f $tmp.*
> }
>
> # get standard environment, filters and checks
> . ./common/rc
> . ./common/filter
> . ./common/dmflakey
>
> # real QA test starts here
> _need_to_be_root
> _supported_fs btrfs
> _supported_os Linux
> _require_scratch
> _require_dm_flakey
> _require_cloner
> _require_metadata_journaling $SCRATCH_DEV
>
> rm -f $seqres.full
>
> _scratch_mkfs >>$seqres.full 2>&1
> _init_flakey
> _mount_flakey
>
> # Create our test file with a single 100K extent starting at file
> # offset 800K. We fsync the file here to make the fsync log tree gets
> # a single csum item that covers the whole 100K extent, which causes
> # the second fsync, done after the cloning operation below, to not
> # leave in the log tree two csum items covering two sub-ranges
> # ([0, 20K[ and [20K, 100K[)) of our extent.
> $XFS_IO_PROG -f -c "pwrite -S 0xaa 800K 100K" \
> -c "fsync" \
> $SCRATCH_MNT/foo | _filter_xfs_io
>
> # Now clone part of our extent into file offset 400K. This adds a file
> # extent item to our inode's metadata that points to the 100K extent
> # we created before, using a data offset of 20K and a data length of
> # 20K, so that it refers to the sub-range [20K, 40K[ of our original
> # extent.
> $CLONER_PROG -s $((800 * 1024 + 20 * 1024)) -d $((400 * 1024)) \
> -l $((20 * 1024)) $SCRATCH_MNT/foo $SCRATCH_MNT/foo
>
> # Now fsync our file to make sure the extent cloning is durably
> # persisted. This fsync will not add a second csum item to the log
> # tree containing the checksums for the blocks in the sub-range
> # [20K, 40K[ of our extent, because there was already a csum item in
> # the log tree covering the whole extent, added by the first fsync
> # we did before.
> $XFS_IO_PROG -c "fsync" $SCRATCH_MNT/foo
>
> echo "File digest before power failure:"
> md5sum $SCRATCH_MNT/foo | _filter_scratch
>
> # Silently drop all writes and ummount to simulate a crash/power
> # failure.
> _load_flakey_table $FLAKEY_DROP_WRITES
> _unmount_flakey
>
> # Allow writes again, mount to trigger log replay and validate file
> # contents.
> # The fsync log replay first processes the file extent item
> # corresponding to the file offset 400K (the one which refers to the
> # [20K, 40K[ sub-range of our 100K extent) and then processes the file
> # extent item for file offset 800K. It used to happen that when
> # processing the later, it erroneously left in the csum tree 2 csum
> # items that overlapped each other, 1 for the sub-range [20K, 40K[ and
> # 1 for the whole range of our extent. This introduced a problem where
> # subsequent lookups for the checksums of blocks within the range
> # [40K, 100K[ of our extent would not find anything because lookups in
> # the csum tree ended up looking only at the smaller csum item, the
> # one covering the subrange [20K, 40K[. This made read requests assume
> # an expected checksum with a value of 0 for those blocks, which caused
> # checksum verification failure when the read operations finished.
> # However those checksum failure did not result in read requests
> # returning an error to user space (like -EIO for e.g.) because the
> # expected checksum value had the special value 0, and in that case
> # btrfs set all bytes of the corresponding pages with the value 0x01
> # and produce the following warning in dmesg/syslog:
> #
> # "BTRFS warning (device dm-0): csum failed ino 257 off 917504 csum\
> # 1322675045 expected csum 0"
> #
> _load_flakey_table $FLAKEY_ALLOW_WRITES
> _mount_flakey
>
> echo "File digest after log replay:"
> # Must match the same digest he had after cloning the extent and
> # before the power failure happened.
> md5sum $SCRATCH_MNT/foo | _filter_scratch
>
> _unmount_flakey
>
> status=0
> exit
>
> Signed-off-by: Filipe Manana <fdmanana@suse.com>
> ---
> fs/btrfs/tree-log.c | 54 +++++++++++++++++++++++++++++++++++++++++++++++++++++
> 1 file changed, 54 insertions(+)
>
> diff --git a/fs/btrfs/tree-log.c b/fs/btrfs/tree-log.c
> index 9314ade..beb2c0a 100644
> --- a/fs/btrfs/tree-log.c
> +++ b/fs/btrfs/tree-log.c
> @@ -731,12 +731,66 @@ static noinline int replay_one_extent(struct btrfs_trans_handle *trans,
> &ordered_sums, 0);
> if (ret)
> goto out;
> + /*
> + * Now delete all existing cums in the csum root that
> + * cover our range. We do this because because we can
Minor grammatical correction. Remove second "because".
> + * have an extent that is completely referenced by one
> + * file extent item and partially referenced by another
> + * file extent item (like after using the clone or
> + * extent_same ioctls). In this case if we end up doing
> + * the replay of the one that partially references the
> + * extent first, and we do not do the csum deletion
> + * below, we can get 2 csum items in the csum tree that
> + * overlap each other. For example, imagine our log has
> + * the two following file extent items:
> + *
> + * key (257 EXTENT_DATA 409600)
> + * extent data disk byte 12845056 nr 102400
> + * extent data offset 20480 nr 20480 ram 102400
> + *
> + * key (257 EXTENT_DATA 819200)
> + * extent data disk byte 12845056 nr 102400
> + * extent data offset 0 nr 102400 ram 102400
> + *
> + * Where the second one fully references the 100K extent
> + * that starts at disk byte 12845056, and the log tree
> + * has a single csum item that covers the entire range
> + * of the extent:
> + *
> + * key (EXTENT_CSUM EXTENT_CSUM 12845056) itemsize 100
> + *
> + * After the first file extent item is replayed, the
> + * csum tree gets the following csum item:
> + *
> + * key (EXTENT_CSUM EXTENT_CSUM 12865536) itemsize 20
> + *
> + * Which covers the 20K sub-range starting at offset 20K
> + * of our extent. Now when we replay the second file
> + * extent item, if we do not delete existing csum items
> + * that cover any of its blocks, we end up getting two
> + * csum items in our csum tree that overlap each other:
> + *
> + * key (EXTENT_CSUM EXTENT_CSUM 12845056) itemsize 100
> + * key (EXTENT_CSUM EXTENT_CSUM 12865536) itemsize 20
> + *
> + * Which is a problem, because after this anyone trying
> + * to lookup up for the checksum of any block of our
> + * extent starting at an offset of 40K or higher, will
> + * end up looking at the second csum item only, which
> + * does not contain the checksum for any block starting
> + * at offset 40K or higher of our extent.
> + */
> while (!list_empty(&ordered_sums)) {
> struct btrfs_ordered_sum *sums;
> sums = list_entry(ordered_sums.next,
> struct btrfs_ordered_sum,
> list);
> if (!ret)
> + ret = btrfs_del_csums(trans,
> + root->fs_info->csum_root,
> + sums->bytenr,
> + sums->len);
> + if (!ret)
> ret = btrfs_csum_file_blocks(trans,
> root->fs_info->csum_root,
> sums);
> --
> 2.1.3
Thanks,
Mike
prev parent reply other threads:[~2015-08-19 12:26 UTC|newest]
Thread overview: 3+ messages / expand[flat|nested] mbox.gz Atom feed top
2015-08-19 10:11 [PATCH] Btrfs: fix file read corruption after extent cloning and fsync fdmanana
2015-08-19 11:36 ` Liu Bo
2015-08-19 12:26 ` Mike Fleetwood [this message]
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=CAMU1PDjctOgrd3P2h2XDmKLQpM5XmSyZGqeDz8sqaZaOBg2JGw@mail.gmail.com \
--to=mike.fleetwood@googlemail.com \
--cc=fdmanana@kernel.org \
--cc=fdmanana@suse.com \
--cc=linux-btrfs@vger.kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).