Re: [PATCH] Btrfs: fix file read corruption after extent cloning and fsync

linux-btrfs.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

From: Mike Fleetwood <mike.fleetwood@googlemail.com>
To: fdmanana@kernel.org
Cc: linux-btrfs <linux-btrfs@vger.kernel.org>,
	Filipe Manana <fdmanana@suse.com>
Subject: Re: [PATCH] Btrfs: fix file read corruption after extent cloning and fsync
Date: Wed, 19 Aug 2015 13:26:24 +0100	[thread overview]
Message-ID: <CAMU1PDjctOgrd3P2h2XDmKLQpM5XmSyZGqeDz8sqaZaOBg2JGw@mail.gmail.com> (raw)
In-Reply-To: <1439979115-14122-1-git-send-email-fdmanana@kernel.org>

On 19 August 2015 at 11:11,  <fdmanana@kernel.org> wrote:
> From: Filipe Manana <fdmanana@suse.com>
>
> If we partially clone one extent of a file into a lower offset of the
> file, fsync the file, power fail and then mount the fs to trigger log
> replay, we can get multiple checksum items in the csum tree that overlap
> each other and result in checksum lookup failures later. Those failures
> can make file data read requests assume a checksum value of 0, but they
> will not return an error (-EIO for example) to userspace exactly because
> the expected checksum value 0 is a special value that makes the read bio
> endio callback return success and set all the bytes of the corresponding
> page with the value 0x01 (at fs/btrfs/inode.c:__readpage_endio_check()).
> From a userspace perspective this is equivalent to file corruption
> because we are not returning what was written to the file.
>
> Details about how this can happen, and why, are included inline in the
> following reproducer test case for fstests and the comment added to
> tree-log.c.
>
>   seq=`basename $0`
>   seqres=$RESULT_DIR/$seq
>   echo "QA output created by $seq"
>   tmp=/tmp/$$
>   status=1      # failure is the default!
>   trap "_cleanup; exit \$status" 0 1 2 3 15
>
>   _cleanup()
>   {
>       _cleanup_flakey
>       rm -f $tmp.*
>   }
>
>   # get standard environment, filters and checks
>   . ./common/rc
>   . ./common/filter
>   . ./common/dmflakey
>
>   # real QA test starts here
>   _need_to_be_root
>   _supported_fs btrfs
>   _supported_os Linux
>   _require_scratch
>   _require_dm_flakey
>   _require_cloner
>   _require_metadata_journaling $SCRATCH_DEV
>
>   rm -f $seqres.full
>
>   _scratch_mkfs >>$seqres.full 2>&1
>   _init_flakey
>   _mount_flakey
>
>   # Create our test file with a single 100K extent starting at file
>   # offset 800K. We fsync the file here to make the fsync log tree gets
>   # a single csum item that covers the whole 100K extent, which causes
>   # the second fsync, done after the cloning operation below, to not
>   # leave in the log tree two csum items covering two sub-ranges
>   # ([0, 20K[ and [20K, 100K[)) of our extent.
>   $XFS_IO_PROG -f -c "pwrite -S 0xaa 800K 100K"  \
>                   -c "fsync"                     \
>                    $SCRATCH_MNT/foo | _filter_xfs_io
>
>   # Now clone part of our extent into file offset 400K. This adds a file
>   # extent item to our inode's metadata that points to the 100K extent
>   # we created before, using a data offset of 20K and a data length of
>   # 20K, so that it refers to the sub-range [20K, 40K[ of our original
>   # extent.
>   $CLONER_PROG -s $((800 * 1024 + 20 * 1024)) -d $((400 * 1024)) \
>       -l $((20 * 1024)) $SCRATCH_MNT/foo $SCRATCH_MNT/foo
>
>   # Now fsync our file to make sure the extent cloning is durably
>   # persisted. This fsync will not add a second csum item to the log
>   # tree containing the checksums for the blocks in the sub-range
>   # [20K, 40K[ of our extent, because there was already a csum item in
>   # the log tree covering the whole extent, added by the first fsync
>   # we did before.
>   $XFS_IO_PROG -c "fsync" $SCRATCH_MNT/foo
>
>   echo "File digest before power failure:"
>   md5sum $SCRATCH_MNT/foo | _filter_scratch
>
>   # Silently drop all writes and ummount to simulate a crash/power
>   # failure.
>   _load_flakey_table $FLAKEY_DROP_WRITES
>   _unmount_flakey
>
>   # Allow writes again, mount to trigger log replay and validate file
>   # contents.
>   # The fsync log replay first processes the file extent item
>   # corresponding to the file offset 400K (the one which refers to the
>   # [20K, 40K[ sub-range of our 100K extent) and then processes the file
>   # extent item for file offset 800K. It used to happen that when
>   # processing the later, it erroneously left in the csum tree 2 csum
>   # items that overlapped each other, 1 for the sub-range [20K, 40K[ and
>   # 1 for the whole range of our extent. This introduced a problem where
>   # subsequent lookups for the checksums of blocks within the range
>   # [40K, 100K[ of our extent would not find anything because lookups in
>   # the csum tree ended up looking only at the smaller csum item, the
>   # one covering the subrange [20K, 40K[. This made read requests assume
>   # an expected checksum with a value of 0 for those blocks, which caused
>   # checksum verification failure when the read operations finished.
>   # However those checksum failure did not result in read requests
>   # returning an error to user space (like -EIO for e.g.) because the
>   # expected checksum value had the special value 0, and in that case
>   # btrfs set all bytes of the corresponding pages with the value 0x01
>   # and produce the following warning in dmesg/syslog:
>   #
>   #  "BTRFS warning (device dm-0): csum failed ino 257 off 917504 csum\
>   #   1322675045 expected csum 0"
>   #
>   _load_flakey_table $FLAKEY_ALLOW_WRITES
>   _mount_flakey
>
>   echo "File digest after log replay:"
>   # Must match the same digest he had after cloning the extent and
>   # before the power failure happened.
>   md5sum $SCRATCH_MNT/foo | _filter_scratch
>
>   _unmount_flakey
>
>   status=0
>   exit
>
> Signed-off-by: Filipe Manana <fdmanana@suse.com>
> ---
>  fs/btrfs/tree-log.c | 54 +++++++++++++++++++++++++++++++++++++++++++++++++++++
>  1 file changed, 54 insertions(+)
>
> diff --git a/fs/btrfs/tree-log.c b/fs/btrfs/tree-log.c
> index 9314ade..beb2c0a 100644
> --- a/fs/btrfs/tree-log.c
> +++ b/fs/btrfs/tree-log.c
> @@ -731,12 +731,66 @@ static noinline int replay_one_extent(struct btrfs_trans_handle *trans,
>                                                 &ordered_sums, 0);
>                         if (ret)
>                                 goto out;
> +                       /*
> +                        * Now delete all existing cums in the csum root that
> +                        * cover our range. We do this because because we can
Minor grammatical correction.  Remove second "because".

> +                        * have an extent that is completely referenced by one
> +                        * file extent item and partially referenced by another
> +                        * file extent item (like after using the clone or
> +                        * extent_same ioctls). In this case if we end up doing
> +                        * the replay of the one that partially references the
> +                        * extent first, and we do not do the csum deletion
> +                        * below, we can get 2 csum items in the csum tree that
> +                        * overlap each other. For example, imagine our log has
> +                        * the two following file extent items:
> +                        *
> +                        * key (257 EXTENT_DATA 409600)
> +                        *     extent data disk byte 12845056 nr 102400
> +                        *     extent data offset 20480 nr 20480 ram 102400
> +                        *
> +                        * key (257 EXTENT_DATA 819200)
> +                        *     extent data disk byte 12845056 nr 102400
> +                        *     extent data offset 0 nr 102400 ram 102400
> +                        *
> +                        * Where the second one fully references the 100K extent
> +                        * that starts at disk byte 12845056, and the log tree
> +                        * has a single csum item that covers the entire range
> +                        * of the extent:
> +                        *
> +                        * key (EXTENT_CSUM EXTENT_CSUM 12845056) itemsize 100
> +                        *
> +                        * After the first file extent item is replayed, the
> +                        * csum tree gets the following csum item:
> +                        *
> +                        * key (EXTENT_CSUM EXTENT_CSUM 12865536) itemsize 20
> +                        *
> +                        * Which covers the 20K sub-range starting at offset 20K
> +                        * of our extent. Now when we replay the second file
> +                        * extent item, if we do not delete existing csum items
> +                        * that cover any of its blocks, we end up getting two
> +                        * csum items in our csum tree that overlap each other:
> +                        *
> +                        * key (EXTENT_CSUM EXTENT_CSUM 12845056) itemsize 100
> +                        * key (EXTENT_CSUM EXTENT_CSUM 12865536) itemsize 20
> +                        *
> +                        * Which is a problem, because after this anyone trying
> +                        * to lookup up for the checksum of any block of our
> +                        * extent starting at an offset of 40K or higher, will
> +                        * end up looking at the second csum item only, which
> +                        * does not contain the checksum for any block starting
> +                        * at offset 40K or higher of our extent.
> +                        */
>                         while (!list_empty(&ordered_sums)) {
>                                 struct btrfs_ordered_sum *sums;
>                                 sums = list_entry(ordered_sums.next,
>                                                 struct btrfs_ordered_sum,
>                                                 list);
>                                 if (!ret)
> +                                       ret = btrfs_del_csums(trans,
> +                                                     root->fs_info->csum_root,
> +                                                     sums->bytenr,
> +                                                     sums->len);
> +                               if (!ret)
>                                         ret = btrfs_csum_file_blocks(trans,
>                                                 root->fs_info->csum_root,
>                                                 sums);
> --
> 2.1.3

Thanks,
Mike

     prev parent reply	other threads:[~2015-08-19 12:26 UTC|newest]

Thread overview: 3+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2015-08-19 10:11 [PATCH] Btrfs: fix file read corruption after extent cloning and fsync fdmanana
2015-08-19 11:36 ` Liu Bo
2015-08-19 12:26 ` Mike Fleetwood [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=CAMU1PDjctOgrd3P2h2XDmKLQpM5XmSyZGqeDz8sqaZaOBg2JGw@mail.gmail.com \
    --to=mike.fleetwood@googlemail.com \
    --cc=fdmanana@kernel.org \
    --cc=fdmanana@suse.com \
    --cc=linux-btrfs@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).