From: Josef Bacik <josef@toxicpanda.com>
To: fdmanana@kernel.org, linux-btrfs@vger.kernel.org
Subject: Re: [PATCH] btrfs: fix stale data exposure after cloning a hole with NO_HOLES enabled
Date: Tue, 16 Feb 2021 10:48:49 -0500 [thread overview]
Message-ID: <40803a05-1f71-b648-4b59-dade906e48cf@toxicpanda.com> (raw)
In-Reply-To: <07067d184eb90be19874190df45cc83f06186307.1613473473.git.fdmanana@suse.com>
On 2/16/21 6:09 AM, fdmanana@kernel.org wrote:
> From: Filipe Manana <fdmanana@suse.com>
>
> When using the NO_HOLES feature, if we clone a file range that spans only
> a hole into a range that is at or beyond the current i_size of the
> destination file, we end up not setting the full sync runtime flag on the
> inode. As a result, if we then fsync the destination file and have a power
> failure, after log replay we can end up exposing stale data instead of
> having a hole for that range.
>
> The conditions for this to happen are the following:
>
> 1) We have a file with a size of, for example, 1280K;
>
> 2) There is a written (non-prealloc) extent for the file range from 1024K
> to 1280K with a length of 256K;
>
> 3) This particular file extent layout is durably persisted, so that the
> existing superblock persisted on disk points to a subvolume root where
> the file has that exact file extent layout and state;
>
> 4) The file is truncated to a smaller size, to an offset lower than the
> start offset of its last extent, for example to 800K. The truncate sets
> the full sync runtime flag on the inode;
>
> 6) Fsync the file to log it and clear the full sync runtime flag;
>
> 7) Clone a region that covers only a hole (implicit hole due to NO_HOLES)
> into the file with a destination offset that starts at or beyond the
> 256K file extent item we had - for example to offset 1024K;
>
> 8) Since the clone operation does not find extents in the source range,
> we end up in the if branch at the bottom of btrfs_clone() where we
> punch a hole for the file range starting at offset 1024K by calling
> btrfs_replace_file_extents(). There we end up not setting the full
> sync flag on the inode, because we don't know we are being called in
> a clone context (and not fallocate's punch hole operation), and
> neither do we create an extent map to represent a hole because the
> requested range is beyond eof;
>
> 9) A further fsync to the file will be a fast fsync, since the clone
> operation did not set the full sync flag, and therefore it relies on
> modified extent maps to correctly log the file layout. But since
> it does not find any extent map marking the range from 1024K (the
> previous eof) to the new eof, it does not log a file extent item
> for that range representing the hole;
>
> 10) After a power failure no hole for the range starting at 1024K is
> punched and we end up exposing stale data from the old 256K extent.
>
> Turning this into exact steps:
>
> $ mkfs.btrfs -f -O no-holes /dev/sdi
> $ mount /dev/sdi /mnt
>
> # Create our test file with 3 extents of 256K and a 256K hole at offset
> # 256K. The file has a size of 1280K.
> $ xfs_io -f -s \
> -c "pwrite -S 0xab -b 256K 0 256K" \
> -c "pwrite -S 0xcd -b 256K 512K 256K" \
> -c "pwrite -S 0xef -b 256K 768K 256K" \
> -c "pwrite -S 0x73 -b 256K 1024K 256K" \
> /mnt/sdi/foobar
>
> # Make sure it's durably persisted. We want the last committed super
> # block to point to this particular file extent layout.
> sync
>
> # Now truncate our file to a smaller size, falling within a position of
> # the second extent. This sets the full sync runtime flag on the inode.
> # Then fsync the file to log it and clear the full sync flag from the
> # inode. The third extent is no longer part of the file and therefore
> # it is not logged.
> $ xfs_io -c "truncate 800K" -c "fsync" /mnt/foobar
>
> # Now do a clone operation that only clones the hole and sets back the
> # file size to match the size it had before the truncate operation
> # (1280K).
> $ xfs_io \
> -c "reflink /mnt/foobar 256K 1024K 256K" \
> -c "fsync" \
> /mnt/foobar
>
> # File data before power failure:
> $ od -A d -t x1 /mnt/foobar
> 0000000 ab ab ab ab ab ab ab ab ab ab ab ab ab ab ab ab
> *
> 0262144 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
> *
> 0524288 cd cd cd cd cd cd cd cd cd cd cd cd cd cd cd cd
> *
> 0786432 ef ef ef ef ef ef ef ef ef ef ef ef ef ef ef ef
> *
> 0819200 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
> *
> 1310720
>
> <power fail>
>
> # Mount the fs again to replay the log tree.
> $ mount /dev/sdi /mnt
>
> # File data after power failure:
> $ od -A d -t x1 /mnt/foobar
> 0000000 ab ab ab ab ab ab ab ab ab ab ab ab ab ab ab ab
> *
> 0262144 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
> *
> 0524288 cd cd cd cd cd cd cd cd cd cd cd cd cd cd cd cd
> *
> 0786432 ef ef ef ef ef ef ef ef ef ef ef ef ef ef ef ef
> *
> 0819200 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
> *
> 1048576 73 73 73 73 73 73 73 73 73 73 73 73 73 73 73 73
> *
> 1310720
>
> The range from 1024K to 1280K should correspond to a hole but instead it
> points to stale data, to the 256K extent that should not exist after the
> truncate operation.
>
> The issue does not exists when not using NO_HOLES, because for that case
> we use file extent items to represent holes, these are found and copied
> during the loop that iterates over extents at btrfs_clone(), and that
> causes btrfs_replace_file_extents() to be called with a non-NULL
> extent_info argument and therefore set the full sync runtime flag on the
> inode.
>
> So fix this by making the code that deals with a trailing hole during
> cloning, at btrfs_clone(), to set the full sync flag on the inode, if the
> range starts at or beyond the current i_size.
>
> A test case for fstests will follow soon.
>
> Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Thanks,
Josef
next prev parent reply other threads:[~2021-02-16 15:49 UTC|newest]
Thread overview: 3+ messages / expand[flat|nested] mbox.gz Atom feed top
2021-02-16 11:09 [PATCH] btrfs: fix stale data exposure after cloning a hole with NO_HOLES enabled fdmanana
2021-02-16 15:48 ` Josef Bacik [this message]
2021-02-17 17:24 ` David Sterba
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=40803a05-1f71-b648-4b59-dade906e48cf@toxicpanda.com \
--to=josef@toxicpanda.com \
--cc=fdmanana@kernel.org \
--cc=linux-btrfs@vger.kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox