From: gregkh@linuxfoundation.org
To: linux-kernel@vger.kernel.org
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>,
stable@vger.kernel.org, Josef Bacik <josef@toxicpanda.com>,
Filipe Manana <fdmanana@suse.com>,
David Sterba <dsterba@suse.com>
Subject: [PATCH 5.11 12/44] btrfs: fix stale data exposure after cloning a hole with NO_HOLES enabled
Date: Mon, 8 Mar 2021 13:34:50 +0100 [thread overview]
Message-ID: <20210308122719.185126994@linuxfoundation.org> (raw)
In-Reply-To: <20210308122718.586629218@linuxfoundation.org>
From: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
From: Filipe Manana <fdmanana@suse.com>
commit 3660d0bcdb82807d434da9d2e57d88b37331182d upstream.
When using the NO_HOLES feature, if we clone a file range that spans only
a hole into a range that is at or beyond the current i_size of the
destination file, we end up not setting the full sync runtime flag on the
inode. As a result, if we then fsync the destination file and have a power
failure, after log replay we can end up exposing stale data instead of
having a hole for that range.
The conditions for this to happen are the following:
1) We have a file with a size of, for example, 1280K;
2) There is a written (non-prealloc) extent for the file range from 1024K
to 1280K with a length of 256K;
3) This particular file extent layout is durably persisted, so that the
existing superblock persisted on disk points to a subvolume root where
the file has that exact file extent layout and state;
4) The file is truncated to a smaller size, to an offset lower than the
start offset of its last extent, for example to 800K. The truncate sets
the full sync runtime flag on the inode;
6) Fsync the file to log it and clear the full sync runtime flag;
7) Clone a region that covers only a hole (implicit hole due to NO_HOLES)
into the file with a destination offset that starts at or beyond the
256K file extent item we had - for example to offset 1024K;
8) Since the clone operation does not find extents in the source range,
we end up in the if branch at the bottom of btrfs_clone() where we
punch a hole for the file range starting at offset 1024K by calling
btrfs_replace_file_extents(). There we end up not setting the full
sync flag on the inode, because we don't know we are being called in
a clone context (and not fallocate's punch hole operation), and
neither do we create an extent map to represent a hole because the
requested range is beyond eof;
9) A further fsync to the file will be a fast fsync, since the clone
operation did not set the full sync flag, and therefore it relies on
modified extent maps to correctly log the file layout. But since
it does not find any extent map marking the range from 1024K (the
previous eof) to the new eof, it does not log a file extent item
for that range representing the hole;
10) After a power failure no hole for the range starting at 1024K is
punched and we end up exposing stale data from the old 256K extent.
Turning this into exact steps:
$ mkfs.btrfs -f -O no-holes /dev/sdi
$ mount /dev/sdi /mnt
# Create our test file with 3 extents of 256K and a 256K hole at offset
# 256K. The file has a size of 1280K.
$ xfs_io -f -s \
-c "pwrite -S 0xab -b 256K 0 256K" \
-c "pwrite -S 0xcd -b 256K 512K 256K" \
-c "pwrite -S 0xef -b 256K 768K 256K" \
-c "pwrite -S 0x73 -b 256K 1024K 256K" \
/mnt/sdi/foobar
# Make sure it's durably persisted. We want the last committed super
# block to point to this particular file extent layout.
sync
# Now truncate our file to a smaller size, falling within a position of
# the second extent. This sets the full sync runtime flag on the inode.
# Then fsync the file to log it and clear the full sync flag from the
# inode. The third extent is no longer part of the file and therefore
# it is not logged.
$ xfs_io -c "truncate 800K" -c "fsync" /mnt/foobar
# Now do a clone operation that only clones the hole and sets back the
# file size to match the size it had before the truncate operation
# (1280K).
$ xfs_io \
-c "reflink /mnt/foobar 256K 1024K 256K" \
-c "fsync" \
/mnt/foobar
# File data before power failure:
$ od -A d -t x1 /mnt/foobar
0000000 ab ab ab ab ab ab ab ab ab ab ab ab ab ab ab ab
*
0262144 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
*
0524288 cd cd cd cd cd cd cd cd cd cd cd cd cd cd cd cd
*
0786432 ef ef ef ef ef ef ef ef ef ef ef ef ef ef ef ef
*
0819200 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
*
1310720
<power fail>
# Mount the fs again to replay the log tree.
$ mount /dev/sdi /mnt
# File data after power failure:
$ od -A d -t x1 /mnt/foobar
0000000 ab ab ab ab ab ab ab ab ab ab ab ab ab ab ab ab
*
0262144 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
*
0524288 cd cd cd cd cd cd cd cd cd cd cd cd cd cd cd cd
*
0786432 ef ef ef ef ef ef ef ef ef ef ef ef ef ef ef ef
*
0819200 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
*
1048576 73 73 73 73 73 73 73 73 73 73 73 73 73 73 73 73
*
1310720
The range from 1024K to 1280K should correspond to a hole but instead it
points to stale data, to the 256K extent that should not exist after the
truncate operation.
The issue does not exists when not using NO_HOLES, because for that case
we use file extent items to represent holes, these are found and copied
during the loop that iterates over extents at btrfs_clone(), and that
causes btrfs_replace_file_extents() to be called with a non-NULL
extent_info argument and therefore set the full sync runtime flag on the
inode.
So fix this by making the code that deals with a trailing hole during
cloning, at btrfs_clone(), to set the full sync flag on the inode, if the
range starts at or beyond the current i_size.
A test case for fstests will follow soon.
Backporting notes: for kernel 5.4 the change goes to ioctl.c into
btrfs_clone before the last call to btrfs_punch_hole_range.
CC: stable@vger.kernel.org # 5.4+
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
---
fs/btrfs/reflink.c | 18 ++++++++++++++++++
1 file changed, 18 insertions(+)
--- a/fs/btrfs/reflink.c
+++ b/fs/btrfs/reflink.c
@@ -550,6 +550,24 @@ process_slot:
*/
btrfs_release_path(path);
+ /*
+ * When using NO_HOLES and we are cloning a range that covers
+ * only a hole (no extents) into a range beyond the current
+ * i_size, punching a hole in the target range will not create
+ * an extent map defining a hole, because the range starts at or
+ * beyond current i_size. If the file previously had an i_size
+ * greater than the new i_size set by this clone operation, we
+ * need to make sure the next fsync is a full fsync, so that it
+ * detects and logs a hole covering a range from the current
+ * i_size to the new i_size. If the clone range covers extents,
+ * besides a hole, then we know the full sync flag was already
+ * set by previous calls to btrfs_replace_file_extents() that
+ * replaced file extent items.
+ */
+ if (last_dest_end >= i_size_read(inode))
+ set_bit(BTRFS_INODE_NEEDS_FULL_SYNC,
+ &BTRFS_I(inode)->runtime_flags);
+
ret = btrfs_replace_file_extents(inode, path, last_dest_end,
destoff + len - 1, NULL, &trans);
if (ret)
next prev parent reply other threads:[~2021-03-08 12:36 UTC|newest]
Thread overview: 49+ messages / expand[flat|nested] mbox.gz Atom feed top
2021-03-08 12:34 [PATCH 5.11 00/44] 5.11.5-rc1 review gregkh
2021-03-08 12:34 ` [PATCH 5.11 01/44] ALSA: hda/realtek: Enable headset mic of Acer SWIFT with ALC256 gregkh
2021-03-08 12:34 ` [PATCH 5.11 02/44] ALSA: usb-audio: use Corsair Virtuoso mapping for Corsair Virtuoso SE gregkh
2021-03-08 12:34 ` [PATCH 5.11 03/44] ALSA: usb-audio: Dont abort even if the clock rate differs gregkh
2021-03-08 12:34 ` [PATCH 5.11 04/44] ALSA: usb-audio: Drop bogus dB range in too low level gregkh
2021-03-08 12:34 ` [PATCH 5.11 05/44] ALSA: usb-audio: Allow modifying parameters with succeeding hw_params calls gregkh
2021-03-08 12:34 ` [PATCH 5.11 06/44] tpm, tpm_tis: Decorate tpm_tis_gen_interrupt() with request_locality() gregkh
2021-03-08 12:34 ` [PATCH 5.11 07/44] tpm, tpm_tis: Decorate tpm_get_timeouts() " gregkh
2021-03-08 12:34 ` [PATCH 5.11 08/44] btrfs: avoid double put of block group when emptying cluster gregkh
2021-03-08 12:34 ` [PATCH 5.11 09/44] btrfs: fix raid6 qstripe kmap gregkh
2021-03-08 12:34 ` [PATCH 5.11 10/44] btrfs: fix race between writes to swap files and scrub gregkh
2021-03-08 12:34 ` [PATCH 5.11 11/44] btrfs: fix race between swap file activation and snapshot creation gregkh
2021-03-08 12:34 ` gregkh [this message]
2021-03-08 12:34 ` [PATCH 5.11 13/44] btrfs: tree-checker: do not error out if extent ref hash doesnt match gregkh
2021-03-08 12:34 ` [PATCH 5.11 14/44] btrfs: fix race between extent freeing/allocation when using bitmaps gregkh
2021-03-08 12:34 ` [PATCH 5.11 15/44] btrfs: validate qgroup inherit for SNAP_CREATE_V2 ioctl gregkh
2021-03-08 12:34 ` [PATCH 5.11 16/44] btrfs: free correct amount of space in btrfs_delayed_inode_reserve_metadata gregkh
2021-03-08 12:34 ` [PATCH 5.11 17/44] btrfs: fix spurious free_space_tree remount warning gregkh
2021-03-08 12:34 ` [PATCH 5.11 18/44] btrfs: unlock extents in btrfs_zero_range in case of quota reservation errors gregkh
2021-03-08 12:34 ` [PATCH 5.11 19/44] btrfs: fix warning when creating a directory with smack enabled gregkh
2021-03-08 12:34 ` [PATCH 5.11 20/44] PM: runtime: Update device status before letting suppliers suspend gregkh
2021-03-08 12:34 ` [PATCH 5.11 21/44] ring-buffer: Force before_stamp and write_stamp to be different on discard gregkh
2021-03-08 12:35 ` [PATCH 5.11 22/44] io_uring: ignore double poll add on the same waitqueue head gregkh
2021-03-08 12:35 ` [PATCH 5.11 23/44] dm bufio: subtract the number of initial sectors in dm_bufio_get_device_size gregkh
2021-03-08 12:35 ` [PATCH 5.11 24/44] dm verity: fix FEC for RS roots unaligned to block size gregkh
2021-03-08 12:35 ` [PATCH 5.11 25/44] drm/amd/pm: correct Arcturus mmTHM_BACO_CNTL register address gregkh
2021-03-08 12:35 ` [PATCH 5.11 26/44] drm/amdgpu:disable VCN for Navi12 SKU gregkh
2021-03-08 12:35 ` [PATCH 5.11 27/44] drm/amdgpu: Only check for S0ix if AMD_PMC is configured gregkh
2021-03-08 12:35 ` [PATCH 5.11 28/44] drm/amdgpu: fix parameter error of RREG32_PCIE() in amdgpu_regs_pcie gregkh
2021-03-08 12:35 ` [PATCH 5.11 29/44] crypto - shash: reduce minimum alignment of shash_desc structure gregkh
2021-03-08 12:35 ` [PATCH 5.11 30/44] ALSA: ctxfi: cthw20k2: fix mask on conf to allow 4 bits gregkh
2021-03-08 12:35 ` [PATCH 5.11 31/44] ALSA: usb-audio: Fix Pioneer DJM devices URB_CONTROL request direction to set samplerate gregkh
2021-03-08 12:35 ` [PATCH 5.11 32/44] RDMA/cm: Fix IRQ restore in ib_send_cm_sidr_rep gregkh
2021-03-08 12:35 ` [PATCH 5.11 33/44] RDMA/rxe: Fix missing kconfig dependency on CRYPTO gregkh
2021-03-08 12:35 ` [PATCH 5.11 34/44] IB/mlx5: Add missing error code gregkh
2021-03-08 12:35 ` [PATCH 5.11 35/44] ALSA: hda: intel-nhlt: verify config type gregkh
2021-03-08 12:35 ` [PATCH 5.11 36/44] ftrace: Have recordmcount use w8 to read relp->r_info in arm64_is_fake_mcount gregkh
2021-03-08 12:35 ` [PATCH 5.11 37/44] ia64: dont call handle_signal() unless theres actually a signal queued gregkh
2021-03-08 12:35 ` [PATCH 5.11 38/44] rsxx: Return -EFAULT if copy_to_user() fails gregkh
2021-03-08 12:35 ` [PATCH 5.11 39/44] iommu/tegra-smmu: Fix mc errors on tegra124-nyan gregkh
2021-03-08 12:35 ` [PATCH 5.11 40/44] iommu: Dont use lazy flush for untrusted device gregkh
2021-03-08 12:35 ` [PATCH 5.11 41/44] iommu/vt-d: Fix status code for Allocate/Free PASID command gregkh
2021-03-08 12:35 ` [PATCH 5.11 42/44] btrfs: zoned: use sector_t for zone sectors gregkh
2021-03-08 12:35 ` [PATCH 5.11 43/44] tomoyo: recognize kernel threads correctly gregkh
2021-03-08 12:35 ` [PATCH 5.11 44/44] r8169: fix resuming from suspend on RTL8105e if machine runs on battery gregkh
2021-03-08 22:29 ` [PATCH 5.11 00/44] 5.11.5-rc1 review Guenter Roeck
2021-03-09 10:26 ` Greg KH
2021-03-09 4:22 ` Naresh Kamboju
2021-03-09 10:26 ` Greg Kroah-Hartman
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20210308122719.185126994@linuxfoundation.org \
--to=gregkh@linuxfoundation.org \
--cc=dsterba@suse.com \
--cc=fdmanana@suse.com \
--cc=josef@toxicpanda.com \
--cc=linux-kernel@vger.kernel.org \
--cc=stable@vger.kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox