stable.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
To: linux-kernel@vger.kernel.org
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>,
	stable@vger.kernel.org, Filipe Manana <fdmanana@suse.com>,
	David Sterba <dsterba@suse.com>,
	Sudip Mukherjee <sudipm.mukherjee@gmail.com>
Subject: [PATCH 4.14 13/21] Btrfs: fix file data corruption after cloning a range and fsync
Date: Tue,  7 Aug 2018 20:51:27 +0200	[thread overview]
Message-ID: <20180807172331.570202527@linuxfoundation.org> (raw)
In-Reply-To: <20180807172330.953888701@linuxfoundation.org>

4.14-stable review patch.  If anyone has any objections, please let me know.

------------------

From: Filipe Manana <fdmanana@suse.com>

commit bd3599a0e142cd73edd3b6801068ac3f48ac771a upstream.

When we clone a range into a file we can end up dropping existing
extent maps (or trimming them) and replacing them with new ones if the
range to be cloned overlaps with a range in the destination inode.
When that happens we add the new extent maps to the list of modified
extents in the inode's extent map tree, so that a "fast" fsync (the flag
BTRFS_INODE_NEEDS_FULL_SYNC not set in the inode) will see the extent maps
and log corresponding extent items. However, at the end of range cloning
operation we do truncate all the pages in the affected range (in order to
ensure future reads will not get stale data). Sometimes this truncation
will release the corresponding extent maps besides the pages from the page
cache. If this happens, then a "fast" fsync operation will miss logging
some extent items, because it relies exclusively on the extent maps being
present in the inode's extent tree, leading to data loss/corruption if
the fsync ends up using the same transaction used by the clone operation
(that transaction was not committed in the meanwhile). An extent map is
released through the callback btrfs_invalidatepage(), which gets called by
truncate_inode_pages_range(), and it calls __btrfs_releasepage(). The
later ends up calling try_release_extent_mapping() which will release the
extent map if some conditions are met, like the file size being greater
than 16Mb, gfp flags allow blocking and the range not being locked (which
is the case during the clone operation) nor being the extent map flagged
as pinned (also the case for cloning).

The following example, turned into a test for fstests, reproduces the
issue:

  $ mkfs.btrfs -f /dev/sdb
  $ mount /dev/sdb /mnt

  $ xfs_io -f -c "pwrite -S 0x18 9000K 6908K" /mnt/foo
  $ xfs_io -f -c "pwrite -S 0x20 2572K 156K" /mnt/bar

  $ xfs_io -c "fsync" /mnt/bar
  # reflink destination offset corresponds to the size of file bar,
  # 2728Kb minus 4Kb.
  $ xfs_io -c ""reflink ${SCRATCH_MNT}/foo 0 2724K 15908K" /mnt/bar
  $ xfs_io -c "fsync" /mnt/bar

  $ md5sum /mnt/bar
  95a95813a8c2abc9aa75a6c2914a077e  /mnt/bar

  <power fail>

  $ mount /dev/sdb /mnt
  $ md5sum /mnt/bar
  207fd8d0b161be8a84b945f0df8d5f8d  /mnt/bar
  # digest should be 95a95813a8c2abc9aa75a6c2914a077e like before the
  # power failure

In the above example, the destination offset of the clone operation
corresponds to the size of the "bar" file minus 4Kb. So during the clone
operation, the extent map covering the range from 2572Kb to 2728Kb gets
trimmed so that it ends at offset 2724Kb, and a new extent map covering
the range from 2724Kb to 11724Kb is created. So at the end of the clone
operation when we ask to truncate the pages in the range from 2724Kb to
2724Kb + 15908Kb, the page invalidation callback ends up removing the new
extent map (through try_release_extent_mapping()) when the page at offset
2724Kb is passed to that callback.

Fix this by setting the bit BTRFS_INODE_NEEDS_FULL_SYNC whenever an extent
map is removed at try_release_extent_mapping(), forcing the next fsync to
search for modified extents in the fs/subvolume tree instead of relying on
the presence of extent maps in memory. This way we can continue doing a
"fast" fsync if the destination range of a clone operation does not
overlap with an existing range or if any of the criteria necessary to
remove an extent map at try_release_extent_mapping() is not met (file
size not bigger then 16Mb or gfp flags do not allow blocking).

CC: stable@vger.kernel.org # 3.16+
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Sudip Mukherjee <sudipm.mukherjee@gmail.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
---
 fs/btrfs/extent_io.c |    3 +++
 1 file changed, 3 insertions(+)

--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -4280,6 +4280,7 @@ int try_release_extent_mapping(struct ex
 	struct extent_map *em;
 	u64 start = page_offset(page);
 	u64 end = start + PAGE_SIZE - 1;
+	struct btrfs_inode *btrfs_inode = BTRFS_I(page->mapping->host);
 
 	if (gfpflags_allow_blocking(mask) &&
 	    page->mapping->host->i_size > SZ_16M) {
@@ -4302,6 +4303,8 @@ int try_release_extent_mapping(struct ex
 					    extent_map_end(em) - 1,
 					    EXTENT_LOCKED | EXTENT_WRITEBACK,
 					    0, NULL)) {
+				set_bit(BTRFS_INODE_NEEDS_FULL_SYNC,
+					&btrfs_inode->runtime_flags);
 				remove_extent_mapping(map, em);
 				/* once for the rb tree */
 				free_extent_map(em);

  parent reply	other threads:[~2018-08-07 21:10 UTC|newest]

Thread overview: 24+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2018-08-07 18:51 [PATCH 4.14 00/21] 4.14.62-stable review Greg Kroah-Hartman
2018-08-07 18:51 ` [PATCH 4.14 01/21] scsi: qla2xxx: Fix unintialized List head crash Greg Kroah-Hartman
2018-08-07 18:51 ` [PATCH 4.14 02/21] scsi: qla2xxx: Fix NPIV deletion by calling wait_for_sess_deletion Greg Kroah-Hartman
2018-08-07 18:51 ` [PATCH 4.14 03/21] scsi: qla2xxx: Fix ISP recovery on unload Greg Kroah-Hartman
2018-08-07 18:51 ` [PATCH 4.14 04/21] scsi: qla2xxx: Return error when TMF returns Greg Kroah-Hartman
2018-08-07 18:51 ` [PATCH 4.14 05/21] genirq: Make force irq threading setup more robust Greg Kroah-Hartman
2018-08-07 18:51 ` [PATCH 4.14 06/21] nohz: Fix local_timer_softirq_pending() Greg Kroah-Hartman
2018-08-07 18:51 ` [PATCH 4.14 07/21] nohz: Fix missing tick reprogram when interrupting an inline softirq Greg Kroah-Hartman
2018-08-07 18:51 ` [PATCH 4.14 08/21] netlink: Dont shift on 64 for ngroups Greg Kroah-Hartman
2018-08-07 18:51 ` [PATCH 4.14 09/21] ext4: fix false negatives *and* false positives in ext4_check_descriptors() Greg Kroah-Hartman
2018-08-07 18:51 ` [PATCH 4.14 10/21] ACPI / PCI: Bail early in acpi_pci_add_bus() if there is no ACPI handle Greg Kroah-Hartman
2018-08-07 18:51 ` [PATCH 4.14 11/21] ring_buffer: tracing: Inherit the tracing setting to next ring buffer Greg Kroah-Hartman
2018-08-07 18:51 ` Greg Kroah-Hartman [this message]
2018-08-07 18:51 ` [PATCH 4.14 14/21] nvme-pci: allocate device queues storage space at probe Greg Kroah-Hartman
2018-08-07 18:51 ` [PATCH 4.14 15/21] nvme-pci: Fix queue double allocations Greg Kroah-Hartman
2018-08-07 18:51 ` [PATCH 4.14 16/21] nvmet-fc: fix target sgl list on large transfers Greg Kroah-Hartman
2018-08-07 18:51 ` [PATCH 4.14 17/21] intel_idle: Graceful probe failure when MWAIT is disabled Greg Kroah-Hartman
2018-08-07 18:51 ` [PATCH 4.14 18/21] xfs: catch inode allocation state mismatch corruption Greg Kroah-Hartman
2018-08-07 18:51 ` [PATCH 4.14 19/21] xfs: validate cached inodes are free when allocated Greg Kroah-Hartman
2018-08-07 18:51 ` [PATCH 4.14 20/21] xfs: dont call xfs_da_shrink_inode with NULL bp Greg Kroah-Hartman
2018-08-07 18:51 ` [PATCH 4.14 21/21] jfs: Fix inconsistency between memory allocation and ea_buf->max_size Greg Kroah-Hartman
2018-08-08  2:55 ` [PATCH 4.14 00/21] 4.14.62-stable review Shuah Khan
2018-08-08  5:22 ` Naresh Kamboju
2018-08-08 15:48 ` Guenter Roeck

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20180807172331.570202527@linuxfoundation.org \
    --to=gregkh@linuxfoundation.org \
    --cc=dsterba@suse.com \
    --cc=fdmanana@suse.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=stable@vger.kernel.org \
    --cc=sudipm.mukherjee@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).