From: Sasha Levin <sashal@kernel.org>
To: linux-kernel@vger.kernel.org, stable@vger.kernel.org
Cc: Filipe Manana <fdmanana@suse.com>, Qu Wenruo <wqu@suse.com>,
David Sterba <dsterba@suse.com>, Sasha Levin <sashal@kernel.org>,
clm@fb.com, josef@toxicpanda.com, linux-btrfs@vger.kernel.org
Subject: [PATCH AUTOSEL 6.9 10/44] btrfs: ensure fast fsync waits for ordered extents after a write failure
Date: Mon, 17 Jun 2024 09:19:23 -0400 [thread overview]
Message-ID: <20240617132046.2587008-10-sashal@kernel.org> (raw)
In-Reply-To: <20240617132046.2587008-1-sashal@kernel.org>
From: Filipe Manana <fdmanana@suse.com>
[ Upstream commit f13e01b89daf42330a4a722f451e48c3e2edfc8d ]
If a write path in COW mode fails, either before submitting a bio for the
new extents or an actual IO error happens, we can end up allowing a fast
fsync to log file extent items that point to unwritten extents.
This is because dropping the extent maps happens when completing ordered
extents, at btrfs_finish_one_ordered(), and the completion of an ordered
extent is executed in a work queue.
This can result in a fast fsync to start logging file extent items based
on existing extent maps before the ordered extents complete, therefore
resulting in a log that has file extent items that point to unwritten
extents, resulting in a corrupt file if a crash happens after and the log
tree is replayed the next time the fs is mounted.
This can happen for both direct IO writes and buffered writes.
For example consider a direct IO write, in COW mode, that fails at
btrfs_dio_submit_io() because btrfs_extract_ordered_extent() returned an
error:
1) We call btrfs_finish_ordered_extent() with the 'uptodate' parameter
set to false, meaning an error happened;
2) That results in marking the ordered extent with the BTRFS_ORDERED_IOERR
flag;
3) btrfs_finish_ordered_extent() queues the completion of the ordered
extent - so that btrfs_finish_one_ordered() will be executed later in
a work queue. That function will drop extent maps in the range when
it's executed, since the extent maps point to unwritten locations
(signaled by the BTRFS_ORDERED_IOERR flag);
4) After calling btrfs_finish_ordered_extent() we keep going down the
write path and unlock the inode;
5) After that a fast fsync starts and locks the inode;
6) Before the work queue executes btrfs_finish_one_ordered(), the fsync
task sees the extent maps that point to the unwritten locations and
logs file extent items based on them - it does not know they are
unwritten, and the fast fsync path does not wait for ordered extents
to complete, which is an intentional behaviour in order to reduce
latency.
For the buffered write case, here's one example:
1) A fast fsync begins, and it starts by flushing delalloc and waiting for
the writeback to complete by calling filemap_fdatawait_range();
2) Flushing the dellaloc created a new extent map X;
3) During the writeback some IO error happened, and at the end io callback
(end_bbio_data_write()) we call btrfs_finish_ordered_extent(), which
sets the BTRFS_ORDERED_IOERR flag in the ordered extent and queues its
completion;
4) After queuing the ordered extent completion, the end io callback clears
the writeback flag from all pages (or folios), and from that moment the
fast fsync can proceed;
5) The fast fsync proceeds sees extent map X and logs a file extent item
based on extent map X, resulting in a log that points to an unwritten
data extent - because the ordered extent completion hasn't run yet, it
happens only after the logging.
To fix this make btrfs_finish_ordered_extent() set the inode flag
BTRFS_INODE_NEEDS_FULL_SYNC in case an error happened for a COW write,
so that a fast fsync will wait for ordered extent completion.
Note that this issues of using extent maps that point to unwritten
locations can not happen for reads, because in read paths we start by
locking the extent range and wait for any ordered extents in the range
to complete before looking for extent maps.
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
---
fs/btrfs/btrfs_inode.h | 10 ++++++++++
fs/btrfs/file.c | 16 ++++++++++++++++
fs/btrfs/ordered-data.c | 31 +++++++++++++++++++++++++++++++
3 files changed, 57 insertions(+)
diff --git a/fs/btrfs/btrfs_inode.h b/fs/btrfs/btrfs_inode.h
index 100020ca4658e..787ca2892d7a6 100644
--- a/fs/btrfs/btrfs_inode.h
+++ b/fs/btrfs/btrfs_inode.h
@@ -89,6 +89,16 @@ enum {
BTRFS_INODE_FREE_SPACE_INODE,
/* Set when there are no capabilities in XATTs for the inode. */
BTRFS_INODE_NO_CAP_XATTR,
+ /*
+ * Set if an error happened when doing a COW write before submitting a
+ * bio or during writeback. Used for both buffered writes and direct IO
+ * writes. This is to signal a fast fsync that it has to wait for
+ * ordered extents to complete and therefore not log extent maps that
+ * point to unwritten extents (when an ordered extent completes and it
+ * has the BTRFS_ORDERED_IOERR flag set, it drops extent maps in its
+ * range).
+ */
+ BTRFS_INODE_COW_WRITE_ERROR,
};
/* in memory btrfs inode */
diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
index f9d76072398da..97f6133b6eee8 100644
--- a/fs/btrfs/file.c
+++ b/fs/btrfs/file.c
@@ -1875,6 +1875,7 @@ int btrfs_sync_file(struct file *file, loff_t start, loff_t end, int datasync)
*/
if (full_sync || btrfs_is_zoned(fs_info)) {
ret = btrfs_wait_ordered_range(inode, start, len);
+ clear_bit(BTRFS_INODE_COW_WRITE_ERROR, &BTRFS_I(inode)->runtime_flags);
} else {
/*
* Get our ordered extents as soon as possible to avoid doing
@@ -1884,6 +1885,21 @@ int btrfs_sync_file(struct file *file, loff_t start, loff_t end, int datasync)
btrfs_get_ordered_extents_for_logging(BTRFS_I(inode),
&ctx.ordered_extents);
ret = filemap_fdatawait_range(inode->i_mapping, start, end);
+ if (ret)
+ goto out_release_extents;
+
+ /*
+ * Check and clear the BTRFS_INODE_COW_WRITE_ERROR now after
+ * starting and waiting for writeback, because for buffered IO
+ * it may have been set during the end IO callback
+ * (end_bbio_data_write() -> btrfs_finish_ordered_extent()) in
+ * case an error happened and we need to wait for ordered
+ * extents to complete so that any extent maps that point to
+ * unwritten locations are dropped and we don't log them.
+ */
+ if (test_and_clear_bit(BTRFS_INODE_COW_WRITE_ERROR,
+ &BTRFS_I(inode)->runtime_flags))
+ ret = btrfs_wait_ordered_range(inode, start, len);
}
if (ret)
diff --git a/fs/btrfs/ordered-data.c b/fs/btrfs/ordered-data.c
index c2a42bcde98e0..7dbf4162c75a5 100644
--- a/fs/btrfs/ordered-data.c
+++ b/fs/btrfs/ordered-data.c
@@ -382,6 +382,37 @@ bool btrfs_finish_ordered_extent(struct btrfs_ordered_extent *ordered,
ret = can_finish_ordered_extent(ordered, page, file_offset, len, uptodate);
spin_unlock_irqrestore(&inode->ordered_tree_lock, flags);
+ /*
+ * If this is a COW write it means we created new extent maps for the
+ * range and they point to unwritten locations if we got an error either
+ * before submitting a bio or during IO.
+ *
+ * We have marked the ordered extent with BTRFS_ORDERED_IOERR, and we
+ * are queuing its completion below. During completion, at
+ * btrfs_finish_one_ordered(), we will drop the extent maps for the
+ * unwritten extents.
+ *
+ * However because completion runs in a work queue we can end up having
+ * a fast fsync running before that. In the case of direct IO, once we
+ * unlock the inode the fsync might start, and we queue the completion
+ * before unlocking the inode. In the case of buffered IO when writeback
+ * finishes (end_bbio_data_write()) we queue the completion, so if the
+ * writeback was triggered by a fast fsync, the fsync might start
+ * logging before ordered extent completion runs in the work queue.
+ *
+ * The fast fsync will log file extent items based on the extent maps it
+ * finds, so if by the time it collects extent maps the ordered extent
+ * completion didn't happen yet, it will log file extent items that
+ * point to unwritten extents, resulting in a corruption if a crash
+ * happens and the log tree is replayed. Note that a fast fsync does not
+ * wait for completion of ordered extents in order to reduce latency.
+ *
+ * Set a flag in the inode so that the next fast fsync will wait for
+ * ordered extents to complete before starting to log.
+ */
+ if (!uptodate && !test_bit(BTRFS_ORDERED_NOCOW, &ordered->flags))
+ set_bit(BTRFS_INODE_COW_WRITE_ERROR, &inode->runtime_flags);
+
if (ret)
btrfs_queue_ordered_fn(ordered);
return ret;
--
2.43.0
next prev parent reply other threads:[~2024-06-17 13:21 UTC|newest]
Thread overview: 46+ messages / expand[flat|nested] mbox.gz Atom feed top
2024-06-17 13:19 [PATCH AUTOSEL 6.9 01/44] scsi: core: alua: I/O errors for ALUA state transitions Sasha Levin
2024-06-17 13:19 ` [PATCH AUTOSEL 6.9 02/44] scsi: sr: Fix unintentional arithmetic wraparound Sasha Levin
2024-06-17 13:19 ` [PATCH AUTOSEL 6.9 03/44] scsi: qedf: Don't process stag work during unload and recovery Sasha Levin
2024-06-17 13:19 ` [PATCH AUTOSEL 6.9 04/44] scsi: qedf: Wait for stag work during unload Sasha Levin
2024-06-17 13:19 ` [PATCH AUTOSEL 6.9 05/44] scsi: qedf: Set qed_slowpath_params to zero before use Sasha Levin
2024-06-17 13:19 ` [PATCH AUTOSEL 6.9 06/44] efi/libstub: zboot.lds: Discard .discard sections Sasha Levin
2024-06-17 13:19 ` [PATCH AUTOSEL 6.9 07/44] efi: pstore: Return proper errors on UEFI failures Sasha Levin
2024-06-17 13:22 ` Ard Biesheuvel
2024-06-17 13:19 ` [PATCH AUTOSEL 6.9 08/44] ACPI: EC: Abort address space access upon error Sasha Levin
2024-06-17 13:19 ` [PATCH AUTOSEL 6.9 09/44] ACPI: EC: Avoid returning AE_OK on errors in address space handler Sasha Levin
2024-06-17 13:19 ` Sasha Levin [this message]
2024-06-17 13:19 ` [PATCH AUTOSEL 6.9 11/44] tools/power/cpupower: Fix Pstate frequency reporting on AMD Family 1Ah CPUs Sasha Levin
2024-06-17 13:19 ` [PATCH AUTOSEL 6.9 12/44] PNP: Hide pnp_bus_type from the non-PNP code Sasha Levin
2024-06-17 13:19 ` [PATCH AUTOSEL 6.9 13/44] ACPI: AC: Properly notify powermanagement core about changes Sasha Levin
2024-06-17 13:19 ` [PATCH AUTOSEL 6.9 14/44] wifi: mac80211: mesh: init nonpeer_pm to active by default in mesh sdata Sasha Levin
2024-06-17 13:19 ` [PATCH AUTOSEL 6.9 15/44] wifi: mac80211: apply mcast rate only if interface is up Sasha Levin
2024-06-17 13:19 ` [PATCH AUTOSEL 6.9 16/44] wifi: mac80211: handle tasklet frames before stopping Sasha Levin
2024-06-17 13:19 ` [PATCH AUTOSEL 6.9 17/44] wifi: cfg80211: fix 6 GHz scan request building Sasha Levin
2024-06-17 13:19 ` [PATCH AUTOSEL 6.9 18/44] wifi: iwlwifi: mvm: d3: fix WoWLAN command version lookup Sasha Levin
2024-06-17 13:19 ` [PATCH AUTOSEL 6.9 19/44] wifi: iwlwifi: mvm: remove stale STA link data during restart Sasha Levin
2024-06-17 13:19 ` [PATCH AUTOSEL 6.9 20/44] wifi: iwlwifi: mvm: Handle BIGTK cipher in kek_kck cmd Sasha Levin
2024-06-17 13:19 ` [PATCH AUTOSEL 6.9 21/44] wifi: iwlwifi: mvm: handle BA session teardown in RF-kill Sasha Levin
2024-06-17 13:19 ` [PATCH AUTOSEL 6.9 22/44] wifi: iwlwifi: mvm: properly set 6 GHz channel direct probe option Sasha Levin
2024-06-17 13:19 ` [PATCH AUTOSEL 6.9 23/44] wifi: iwlwifi: mvm: Fix scan abort handling with HW rfkill Sasha Levin
2024-06-17 13:19 ` [PATCH AUTOSEL 6.9 24/44] wifi: mac80211: fix UBSAN noise in ieee80211_prep_hw_scan() Sasha Levin
2024-06-17 13:19 ` [PATCH AUTOSEL 6.9 25/44] selftests: cachestat: Fix build warnings on ppc64 Sasha Levin
2024-06-17 13:19 ` [PATCH AUTOSEL 6.9 26/44] selftests/openat2: " Sasha Levin
2024-06-17 13:19 ` [PATCH AUTOSEL 6.9 27/44] selftests/overlayfs: Fix build error " Sasha Levin
2024-06-17 13:19 ` [PATCH AUTOSEL 6.9 28/44] selftests/futex: pass _GNU_SOURCE without a value to the compiler Sasha Levin
2024-06-17 13:19 ` [PATCH AUTOSEL 6.9 29/44] of/irq: Factor out parsing of interrupt-map parent phandle+args from of_irq_parse_raw() Sasha Levin
2024-06-17 13:19 ` [PATCH AUTOSEL 6.9 30/44] nvme-fabrics: use reserved tag for reg read/write command Sasha Levin
2024-06-17 13:19 ` [PATCH AUTOSEL 6.9 31/44] LoongArch: Fix GMAC's phy-mode definitions in dts Sasha Levin
2024-06-17 13:19 ` [PATCH AUTOSEL 6.9 32/44] Input: silead - Always support 10 fingers Sasha Levin
2024-06-17 13:19 ` [PATCH AUTOSEL 6.9 33/44] platform/x86/amd/hsmp: Check HSMP support on AMD family of processors Sasha Levin
2024-06-17 13:19 ` [PATCH AUTOSEL 6.9 34/44] net: ipv6: rpl_iptunnel: block BH in rpl_output() and rpl_input() Sasha Levin
2024-06-17 13:19 ` [PATCH AUTOSEL 6.9 35/44] ila: block BH in ila_output() Sasha Levin
2024-06-17 13:19 ` [PATCH AUTOSEL 6.9 36/44] io_uring: fix possible deadlock in io_register_iowq_max_workers() Sasha Levin
2024-06-17 13:19 ` [PATCH AUTOSEL 6.9 37/44] arm64: armv8_deprecated: Fix warning in isndep cpuhp starting process Sasha Levin
2024-06-17 13:19 ` [PATCH AUTOSEL 6.9 38/44] drm/amdgpu/pptable: Fix UBSAN array-index-out-of-bounds Sasha Levin
2024-06-17 13:19 ` [PATCH AUTOSEL 6.9 39/44] null_blk: fix validation of block size Sasha Levin
2024-06-17 13:19 ` [PATCH AUTOSEL 6.9 40/44] kconfig: gconf: give a proper initial state to the Save button Sasha Levin
2024-06-17 13:19 ` [PATCH AUTOSEL 6.9 41/44] kconfig: remove wrong expr_trans_bool() Sasha Levin
2024-06-17 13:19 ` [PATCH AUTOSEL 6.9 42/44] input: Add event code for accessibility key Sasha Levin
2024-06-17 13:19 ` [PATCH AUTOSEL 6.9 43/44] input: Add support for "Do Not Disturb" Sasha Levin
2024-06-17 13:19 ` [PATCH AUTOSEL 6.9 44/44] HID: Ignore battery for ELAN touchscreens 2F2C and 4116 Sasha Levin
-- strict thread matches above, loose matches on Subject: below --
2024-06-18 12:34 [PATCH AUTOSEL 6.9 01/44] scsi: core: alua: I/O errors for ALUA state transitions Sasha Levin
2024-06-18 12:34 ` [PATCH AUTOSEL 6.9 10/44] btrfs: ensure fast fsync waits for ordered extents after a write failure Sasha Levin
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20240617132046.2587008-10-sashal@kernel.org \
--to=sashal@kernel.org \
--cc=clm@fb.com \
--cc=dsterba@suse.com \
--cc=fdmanana@suse.com \
--cc=josef@toxicpanda.com \
--cc=linux-btrfs@vger.kernel.org \
--cc=linux-kernel@vger.kernel.org \
--cc=stable@vger.kernel.org \
--cc=wqu@suse.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox