From: Filipe Manana <fdmanana@suse.com>
To: linux-btrfs@vger.kernel.org
Cc: Filipe Manana <fdmanana@suse.com>
Subject: [PATCH] Btrfs: fix data loss in the fast fsync path
Date: Thu, 26 Feb 2015 19:52:57 +0000 [thread overview]
Message-ID: <1424980377-8060-1-git-send-email-fdmanana@suse.com> (raw)
When using the fast file fsync code path we can miss the fact that
there new writes happened since the last file fsync and therefore
return without waiting for the IO to finish and write the new extents
to the fsync log.
The comment this change adds to the fsync handler explains how this
can happen and why, and therefore it's included here:
"If the last transaction that changed this file was before the current
transaction and we have the full sync flag set in our inode, we can
bail out now without any syncing.
Note that we can't bail out if the full sync flag isn't set. This is
because when the full sync flag is set we start all ordered extents
and wait for them to fully complete - when they complete they update
the inode's last_trans field through:
btrfs_finish_ordered_io() ->
btrfs_update_inode_fallback() ->
btrfs_update_inode() ->
btrfs_set_inode_last_trans()
So we are sure that last_trans is up to date and can do this check to
bail out safely. For the fast path, when the full sync flag is not
set in our inode, we can not do it because we start only our ordered
extents and don't wait for them to complete (that is when
btrfs_finish_ordered_io runs) - if we rely on the speculative value
for inode->last_trans set by btrfs_file_write_iter we lose data in
the following scenario:
1. fs_info->last_trans_committed == N - 1 and current transaction is
transaction N (fs_info->generation == N);
2. do a buffered write;
3. fsync our inode, this clears our inode's full sync flag, starts
an ordered extent and waits for it to complete - when it completes
at btrfs_finish_ordered_io(), the inode's last_trans is set to
the value N;
4. transaction N is committed, so fs_info->last_trans_committed is
now set to the value N and fs_info->generation remains with the
value N;
5. do another buffered write, when this happens btrfs_file_write_iter
sets our inode's last_trans to the value N + 1;
6. transaction N + 1 is started and fs_info->generation now has the
value N + 1;
7. transaction N + 1 is committed, so fs_info->last_trans_committed
is set to the value N + 1;
8. fsync our inode - because it doesn't have the full sync flag set,
we only start the ordered extent, we don't wait for it to complete
(only in a later phase) therefore its last_trans field has the
value N + 1 set previously by btrfs_file_write_iter(), and so we
have:
inode->last_trans <= fs_info->last_trans_committed
(N + 1) (N + 1)
Which would make us not log the last buffered write, resulting in
data loss after a crash."
This is actually deterministic and not hard to trigger. The following excerpt
from a testcase I made for xfstests triggers the issue. It moves a dummy file
across directories and then fsyncs the old parent directory - this is just to
trigger a transaction commit, so moving files around isn't directly related
to the issue but it was chosen because running 'sync' for example does more
than just committing the current transaction, as it flushes/waits for all
file data to be persisted.
The body of the test is:
_scratch_mkfs >> $seqres.full 2>&1
_init_flakey
_mount_flakey
# Create our main test file 'foo', the one we check for data loss.
# By doing an fsync against our file, it makes btrfs clear the 'needs_full_sync'
# bit from its flags (btrfs inode specific flags).
$XFS_IO_PROG -f -c "pwrite -S 0xaa 0 8K" \
-c "fsync" $SCRATCH_MNT/foo | _filter_xfs_io
# Now create one other file and 2 directories. We will move this second file
# from one directory to the other later because it forces btrfs to commit its
# currently open transaction if we fsync the old parent directory. This is
# necessary to trigger the data loss bug that affected btrfs.
mkdir $SCRATCH_MNT/testdir_1
touch $SCRATCH_MNT/testdir_1/bar
mkdir $SCRATCH_MNT/testdir_2
# Make sure everything is durably persisted.
sync
# Write more 8Kb of data to our file.
$XFS_IO_PROG -c "pwrite -S 0xbb 8K 8K" $SCRATCH_MNT/foo | _filter_xfs_io
# Move our 'bar' file into a new directory.
mv $SCRATCH_MNT/testdir_1/bar $SCRATCH_MNT/testdir_2/bar
# Fsync our first directory. Because it had a file moved into some other
# directory, this made btrfs commit the currently open transaction. This is
# a condition necessary to trigger the data loss bug.
$XFS_IO_PROG -c "fsync" $SCRATCH_MNT/testdir_1
# Now fsync our main test file. If the fsync succeeds, we expect the 8Kb of
# data we wrote previously to be persisted and available if a crash happens.
# This did not happen with btrfs, because of the transaction commit that
# happened when we fsynced the parent directory.
$XFS_IO_PROG -c "fsync" $SCRATCH_MNT/foo
# Simulate a crash/power loss.
_load_flakey_table $FLAKEY_DROP_WRITES
_unmount_flakey
_load_flakey_table $FLAKEY_ALLOW_WRITES
_mount_flakey
# Now check that all data we wrote before are available.
echo "File content after log replay:"
od -t x1 $SCRATCH_MNT/foo
status=0
exit
The expected golden output for the test, which is what we get with this
fix applied (or when running against ext3/4 and xfs), is:
wrote 8192/8192 bytes at offset 0
XXX Bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
wrote 8192/8192 bytes at offset 8192
XXX Bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
File content after log replay:
0000000 aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa
*
0020000 bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb
*
0040000
Without this fix applied, the output shows the test file is empty:
wrote 8192/8192 bytes at offset 0
XXX Bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
wrote 8192/8192 bytes at offset 8192
XXX Bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
File content after log replay:
0000000 aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa
*
0020000
A test case for xfstests will be sent soon.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
---
fs/btrfs/file.c | 63 ++++++++++++++++++++++++++++++++++++++++++++++++++++-----
1 file changed, 58 insertions(+), 5 deletions(-)
diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
index 2bd72cd..91122b6 100644
--- a/fs/btrfs/file.c
+++ b/fs/btrfs/file.c
@@ -1971,14 +1971,67 @@ int btrfs_sync_file(struct file *file, loff_t start, loff_t end, int datasync)
}
/*
- * if the last transaction that changed this file was before
- * the current transaction, we can bail out now without any
- * syncing
+ * If the last transaction that changed this file was before the current
+ * transaction and we have the full sync flag set in our inode, we can
+ * bail out now without any syncing.
+ *
+ * Note that we can't bail out if the full sync flag isn't set. This is
+ * because when the full sync flag is set we start all ordered extents
+ * and wait for them to fully complete - when they complete they update
+ * the inode's last_trans field through:
+ *
+ * btrfs_finish_ordered_io() ->
+ * btrfs_update_inode_fallback() ->
+ * btrfs_update_inode() ->
+ * btrfs_set_inode_last_trans()
+ *
+ * So we are sure that last_trans is up to date and can do this check to
+ * bail out safely. For the fast path, when the full sync flag is not
+ * set in our inode, we can not do it because we start only our ordered
+ * extents and don't wait for them to complete (that is when
+ * btrfs_finish_ordered_io runs) - if we rely on the speculative value
+ * for inode->last_trans set by btrfs_file_write_iter we lose data in
+ * the following scenario:
+ *
+ * 1. fs_info->last_trans_committed == N - 1 and current transaction is
+ * transaction N (fs_info->generation == N);
+ *
+ * 2. do a buffered write;
+ *
+ * 3. fsync our inode, this clears our inode's full sync flag, starts
+ * an ordered extent and waits for it to complete - when it completes
+ * at btrfs_finish_ordered_io(), the inode's last_trans is set to
+ * the value N;
+ *
+ * 4. transaction N is committed, so fs_info->last_trans_committed is
+ * now set to the value N and fs_info->generation remains with the
+ * value N;
+ *
+ * 5. do another buffered write, when this happens btrfs_file_write_iter
+ * sets our inode's last_trans to the value N + 1;
+ *
+ * 6. transaction N + 1 is started and fs_info->generation now has the
+ * value N + 1;
+ *
+ * 7. transaction N + 1 is committed, so fs_info->last_trans_committed
+ * is set to the value N + 1;
+ *
+ * 8. fsync our inode - because it doesn't have the full sync flag set,
+ * we only start the ordered extent, we don't wait for it to complete
+ * (only in a later phase) therefore its last_trans field has the
+ * value N + 1 set previously by btrfs_file_write_iter(), and so we
+ * have:
+ *
+ * inode->last_trans <= fs_info->last_trans_committed
+ * (N + 1) (N + 1)
+ *
+ * Which would make us not log the last buffered write, resulting in
+ * data loss after a crash.
*/
smp_mb();
if (btrfs_inode_in_log(inode, root->fs_info->generation) ||
- BTRFS_I(inode)->last_trans <=
- root->fs_info->last_trans_committed) {
+ (full_sync && BTRFS_I(inode)->last_trans <=
+ root->fs_info->last_trans_committed)) {
BTRFS_I(inode)->last_trans = 0;
/*
--
2.1.3
next reply other threads:[~2015-02-26 19:53 UTC|newest]
Thread overview: 7+ messages / expand[flat|nested] mbox.gz Atom feed top
2015-02-26 19:52 Filipe Manana [this message]
2015-02-28 15:04 ` [PATCH] Btrfs: fix data loss in the fast fsync path Liu Bo
2015-03-01 9:08 ` [PATCH v2] " Filipe Manana
2015-03-03 0:41 ` Liu Bo
2015-03-03 11:15 ` Filipe David Manana
2015-03-03 13:31 ` Liu Bo
2015-03-01 20:36 ` [PATCH v3] " Filipe Manana
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=1424980377-8060-1-git-send-email-fdmanana@suse.com \
--to=fdmanana@suse.com \
--cc=linux-btrfs@vger.kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).