linux-btrfs.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Filipe Manana <fdmanana@suse.com>
To: linux-btrfs@vger.kernel.org
Cc: Filipe Manana <fdmanana@suse.com>
Subject: [PATCH] Btrfs: fix fsync race leading to invalid data after log replay
Date: Sat, 30 Aug 2014 18:01:29 +0100	[thread overview]
Message-ID: <1409418089-31790-1-git-send-email-fdmanana@suse.com> (raw)

When the fsync callback (btrfs_sync_file) starts, it first waits for
the writeback of any dirty pages to start and finish without holding
the inode's mutex (to reduce contention). After this it acquires the
inode's mutex and repeats that process via btrfs_wait_ordered_range
only if we're doing a full sync (BTRFS_INODE_NEEDS_FULL_SYNC flag
is set on the inode).

This is not safe for a non full sync - we need to start and wait for
writeback to finish of any pages that might have been made dirty
before acquiring the inode's mutex and after that first step mentioned
before. Why this is needed is explained by the following comment added
to btrfs_sync_file:

    "Right before acquiring the inode's mutex, we might have new writes
     dirtying pages, which won't immediately start the respective ordered
     operations - that is done through the fill_delalloc callbacks invoked
     from the writepage and writepages address space operations. So make
     sure we start all ordered operations before starting to log our
     inode. Not doing this means that while logging the inode, writeback
     could start and invoke writepage/writepages, which would call the
     fill_delalloc callbacks (cow_file_range, submit_compressed_extents).
     These callbacks add first an extent map to the modified list of
     extents and then create the respective ordered operation, which means
     in tree-log.c:btrfs_log_inode() we might capture all existing ordered
     operations (with btrfs_get_logged_extents()) before the fill_delalloc
     callback adds its ordered operation, and by the time we visit the
     modified list of extent maps (with btrfs_log_changed_extents()), we
     see and process the extent map they created. We then use their extent
     map to construct a file extent item for logging without waiting for
     the respective ordered operation to finish - these file extent items
     point to a disk location that might not have yet been written to,
     containing random data - so after a crash a log replay will make our
     inode have file extent items that point to disk locations containing
     invalid data."

Signed-off-by: Filipe Manana <fdmanana@suse.com>
---
 fs/btrfs/file.c | 33 +++++++++++++++++++++++++++------
 1 file changed, 27 insertions(+), 6 deletions(-)

diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
index e5534c1..5e9d108 100644
--- a/fs/btrfs/file.c
+++ b/fs/btrfs/file.c
@@ -1912,12 +1912,33 @@ int btrfs_sync_file(struct file *file, loff_t start, loff_t end, int datasync)
 	atomic_inc(&root->log_batch);
 	full_sync = test_bit(BTRFS_INODE_NEEDS_FULL_SYNC,
 			     &BTRFS_I(inode)->runtime_flags);
-	if (full_sync) {
-		ret = btrfs_wait_ordered_range(inode, start, end - start + 1);
-		if (ret) {
-			mutex_unlock(&inode->i_mutex);
-			goto out;
-		}
+	/*
+	 * Right before acquiring the inode's mutex, we might have new writes
+	 * dirtying pages, which won't immediately start the respective ordered
+	 * operations - that is done through the fill_delalloc callbacks invoked
+	 * from the writepage and writepages address space operations. So make
+	 * sure we start all ordered operations before starting to log our
+	 * inode. Not doing this means that while logging the inode, writeback
+	 * could start and invoke writepage/writepages, which would call the
+	 * fill_delalloc callbacks (cow_file_range, submit_compressed_extents).
+	 * These callbacks add first an extent map to the modified list of
+	 * extents and then create the respective ordered operation, which means
+	 * in tree-log.c:btrfs_log_inode() we might capture all existing ordered
+	 * operations (with btrfs_get_logged_extents()) before the fill_delalloc
+	 * callback adds its ordered operation, and by the time we visit the
+	 * modified list of extent maps (with btrfs_log_changed_extents()), we
+	 * see and process the extent map they created. We then use their extent
+	 * map to construct a file extent item for logging without waiting for
+	 * the respective ordered operation to finish - these file extent items
+	 * point to a disk location that might not have yet been written to,
+	 * containing random data - so after a crash a log replay will make our
+	 * inode have file extent items that point to disk locations containing
+	 * invalid data.
+	 */
+	ret = btrfs_wait_ordered_range(inode, start, end - start + 1);
+	if (ret) {
+		mutex_unlock(&inode->i_mutex);
+		goto out;
 	}
 	atomic_inc(&root->log_batch);
 
-- 
1.9.1


             reply	other threads:[~2014-08-30 16:01 UTC|newest]

Thread overview: 2+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2014-08-30 17:01 Filipe Manana [this message]
2014-09-02 10:09 ` [PATCH v2] Btrfs: fix fsync race leading to invalid data after log replay Filipe Manana

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=1409418089-31790-1-git-send-email-fdmanana@suse.com \
    --to=fdmanana@suse.com \
    --cc=linux-btrfs@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).