linux-btrfs.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Filipe Manana <fdmanana@suse.com>
To: linux-btrfs@vger.kernel.org
Cc: Filipe Manana <fdmanana@suse.com>
Subject: [PATCH v2] Btrfs: fix fsync race leading to invalid data after log replay
Date: Tue,  2 Sep 2014 11:09:58 +0100	[thread overview]
Message-ID: <1409652598-27398-1-git-send-email-fdmanana@suse.com> (raw)
In-Reply-To: <1409418089-31790-1-git-send-email-fdmanana@suse.com>

When the fsync callback (btrfs_sync_file) starts, it first waits for
the writeback of any dirty pages to start and finish without holding
the inode's mutex (to reduce contention). After this it acquires the
inode's mutex and repeats that process via btrfs_wait_ordered_range
only if we're doing a full sync (BTRFS_INODE_NEEDS_FULL_SYNC flag
is set on the inode).

This is not safe for a non full sync - we need to start and wait for
writeback to finish for any pages that might have been made dirty
before acquiring the inode's mutex and after that first step mentioned
before. Why this is needed is explained by the following comment added
to btrfs_sync_file:

  "Right before acquiring the inode's mutex, we might have new
   writes dirtying pages, which won't immediately start the
   respective ordered operations - that is done through the
   fill_delalloc callbacks invoked from the writepage and
   writepages address space operations. So make sure we start
   all ordered operations before starting to log our inode. Not
   doing this means that while logging the inode, writeback
   could start and invoke writepage/writepages, which would call
   the fill_delalloc callbacks (cow_file_range,
   submit_compressed_extents). These callbacks add first an
   extent map to the modified list of extents and then create
   the respective ordered operation, which means in
   tree-log.c:btrfs_log_inode() we might capture all existing
   ordered operations (with btrfs_get_logged_extents()) before
   the fill_delalloc callback adds its ordered operation, and by
   the time we visit the modified list of extent maps (with
   btrfs_log_changed_extents()), we see and process the extent
   map they created. We then use the extent map to construct a
   file extent item for logging without waiting for the
   respective ordered operation to finish - this file extent
   item points to a disk location that might not have yet been
   written to, containing random data - so after a crash a log
   replay will make our inode have file extent items that point
   to disk locations containing invalid data, as we returned
   success to userspace without waiting for the respective
   ordered operation to finish, because it wasn't captured by
   btrfs_get_logged_extents()."

Signed-off-by: Filipe Manana <fdmanana@suse.com>
---

V2: Better comments and for the non full sync case, start only the
    ordered operations, instead of starting them and waiting for them
    to complete. Waiting for their completion is already done later by
    btrfs_sync_log(), so like this we can reduce some latency as some
    ordered operations might complete (or get closer to) while writing
    to the log tree (btrfs_log_dentry_safe).

 fs/btrfs/file.c | 78 ++++++++++++++++++++++++++++++++++++++++++++++-----------
 1 file changed, 63 insertions(+), 15 deletions(-)

diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
index e5534c1..5427ba8 100644
--- a/fs/btrfs/file.c
+++ b/fs/btrfs/file.c
@@ -1865,6 +1865,20 @@ int btrfs_release_file(struct inode *inode, struct file *filp)
 	return 0;
 }
 
+static int start_ordered_ops(struct inode *inode, loff_t start, loff_t end)
+{
+	int ret;
+
+	atomic_inc(&BTRFS_I(inode)->sync_writers);
+	ret = filemap_fdatawrite_range(inode->i_mapping, start, end);
+	if (!ret && test_bit(BTRFS_INODE_HAS_ASYNC_EXTENT,
+			     &BTRFS_I(inode)->runtime_flags))
+		ret = filemap_fdatawrite_range(inode->i_mapping, start, end);
+	atomic_dec(&BTRFS_I(inode)->sync_writers);
+
+	return ret;
+}
+
 /*
  * fsync call for both files and directories.  This logs the inode into
  * the tree log instead of forcing full commits whenever possible.
@@ -1894,30 +1908,64 @@ int btrfs_sync_file(struct file *file, loff_t start, loff_t end, int datasync)
 	 * multi-task, and make the performance up.  See
 	 * btrfs_wait_ordered_range for an explanation of the ASYNC check.
 	 */
-	atomic_inc(&BTRFS_I(inode)->sync_writers);
-	ret = filemap_fdatawrite_range(inode->i_mapping, start, end);
-	if (!ret && test_bit(BTRFS_INODE_HAS_ASYNC_EXTENT,
-			     &BTRFS_I(inode)->runtime_flags))
-		ret = filemap_fdatawrite_range(inode->i_mapping, start, end);
-	atomic_dec(&BTRFS_I(inode)->sync_writers);
+	ret = start_ordered_ops(inode, start, end);
 	if (ret)
 		return ret;
 
 	mutex_lock(&inode->i_mutex);
-
-	/*
-	 * We flush the dirty pages again to avoid some dirty pages in the
-	 * range being left.
-	 */
 	atomic_inc(&root->log_batch);
 	full_sync = test_bit(BTRFS_INODE_NEEDS_FULL_SYNC,
 			     &BTRFS_I(inode)->runtime_flags);
+	/*
+	 * We might have have had more pages made dirty after calling
+	 * start_ordered_ops and before acquiring the inode's i_mutex.
+	 */
 	if (full_sync) {
+		/*
+		 * For a full sync, we need to make sure any ordered operations
+		 * start and finish before we start logging the inode, so that
+		 * all extents are persisted and the respective file extent
+		 * items are in the fs/subvol btree.
+		 */
 		ret = btrfs_wait_ordered_range(inode, start, end - start + 1);
-		if (ret) {
-			mutex_unlock(&inode->i_mutex);
-			goto out;
-		}
+	} else {
+		/*
+		 * Start any new ordered operations before starting to log the
+		 * inode. We will wait for them to finish in btrfs_sync_log().
+		 *
+		 * Right before acquiring the inode's mutex, we might have new
+		 * writes dirtying pages, which won't immediately start the
+		 * respective ordered operations - that is done through the
+		 * fill_delalloc callbacks invoked from the writepage and
+		 * writepages address space operations. So make sure we start
+		 * all ordered operations before starting to log our inode. Not
+		 * doing this means that while logging the inode, writeback
+		 * could start and invoke writepage/writepages, which would call
+		 * the fill_delalloc callbacks (cow_file_range,
+		 * submit_compressed_extents). These callbacks add first an
+		 * extent map to the modified list of extents and then create
+		 * the respective ordered operation, which means in
+		 * tree-log.c:btrfs_log_inode() we might capture all existing
+		 * ordered operations (with btrfs_get_logged_extents()) before
+		 * the fill_delalloc callback adds its ordered operation, and by
+		 * the time we visit the modified list of extent maps (with
+		 * btrfs_log_changed_extents()), we see and process the extent
+		 * map they created. We then use the extent map to construct a
+		 * file extent item for logging without waiting for the
+		 * respective ordered operation to finish - this file extent
+		 * item points to a disk location that might not have yet been
+		 * written to, containing random data - so after a crash a log
+		 * replay will make our inode have file extent items that point
+		 * to disk locations containing invalid data, as we returned
+		 * success to userspace without waiting for the respective
+		 * ordered operation to finish, because it wasn't captured by
+		 * btrfs_get_logged_extents().
+		 */
+		ret = start_ordered_ops(inode, start, end);
+	}
+	if (ret) {
+		mutex_unlock(&inode->i_mutex);
+		goto out;
 	}
 	atomic_inc(&root->log_batch);
 
-- 
1.9.1


      reply	other threads:[~2014-09-02  9:13 UTC|newest]

Thread overview: 2+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2014-08-30 17:01 [PATCH] Btrfs: fix fsync race leading to invalid data after log replay Filipe Manana
2014-09-02 10:09 ` Filipe Manana [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=1409652598-27398-1-git-send-email-fdmanana@suse.com \
    --to=fdmanana@suse.com \
    --cc=linux-btrfs@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).