[PATCH] Btrfs: fix data loss in the fast fsync path

linux-btrfs.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH] Btrfs: fix data loss in the fast fsync path
@ 2015-02-26 19:52 Filipe Manana
  2015-02-28 15:04 ` Liu Bo
                   ` (2 more replies)
  0 siblings, 3 replies; 7+ messages in thread
From: Filipe Manana @ 2015-02-26 19:52 UTC (permalink / raw)
  To: linux-btrfs; +Cc: Filipe Manana

When using the fast file fsync code path we can miss the fact that
there new writes happened since the last file fsync and therefore
return without waiting for the IO to finish and write the new extents
to the fsync log.

The comment this change adds to the fsync handler explains how this
can happen and why, and therefore it's included here:

  "If the last transaction that changed this file was before the current
   transaction and we have the full sync flag set in our inode, we can
   bail out now without any syncing.

   Note that we can't bail out if the full sync flag isn't set. This is
   because when the full sync flag is set we start all ordered extents
   and wait for them to fully complete - when they complete they update
   the inode's last_trans field through:

       btrfs_finish_ordered_io() ->
           btrfs_update_inode_fallback() ->
               btrfs_update_inode() ->
                   btrfs_set_inode_last_trans()

   So we are sure that last_trans is up to date and can do this check to
   bail out safely. For the fast path, when the full sync flag is not
   set in our inode, we can not do it because we start only our ordered
   extents and don't wait for them to complete (that is when
   btrfs_finish_ordered_io runs) - if we rely on the speculative value
   for inode->last_trans set by btrfs_file_write_iter we lose data in
   the following scenario:

   1. fs_info->last_trans_committed == N - 1 and current transaction is
      transaction N (fs_info->generation == N);

   2. do a buffered write;

   3. fsync our inode, this clears our inode's full sync flag, starts
      an ordered extent and waits for it to complete - when it completes
      at btrfs_finish_ordered_io(), the inode's last_trans is set to
      the value N;

   4. transaction N is committed, so fs_info->last_trans_committed is
      now set to the value N and fs_info->generation remains with the
      value N;

   5. do another buffered write, when this happens btrfs_file_write_iter
      sets our inode's last_trans to the value N + 1;

   6. transaction N + 1 is started and fs_info->generation now has the
      value N + 1;

   7. transaction N + 1 is committed, so fs_info->last_trans_committed
      is set to the value N + 1;

   8. fsync our inode - because it doesn't have the full sync flag set,
      we only start the ordered extent, we don't wait for it to complete
      (only in a later phase) therefore its last_trans field has the
      value N + 1 set previously by btrfs_file_write_iter(), and so we
      have:

          inode->last_trans <= fs_info->last_trans_committed
              (N + 1)              (N + 1)

      Which would make us not log the last buffered write, resulting in
      data loss after a crash."

This is actually deterministic and not hard to trigger. The following excerpt
from a testcase I made for xfstests triggers the issue. It moves a dummy file
across directories and then fsyncs the old parent directory - this is just to
trigger a transaction commit, so moving files around isn't directly related
to the issue but it was chosen because running 'sync' for example does more
than just committing the current transaction, as it flushes/waits for all
file data to be persisted.
The body of the test is:

  _scratch_mkfs >> $seqres.full 2>&1
  _init_flakey
  _mount_flakey

  # Create our main test file 'foo', the one we check for data loss.
  # By doing an fsync against our file, it makes btrfs clear the 'needs_full_sync'
  # bit from its flags (btrfs inode specific flags).
  $XFS_IO_PROG -f -c "pwrite -S 0xaa 0 8K" \
                  -c "fsync" $SCRATCH_MNT/foo | _filter_xfs_io

  # Now create one other file and 2 directories. We will move this second file
  # from one directory to the other later because it forces btrfs to commit its
  # currently open transaction if we fsync the old parent directory. This is
  # necessary to trigger the data loss bug that affected btrfs.
  mkdir $SCRATCH_MNT/testdir_1
  touch $SCRATCH_MNT/testdir_1/bar
  mkdir $SCRATCH_MNT/testdir_2

  # Make sure everything is durably persisted.
  sync

  # Write more 8Kb of data to our file.
  $XFS_IO_PROG -c "pwrite -S 0xbb 8K 8K" $SCRATCH_MNT/foo | _filter_xfs_io

  # Move our 'bar' file into a new directory.
  mv $SCRATCH_MNT/testdir_1/bar $SCRATCH_MNT/testdir_2/bar

  # Fsync our first directory. Because it had a file moved into some other
  # directory, this made btrfs commit the currently open transaction. This is
  # a condition necessary to trigger the data loss bug.
  $XFS_IO_PROG -c "fsync" $SCRATCH_MNT/testdir_1

  # Now fsync our main test file. If the fsync succeeds, we expect the 8Kb of
  # data we wrote previously to be persisted and available if a crash happens.
  # This did not happen with btrfs, because of the transaction commit that
  # happened when we fsynced the parent directory.
  $XFS_IO_PROG -c "fsync" $SCRATCH_MNT/foo

  # Simulate a crash/power loss.
  _load_flakey_table $FLAKEY_DROP_WRITES
  _unmount_flakey

  _load_flakey_table $FLAKEY_ALLOW_WRITES
  _mount_flakey

  # Now check that all data we wrote before are available.
  echo "File content after log replay:"
  od -t x1 $SCRATCH_MNT/foo

  status=0
  exit

The expected golden output for the test, which is what we get with this
fix applied (or when running against ext3/4 and xfs), is:

  wrote 8192/8192 bytes at offset 0
  XXX Bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
  wrote 8192/8192 bytes at offset 8192
  XXX Bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
  File content after log replay:
  0000000 aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa
  *
  0020000 bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb
  *
  0040000

Without this fix applied, the output shows the test file is empty:

  wrote 8192/8192 bytes at offset 0
  XXX Bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
  wrote 8192/8192 bytes at offset 8192
  XXX Bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
  File content after log replay:
  0000000 aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa
  *
  0020000

A test case for xfstests will be sent soon.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
---
 fs/btrfs/file.c | 63 ++++++++++++++++++++++++++++++++++++++++++++++++++++-----
 1 file changed, 58 insertions(+), 5 deletions(-)

diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
index 2bd72cd..91122b6 100644
--- a/fs/btrfs/file.c
+++ b/fs/btrfs/file.c
@@ -1971,14 +1971,67 @@ int btrfs_sync_file(struct file *file, loff_t start, loff_t end, int datasync)
 	}

 	/*
-	 * if the last transaction that changed this file was before
-	 * the current transaction, we can bail out now without any
-	 * syncing
+	 * If the last transaction that changed this file was before the current
+	 * transaction and we have the full sync flag set in our inode, we can
+	 * bail out now without any syncing.
+	 *
+	 * Note that we can't bail out if the full sync flag isn't set. This is
+	 * because when the full sync flag is set we start all ordered extents
+	 * and wait for them to fully complete - when they complete they update
+	 * the inode's last_trans field through:
+	 *
+	 *     btrfs_finish_ordered_io() ->
+	 *         btrfs_update_inode_fallback() ->
+	 *             btrfs_update_inode() ->
+	 *                 btrfs_set_inode_last_trans()
+	 *
+	 * So we are sure that last_trans is up to date and can do this check to
+	 * bail out safely. For the fast path, when the full sync flag is not
+	 * set in our inode, we can not do it because we start only our ordered
+	 * extents and don't wait for them to complete (that is when
+	 * btrfs_finish_ordered_io runs) - if we rely on the speculative value
+	 * for inode->last_trans set by btrfs_file_write_iter we lose data in
+	 * the following scenario:
+	 *
+	 * 1. fs_info->last_trans_committed == N - 1 and current transaction is
+	 *    transaction N (fs_info->generation == N);
+	 *
+	 * 2. do a buffered write;
+	 *
+	 * 3. fsync our inode, this clears our inode's full sync flag, starts
+	 *    an ordered extent and waits for it to complete - when it completes
+	 *    at btrfs_finish_ordered_io(), the inode's last_trans is set to
+	 *    the value N;
+	 *
+	 * 4. transaction N is committed, so fs_info->last_trans_committed is
+	 *    now set to the value N and fs_info->generation remains with the
+	 *    value N;
+	 *
+	 * 5. do another buffered write, when this happens btrfs_file_write_iter
+	 *    sets our inode's last_trans to the value N + 1;
+	 *
+	 * 6. transaction N + 1 is started and fs_info->generation now has the
+	 *    value N + 1;
+	 *
+	 * 7. transaction N + 1 is committed, so fs_info->last_trans_committed
+	 *    is set to the value N + 1;
+	 *
+	 * 8. fsync our inode - because it doesn't have the full sync flag set,
+	 *    we only start the ordered extent, we don't wait for it to complete
+	 *    (only in a later phase) therefore its last_trans field has the
+	 *    value N + 1 set previously by btrfs_file_write_iter(), and so we
+	 *    have:
+	 *
+	 *        inode->last_trans <= fs_info->last_trans_committed
+	 *            (N + 1)              (N + 1)
+	 *
+	 *    Which would make us not log the last buffered write, resulting in
+	 *    data loss after a crash.
 	 */
 	smp_mb();
 	if (btrfs_inode_in_log(inode, root->fs_info->generation) ||
-	    BTRFS_I(inode)->last_trans <=
-	    root->fs_info->last_trans_committed) {
+	    (full_sync && BTRFS_I(inode)->last_trans <=
+	     root->fs_info->last_trans_committed)) {
 		BTRFS_I(inode)->last_trans = 0;

 		/*
-- 
2.1.3

^ permalink raw reply related	[flat|nested] 7+ messages in thread

* Re: [PATCH] Btrfs: fix data loss in the fast fsync path
  2015-02-26 19:52 [PATCH] Btrfs: fix data loss in the fast fsync path Filipe Manana
@ 2015-02-28 15:04 ` Liu Bo
  2015-03-01  9:08 ` [PATCH v2] " Filipe Manana
  2015-03-01 20:36 ` [PATCH v3] " Filipe Manana
  2 siblings, 0 replies; 7+ messages in thread
From: Liu Bo @ 2015-02-28 15:04 UTC (permalink / raw)
  To: Filipe Manana; +Cc: linux-btrfs

On Thu, Feb 26, 2015 at 07:52:57PM +0000, Filipe Manana wrote:
> When using the fast file fsync code path we can miss the fact that
> there new writes happened since the last file fsync and therefore
> return without waiting for the IO to finish and write the new extents
> to the fsync log.
> 
> The comment this change adds to the fsync handler explains how this
> can happen and why, and therefore it's included here:
> 
>   "If the last transaction that changed this file was before the current
>    transaction and we have the full sync flag set in our inode, we can
>    bail out now without any syncing.
> 
>    Note that we can't bail out if the full sync flag isn't set. This is
>    because when the full sync flag is set we start all ordered extents
>    and wait for them to fully complete - when they complete they update
>    the inode's last_trans field through:
> 
>        btrfs_finish_ordered_io() ->
>            btrfs_update_inode_fallback() ->
>                btrfs_update_inode() ->
>                    btrfs_set_inode_last_trans()
> 
>    So we are sure that last_trans is up to date and can do this check to
>    bail out safely. For the fast path, when the full sync flag is not
>    set in our inode, we can not do it because we start only our ordered
>    extents and don't wait for them to complete (that is when
>    btrfs_finish_ordered_io runs) - if we rely on the speculative value
>    for inode->last_trans set by btrfs_file_write_iter we lose data in
>    the following scenario:
> 
>    1. fs_info->last_trans_committed == N - 1 and current transaction is
>       transaction N (fs_info->generation == N);
> 
>    2. do a buffered write;
> 
>    3. fsync our inode, this clears our inode's full sync flag, starts
>       an ordered extent and waits for it to complete - when it completes
>       at btrfs_finish_ordered_io(), the inode's last_trans is set to
>       the value N;
> 
>    4. transaction N is committed, so fs_info->last_trans_committed is
>       now set to the value N and fs_info->generation remains with the
>       value N;
> 
>    5. do another buffered write, when this happens btrfs_file_write_iter
>       sets our inode's last_trans to the value N + 1;
> 
>    6. transaction N + 1 is started and fs_info->generation now has the
>       value N + 1;
> 
>    7. transaction N + 1 is committed, so fs_info->last_trans_committed
>       is set to the value N + 1;
> 
>    8. fsync our inode - because it doesn't have the full sync flag set,
>       we only start the ordered extent, we don't wait for it to complete
>       (only in a later phase) therefore its last_trans field has the
>       value N + 1 set previously by btrfs_file_write_iter(), and so we
>       have:
> 
>           inode->last_trans <= fs_info->last_trans_committed
>               (N + 1)              (N + 1)
> 
>       Which would make us not log the last buffered write, resulting in
>       data loss after a crash."
> 
> This is actually deterministic and not hard to trigger. The following excerpt
> from a testcase I made for xfstests triggers the issue. It moves a dummy file
> across directories and then fsyncs the old parent directory - this is just to
> trigger a transaction commit, so moving files around isn't directly related
> to the issue but it was chosen because running 'sync' for example does more
> than just committing the current transaction, as it flushes/waits for all
> file data to be persisted.
> The body of the test is:
> 
>   _scratch_mkfs >> $seqres.full 2>&1
>   _init_flakey
>   _mount_flakey
> 
>   # Create our main test file 'foo', the one we check for data loss.
>   # By doing an fsync against our file, it makes btrfs clear the 'needs_full_sync'
>   # bit from its flags (btrfs inode specific flags).
>   $XFS_IO_PROG -f -c "pwrite -S 0xaa 0 8K" \
>                   -c "fsync" $SCRATCH_MNT/foo | _filter_xfs_io
> 
>   # Now create one other file and 2 directories. We will move this second file
>   # from one directory to the other later because it forces btrfs to commit its
>   # currently open transaction if we fsync the old parent directory. This is
>   # necessary to trigger the data loss bug that affected btrfs.
>   mkdir $SCRATCH_MNT/testdir_1
>   touch $SCRATCH_MNT/testdir_1/bar
>   mkdir $SCRATCH_MNT/testdir_2
> 
>   # Make sure everything is durably persisted.
>   sync
> 
>   # Write more 8Kb of data to our file.
>   $XFS_IO_PROG -c "pwrite -S 0xbb 8K 8K" $SCRATCH_MNT/foo | _filter_xfs_io
> 
>   # Move our 'bar' file into a new directory.
>   mv $SCRATCH_MNT/testdir_1/bar $SCRATCH_MNT/testdir_2/bar
> 
>   # Fsync our first directory. Because it had a file moved into some other
>   # directory, this made btrfs commit the currently open transaction. This is
>   # a condition necessary to trigger the data loss bug.
>   $XFS_IO_PROG -c "fsync" $SCRATCH_MNT/testdir_1
> 
>   # Now fsync our main test file. If the fsync succeeds, we expect the 8Kb of
>   # data we wrote previously to be persisted and available if a crash happens.
>   # This did not happen with btrfs, because of the transaction commit that
>   # happened when we fsynced the parent directory.
>   $XFS_IO_PROG -c "fsync" $SCRATCH_MNT/foo
> 
>   # Simulate a crash/power loss.
>   _load_flakey_table $FLAKEY_DROP_WRITES
>   _unmount_flakey
> 
>   _load_flakey_table $FLAKEY_ALLOW_WRITES
>   _mount_flakey
> 
>   # Now check that all data we wrote before are available.
>   echo "File content after log replay:"
>   od -t x1 $SCRATCH_MNT/foo
> 
>   status=0
>   exit
> 
> The expected golden output for the test, which is what we get with this
> fix applied (or when running against ext3/4 and xfs), is:
> 
>   wrote 8192/8192 bytes at offset 0
>   XXX Bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
>   wrote 8192/8192 bytes at offset 8192
>   XXX Bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
>   File content after log replay:
>   0000000 aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa
>   *
>   0020000 bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb
>   *
>   0040000
> 
> Without this fix applied, the output shows the test file is empty:
> 
>   wrote 8192/8192 bytes at offset 0
>   XXX Bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
>   wrote 8192/8192 bytes at offset 8192
>   XXX Bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
>   File content after log replay:
>   0000000 aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa
>   *
>   0020000
> 
> A test case for xfstests will be sent soon.
> 
> Signed-off-by: Filipe Manana <fdmanana@suse.com>
> ---
>  fs/btrfs/file.c | 63 ++++++++++++++++++++++++++++++++++++++++++++++++++++-----
>  1 file changed, 58 insertions(+), 5 deletions(-)
> 
> diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
> index 2bd72cd..91122b6 100644
> --- a/fs/btrfs/file.c
> +++ b/fs/btrfs/file.c
> @@ -1971,14 +1971,67 @@ int btrfs_sync_file(struct file *file, loff_t start, loff_t end, int datasync)
>  	}
>  
>  	/*
> -	 * if the last transaction that changed this file was before
> -	 * the current transaction, we can bail out now without any
> -	 * syncing
> +	 * If the last transaction that changed this file was before the current
> +	 * transaction and we have the full sync flag set in our inode, we can
> +	 * bail out now without any syncing.
> +	 *
> +	 * Note that we can't bail out if the full sync flag isn't set. This is
> +	 * because when the full sync flag is set we start all ordered extents
> +	 * and wait for them to fully complete - when they complete they update
> +	 * the inode's last_trans field through:
> +	 *
> +	 *     btrfs_finish_ordered_io() ->
> +	 *         btrfs_update_inode_fallback() ->
> +	 *             btrfs_update_inode() ->
> +	 *                 btrfs_set_inode_last_trans()
> +	 *
> +	 * So we are sure that last_trans is up to date and can do this check to
> +	 * bail out safely. For the fast path, when the full sync flag is not
> +	 * set in our inode, we can not do it because we start only our ordered
> +	 * extents and don't wait for them to complete (that is when
> +	 * btrfs_finish_ordered_io runs) - if we rely on the speculative value
> +	 * for inode->last_trans set by btrfs_file_write_iter we lose data in
> +	 * the following scenario:
> +	 *
> +	 * 1. fs_info->last_trans_committed == N - 1 and current transaction is
> +	 *    transaction N (fs_info->generation == N);
> +	 *
> +	 * 2. do a buffered write;
> +	 *
> +	 * 3. fsync our inode, this clears our inode's full sync flag, starts
> +	 *    an ordered extent and waits for it to complete - when it completes
> +	 *    at btrfs_finish_ordered_io(), the inode's last_trans is set to
> +	 *    the value N;
> +	 *
> +	 * 4. transaction N is committed, so fs_info->last_trans_committed is
> +	 *    now set to the value N and fs_info->generation remains with the
> +	 *    value N;
> +	 *
> +	 * 5. do another buffered write, when this happens btrfs_file_write_iter
> +	 *    sets our inode's last_trans to the value N + 1;

This problem proves that in step 5 it doesn't work any more by setting last_trans to "root->fs_info->generation + 1", we can remove that setting.

Others look good.

Reviewed-by: Liu Bo <bo.li.liu@oracle.com>

Thanks,

-liubo
> +	 *
> +	 * 6. transaction N + 1 is started and fs_info->generation now has the
> +	 *    value N + 1;
> +	 *
> +	 * 7. transaction N + 1 is committed, so fs_info->last_trans_committed
> +	 *    is set to the value N + 1;
> +	 *
> +	 * 8. fsync our inode - because it doesn't have the full sync flag set,
> +	 *    we only start the ordered extent, we don't wait for it to complete
> +	 *    (only in a later phase) therefore its last_trans field has the
> +	 *    value N + 1 set previously by btrfs_file_write_iter(), and so we
> +	 *    have:
> +	 *
> +	 *        inode->last_trans <= fs_info->last_trans_committed
> +	 *            (N + 1)              (N + 1)
> +	 *
> +	 *    Which would make us not log the last buffered write, resulting in
> +	 *    data loss after a crash.
>  	 */
>  	smp_mb();
>  	if (btrfs_inode_in_log(inode, root->fs_info->generation) ||
> -	    BTRFS_I(inode)->last_trans <=
> -	    root->fs_info->last_trans_committed) {
> +	    (full_sync && BTRFS_I(inode)->last_trans <=
> +	     root->fs_info->last_trans_committed)) {
>  		BTRFS_I(inode)->last_trans = 0;
>  
>  		/*
> -- 
> 2.1.3
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 7+ messages in thread

* [PATCH v2] Btrfs: fix data loss in the fast fsync path
  2015-02-26 19:52 [PATCH] Btrfs: fix data loss in the fast fsync path Filipe Manana
  2015-02-28 15:04 ` Liu Bo
@ 2015-03-01  9:08 ` Filipe Manana
  2015-03-03  0:41   ` Liu Bo
  2015-03-01 20:36 ` [PATCH v3] " Filipe Manana
  2 siblings, 1 reply; 7+ messages in thread
From: Filipe Manana @ 2015-03-01  9:08 UTC (permalink / raw)
  To: linux-btrfs; +Cc: Filipe Manana, stable

When using the fast file fsync code path we can miss the fact that new
writes happened since the last file fsync and therefore return without
waiting for the IO to finish and write the new extents to the fsync log.

Here's an example scenario where the fsync will miss the fact that new
file data exists that wasn't yet durably persisted:

1. fs_info->last_trans_committed == N - 1 and current transaction is
   transaction N (fs_info->generation == N);

2. do a buffered write;

3. fsync our inode, this clears our inode's full sync flag, starts
   an ordered extent and waits for it to complete - when it completes
   at btrfs_finish_ordered_io(), the inode's last_trans is set to the
   value N (via btrfs_update_inode_fallback -> btrfs_update_inode ->
   btrfs_set_inode_last_trans);

4. transaction N is committed, so fs_info->last_trans_committed is now
   set to the value N and fs_info->generation remains with the value N;

5. do another buffered write, when this happens btrfs_file_write_iter
   sets our inode's last_trans to the value N + 1 (that is
   fs_info->generation + 1 == N + 1);

6. transaction N + 1 is started and fs_info->generation now has the
   value N + 1;

7. transaction N + 1 is committed, so fs_info->last_trans_committed
   is set to the value N + 1;

8. fsync our inode - because it doesn't have the full sync flag set,
   we only start the ordered extent, we don't wait for it to complete
   (only in a later phase) therefore its last_trans field has the
   value N + 1 set previously by btrfs_file_write_iter(), and so we
   have:

       inode->last_trans <= fs_info->last_trans_committed
           (N + 1)              (N + 1)

   Which made us not log the last buffered write and exit the fsync
   handler immediately, returning success (0) to user space and resulting
   in data loss after a crash.

This can actually be triggered deterministically and the following excerpt
from a testcase I made for xfstests triggers the issue. It moves a dummy
file across directories and then fsyncs the old parent directory - this
is just to trigger a transaction commit, so moving files around isn't
directly related to the issue but it was chosen because running 'sync' for
example does more than just committing the current transaction, as it
flushes/waits for all file data to be persisted. The issue can also happen
at random periods, since the transaction kthread periodicaly commits the
current transaction (about every 30 seconds by default).
The body of the test is:

  _scratch_mkfs >> $seqres.full 2>&1
  _init_flakey
  _mount_flakey

  # Create our main test file 'foo', the one we check for data loss.
  # By doing an fsync against our file, it makes btrfs clear the 'needs_full_sync'
  # bit from its flags (btrfs inode specific flags).
  $XFS_IO_PROG -f -c "pwrite -S 0xaa 0 8K" \
                  -c "fsync" $SCRATCH_MNT/foo | _filter_xfs_io

  # Now create one other file and 2 directories. We will move this second file
  # from one directory to the other later because it forces btrfs to commit its
  # currently open transaction if we fsync the old parent directory. This is
  # necessary to trigger the data loss bug that affected btrfs.
  mkdir $SCRATCH_MNT/testdir_1
  touch $SCRATCH_MNT/testdir_1/bar
  mkdir $SCRATCH_MNT/testdir_2

  # Make sure everything is durably persisted.
  sync

  # Write more 8Kb of data to our file.
  $XFS_IO_PROG -c "pwrite -S 0xbb 8K 8K" $SCRATCH_MNT/foo | _filter_xfs_io

  # Move our 'bar' file into a new directory.
  mv $SCRATCH_MNT/testdir_1/bar $SCRATCH_MNT/testdir_2/bar

  # Fsync our first directory. Because it had a file moved into some other
  # directory, this made btrfs commit the currently open transaction. This is
  # a condition necessary to trigger the data loss bug.
  $XFS_IO_PROG -c "fsync" $SCRATCH_MNT/testdir_1

  # Now fsync our main test file. If the fsync succeeds, we expect the 8Kb of
  # data we wrote previously to be persisted and available if a crash happens.
  # This did not happen with btrfs, because of the transaction commit that
  # happened when we fsynced the parent directory.
  $XFS_IO_PROG -c "fsync" $SCRATCH_MNT/foo

  # Simulate a crash/power loss.
  _load_flakey_table $FLAKEY_DROP_WRITES
  _unmount_flakey

  _load_flakey_table $FLAKEY_ALLOW_WRITES
  _mount_flakey

  # Now check that all data we wrote before are available.
  echo "File content after log replay:"
  od -t x1 $SCRATCH_MNT/foo

  status=0
  exit

The expected golden output for the test, which is what we get with this
fix applied (or when running against ext3/4 and xfs), is:

  wrote 8192/8192 bytes at offset 0
  XXX Bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
  wrote 8192/8192 bytes at offset 8192
  XXX Bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
  File content after log replay:
  0000000 aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa
  *
  0020000 bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb
  *
  0040000

Without this fix applied, the output shows the test file does not have
the second 8Kb extent that we successfully fsynced:

  wrote 8192/8192 bytes at offset 0
  XXX Bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
  wrote 8192/8192 bytes at offset 8192
  XXX Bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
  File content after log replay:
  0000000 aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa
  *
  0020000

So fix this by skipping the fsync only if we're doing a full sync and
if the inode's last_trans is <= fs_info->last_trans_committed, or if
the inode is already in the log. Also remove setting the inode's
last_trans in btrfs_file_write_iter since it's useless/unreliable.

A test case for xfstests will be sent soon.

CC: <stable@vger.kernel.org>
Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
---

V2: Removed dead assignment of inode->last_trans in btrfs_file_write_iter
    (and the respective comment) since it's useless now. Added stable to
    cc because it's a data loss fix.

 fs/btrfs/file.c | 45 ++++++++++++++++++++++++++++-----------------
 1 file changed, 28 insertions(+), 17 deletions(-)

diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
index 2bd72cd..b7334c9 100644
--- a/fs/btrfs/file.c
+++ b/fs/btrfs/file.c
@@ -1811,22 +1811,10 @@ static ssize_t btrfs_file_write_iter(struct kiocb *iocb,
 	mutex_unlock(&inode->i_mutex);

 	/*
-	 * we want to make sure fsync finds this change
-	 * but we haven't joined a transaction running right now.
-	 *
-	 * Later on, someone is sure to update the inode and get the
-	 * real transid recorded.
-	 *
-	 * We set last_trans now to the fs_info generation + 1,
-	 * this will either be one more than the running transaction
-	 * or the generation used for the next transaction if there isn't
-	 * one running right now.
-	 *
 	 * We also have to set last_sub_trans to the current log transid,
 	 * otherwise subsequent syncs to a file that's been synced in this
 	 * transaction will appear to have already occured.
 	 */
-	BTRFS_I(inode)->last_trans = root->fs_info->generation + 1;
 	BTRFS_I(inode)->last_sub_trans = root->log_transid;
 	if (num_written > 0) {
 		err = generic_write_sync(file, pos, num_written);
@@ -1971,14 +1959,37 @@ int btrfs_sync_file(struct file *file, loff_t start, loff_t end, int datasync)
 	}

 	/*
-	 * if the last transaction that changed this file was before
-	 * the current transaction, we can bail out now without any
-	 * syncing
+	 * If the last transaction that changed this file was before the current
+	 * transaction and we have the full sync flag set in our inode, we can
+	 * bail out now without any syncing.
+	 *
+	 * Note that we can't bail out if the full sync flag isn't set. This is
+	 * because when the full sync flag is set we start all ordered extents
+	 * and wait for them to fully complete - when they complete they update
+	 * the inode's last_trans field through:
+	 *
+	 *     btrfs_finish_ordered_io() ->
+	 *         btrfs_update_inode_fallback() ->
+	 *             btrfs_update_inode() ->
+	 *                 btrfs_set_inode_last_trans()
+	 *
+	 * So we are sure that last_trans is up to date and can do this check to
+	 * bail out safely. For the fast path, when the full sync flag is not
+	 * set in our inode, we can not do it because we start only our ordered
+	 * extents and don't wait for them to complete (that is when
+	 * btrfs_finish_ordered_io runs), so here at this point their last_trans
+	 * value might be less than or equals to fs_info->last_trans_committed,
+	 * and setting a speculative last_trans for an inode when a buffered
+	 * write is made (such as fs_info->generation + 1 for example) would not
+	 * be reliable since after setting the value and before fsync is called
+	 * any number of transactions can start and commit (transaction kthread
+	 * commits the current transaction periodically), and a transaction
+	 * commit does not start nor waits for ordered extents to complete.
 	 */
 	smp_mb();
 	if (btrfs_inode_in_log(inode, root->fs_info->generation) ||
-	    BTRFS_I(inode)->last_trans <=
-	    root->fs_info->last_trans_committed) {
+	    (full_sync && BTRFS_I(inode)->last_trans <=
+	     root->fs_info->last_trans_committed)) {
 		BTRFS_I(inode)->last_trans = 0;

 		/*
-- 
2.1.3

^ permalink raw reply related	[flat|nested] 7+ messages in thread

* Re: [PATCH v2] Btrfs: fix data loss in the fast fsync path
  2015-03-01  9:08 ` [PATCH v2] " Filipe Manana
@ 2015-03-03  0:41   ` Liu Bo
  2015-03-03 11:15     ` Filipe David Manana
  0 siblings, 1 reply; 7+ messages in thread
From: Liu Bo @ 2015-03-03  0:41 UTC (permalink / raw)
  To: Filipe Manana; +Cc: linux-btrfs, stable

On Sun, Mar 01, 2015 at 09:08:38AM +0000, Filipe Manana wrote:
> When using the fast file fsync code path we can miss the fact that new
> writes happened since the last file fsync and therefore return without
> waiting for the IO to finish and write the new extents to the fsync log.
> 
> Here's an example scenario where the fsync will miss the fact that new
> file data exists that wasn't yet durably persisted:
> 
> 1. fs_info->last_trans_committed == N - 1 and current transaction is
>    transaction N (fs_info->generation == N);
> 
> 2. do a buffered write;
> 
> 3. fsync our inode, this clears our inode's full sync flag, starts
>    an ordered extent and waits for it to complete - when it completes
>    at btrfs_finish_ordered_io(), the inode's last_trans is set to the
>    value N (via btrfs_update_inode_fallback -> btrfs_update_inode ->
>    btrfs_set_inode_last_trans);
> 
> 4. transaction N is committed, so fs_info->last_trans_committed is now
>    set to the value N and fs_info->generation remains with the value N;
> 
> 5. do another buffered write, when this happens btrfs_file_write_iter
>    sets our inode's last_trans to the value N + 1 (that is
>    fs_info->generation + 1 == N + 1);
> 
> 6. transaction N + 1 is started and fs_info->generation now has the
>    value N + 1;
> 
> 7. transaction N + 1 is committed, so fs_info->last_trans_committed
>    is set to the value N + 1;
> 
> 8. fsync our inode - because it doesn't have the full sync flag set,
>    we only start the ordered extent, we don't wait for it to complete
>    (only in a later phase) therefore its last_trans field has the
>    value N + 1 set previously by btrfs_file_write_iter(), and so we
>    have:
> 
>        inode->last_trans <= fs_info->last_trans_committed
>            (N + 1)              (N + 1)
> 
>    Which made us not log the last buffered write and exit the fsync
>    handler immediately, returning success (0) to user space and resulting
>    in data loss after a crash.
> 
> This can actually be triggered deterministically and the following excerpt
> from a testcase I made for xfstests triggers the issue. It moves a dummy
> file across directories and then fsyncs the old parent directory - this
> is just to trigger a transaction commit, so moving files around isn't
> directly related to the issue but it was chosen because running 'sync' for
> example does more than just committing the current transaction, as it
> flushes/waits for all file data to be persisted. The issue can also happen
> at random periods, since the transaction kthread periodicaly commits the
> current transaction (about every 30 seconds by default).
> The body of the test is:
> 
>   _scratch_mkfs >> $seqres.full 2>&1
>   _init_flakey
>   _mount_flakey
> 
>   # Create our main test file 'foo', the one we check for data loss.
>   # By doing an fsync against our file, it makes btrfs clear the 'needs_full_sync'
>   # bit from its flags (btrfs inode specific flags).
>   $XFS_IO_PROG -f -c "pwrite -S 0xaa 0 8K" \
>                   -c "fsync" $SCRATCH_MNT/foo | _filter_xfs_io
> 
>   # Now create one other file and 2 directories. We will move this second file
>   # from one directory to the other later because it forces btrfs to commit its
>   # currently open transaction if we fsync the old parent directory. This is
>   # necessary to trigger the data loss bug that affected btrfs.
>   mkdir $SCRATCH_MNT/testdir_1
>   touch $SCRATCH_MNT/testdir_1/bar
>   mkdir $SCRATCH_MNT/testdir_2
> 
>   # Make sure everything is durably persisted.
>   sync
> 
>   # Write more 8Kb of data to our file.
>   $XFS_IO_PROG -c "pwrite -S 0xbb 8K 8K" $SCRATCH_MNT/foo | _filter_xfs_io
> 
>   # Move our 'bar' file into a new directory.
>   mv $SCRATCH_MNT/testdir_1/bar $SCRATCH_MNT/testdir_2/bar
> 
>   # Fsync our first directory. Because it had a file moved into some other
>   # directory, this made btrfs commit the currently open transaction. This is
>   # a condition necessary to trigger the data loss bug.
>   $XFS_IO_PROG -c "fsync" $SCRATCH_MNT/testdir_1
> 
>   # Now fsync our main test file. If the fsync succeeds, we expect the 8Kb of
>   # data we wrote previously to be persisted and available if a crash happens.
>   # This did not happen with btrfs, because of the transaction commit that
>   # happened when we fsynced the parent directory.
>   $XFS_IO_PROG -c "fsync" $SCRATCH_MNT/foo
> 
>   # Simulate a crash/power loss.
>   _load_flakey_table $FLAKEY_DROP_WRITES
>   _unmount_flakey
> 
>   _load_flakey_table $FLAKEY_ALLOW_WRITES
>   _mount_flakey
> 
>   # Now check that all data we wrote before are available.
>   echo "File content after log replay:"
>   od -t x1 $SCRATCH_MNT/foo
> 
>   status=0
>   exit
> 
> The expected golden output for the test, which is what we get with this
> fix applied (or when running against ext3/4 and xfs), is:
> 
>   wrote 8192/8192 bytes at offset 0
>   XXX Bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
>   wrote 8192/8192 bytes at offset 8192
>   XXX Bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
>   File content after log replay:
>   0000000 aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa
>   *
>   0020000 bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb
>   *
>   0040000
> 
> Without this fix applied, the output shows the test file does not have
> the second 8Kb extent that we successfully fsynced:
> 
>   wrote 8192/8192 bytes at offset 0
>   XXX Bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
>   wrote 8192/8192 bytes at offset 8192
>   XXX Bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
>   File content after log replay:
>   0000000 aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa
>   *
>   0020000
> 
> So fix this by skipping the fsync only if we're doing a full sync and
> if the inode's last_trans is <= fs_info->last_trans_committed, or if
> the inode is already in the log. Also remove setting the inode's
> last_trans in btrfs_file_write_iter since it's useless/unreliable.
> 
> A test case for xfstests will be sent soon.
> 
> CC: <stable@vger.kernel.org>
> Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
> Signed-off-by: Filipe Manana <fdmanana@suse.com>
> ---
> 
> V2: Removed dead assignment of inode->last_trans in btrfs_file_write_iter
>     (and the respective comment) since it's useless now. Added stable to
>     cc because it's a data loss fix.
> 
>  fs/btrfs/file.c | 45 ++++++++++++++++++++++++++++-----------------
>  1 file changed, 28 insertions(+), 17 deletions(-)
> 
> diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
> index 2bd72cd..b7334c9 100644
> --- a/fs/btrfs/file.c
> +++ b/fs/btrfs/file.c
> @@ -1811,22 +1811,10 @@ static ssize_t btrfs_file_write_iter(struct kiocb *iocb,
>  	mutex_unlock(&inode->i_mutex);
>  
>  	/*
> -	 * we want to make sure fsync finds this change
> -	 * but we haven't joined a transaction running right now.
> -	 *
> -	 * Later on, someone is sure to update the inode and get the
> -	 * real transid recorded.
> -	 *
> -	 * We set last_trans now to the fs_info generation + 1,
> -	 * this will either be one more than the running transaction
> -	 * or the generation used for the next transaction if there isn't
> -	 * one running right now.
> -	 *
>  	 * We also have to set last_sub_trans to the current log transid,
>  	 * otherwise subsequent syncs to a file that's been synced in this
>  	 * transaction will appear to have already occured.
>  	 */
> -	BTRFS_I(inode)->last_trans = root->fs_info->generation + 1;

By thinking twice about it, how about setting ->last_trans with (-1ULL)?

So the benefit is that if new writes have already finished its endio where
calling btrfs_set_inode_last_trans() to set ->last_trans with a transid
at that age, we may get a win for skipping log part if someone else has
updated ->last_trans_committed.

By limiting it to 'full_sync' case we lose the above opportunity.

Thanks,

-liubo
>  	BTRFS_I(inode)->last_sub_trans = root->log_transid;
>  	if (num_written > 0) {
>  		err = generic_write_sync(file, pos, num_written);
> @@ -1971,14 +1959,37 @@ int btrfs_sync_file(struct file *file, loff_t start, loff_t end, int datasync)
>  	}
>  
>  	/*
> -	 * if the last transaction that changed this file was before
> -	 * the current transaction, we can bail out now without any
> -	 * syncing
> +	 * If the last transaction that changed this file was before the current
> +	 * transaction and we have the full sync flag set in our inode, we can
> +	 * bail out now without any syncing.
> +	 *
> +	 * Note that we can't bail out if the full sync flag isn't set. This is
> +	 * because when the full sync flag is set we start all ordered extents
> +	 * and wait for them to fully complete - when they complete they update
> +	 * the inode's last_trans field through:
> +	 *
> +	 *     btrfs_finish_ordered_io() ->
> +	 *         btrfs_update_inode_fallback() ->
> +	 *             btrfs_update_inode() ->
> +	 *                 btrfs_set_inode_last_trans()
> +	 *
> +	 * So we are sure that last_trans is up to date and can do this check to
> +	 * bail out safely. For the fast path, when the full sync flag is not
> +	 * set in our inode, we can not do it because we start only our ordered
> +	 * extents and don't wait for them to complete (that is when
> +	 * btrfs_finish_ordered_io runs), so here at this point their last_trans
> +	 * value might be less than or equals to fs_info->last_trans_committed,
> +	 * and setting a speculative last_trans for an inode when a buffered
> +	 * write is made (such as fs_info->generation + 1 for example) would not
> +	 * be reliable since after setting the value and before fsync is called
> +	 * any number of transactions can start and commit (transaction kthread
> +	 * commits the current transaction periodically), and a transaction
> +	 * commit does not start nor waits for ordered extents to complete.
>  	 */
>  	smp_mb();
>  	if (btrfs_inode_in_log(inode, root->fs_info->generation) ||
> -	    BTRFS_I(inode)->last_trans <=
> -	    root->fs_info->last_trans_committed) {
> +	    (full_sync && BTRFS_I(inode)->last_trans <=
> +	     root->fs_info->last_trans_committed)) {
>  		BTRFS_I(inode)->last_trans = 0;
>  
>  		/*
> -- 
> 2.1.3
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH v2] Btrfs: fix data loss in the fast fsync path
  2015-03-03  0:41   ` Liu Bo
@ 2015-03-03 11:15     ` Filipe David Manana
  2015-03-03 13:31       ` Liu Bo
  0 siblings, 1 reply; 7+ messages in thread
From: Filipe David Manana @ 2015-03-03 11:15 UTC (permalink / raw)
  To: Liu Bo; +Cc: Filipe Manana, linux-btrfs@vger.kernel.org, stable

On Tue, Mar 3, 2015 at 12:41 AM, Liu Bo <bo.li.liu@oracle.com> wrote:
> On Sun, Mar 01, 2015 at 09:08:38AM +0000, Filipe Manana wrote:
>> When using the fast file fsync code path we can miss the fact that new
>> writes happened since the last file fsync and therefore return without
>> waiting for the IO to finish and write the new extents to the fsync log.
>>
>> Here's an example scenario where the fsync will miss the fact that new
>> file data exists that wasn't yet durably persisted:
>>
>> 1. fs_info->last_trans_committed == N - 1 and current transaction is
>>    transaction N (fs_info->generation == N);
>>
>> 2. do a buffered write;
>>
>> 3. fsync our inode, this clears our inode's full sync flag, starts
>>    an ordered extent and waits for it to complete - when it completes
>>    at btrfs_finish_ordered_io(), the inode's last_trans is set to the
>>    value N (via btrfs_update_inode_fallback -> btrfs_update_inode ->
>>    btrfs_set_inode_last_trans);
>>
>> 4. transaction N is committed, so fs_info->last_trans_committed is now
>>    set to the value N and fs_info->generation remains with the value N;
>>
>> 5. do another buffered write, when this happens btrfs_file_write_iter
>>    sets our inode's last_trans to the value N + 1 (that is
>>    fs_info->generation + 1 == N + 1);
>>
>> 6. transaction N + 1 is started and fs_info->generation now has the
>>    value N + 1;
>>
>> 7. transaction N + 1 is committed, so fs_info->last_trans_committed
>>    is set to the value N + 1;
>>
>> 8. fsync our inode - because it doesn't have the full sync flag set,
>>    we only start the ordered extent, we don't wait for it to complete
>>    (only in a later phase) therefore its last_trans field has the
>>    value N + 1 set previously by btrfs_file_write_iter(), and so we
>>    have:
>>
>>        inode->last_trans <= fs_info->last_trans_committed
>>            (N + 1)              (N + 1)
>>
>>    Which made us not log the last buffered write and exit the fsync
>>    handler immediately, returning success (0) to user space and resulting
>>    in data loss after a crash.
>>
>> This can actually be triggered deterministically and the following excerpt
>> from a testcase I made for xfstests triggers the issue. It moves a dummy
>> file across directories and then fsyncs the old parent directory - this
>> is just to trigger a transaction commit, so moving files around isn't
>> directly related to the issue but it was chosen because running 'sync' for
>> example does more than just committing the current transaction, as it
>> flushes/waits for all file data to be persisted. The issue can also happen
>> at random periods, since the transaction kthread periodicaly commits the
>> current transaction (about every 30 seconds by default).
>> The body of the test is:
>>
>>   _scratch_mkfs >> $seqres.full 2>&1
>>   _init_flakey
>>   _mount_flakey
>>
>>   # Create our main test file 'foo', the one we check for data loss.
>>   # By doing an fsync against our file, it makes btrfs clear the 'needs_full_sync'
>>   # bit from its flags (btrfs inode specific flags).
>>   $XFS_IO_PROG -f -c "pwrite -S 0xaa 0 8K" \
>>                   -c "fsync" $SCRATCH_MNT/foo | _filter_xfs_io
>>
>>   # Now create one other file and 2 directories. We will move this second file
>>   # from one directory to the other later because it forces btrfs to commit its
>>   # currently open transaction if we fsync the old parent directory. This is
>>   # necessary to trigger the data loss bug that affected btrfs.
>>   mkdir $SCRATCH_MNT/testdir_1
>>   touch $SCRATCH_MNT/testdir_1/bar
>>   mkdir $SCRATCH_MNT/testdir_2
>>
>>   # Make sure everything is durably persisted.
>>   sync
>>
>>   # Write more 8Kb of data to our file.
>>   $XFS_IO_PROG -c "pwrite -S 0xbb 8K 8K" $SCRATCH_MNT/foo | _filter_xfs_io
>>
>>   # Move our 'bar' file into a new directory.
>>   mv $SCRATCH_MNT/testdir_1/bar $SCRATCH_MNT/testdir_2/bar
>>
>>   # Fsync our first directory. Because it had a file moved into some other
>>   # directory, this made btrfs commit the currently open transaction. This is
>>   # a condition necessary to trigger the data loss bug.
>>   $XFS_IO_PROG -c "fsync" $SCRATCH_MNT/testdir_1
>>
>>   # Now fsync our main test file. If the fsync succeeds, we expect the 8Kb of
>>   # data we wrote previously to be persisted and available if a crash happens.
>>   # This did not happen with btrfs, because of the transaction commit that
>>   # happened when we fsynced the parent directory.
>>   $XFS_IO_PROG -c "fsync" $SCRATCH_MNT/foo
>>
>>   # Simulate a crash/power loss.
>>   _load_flakey_table $FLAKEY_DROP_WRITES
>>   _unmount_flakey
>>
>>   _load_flakey_table $FLAKEY_ALLOW_WRITES
>>   _mount_flakey
>>
>>   # Now check that all data we wrote before are available.
>>   echo "File content after log replay:"
>>   od -t x1 $SCRATCH_MNT/foo
>>
>>   status=0
>>   exit
>>
>> The expected golden output for the test, which is what we get with this
>> fix applied (or when running against ext3/4 and xfs), is:
>>
>>   wrote 8192/8192 bytes at offset 0
>>   XXX Bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
>>   wrote 8192/8192 bytes at offset 8192
>>   XXX Bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
>>   File content after log replay:
>>   0000000 aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa
>>   *
>>   0020000 bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb
>>   *
>>   0040000
>>
>> Without this fix applied, the output shows the test file does not have
>> the second 8Kb extent that we successfully fsynced:
>>
>>   wrote 8192/8192 bytes at offset 0
>>   XXX Bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
>>   wrote 8192/8192 bytes at offset 8192
>>   XXX Bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
>>   File content after log replay:
>>   0000000 aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa
>>   *
>>   0020000
>>
>> So fix this by skipping the fsync only if we're doing a full sync and
>> if the inode's last_trans is <= fs_info->last_trans_committed, or if
>> the inode is already in the log. Also remove setting the inode's
>> last_trans in btrfs_file_write_iter since it's useless/unreliable.
>>
>> A test case for xfstests will be sent soon.
>>
>> CC: <stable@vger.kernel.org>
>> Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
>> Signed-off-by: Filipe Manana <fdmanana@suse.com>
>> ---
>>
>> V2: Removed dead assignment of inode->last_trans in btrfs_file_write_iter
>>     (and the respective comment) since it's useless now. Added stable to
>>     cc because it's a data loss fix.
>>
>>  fs/btrfs/file.c | 45 ++++++++++++++++++++++++++++-----------------
>>  1 file changed, 28 insertions(+), 17 deletions(-)
>>
>> diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
>> index 2bd72cd..b7334c9 100644
>> --- a/fs/btrfs/file.c
>> +++ b/fs/btrfs/file.c
>> @@ -1811,22 +1811,10 @@ static ssize_t btrfs_file_write_iter(struct kiocb *iocb,
>>       mutex_unlock(&inode->i_mutex);
>>
>>       /*
>> -      * we want to make sure fsync finds this change
>> -      * but we haven't joined a transaction running right now.
>> -      *
>> -      * Later on, someone is sure to update the inode and get the
>> -      * real transid recorded.
>> -      *
>> -      * We set last_trans now to the fs_info generation + 1,
>> -      * this will either be one more than the running transaction
>> -      * or the generation used for the next transaction if there isn't
>> -      * one running right now.
>> -      *
>>        * We also have to set last_sub_trans to the current log transid,
>>        * otherwise subsequent syncs to a file that's been synced in this
>>        * transaction will appear to have already occured.
>>        */
>> -     BTRFS_I(inode)->last_trans = root->fs_info->generation + 1;
>
> By thinking twice about it, how about setting ->last_trans with (-1ULL)?
>
> So the benefit is that if new writes have already finished its endio where
> calling btrfs_set_inode_last_trans() to set ->last_trans with a transid
> at that age, we may get a win for skipping log part if someone else has
> updated ->last_trans_committed.
>
> By limiting it to 'full_sync' case we lose the above opportunity.

That still won't work.

Imagine the following the scenario:

1) do 2 buffered writes to 2 different ranges of the inode - the
inode's last_trans is set to (u64)-1;

2) writepages is called against the first range only (either the VM
called it due to memory pressure or a ranged fsync like msync for
example);

3) the ordered extent started by the previous writepages calls
completes and sets inode->last_trans to N (N == current transaction
id/generation);

4) transaction N commits;

5) fsync the file (either whole range or a range covering only the
second dirty range) - this will bail out since last_trans ==
last_trans_committed, not logging the second dirty range.

thanks


>
> Thanks,
>
> -liubo
>>       BTRFS_I(inode)->last_sub_trans = root->log_transid;
>>       if (num_written > 0) {
>>               err = generic_write_sync(file, pos, num_written);
>> @@ -1971,14 +1959,37 @@ int btrfs_sync_file(struct file *file, loff_t start, loff_t end, int datasync)
>>       }
>>
>>       /*
>> -      * if the last transaction that changed this file was before
>> -      * the current transaction, we can bail out now without any
>> -      * syncing
>> +      * If the last transaction that changed this file was before the current
>> +      * transaction and we have the full sync flag set in our inode, we can
>> +      * bail out now without any syncing.
>> +      *
>> +      * Note that we can't bail out if the full sync flag isn't set. This is
>> +      * because when the full sync flag is set we start all ordered extents
>> +      * and wait for them to fully complete - when they complete they update
>> +      * the inode's last_trans field through:
>> +      *
>> +      *     btrfs_finish_ordered_io() ->
>> +      *         btrfs_update_inode_fallback() ->
>> +      *             btrfs_update_inode() ->
>> +      *                 btrfs_set_inode_last_trans()
>> +      *
>> +      * So we are sure that last_trans is up to date and can do this check to
>> +      * bail out safely. For the fast path, when the full sync flag is not
>> +      * set in our inode, we can not do it because we start only our ordered
>> +      * extents and don't wait for them to complete (that is when
>> +      * btrfs_finish_ordered_io runs), so here at this point their last_trans
>> +      * value might be less than or equals to fs_info->last_trans_committed,
>> +      * and setting a speculative last_trans for an inode when a buffered
>> +      * write is made (such as fs_info->generation + 1 for example) would not
>> +      * be reliable since after setting the value and before fsync is called
>> +      * any number of transactions can start and commit (transaction kthread
>> +      * commits the current transaction periodically), and a transaction
>> +      * commit does not start nor waits for ordered extents to complete.
>>        */
>>       smp_mb();
>>       if (btrfs_inode_in_log(inode, root->fs_info->generation) ||
>> -         BTRFS_I(inode)->last_trans <=
>> -         root->fs_info->last_trans_committed) {
>> +         (full_sync && BTRFS_I(inode)->last_trans <=
>> +          root->fs_info->last_trans_committed)) {
>>               BTRFS_I(inode)->last_trans = 0;
>>
>>               /*
>> --
>> 2.1.3
>>
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html



-- 
Filipe David Manana,

"Reasonable men adapt themselves to the world.
 Unreasonable men adapt the world to themselves.
 That's why all progress depends on unreasonable men."

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH v2] Btrfs: fix data loss in the fast fsync path
  2015-03-03 11:15     ` Filipe David Manana
@ 2015-03-03 13:31       ` Liu Bo
  0 siblings, 0 replies; 7+ messages in thread
From: Liu Bo @ 2015-03-03 13:31 UTC (permalink / raw)
  To: Filipe David Manana; +Cc: Filipe Manana, linux-btrfs@vger.kernel.org, stable

On Tue, Mar 03, 2015 at 11:15:17AM +0000, Filipe David Manana wrote:
> On Tue, Mar 3, 2015 at 12:41 AM, Liu Bo <bo.li.liu@oracle.com> wrote:
> > On Sun, Mar 01, 2015 at 09:08:38AM +0000, Filipe Manana wrote:
> >> When using the fast file fsync code path we can miss the fact that new
> >> writes happened since the last file fsync and therefore return without
> >> waiting for the IO to finish and write the new extents to the fsync log.
> >>
> >> Here's an example scenario where the fsync will miss the fact that new
> >> file data exists that wasn't yet durably persisted:
> >>
> >> 1. fs_info->last_trans_committed == N - 1 and current transaction is
> >>    transaction N (fs_info->generation == N);
> >>
> >> 2. do a buffered write;
> >>
> >> 3. fsync our inode, this clears our inode's full sync flag, starts
> >>    an ordered extent and waits for it to complete - when it completes
> >>    at btrfs_finish_ordered_io(), the inode's last_trans is set to the
> >>    value N (via btrfs_update_inode_fallback -> btrfs_update_inode ->
> >>    btrfs_set_inode_last_trans);
> >>
> >> 4. transaction N is committed, so fs_info->last_trans_committed is now
> >>    set to the value N and fs_info->generation remains with the value N;
> >>
> >> 5. do another buffered write, when this happens btrfs_file_write_iter
> >>    sets our inode's last_trans to the value N + 1 (that is
> >>    fs_info->generation + 1 == N + 1);
> >>
> >> 6. transaction N + 1 is started and fs_info->generation now has the
> >>    value N + 1;
> >>
> >> 7. transaction N + 1 is committed, so fs_info->last_trans_committed
> >>    is set to the value N + 1;
> >>
> >> 8. fsync our inode - because it doesn't have the full sync flag set,
> >>    we only start the ordered extent, we don't wait for it to complete
> >>    (only in a later phase) therefore its last_trans field has the
> >>    value N + 1 set previously by btrfs_file_write_iter(), and so we
> >>    have:
> >>
> >>        inode->last_trans <= fs_info->last_trans_committed
> >>            (N + 1)              (N + 1)
> >>
> >>    Which made us not log the last buffered write and exit the fsync
> >>    handler immediately, returning success (0) to user space and resulting
> >>    in data loss after a crash.
> >>
> >> This can actually be triggered deterministically and the following excerpt
> >> from a testcase I made for xfstests triggers the issue. It moves a dummy
> >> file across directories and then fsyncs the old parent directory - this
> >> is just to trigger a transaction commit, so moving files around isn't
> >> directly related to the issue but it was chosen because running 'sync' for
> >> example does more than just committing the current transaction, as it
> >> flushes/waits for all file data to be persisted. The issue can also happen
> >> at random periods, since the transaction kthread periodicaly commits the
> >> current transaction (about every 30 seconds by default).
> >> The body of the test is:
> >>
> >>   _scratch_mkfs >> $seqres.full 2>&1
> >>   _init_flakey
> >>   _mount_flakey
> >>
> >>   # Create our main test file 'foo', the one we check for data loss.
> >>   # By doing an fsync against our file, it makes btrfs clear the 'needs_full_sync'
> >>   # bit from its flags (btrfs inode specific flags).
> >>   $XFS_IO_PROG -f -c "pwrite -S 0xaa 0 8K" \
> >>                   -c "fsync" $SCRATCH_MNT/foo | _filter_xfs_io
> >>
> >>   # Now create one other file and 2 directories. We will move this second file
> >>   # from one directory to the other later because it forces btrfs to commit its
> >>   # currently open transaction if we fsync the old parent directory. This is
> >>   # necessary to trigger the data loss bug that affected btrfs.
> >>   mkdir $SCRATCH_MNT/testdir_1
> >>   touch $SCRATCH_MNT/testdir_1/bar
> >>   mkdir $SCRATCH_MNT/testdir_2
> >>
> >>   # Make sure everything is durably persisted.
> >>   sync
> >>
> >>   # Write more 8Kb of data to our file.
> >>   $XFS_IO_PROG -c "pwrite -S 0xbb 8K 8K" $SCRATCH_MNT/foo | _filter_xfs_io
> >>
> >>   # Move our 'bar' file into a new directory.
> >>   mv $SCRATCH_MNT/testdir_1/bar $SCRATCH_MNT/testdir_2/bar
> >>
> >>   # Fsync our first directory. Because it had a file moved into some other
> >>   # directory, this made btrfs commit the currently open transaction. This is
> >>   # a condition necessary to trigger the data loss bug.
> >>   $XFS_IO_PROG -c "fsync" $SCRATCH_MNT/testdir_1
> >>
> >>   # Now fsync our main test file. If the fsync succeeds, we expect the 8Kb of
> >>   # data we wrote previously to be persisted and available if a crash happens.
> >>   # This did not happen with btrfs, because of the transaction commit that
> >>   # happened when we fsynced the parent directory.
> >>   $XFS_IO_PROG -c "fsync" $SCRATCH_MNT/foo
> >>
> >>   # Simulate a crash/power loss.
> >>   _load_flakey_table $FLAKEY_DROP_WRITES
> >>   _unmount_flakey
> >>
> >>   _load_flakey_table $FLAKEY_ALLOW_WRITES
> >>   _mount_flakey
> >>
> >>   # Now check that all data we wrote before are available.
> >>   echo "File content after log replay:"
> >>   od -t x1 $SCRATCH_MNT/foo
> >>
> >>   status=0
> >>   exit
> >>
> >> The expected golden output for the test, which is what we get with this
> >> fix applied (or when running against ext3/4 and xfs), is:
> >>
> >>   wrote 8192/8192 bytes at offset 0
> >>   XXX Bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
> >>   wrote 8192/8192 bytes at offset 8192
> >>   XXX Bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
> >>   File content after log replay:
> >>   0000000 aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa
> >>   *
> >>   0020000 bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb
> >>   *
> >>   0040000
> >>
> >> Without this fix applied, the output shows the test file does not have
> >> the second 8Kb extent that we successfully fsynced:
> >>
> >>   wrote 8192/8192 bytes at offset 0
> >>   XXX Bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
> >>   wrote 8192/8192 bytes at offset 8192
> >>   XXX Bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
> >>   File content after log replay:
> >>   0000000 aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa
> >>   *
> >>   0020000
> >>
> >> So fix this by skipping the fsync only if we're doing a full sync and
> >> if the inode's last_trans is <= fs_info->last_trans_committed, or if
> >> the inode is already in the log. Also remove setting the inode's
> >> last_trans in btrfs_file_write_iter since it's useless/unreliable.
> >>
> >> A test case for xfstests will be sent soon.
> >>
> >> CC: <stable@vger.kernel.org>
> >> Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
> >> Signed-off-by: Filipe Manana <fdmanana@suse.com>
> >> ---
> >>
> >> V2: Removed dead assignment of inode->last_trans in btrfs_file_write_iter
> >>     (and the respective comment) since it's useless now. Added stable to
> >>     cc because it's a data loss fix.
> >>
> >>  fs/btrfs/file.c | 45 ++++++++++++++++++++++++++++-----------------
> >>  1 file changed, 28 insertions(+), 17 deletions(-)
> >>
> >> diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
> >> index 2bd72cd..b7334c9 100644
> >> --- a/fs/btrfs/file.c
> >> +++ b/fs/btrfs/file.c
> >> @@ -1811,22 +1811,10 @@ static ssize_t btrfs_file_write_iter(struct kiocb *iocb,
> >>       mutex_unlock(&inode->i_mutex);
> >>
> >>       /*
> >> -      * we want to make sure fsync finds this change
> >> -      * but we haven't joined a transaction running right now.
> >> -      *
> >> -      * Later on, someone is sure to update the inode and get the
> >> -      * real transid recorded.
> >> -      *
> >> -      * We set last_trans now to the fs_info generation + 1,
> >> -      * this will either be one more than the running transaction
> >> -      * or the generation used for the next transaction if there isn't
> >> -      * one running right now.
> >> -      *
> >>        * We also have to set last_sub_trans to the current log transid,
> >>        * otherwise subsequent syncs to a file that's been synced in this
> >>        * transaction will appear to have already occured.
> >>        */
> >> -     BTRFS_I(inode)->last_trans = root->fs_info->generation + 1;
> >
> > By thinking twice about it, how about setting ->last_trans with (-1ULL)?
> >
> > So the benefit is that if new writes have already finished its endio where
> > calling btrfs_set_inode_last_trans() to set ->last_trans with a transid
> > at that age, we may get a win for skipping log part if someone else has
> > updated ->last_trans_committed.
> >
> > By limiting it to 'full_sync' case we lose the above opportunity.
> 
> That still won't work.
> 
> Imagine the following the scenario:
> 
> 1) do 2 buffered writes to 2 different ranges of the inode - the
> inode's last_trans is set to (u64)-1;
> 
> 2) writepages is called against the first range only (either the VM
> called it due to memory pressure or a ranged fsync like msync for
> example);
> 
> 3) the ordered extent started by the previous writepages calls
> completes and sets inode->last_trans to N (N == current transaction
> id/generation);
> 
> 4) transaction N commits;
> 
> 5) fsync the file (either whole range or a range covering only the
> second dirty range) - this will bail out since last_trans ==
> last_trans_committed, not logging the second dirty range.

Good explanation, it's true.

Thanks,

-liubo

> 
> thanks
> 
> 
> >
> > Thanks,
> >
> > -liubo
> >>       BTRFS_I(inode)->last_sub_trans = root->log_transid;
> >>       if (num_written > 0) {
> >>               err = generic_write_sync(file, pos, num_written);
> >> @@ -1971,14 +1959,37 @@ int btrfs_sync_file(struct file *file, loff_t start, loff_t end, int datasync)
> >>       }
> >>
> >>       /*
> >> -      * if the last transaction that changed this file was before
> >> -      * the current transaction, we can bail out now without any
> >> -      * syncing
> >> +      * If the last transaction that changed this file was before the current
> >> +      * transaction and we have the full sync flag set in our inode, we can
> >> +      * bail out now without any syncing.
> >> +      *
> >> +      * Note that we can't bail out if the full sync flag isn't set. This is
> >> +      * because when the full sync flag is set we start all ordered extents
> >> +      * and wait for them to fully complete - when they complete they update
> >> +      * the inode's last_trans field through:
> >> +      *
> >> +      *     btrfs_finish_ordered_io() ->
> >> +      *         btrfs_update_inode_fallback() ->
> >> +      *             btrfs_update_inode() ->
> >> +      *                 btrfs_set_inode_last_trans()
> >> +      *
> >> +      * So we are sure that last_trans is up to date and can do this check to
> >> +      * bail out safely. For the fast path, when the full sync flag is not
> >> +      * set in our inode, we can not do it because we start only our ordered
> >> +      * extents and don't wait for them to complete (that is when
> >> +      * btrfs_finish_ordered_io runs), so here at this point their last_trans
> >> +      * value might be less than or equals to fs_info->last_trans_committed,
> >> +      * and setting a speculative last_trans for an inode when a buffered
> >> +      * write is made (such as fs_info->generation + 1 for example) would not
> >> +      * be reliable since after setting the value and before fsync is called
> >> +      * any number of transactions can start and commit (transaction kthread
> >> +      * commits the current transaction periodically), and a transaction
> >> +      * commit does not start nor waits for ordered extents to complete.
> >>        */
> >>       smp_mb();
> >>       if (btrfs_inode_in_log(inode, root->fs_info->generation) ||
> >> -         BTRFS_I(inode)->last_trans <=
> >> -         root->fs_info->last_trans_committed) {
> >> +         (full_sync && BTRFS_I(inode)->last_trans <=
> >> +          root->fs_info->last_trans_committed)) {
> >>               BTRFS_I(inode)->last_trans = 0;
> >>
> >>               /*
> >> --
> >> 2.1.3
> >>
> >> --
> >> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> >> the body of a message to majordomo@vger.kernel.org
> >> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> > --
> > To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> > the body of a message to majordomo@vger.kernel.org
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
> 
> 
> -- 
> Filipe David Manana,
> 
> "Reasonable men adapt themselves to the world.
>  Unreasonable men adapt the world to themselves.
>  That's why all progress depends on unreasonable men."

^ permalink raw reply	[flat|nested] 7+ messages in thread

* [PATCH v3] Btrfs: fix data loss in the fast fsync path
  2015-02-26 19:52 [PATCH] Btrfs: fix data loss in the fast fsync path Filipe Manana
  2015-02-28 15:04 ` Liu Bo
  2015-03-01  9:08 ` [PATCH v2] " Filipe Manana
@ 2015-03-01 20:36 ` Filipe Manana
  2 siblings, 0 replies; 7+ messages in thread
From: Filipe Manana @ 2015-03-01 20:36 UTC (permalink / raw)
  To: linux-btrfs; +Cc: Filipe Manana, stable

When using the fast file fsync code path we can miss the fact that new
writes happened since the last file fsync and therefore return without
waiting for the IO to finish and write the new extents to the fsync log.

Here's an example scenario where the fsync will miss the fact that new
file data exists that wasn't yet durably persisted:

1. fs_info->last_trans_committed == N - 1 and current transaction is
   transaction N (fs_info->generation == N);

2. do a buffered write;

3. fsync our inode, this clears our inode's full sync flag, starts
   an ordered extent and waits for it to complete - when it completes
   at btrfs_finish_ordered_io(), the inode's last_trans is set to the
   value N (via btrfs_update_inode_fallback -> btrfs_update_inode ->
   btrfs_set_inode_last_trans);

4. transaction N is committed, so fs_info->last_trans_committed is now
   set to the value N and fs_info->generation remains with the value N;

5. do another buffered write, when this happens btrfs_file_write_iter
   sets our inode's last_trans to the value N + 1 (that is
   fs_info->generation + 1 == N + 1);

6. transaction N + 1 is started and fs_info->generation now has the
   value N + 1;

7. transaction N + 1 is committed, so fs_info->last_trans_committed
   is set to the value N + 1;

8. fsync our inode - because it doesn't have the full sync flag set,
   we only start the ordered extent, we don't wait for it to complete
   (only in a later phase) therefore its last_trans field has the
   value N + 1 set previously by btrfs_file_write_iter(), and so we
   have:

       inode->last_trans <= fs_info->last_trans_committed
           (N + 1)              (N + 1)

   Which made us not log the last buffered write and exit the fsync
   handler immediately, returning success (0) to user space and resulting
   in data loss after a crash.

This can actually be triggered deterministically and the following excerpt
from a testcase I made for xfstests triggers the issue. It moves a dummy
file across directories and then fsyncs the old parent directory - this
is just to trigger a transaction commit, so moving files around isn't
directly related to the issue but it was chosen because running 'sync' for
example does more than just committing the current transaction, as it
flushes/waits for all file data to be persisted. The issue can also happen
at random periods, since the transaction kthread periodicaly commits the
current transaction (about every 30 seconds by default).
The body of the test is:

  _scratch_mkfs >> $seqres.full 2>&1
  _init_flakey
  _mount_flakey

  # Create our main test file 'foo', the one we check for data loss.
  # By doing an fsync against our file, it makes btrfs clear the 'needs_full_sync'
  # bit from its flags (btrfs inode specific flags).
  $XFS_IO_PROG -f -c "pwrite -S 0xaa 0 8K" \
                  -c "fsync" $SCRATCH_MNT/foo | _filter_xfs_io

  # Now create one other file and 2 directories. We will move this second file
  # from one directory to the other later because it forces btrfs to commit its
  # currently open transaction if we fsync the old parent directory. This is
  # necessary to trigger the data loss bug that affected btrfs.
  mkdir $SCRATCH_MNT/testdir_1
  touch $SCRATCH_MNT/testdir_1/bar
  mkdir $SCRATCH_MNT/testdir_2

  # Make sure everything is durably persisted.
  sync

  # Write more 8Kb of data to our file.
  $XFS_IO_PROG -c "pwrite -S 0xbb 8K 8K" $SCRATCH_MNT/foo | _filter_xfs_io

  # Move our 'bar' file into a new directory.
  mv $SCRATCH_MNT/testdir_1/bar $SCRATCH_MNT/testdir_2/bar

  # Fsync our first directory. Because it had a file moved into some other
  # directory, this made btrfs commit the currently open transaction. This is
  # a condition necessary to trigger the data loss bug.
  $XFS_IO_PROG -c "fsync" $SCRATCH_MNT/testdir_1

  # Now fsync our main test file. If the fsync succeeds, we expect the 8Kb of
  # data we wrote previously to be persisted and available if a crash happens.
  # This did not happen with btrfs, because of the transaction commit that
  # happened when we fsynced the parent directory.
  $XFS_IO_PROG -c "fsync" $SCRATCH_MNT/foo

  # Simulate a crash/power loss.
  _load_flakey_table $FLAKEY_DROP_WRITES
  _unmount_flakey

  _load_flakey_table $FLAKEY_ALLOW_WRITES
  _mount_flakey

  # Now check that all data we wrote before are available.
  echo "File content after log replay:"
  od -t x1 $SCRATCH_MNT/foo

  status=0
  exit

The expected golden output for the test, which is what we get with this
fix applied (or when running against ext3/4 and xfs), is:

  wrote 8192/8192 bytes at offset 0
  XXX Bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
  wrote 8192/8192 bytes at offset 8192
  XXX Bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
  File content after log replay:
  0000000 aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa
  *
  0020000 bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb
  *
  0040000

Without this fix applied, the output shows the test file does not have
the second 8Kb extent that we successfully fsynced:

  wrote 8192/8192 bytes at offset 0
  XXX Bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
  wrote 8192/8192 bytes at offset 8192
  XXX Bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
  File content after log replay:
  0000000 aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa
  *
  0020000

So fix this by skipping the fsync only if we're doing a full sync and
if the inode's last_trans is <= fs_info->last_trans_committed, or if
the inode is already in the log. Also remove setting the inode's
last_trans in btrfs_file_write_iter since it's useless/unreliable.

Also because btrfs_file_write_iter no longer sets inode->last_trans to
fs_info->generation + 1, don't set last_trans to 0 if we bail out and don't
bail out if last_trans is 0, otherwise something as simple as the following
example wouldn't log the second write on the last fsync:

  1. write to file

  2. fsync file

  3. fsync file
       |--> btrfs_inode_in_log() returns true and it set last_trans to 0

  4. write to file
       |--> btrfs_file_write_iter() no longers sets last_trans, so it
            remained with a value of 0
  5. fsync
       |--> inode->last_trans == 0, so it bails out without logging the
            second write

A test case for xfstests will be sent soon.

CC: <stable@vger.kernel.org>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
---

V2: Removed dead assignment of inode->last_trans in btrfs_file_write_iter
    (and the respective comment) since it's useless now. Added stable to
    cc because it's a data loss fix.

V3: After removing the statement that sets inode->last_trans to generation + 1
    in btrfs_file_write_iter() we would bail out on an fsync if before it
    there was a double fsync followed by a new write. Updated bail out logic
    to not set last_trans to 0 and updated commit message to explicitly talk
    about this case.

 fs/btrfs/file.c | 56 ++++++++++++++++++++++++++++----------------------------
 1 file changed, 28 insertions(+), 28 deletions(-)

diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
index 2bd72cd..8f256b1 100644
--- a/fs/btrfs/file.c
+++ b/fs/btrfs/file.c
@@ -1811,22 +1811,10 @@ static ssize_t btrfs_file_write_iter(struct kiocb *iocb,
 	mutex_unlock(&inode->i_mutex);

 	/*
-	 * we want to make sure fsync finds this change
-	 * but we haven't joined a transaction running right now.
-	 *
-	 * Later on, someone is sure to update the inode and get the
-	 * real transid recorded.
-	 *
-	 * We set last_trans now to the fs_info generation + 1,
-	 * this will either be one more than the running transaction
-	 * or the generation used for the next transaction if there isn't
-	 * one running right now.
-	 *
 	 * We also have to set last_sub_trans to the current log transid,
 	 * otherwise subsequent syncs to a file that's been synced in this
 	 * transaction will appear to have already occured.
 	 */
-	BTRFS_I(inode)->last_trans = root->fs_info->generation + 1;
 	BTRFS_I(inode)->last_sub_trans = root->log_transid;
 	if (num_written > 0) {
 		err = generic_write_sync(file, pos, num_written);
@@ -1962,25 +1950,37 @@ int btrfs_sync_file(struct file *file, loff_t start, loff_t end, int datasync)
 	atomic_inc(&root->log_batch);

 	/*
-	 * check the transaction that last modified this inode
-	 * and see if its already been committed
-	 */
-	if (!BTRFS_I(inode)->last_trans) {
-		mutex_unlock(&inode->i_mutex);
-		goto out;
-	}
-
-	/*
-	 * if the last transaction that changed this file was before
-	 * the current transaction, we can bail out now without any
-	 * syncing
+	 * If the last transaction that changed this file was before the current
+	 * transaction and we have the full sync flag set in our inode, we can
+	 * bail out now without any syncing.
+	 *
+	 * Note that we can't bail out if the full sync flag isn't set. This is
+	 * because when the full sync flag is set we start all ordered extents
+	 * and wait for them to fully complete - when they complete they update
+	 * the inode's last_trans field through:
+	 *
+	 *     btrfs_finish_ordered_io() ->
+	 *         btrfs_update_inode_fallback() ->
+	 *             btrfs_update_inode() ->
+	 *                 btrfs_set_inode_last_trans()
+	 *
+	 * So we are sure that last_trans is up to date and can do this check to
+	 * bail out safely. For the fast path, when the full sync flag is not
+	 * set in our inode, we can not do it because we start only our ordered
+	 * extents and don't wait for them to complete (that is when
+	 * btrfs_finish_ordered_io runs), so here at this point their last_trans
+	 * value might be less than or equals to fs_info->last_trans_committed,
+	 * and setting a speculative last_trans for an inode when a buffered
+	 * write is made (such as fs_info->generation + 1 for example) would not
+	 * be reliable since after setting the value and before fsync is called
+	 * any number of transactions can start and commit (transaction kthread
+	 * commits the current transaction periodically), and a transaction
+	 * commit does not start nor waits for ordered extents to complete.
 	 */
 	smp_mb();
 	if (btrfs_inode_in_log(inode, root->fs_info->generation) ||
-	    BTRFS_I(inode)->last_trans <=
-	    root->fs_info->last_trans_committed) {
-		BTRFS_I(inode)->last_trans = 0;
-
+	    (full_sync && BTRFS_I(inode)->last_trans <=
+	     root->fs_info->last_trans_committed)) {
 		/*
 		 * We'v had everything committed since the last time we were
 		 * modified so clear this flag in case it was set for whatever
-- 
2.1.3

^ permalink raw reply related	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2015-03-03 13:48 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2015-02-26 19:52 [PATCH] Btrfs: fix data loss in the fast fsync path Filipe Manana
2015-02-28 15:04 ` Liu Bo
2015-03-01  9:08 ` [PATCH v2] " Filipe Manana
2015-03-03  0:41   ` Liu Bo
2015-03-03 11:15     ` Filipe David Manana
2015-03-03 13:31       ` Liu Bo
2015-03-01 20:36 ` [PATCH v3] " Filipe Manana

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).