All of lore.kernel.org
 help / color / mirror / Atom feed
From: Liu Bo <bo.li.liu@oracle.com>
To: fdmanana@kernel.org
Cc: linux-btrfs@vger.kernel.org, Filipe Manana <fdmanana@suse.com>,
	stable@vger.kernel.org
Subject: Re: [PATCH] Btrfs: fix fsync data loss after append write
Date: Thu, 18 Jun 2015 18:20:34 +0800	[thread overview]
Message-ID: <20150618102032.GC8530@localhost.localdomain> (raw)
In-Reply-To: <1434541763-23753-1-git-send-email-fdmanana@kernel.org>

On Wed, Jun 17, 2015 at 12:49:23PM +0100, fdmanana@kernel.org wrote:
> From: Filipe Manana <fdmanana@suse.com>
> 
> If we do an append write to a file (which increases its inode's i_size)
> that does not have the flag BTRFS_INODE_NEEDS_FULL_SYNC set in its inode,
> and the previous transaction added a new hard link to the file, which sets
> the flag BTRFS_INODE_COPY_EVERYTHING in the file's inode, and then fsync
> the file, the inode's new i_size isn't logged. This has the consequence
> that after the fsync log is replayed, the file size remains what it was
> before the append write operation, which means users/applications will
> not be able to read the data that was successsfully fsync'ed before.
> 
> This happens because neither the inode item nor the delayed inode get
> their i_size updated when the append write is made - doing so would
> require starting a transaction in the buffered write path, something that
> we do not do intentionally for performance reasons.
> 
> Fix this by making sure that when the flag BTRFS_INODE_COPY_EVERYTHING is
> set the inode is logged with its current i_size (log the in-memory inode
> into the log tree).

Looks good to me.

Reviewed-by: Liu Bo <bo.li.liu@oracle.com>

Thanks,

-liubo
> 
> This issue is not a recent regression and is easy to reproduce with the
> following test case for fstests:
> 
>   seq=`basename $0`
>   seqres=$RESULT_DIR/$seq
>   echo "QA output created by $seq"
> 
>   here=`pwd`
>   tmp=/tmp/$$
>   status=1	# failure is the default!
> 
>   _cleanup()
>   {
>           _cleanup_flakey
>           rm -f $tmp.*
>   }
>   trap "_cleanup; exit \$status" 0 1 2 3 15
> 
>   # get standard environment, filters and checks
>   . ./common/rc
>   . ./common/filter
>   . ./common/dmflakey
> 
>   # real QA test starts here
>   _supported_fs generic
>   _supported_os Linux
>   _need_to_be_root
>   _require_scratch
>   _require_dm_flakey
>   _require_metadata_journaling $SCRATCH_DEV
> 
>   _crash_and_mount()
>   {
>           # Simulate a crash/power loss.
>           _load_flakey_table $FLAKEY_DROP_WRITES
>           _unmount_flakey
>           # Allow writes again and mount. This makes the fs replay its fsync log.
>           _load_flakey_table $FLAKEY_ALLOW_WRITES
>           _mount_flakey
>   }
> 
>   rm -f $seqres.full
> 
>   _scratch_mkfs >> $seqres.full 2>&1
>   _init_flakey
>   _mount_flakey
> 
>   # Create the test file with some initial data and then fsync it.
>   # The fsync here is only needed to trigger the issue in btrfs, as it causes the
>   # the flag BTRFS_INODE_NEEDS_FULL_SYNC to be removed from the btrfs inode.
>   $XFS_IO_PROG -f -c "pwrite -S 0xaa 0 32k" \
>                   -c "fsync" \
>                   $SCRATCH_MNT/foo | _filter_xfs_io
>   sync
> 
>   # Add a hard link to our file.
>   # On btrfs this sets the flag BTRFS_INODE_COPY_EVERYTHING on the btrfs inode,
>   # which is a necessary condition to trigger the issue.
>   ln $SCRATCH_MNT/foo $SCRATCH_MNT/bar
> 
>   # Sync the filesystem to force a commit of the current btrfs transaction, this
>   # is a necessary condition to trigger the bug on btrfs.
>   sync
> 
>   # Now append more data to our file, increasing its size, and fsync the file.
>   # In btrfs because the inode flag BTRFS_INODE_COPY_EVERYTHING was set and the
>   # write path did not update the inode item in the btree nor the delayed inode
>   # item (in memory struture) in the current transaction (created by the fsync
>   # handler), the fsync did not record the inode's new i_size in the fsync
>   # log/journal. This made the data unavailable after the fsync log/journal is
>   # replayed.
>   $XFS_IO_PROG -c "pwrite -S 0xbb 32K 32K" \
>                -c "fsync" \
>                $SCRATCH_MNT/foo | _filter_xfs_io
> 
>   echo "File content after fsync and before crash:"
>   od -t x1 $SCRATCH_MNT/foo
> 
>   _crash_and_mount
> 
>   echo "File content after crash and log replay:"
>   od -t x1 $SCRATCH_MNT/foo
> 
>   status=0
>   exit
> 
> The expected file output before and after the crash/power failure expects the
> appended data to be available, which is:
> 
>   0000000 aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa
>   *
>   0100000 bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb
>   *
>   0200000
> 
> Cc: stable@vger.kernel.org
> Signed-off-by: Filipe Manana <fdmanana@suse.com>
> ---
>  fs/btrfs/tree-log.c | 14 +++++++++-----
>  1 file changed, 9 insertions(+), 5 deletions(-)
> 
> diff --git a/fs/btrfs/tree-log.c b/fs/btrfs/tree-log.c
> index d049683..4920fce 100644
> --- a/fs/btrfs/tree-log.c
> +++ b/fs/btrfs/tree-log.c
> @@ -4161,6 +4161,7 @@ static int btrfs_log_inode(struct btrfs_trans_handle *trans,
>  	u64 ino = btrfs_ino(inode);
>  	struct extent_map_tree *em_tree = &BTRFS_I(inode)->extent_tree;
>  	u64 logged_isize = 0;
> +	bool need_log_inode_item = true;
>  
>  	path = btrfs_alloc_path();
>  	if (!path)
> @@ -4269,11 +4270,6 @@ static int btrfs_log_inode(struct btrfs_trans_handle *trans,
>  		} else {
>  			if (inode_only == LOG_INODE_ALL)
>  				fast_search = true;
> -			ret = log_inode_item(trans, log, dst_path, inode);
> -			if (ret) {
> -				err = ret;
> -				goto out_unlock;
> -			}
>  			goto log_extents;
>  		}
>  
> @@ -4296,6 +4292,9 @@ again:
>  		if (min_key.type > max_key.type)
>  			break;
>  
> +		if (min_key.type == BTRFS_INODE_ITEM_KEY)
> +			need_log_inode_item = false;
> +
>  		src = path->nodes[0];
>  		if (ins_nr && ins_start_slot + ins_nr == path->slots[0]) {
>  			ins_nr++;
> @@ -4366,6 +4365,11 @@ next_slot:
>  log_extents:
>  	btrfs_release_path(path);
>  	btrfs_release_path(dst_path);
> +	if (need_log_inode_item) {
> +		err = log_inode_item(trans, log, dst_path, inode);
> +		if (err)
> +			goto out_unlock;
> +	}
>  	if (fast_search) {
>  		/*
>  		 * Some ordered extents started by fsync might have completed
> -- 
> 2.1.3
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

      reply	other threads:[~2015-06-18 10:20 UTC|newest]

Thread overview: 2+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2015-06-17 11:49 [PATCH] Btrfs: fix fsync data loss after append write fdmanana
2015-06-18 10:20 ` Liu Bo [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20150618102032.GC8530@localhost.localdomain \
    --to=bo.li.liu@oracle.com \
    --cc=fdmanana@kernel.org \
    --cc=fdmanana@suse.com \
    --cc=linux-btrfs@vger.kernel.org \
    --cc=stable@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.