Linux Btrfs filesystem development
 help / color / mirror / Atom feed
From: Wang Yugui <wangyugui@e16-tech.com>
To: fdmanana@kernel.org
Cc: linux-btrfs@vger.kernel.org
Subject: Re: [PATCH 1/7] btrfs: fix lost prealloc extents beyond eof after full fsync
Date: Mon, 21 Feb 2022 18:11:42 +0800	[thread overview]
Message-ID: <20220221181141.DD77.409509F4@e16-tech.com> (raw)
In-Reply-To: <0b6139a324f2cd5b470d58c45617fc29ab893f12.1645098951.git.fdmanana@suse.com>

Hi,

xfstest btrfs/261 PASSED 40 times without this patch.

so there is some problem in xfstest btrfs/261?

Best Regards
Wang Yugui (wangyugui@e16-tech.com)
2022/02/21

> From: Filipe Manana <fdmanana@suse.com>
> 
> When doing a full fsync, if we have prealloc extents beyond (or at) eof,
> and the leaves that contain them were not modified in the current
> transaction, we end up not logging them. This results in losing those
> extents when we replay the log after a power failure, since the inode is
> truncated to the current value of the logged i_size.
> 
> Just like for the fast fsync path, we need to always log all prealloc
> extents starting at or beyond i_size. The fast fsync case was fixed in
> commit 471d557afed155 ("Btrfs: fix loss of prealloc extents past i_size
> after fsync log replay") but it missed the full fsync path. The problem
> exists since the very early days, when the log tree was added by
> commit e02119d5a7b439 ("Btrfs: Add a write ahead tree log to optimize
> synchronous operations").
> 
> Example reproducer:
> 
>   $ mkfs.btrfs -f /dev/sdc
>   $ mount /dev/sdc /mnt
> 
>   # Create our test file with many file extent items, so that they span
>   # several leaves of metadata, even if the node/page size is 64K. Use
>   # direct IO and not fsync/O_SYNC because it's both faster and it avoids
>   # clearing the full sync flag from the inode - we want the fsync below
>   # to trigger the slow full sync code path.
>   $ xfs_io -f -d -c "pwrite -b 4K 0 16M" /mnt/foo
> 
>   # Now add two preallocated extents to our file without extending the
>   # file's size. One right at i_size, and another further beyond, leaving
>   # a gap between the two prealloc extents.
>   $ xfs_io -c "falloc -k 16M 1M" /mnt/foo
>   $ xfs_io -c "falloc -k 20M 1M" /mnt/foo
> 
>   # Make sure everything is durably persisted and the transaction is
>   # committed. This makes all created extents to have a generation lower
>   # than the generation of the transaction used by the next write and
>   # fsync.
>   sync
> 
>   # Now overwrite only the first extent, which will result in modifying
>   # only the first leaf of metadata for our inode. Then fsync it. This
>   # fsync will use the slow code path (inode full sync bit is set) because
>   # it's the first fsync since the inode was created/loaded.
>   $ xfs_io -c "pwrite 0 4K" -c "fsync" /mnt/foo
> 
>   # Extent list before power failure.
>   $ xfs_io -c "fiemap -v" /mnt/foo
>   /mnt/foo:
>    EXT: FILE-OFFSET      BLOCK-RANGE      TOTAL FLAGS
>      0: [0..7]:          2178048..2178055     8   0x0
>      1: [8..16383]:      26632..43007     16376   0x0
>      2: [16384..32767]:  2156544..2172927 16384   0x0
>      3: [32768..34815]:  2172928..2174975  2048 0x800
>      4: [34816..40959]:  hole              6144
>      5: [40960..43007]:  2174976..2177023  2048 0x801
> 
>   <power fail>
> 
>   # Mount fs again, trigger log replay.
>   $ mount /dev/sdc /mnt
> 
>   # Extent list after power failure and log replay.
>   $ xfs_io -c "fiemap -v" /mnt/foo
>   /mnt/foo:
>    EXT: FILE-OFFSET      BLOCK-RANGE      TOTAL FLAGS
>      0: [0..7]:          2178048..2178055     8   0x0
>      1: [8..16383]:      26632..43007     16376   0x0
>      2: [16384..32767]:  2156544..2172927 16384   0x1
> 
>   # The prealloc extents at file offsets 16M and 20M are missing.
> 
> So fix this by calling btrfs_log_prealloc_extents() when we are doing a
> full fsync, so that we always log all prealloc extents beyond eof.
> 
> A test case for fstests will follow soon.
> 
> CC: stable@vger.kernel.org # 4.19+
> Signed-off-by: Filipe Manana <fdmanana@suse.com>
> ---
>  fs/btrfs/tree-log.c | 43 +++++++++++++++++++++++++++++++------------
>  1 file changed, 31 insertions(+), 12 deletions(-)
> 
> diff --git a/fs/btrfs/tree-log.c b/fs/btrfs/tree-log.c
> index a483337e8f41..71a5a961fef7 100644
> --- a/fs/btrfs/tree-log.c
> +++ b/fs/btrfs/tree-log.c
> @@ -4732,7 +4732,7 @@ static int log_one_extent(struct btrfs_trans_handle *trans,
>  
>  /*
>   * Log all prealloc extents beyond the inode's i_size to make sure we do not
> - * lose them after doing a fast fsync and replaying the log. We scan the
> + * lose them after doing a full/fast fsync and replaying the log. We scan the
>   * subvolume's root instead of iterating the inode's extent map tree because
>   * otherwise we can log incorrect extent items based on extent map conversion.
>   * That can happen due to the fact that extent maps are merged when they
> @@ -5510,6 +5510,7 @@ static int copy_inode_items_to_log(struct btrfs_trans_handle *trans,
>  				   struct btrfs_log_ctx *ctx,
>  				   bool *need_log_inode_item)
>  {
> +	const u64 i_size = i_size_read(&inode->vfs_inode);
>  	struct btrfs_root *root = inode->root;
>  	int ins_start_slot = 0;
>  	int ins_nr = 0;
> @@ -5530,13 +5531,21 @@ static int copy_inode_items_to_log(struct btrfs_trans_handle *trans,
>  		if (min_key->type > max_key->type)
>  			break;
>  
> -		if (min_key->type == BTRFS_INODE_ITEM_KEY)
> +		if (min_key->type == BTRFS_INODE_ITEM_KEY) {
>  			*need_log_inode_item = false;
> -
> -		if ((min_key->type == BTRFS_INODE_REF_KEY ||
> -		     min_key->type == BTRFS_INODE_EXTREF_KEY) &&
> -		    inode->generation == trans->transid &&
> -		    !recursive_logging) {
> +		} else if (min_key->type == BTRFS_EXTENT_DATA_KEY &&
> +			   min_key->offset >= i_size) {
> +			/*
> +			 * Extents at and beyond eof are logged with
> +			 * btrfs_log_prealloc_extents().
> +			 * Only regular files have BTRFS_EXTENT_DATA_KEY keys,
> +			 * and no keys greater than that, so bail out.
> +			 */
> +			break;
> +		} else if ((min_key->type == BTRFS_INODE_REF_KEY ||
> +			    min_key->type == BTRFS_INODE_EXTREF_KEY) &&
> +			   inode->generation == trans->transid &&
> +			   !recursive_logging) {
>  			u64 other_ino = 0;
>  			u64 other_parent = 0;
>  
> @@ -5567,10 +5576,8 @@ static int copy_inode_items_to_log(struct btrfs_trans_handle *trans,
>  				btrfs_release_path(path);
>  				goto next_key;
>  			}
> -		}
> -
> -		/* Skip xattrs, we log them later with btrfs_log_all_xattrs() */
> -		if (min_key->type == BTRFS_XATTR_ITEM_KEY) {
> +		} else if (min_key->type == BTRFS_XATTR_ITEM_KEY) {
> +			/* Skip xattrs, logged later with btrfs_log_all_xattrs() */
>  			if (ins_nr == 0)
>  				goto next_slot;
>  			ret = copy_items(trans, inode, dst_path, path,
> @@ -5623,9 +5630,21 @@ static int copy_inode_items_to_log(struct btrfs_trans_handle *trans,
>  			break;
>  		}
>  	}
> -	if (ins_nr)
> +	if (ins_nr) {
>  		ret = copy_items(trans, inode, dst_path, path, ins_start_slot,
>  				 ins_nr, inode_only, logged_isize);
> +		if (ret)
> +			return ret;
> +	}
> +
> +	if (inode_only == LOG_INODE_ALL && S_ISREG(inode->vfs_inode.i_mode)) {
> +		/*
> +		 * Release the path because otherwise we might attempt to double
> +		 * lock the same leaf with btrfs_log_prealloc_extents() below.
> +		 */
> +		btrfs_release_path(path);
> +		ret = btrfs_log_prealloc_extents(trans, inode, dst_path);
> +	}
>  
>  	return ret;
>  }
> -- 
> 2.33.0



  reply	other threads:[~2022-02-21 10:49 UTC|newest]

Thread overview: 13+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2022-02-17 12:12 [PATCH 0/7] btrfs: a fix for fsync and a few improvements to the full fsync path fdmanana
2022-02-17 12:12 ` [PATCH 1/7] btrfs: fix lost prealloc extents beyond eof after full fsync fdmanana
2022-02-21 10:11   ` Wang Yugui [this message]
2022-02-21 10:41     ` Filipe Manana
2022-02-21 23:51       ` Wang Yugui
2022-02-22 10:29         ` Filipe Manana
2022-02-17 12:12 ` [PATCH 2/7] btrfs: stop copying old file extents when doing a " fdmanana
2022-02-17 12:12 ` [PATCH 3/7] btrfs: hold on to less memory when logging checksums during " fdmanana
2022-02-17 12:12 ` [PATCH 4/7] btrfs: voluntarily relinquish cpu when doing a " fdmanana
2022-02-17 12:12 ` [PATCH 5/7] btrfs: reset last_reflink_trans after fsyncing inode fdmanana
2022-02-17 12:12 ` [PATCH 6/7] btrfs: fix unexpected error path when reflinking an inline extent fdmanana
2022-02-17 12:12 ` [PATCH 7/7] btrfs: deal with unexpected extent type during reflinking fdmanana
2022-02-21 16:15 ` [PATCH 0/7] btrfs: a fix for fsync and a few improvements to the full fsync path David Sterba

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20220221181141.DD77.409509F4@e16-tech.com \
    --to=wangyugui@e16-tech.com \
    --cc=fdmanana@kernel.org \
    --cc=linux-btrfs@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox