From: Wang Yugui <wangyugui@e16-tech.com>
To: fdmanana@kernel.org
Cc: linux-btrfs@vger.kernel.org
Subject: Re: [PATCH 1/7] btrfs: fix lost prealloc extents beyond eof after full fsync
Date: Mon, 21 Feb 2022 18:11:42 +0800 [thread overview]
Message-ID: <20220221181141.DD77.409509F4@e16-tech.com> (raw)
In-Reply-To: <0b6139a324f2cd5b470d58c45617fc29ab893f12.1645098951.git.fdmanana@suse.com>
Hi,
xfstest btrfs/261 PASSED 40 times without this patch.
so there is some problem in xfstest btrfs/261?
Best Regards
Wang Yugui (wangyugui@e16-tech.com)
2022/02/21
> From: Filipe Manana <fdmanana@suse.com>
>
> When doing a full fsync, if we have prealloc extents beyond (or at) eof,
> and the leaves that contain them were not modified in the current
> transaction, we end up not logging them. This results in losing those
> extents when we replay the log after a power failure, since the inode is
> truncated to the current value of the logged i_size.
>
> Just like for the fast fsync path, we need to always log all prealloc
> extents starting at or beyond i_size. The fast fsync case was fixed in
> commit 471d557afed155 ("Btrfs: fix loss of prealloc extents past i_size
> after fsync log replay") but it missed the full fsync path. The problem
> exists since the very early days, when the log tree was added by
> commit e02119d5a7b439 ("Btrfs: Add a write ahead tree log to optimize
> synchronous operations").
>
> Example reproducer:
>
> $ mkfs.btrfs -f /dev/sdc
> $ mount /dev/sdc /mnt
>
> # Create our test file with many file extent items, so that they span
> # several leaves of metadata, even if the node/page size is 64K. Use
> # direct IO and not fsync/O_SYNC because it's both faster and it avoids
> # clearing the full sync flag from the inode - we want the fsync below
> # to trigger the slow full sync code path.
> $ xfs_io -f -d -c "pwrite -b 4K 0 16M" /mnt/foo
>
> # Now add two preallocated extents to our file without extending the
> # file's size. One right at i_size, and another further beyond, leaving
> # a gap between the two prealloc extents.
> $ xfs_io -c "falloc -k 16M 1M" /mnt/foo
> $ xfs_io -c "falloc -k 20M 1M" /mnt/foo
>
> # Make sure everything is durably persisted and the transaction is
> # committed. This makes all created extents to have a generation lower
> # than the generation of the transaction used by the next write and
> # fsync.
> sync
>
> # Now overwrite only the first extent, which will result in modifying
> # only the first leaf of metadata for our inode. Then fsync it. This
> # fsync will use the slow code path (inode full sync bit is set) because
> # it's the first fsync since the inode was created/loaded.
> $ xfs_io -c "pwrite 0 4K" -c "fsync" /mnt/foo
>
> # Extent list before power failure.
> $ xfs_io -c "fiemap -v" /mnt/foo
> /mnt/foo:
> EXT: FILE-OFFSET BLOCK-RANGE TOTAL FLAGS
> 0: [0..7]: 2178048..2178055 8 0x0
> 1: [8..16383]: 26632..43007 16376 0x0
> 2: [16384..32767]: 2156544..2172927 16384 0x0
> 3: [32768..34815]: 2172928..2174975 2048 0x800
> 4: [34816..40959]: hole 6144
> 5: [40960..43007]: 2174976..2177023 2048 0x801
>
> <power fail>
>
> # Mount fs again, trigger log replay.
> $ mount /dev/sdc /mnt
>
> # Extent list after power failure and log replay.
> $ xfs_io -c "fiemap -v" /mnt/foo
> /mnt/foo:
> EXT: FILE-OFFSET BLOCK-RANGE TOTAL FLAGS
> 0: [0..7]: 2178048..2178055 8 0x0
> 1: [8..16383]: 26632..43007 16376 0x0
> 2: [16384..32767]: 2156544..2172927 16384 0x1
>
> # The prealloc extents at file offsets 16M and 20M are missing.
>
> So fix this by calling btrfs_log_prealloc_extents() when we are doing a
> full fsync, so that we always log all prealloc extents beyond eof.
>
> A test case for fstests will follow soon.
>
> CC: stable@vger.kernel.org # 4.19+
> Signed-off-by: Filipe Manana <fdmanana@suse.com>
> ---
> fs/btrfs/tree-log.c | 43 +++++++++++++++++++++++++++++++------------
> 1 file changed, 31 insertions(+), 12 deletions(-)
>
> diff --git a/fs/btrfs/tree-log.c b/fs/btrfs/tree-log.c
> index a483337e8f41..71a5a961fef7 100644
> --- a/fs/btrfs/tree-log.c
> +++ b/fs/btrfs/tree-log.c
> @@ -4732,7 +4732,7 @@ static int log_one_extent(struct btrfs_trans_handle *trans,
>
> /*
> * Log all prealloc extents beyond the inode's i_size to make sure we do not
> - * lose them after doing a fast fsync and replaying the log. We scan the
> + * lose them after doing a full/fast fsync and replaying the log. We scan the
> * subvolume's root instead of iterating the inode's extent map tree because
> * otherwise we can log incorrect extent items based on extent map conversion.
> * That can happen due to the fact that extent maps are merged when they
> @@ -5510,6 +5510,7 @@ static int copy_inode_items_to_log(struct btrfs_trans_handle *trans,
> struct btrfs_log_ctx *ctx,
> bool *need_log_inode_item)
> {
> + const u64 i_size = i_size_read(&inode->vfs_inode);
> struct btrfs_root *root = inode->root;
> int ins_start_slot = 0;
> int ins_nr = 0;
> @@ -5530,13 +5531,21 @@ static int copy_inode_items_to_log(struct btrfs_trans_handle *trans,
> if (min_key->type > max_key->type)
> break;
>
> - if (min_key->type == BTRFS_INODE_ITEM_KEY)
> + if (min_key->type == BTRFS_INODE_ITEM_KEY) {
> *need_log_inode_item = false;
> -
> - if ((min_key->type == BTRFS_INODE_REF_KEY ||
> - min_key->type == BTRFS_INODE_EXTREF_KEY) &&
> - inode->generation == trans->transid &&
> - !recursive_logging) {
> + } else if (min_key->type == BTRFS_EXTENT_DATA_KEY &&
> + min_key->offset >= i_size) {
> + /*
> + * Extents at and beyond eof are logged with
> + * btrfs_log_prealloc_extents().
> + * Only regular files have BTRFS_EXTENT_DATA_KEY keys,
> + * and no keys greater than that, so bail out.
> + */
> + break;
> + } else if ((min_key->type == BTRFS_INODE_REF_KEY ||
> + min_key->type == BTRFS_INODE_EXTREF_KEY) &&
> + inode->generation == trans->transid &&
> + !recursive_logging) {
> u64 other_ino = 0;
> u64 other_parent = 0;
>
> @@ -5567,10 +5576,8 @@ static int copy_inode_items_to_log(struct btrfs_trans_handle *trans,
> btrfs_release_path(path);
> goto next_key;
> }
> - }
> -
> - /* Skip xattrs, we log them later with btrfs_log_all_xattrs() */
> - if (min_key->type == BTRFS_XATTR_ITEM_KEY) {
> + } else if (min_key->type == BTRFS_XATTR_ITEM_KEY) {
> + /* Skip xattrs, logged later with btrfs_log_all_xattrs() */
> if (ins_nr == 0)
> goto next_slot;
> ret = copy_items(trans, inode, dst_path, path,
> @@ -5623,9 +5630,21 @@ static int copy_inode_items_to_log(struct btrfs_trans_handle *trans,
> break;
> }
> }
> - if (ins_nr)
> + if (ins_nr) {
> ret = copy_items(trans, inode, dst_path, path, ins_start_slot,
> ins_nr, inode_only, logged_isize);
> + if (ret)
> + return ret;
> + }
> +
> + if (inode_only == LOG_INODE_ALL && S_ISREG(inode->vfs_inode.i_mode)) {
> + /*
> + * Release the path because otherwise we might attempt to double
> + * lock the same leaf with btrfs_log_prealloc_extents() below.
> + */
> + btrfs_release_path(path);
> + ret = btrfs_log_prealloc_extents(trans, inode, dst_path);
> + }
>
> return ret;
> }
> --
> 2.33.0
next prev parent reply other threads:[~2022-02-21 10:49 UTC|newest]
Thread overview: 13+ messages / expand[flat|nested] mbox.gz Atom feed top
2022-02-17 12:12 [PATCH 0/7] btrfs: a fix for fsync and a few improvements to the full fsync path fdmanana
2022-02-17 12:12 ` [PATCH 1/7] btrfs: fix lost prealloc extents beyond eof after full fsync fdmanana
2022-02-21 10:11 ` Wang Yugui [this message]
2022-02-21 10:41 ` Filipe Manana
2022-02-21 23:51 ` Wang Yugui
2022-02-22 10:29 ` Filipe Manana
2022-02-17 12:12 ` [PATCH 2/7] btrfs: stop copying old file extents when doing a " fdmanana
2022-02-17 12:12 ` [PATCH 3/7] btrfs: hold on to less memory when logging checksums during " fdmanana
2022-02-17 12:12 ` [PATCH 4/7] btrfs: voluntarily relinquish cpu when doing a " fdmanana
2022-02-17 12:12 ` [PATCH 5/7] btrfs: reset last_reflink_trans after fsyncing inode fdmanana
2022-02-17 12:12 ` [PATCH 6/7] btrfs: fix unexpected error path when reflinking an inline extent fdmanana
2022-02-17 12:12 ` [PATCH 7/7] btrfs: deal with unexpected extent type during reflinking fdmanana
2022-02-21 16:15 ` [PATCH 0/7] btrfs: a fix for fsync and a few improvements to the full fsync path David Sterba
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20220221181141.DD77.409509F4@e16-tech.com \
--to=wangyugui@e16-tech.com \
--cc=fdmanana@kernel.org \
--cc=linux-btrfs@vger.kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox