From: Wang Yugui <wangyugui@e16-tech.com>
To: Filipe Manana <fdmanana@kernel.org>
Cc: linux-btrfs <linux-btrfs@vger.kernel.org>
Subject: Re: [PATCH 1/7] btrfs: fix lost prealloc extents beyond eof after full fsync
Date: Tue, 22 Feb 2022 07:51:55 +0800 [thread overview]
Message-ID: <20220222075154.1347.409509F4@e16-tech.com> (raw)
In-Reply-To: <CAL3q7H6NZMSAMxEidQsZWibyrmyEFMM4tGX_RDiRbVs1Gm8XgQ@mail.gmail.com>
Hi,
> >
> > xfstest btrfs/261 PASSED 40 times without this patch.
> >
> > so there is some problem in xfstest btrfs/261?
>
> Not that I can see, it always fails for me on an unpatched kernel:
xfstest btrfs/261 always fails on unpatched kernel 5.17.0-rc4 here too.
Thanks a lot.
Best Regards
Wang Yugui (wangyugui@e16-tech.com)
2022/02/22
>
> $ ./check btrfs/261
> FSTYP -- btrfs
> PLATFORM -- Linux/x86_64 debian9 5.17.0-rc4-btrfs-next-112 #1 SMP
> PREEMPT Wed Feb 16 11:20:50 WET 2022
> MKFS_OPTIONS -- /dev/sdc
> MOUNT_OPTIONS -- /dev/sdc /home/fdmanana/btrfs-tests/scratch_1
>
> btrfs/261 8s ... - output mismatch (see
> /home/fdmanana/git/hub/xfstests/results//btrfs/261.out.bad)
> --- tests/btrfs/261.out 2022-02-15 11:53:34.273201027 +0000
> +++ /home/fdmanana/git/hub/xfstests/results//btrfs/261.out.bad
> 2022-02-21 10:34:59.781769394 +0000
> @@ -5,6 +5,3 @@
> XXX Bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
> List of extents after power failure:
> 0: [0..32767]: data
> -1: [32768..34815]: unwritten
> -2: [34816..40959]: hole
> -3: [40960..43007]: unwritten
> ...
> (Run 'diff -u /home/fdmanana/git/hub/xfstests/tests/btrfs/261.out
> /home/fdmanana/git/hub/xfstests/results//btrfs/261.out.bad' to see
> the entire diff)
> Ran: btrfs/261
> Failures: btrfs/261
> Failed 1 of 1 tests
>
> And should fail regardless of the page size, which is the only factor
> I think could make a difference.
>
> To figure it out, change the test like this:
>
> diff --git a/tests/btrfs/261 b/tests/btrfs/261
> index 8275e6a5..540a2de2 100755
> --- a/tests/btrfs/261
> +++ b/tests/btrfs/261
> @@ -65,7 +65,15 @@ $XFS_IO_PROG -c "pwrite 0 4K" -c "fsync"
> $SCRATCH_MNT/foo | _filter_xfs_io
>
> # Simulate a power failure and then mount again the filesystem to
> replay the log
> # tree.
> -_flakey_drop_and_remount
> +#_flakey_drop_and_remount
> +_load_flakey_table $FLAKEY_DROP_WRITES
> +_unmount_flakey
> +
> +$BTRFS_UTIL_PROG inspect-internal dump-tree $SCRATCH_DEV >>$seqres.full 2>&1
> +
> +_load_flakey_table $FLAKEY_ALLOW_WRITES
> +_mount_flakey
> +
>
> # After the power failure we expect that the preallocated extents, beyond the
> # inode's i_size, still exist.
>
> Run it and then provide the contents of the file results/btrfs/261.full
>
> >
> > Best Regards
> > Wang Yugui (wangyugui@e16-tech.com)
> > 2022/02/21
> >
> > > From: Filipe Manana <fdmanana@suse.com>
> > >
> > > When doing a full fsync, if we have prealloc extents beyond (or at) eof,
> > > and the leaves that contain them were not modified in the current
> > > transaction, we end up not logging them. This results in losing those
> > > extents when we replay the log after a power failure, since the inode is
> > > truncated to the current value of the logged i_size.
> > >
> > > Just like for the fast fsync path, we need to always log all prealloc
> > > extents starting at or beyond i_size. The fast fsync case was fixed in
> > > commit 471d557afed155 ("Btrfs: fix loss of prealloc extents past i_size
> > > after fsync log replay") but it missed the full fsync path. The problem
> > > exists since the very early days, when the log tree was added by
> > > commit e02119d5a7b439 ("Btrfs: Add a write ahead tree log to optimize
> > > synchronous operations").
> > >
> > > Example reproducer:
> > >
> > > $ mkfs.btrfs -f /dev/sdc
> > > $ mount /dev/sdc /mnt
> > >
> > > # Create our test file with many file extent items, so that they span
> > > # several leaves of metadata, even if the node/page size is 64K. Use
> > > # direct IO and not fsync/O_SYNC because it's both faster and it avoids
> > > # clearing the full sync flag from the inode - we want the fsync below
> > > # to trigger the slow full sync code path.
> > > $ xfs_io -f -d -c "pwrite -b 4K 0 16M" /mnt/foo
> > >
> > > # Now add two preallocated extents to our file without extending the
> > > # file's size. One right at i_size, and another further beyond, leaving
> > > # a gap between the two prealloc extents.
> > > $ xfs_io -c "falloc -k 16M 1M" /mnt/foo
> > > $ xfs_io -c "falloc -k 20M 1M" /mnt/foo
> > >
> > > # Make sure everything is durably persisted and the transaction is
> > > # committed. This makes all created extents to have a generation lower
> > > # than the generation of the transaction used by the next write and
> > > # fsync.
> > > sync
> > >
> > > # Now overwrite only the first extent, which will result in modifying
> > > # only the first leaf of metadata for our inode. Then fsync it. This
> > > # fsync will use the slow code path (inode full sync bit is set) because
> > > # it's the first fsync since the inode was created/loaded.
> > > $ xfs_io -c "pwrite 0 4K" -c "fsync" /mnt/foo
> > >
> > > # Extent list before power failure.
> > > $ xfs_io -c "fiemap -v" /mnt/foo
> > > /mnt/foo:
> > > EXT: FILE-OFFSET BLOCK-RANGE TOTAL FLAGS
> > > 0: [0..7]: 2178048..2178055 8 0x0
> > > 1: [8..16383]: 26632..43007 16376 0x0
> > > 2: [16384..32767]: 2156544..2172927 16384 0x0
> > > 3: [32768..34815]: 2172928..2174975 2048 0x800
> > > 4: [34816..40959]: hole 6144
> > > 5: [40960..43007]: 2174976..2177023 2048 0x801
> > >
> > > <power fail>
> > >
> > > # Mount fs again, trigger log replay.
> > > $ mount /dev/sdc /mnt
> > >
> > > # Extent list after power failure and log replay.
> > > $ xfs_io -c "fiemap -v" /mnt/foo
> > > /mnt/foo:
> > > EXT: FILE-OFFSET BLOCK-RANGE TOTAL FLAGS
> > > 0: [0..7]: 2178048..2178055 8 0x0
> > > 1: [8..16383]: 26632..43007 16376 0x0
> > > 2: [16384..32767]: 2156544..2172927 16384 0x1
> > >
> > > # The prealloc extents at file offsets 16M and 20M are missing.
> > >
> > > So fix this by calling btrfs_log_prealloc_extents() when we are doing a
> > > full fsync, so that we always log all prealloc extents beyond eof.
> > >
> > > A test case for fstests will follow soon.
> > >
> > > CC: stable@vger.kernel.org # 4.19+
> > > Signed-off-by: Filipe Manana <fdmanana@suse.com>
> > > ---
> > > fs/btrfs/tree-log.c | 43 +++++++++++++++++++++++++++++++------------
> > > 1 file changed, 31 insertions(+), 12 deletions(-)
> > >
> > > diff --git a/fs/btrfs/tree-log.c b/fs/btrfs/tree-log.c
> > > index a483337e8f41..71a5a961fef7 100644
> > > --- a/fs/btrfs/tree-log.c
> > > +++ b/fs/btrfs/tree-log.c
> > > @@ -4732,7 +4732,7 @@ static int log_one_extent(struct btrfs_trans_handle *trans,
> > >
> > > /*
> > > * Log all prealloc extents beyond the inode's i_size to make sure we do not
> > > - * lose them after doing a fast fsync and replaying the log. We scan the
> > > + * lose them after doing a full/fast fsync and replaying the log. We scan the
> > > * subvolume's root instead of iterating the inode's extent map tree because
> > > * otherwise we can log incorrect extent items based on extent map conversion.
> > > * That can happen due to the fact that extent maps are merged when they
> > > @@ -5510,6 +5510,7 @@ static int copy_inode_items_to_log(struct btrfs_trans_handle *trans,
> > > struct btrfs_log_ctx *ctx,
> > > bool *need_log_inode_item)
> > > {
> > > + const u64 i_size = i_size_read(&inode->vfs_inode);
> > > struct btrfs_root *root = inode->root;
> > > int ins_start_slot = 0;
> > > int ins_nr = 0;
> > > @@ -5530,13 +5531,21 @@ static int copy_inode_items_to_log(struct btrfs_trans_handle *trans,
> > > if (min_key->type > max_key->type)
> > > break;
> > >
> > > - if (min_key->type == BTRFS_INODE_ITEM_KEY)
> > > + if (min_key->type == BTRFS_INODE_ITEM_KEY) {
> > > *need_log_inode_item = false;
> > > -
> > > - if ((min_key->type == BTRFS_INODE_REF_KEY ||
> > > - min_key->type == BTRFS_INODE_EXTREF_KEY) &&
> > > - inode->generation == trans->transid &&
> > > - !recursive_logging) {
> > > + } else if (min_key->type == BTRFS_EXTENT_DATA_KEY &&
> > > + min_key->offset >= i_size) {
> > > + /*
> > > + * Extents at and beyond eof are logged with
> > > + * btrfs_log_prealloc_extents().
> > > + * Only regular files have BTRFS_EXTENT_DATA_KEY keys,
> > > + * and no keys greater than that, so bail out.
> > > + */
> > > + break;
> > > + } else if ((min_key->type == BTRFS_INODE_REF_KEY ||
> > > + min_key->type == BTRFS_INODE_EXTREF_KEY) &&
> > > + inode->generation == trans->transid &&
> > > + !recursive_logging) {
> > > u64 other_ino = 0;
> > > u64 other_parent = 0;
> > >
> > > @@ -5567,10 +5576,8 @@ static int copy_inode_items_to_log(struct btrfs_trans_handle *trans,
> > > btrfs_release_path(path);
> > > goto next_key;
> > > }
> > > - }
> > > -
> > > - /* Skip xattrs, we log them later with btrfs_log_all_xattrs() */
> > > - if (min_key->type == BTRFS_XATTR_ITEM_KEY) {
> > > + } else if (min_key->type == BTRFS_XATTR_ITEM_KEY) {
> > > + /* Skip xattrs, logged later with btrfs_log_all_xattrs() */
> > > if (ins_nr == 0)
> > > goto next_slot;
> > > ret = copy_items(trans, inode, dst_path, path,
> > > @@ -5623,9 +5630,21 @@ static int copy_inode_items_to_log(struct btrfs_trans_handle *trans,
> > > break;
> > > }
> > > }
> > > - if (ins_nr)
> > > + if (ins_nr) {
> > > ret = copy_items(trans, inode, dst_path, path, ins_start_slot,
> > > ins_nr, inode_only, logged_isize);
> > > + if (ret)
> > > + return ret;
> > > + }
> > > +
> > > + if (inode_only == LOG_INODE_ALL && S_ISREG(inode->vfs_inode.i_mode)) {
> > > + /*
> > > + * Release the path because otherwise we might attempt to double
> > > + * lock the same leaf with btrfs_log_prealloc_extents() below.
> > > + */
> > > + btrfs_release_path(path);
> > > + ret = btrfs_log_prealloc_extents(trans, inode, dst_path);
> > > + }
> > >
> > > return ret;
> > > }
> > > --
> > > 2.33.0
> >
> >
next prev parent reply other threads:[~2022-02-21 23:51 UTC|newest]
Thread overview: 13+ messages / expand[flat|nested] mbox.gz Atom feed top
2022-02-17 12:12 [PATCH 0/7] btrfs: a fix for fsync and a few improvements to the full fsync path fdmanana
2022-02-17 12:12 ` [PATCH 1/7] btrfs: fix lost prealloc extents beyond eof after full fsync fdmanana
2022-02-21 10:11 ` Wang Yugui
2022-02-21 10:41 ` Filipe Manana
2022-02-21 23:51 ` Wang Yugui [this message]
2022-02-22 10:29 ` Filipe Manana
2022-02-17 12:12 ` [PATCH 2/7] btrfs: stop copying old file extents when doing a " fdmanana
2022-02-17 12:12 ` [PATCH 3/7] btrfs: hold on to less memory when logging checksums during " fdmanana
2022-02-17 12:12 ` [PATCH 4/7] btrfs: voluntarily relinquish cpu when doing a " fdmanana
2022-02-17 12:12 ` [PATCH 5/7] btrfs: reset last_reflink_trans after fsyncing inode fdmanana
2022-02-17 12:12 ` [PATCH 6/7] btrfs: fix unexpected error path when reflinking an inline extent fdmanana
2022-02-17 12:12 ` [PATCH 7/7] btrfs: deal with unexpected extent type during reflinking fdmanana
2022-02-21 16:15 ` [PATCH 0/7] btrfs: a fix for fsync and a few improvements to the full fsync path David Sterba
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20220222075154.1347.409509F4@e16-tech.com \
--to=wangyugui@e16-tech.com \
--cc=fdmanana@kernel.org \
--cc=linux-btrfs@vger.kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.