linux-btrfs.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Qu Wenruo <quwenruo@cn.fujitsu.com>
To: <fdmanana@kernel.org>, <linux-btrfs@vger.kernel.org>
Cc: Filipe Manana <fdmanana@suse.com>
Subject: Re: [PATCH] Btrfs: fix order by which delayed references are run
Date: Mon, 27 Jul 2015 14:53:14 +0800	[thread overview]
Message-ID: <55B5D55A.8060806@cn.fujitsu.com> (raw)
In-Reply-To: <1436453447-27466-1-git-send-email-fdmanana@kernel.org>

Hi Filipe,

Sorry for the late reply after it is already merged,
but I'm a little concerned about the extra loop to find the first inc 
delayed ref.

It may take some extra time when there are a lot of delayed refs.

What about allowing deleting the extent item at dec delayed ref time and 
then add it back for later inc delayed refs?

Thanks,
Qu

  wrote on 2015/07/09 15:50 +0100:
> From: Filipe Manana <fdmanana@suse.com>
>
> When we have an extent that got N references removed and N new references
> added in the same transaction, we must run the insertion of the references
> first because otherwise the last removed reference will remove the extent
> item from the extent tree, resulting in a failure for the insertions.
>
> This is a regression introduced in the 4.2-rc1 release and this fix just
> brings back the behaviour of selecting reference additions before any
> reference removals.
>
> The following test case for fstests reproduces the issue:
>
>    seq=`basename $0`
>    seqres=$RESULT_DIR/$seq
>    echo "QA output created by $seq"
>    tmp=/tmp/$$
>    status=1	# failure is the default!
>    trap "_cleanup; exit \$status" 0 1 2 3 15
>
>    _cleanup()
>    {
>        _cleanup_flakey
>        rm -f $tmp.*
>    }
>
>    # get standard environment, filters and checks
>    . ./common/rc
>    . ./common/filter
>    . ./common/dmflakey
>
>    # real QA test starts here
>    _need_to_be_root
>    _supported_fs btrfs
>    _supported_os Linux
>    _require_scratch
>    _require_dm_flakey
>    _require_cloner
>    _require_metadata_journaling $SCRATCH_DEV
>
>    rm -f $seqres.full
>
>    _scratch_mkfs >>$seqres.full 2>&1
>    _init_flakey
>    _mount_flakey
>
>    # Create prealloc extent covering range [160K, 620K[
>    $XFS_IO_PROG -f -c "falloc 160K 460K" $SCRATCH_MNT/foo
>
>    # Now write to the last 80K of the prealloc extent plus 40K to the unallocated
>    # space that immediately follows it. This creates a new extent of 40K that spans
>    # the range [620K, 660K[.
>    $XFS_IO_PROG -c "pwrite -S 0xaa 540K 120K" $SCRATCH_MNT/foo | _filter_xfs_io
>
>    # At this point, there are now 2 back references to the prealloc extent in our
>    # extent tree. Both are for our file offset 160K and one relates to a file
>    # extent item with a data offset of 0 and a length of 380K, while the other
>    # relates to a file extent item with a data offset of 380K and a length of 80K.
>
>    # Make sure everything done so far is durably persisted (all back references are
>    # in the extent tree, etc).
>    sync
>
>    # Now clone all extents of our file that cover the offset 160K up to its eof
>    # (660K at this point) into itself at offset 2M. This leaves a hole in the file
>    # covering the range [660K, 2M[. The prealloc extent will now be referenced by
>    # the file twice, once for offset 160K and once for offset 2M. The 40K extent
>    # that follows the prealloc extent will also be referenced twice by our file,
>    # once for offset 620K and once for offset 2M + 460K.
>    $CLONER_PROG -s $((160 * 1024)) -d $((2 * 1024 * 1024)) -l 0 $SCRATCH_MNT/foo \
> 	$SCRATCH_MNT/foo
>
>    # Now create one new extent in our file with a size of 100Kb. It will span the
>    # range [3M, 3M + 100K[. It also will cause creation of a hole spanning the
>    # range [2M + 460K, 3M[. Our new file size is 3M + 100K.
>    $XFS_IO_PROG -c "pwrite -S 0xbb 3M 100K" $SCRATCH_MNT/foo | _filter_xfs_io
>
>    # At this point, there are now (in memory) 4 back references to the prealloc
>    # extent.
>    #
>    # Two of them are for file offset 160K, related to file extent items
>    # matching the file offsets 160K and 540K respectively, with data offsets of
>    # 0 and 380K respectively, and with lengths of 380K and 80K respectively.
>    #
>    # The other two references are for file offset 2M, related to file extent items
>    # matching the file offsets 2M and 2M + 380K respectively, with data offsets of
>    # 0 and 380K respectively, and with lengths of 389K and 80K respectively.
>    #
>    # The 40K extent has 2 back references, one for file offset 620K and the other
>    # for file offset 2M + 460K.
>    #
>    # The 100K extent has a single back reference and it relates to file offset 3M.
>
>    # Now clone our 100K extent into offset 600K. That offset covers the last 20K
>    # of the prealloc extent, the whole 40K extent and 40K of the hole starting at
>    # offset 660K.
>    $CLONER_PROG -s $((3 * 1024 * 1024)) -d $((600 * 1024)) -l $((100 * 1024)) \
>        $SCRATCH_MNT/foo $SCRATCH_MNT/foo
>
>    # At this point there's only one reference to the 40K extent, at file offset
>    # 2M + 460K, we have 4 references for the prealloc extent (2 for file offset
>    # 160K and 2 for file offset 2M) and 2 references for the 100K extent (1 for
>    # file offset 3M and a new one for file offset 600K).
>
>    # Now fsync our file to make all its new data and metadata updates are durably
>    # persisted and present if a power failure/crash happens after a successful
>    # fsync and before the next transaction commit.
>    $XFS_IO_PROG -c "fsync" $SCRATCH_MNT/foo
>
>    echo "File digest before power failure:"
>    md5sum $SCRATCH_MNT/foo | _filter_scratch
>
>    # Silently drop all writes and ummount to simulate a crash/power failure.
>    _load_flakey_table $FLAKEY_DROP_WRITES
>    _unmount_flakey
>
>    # Allow writes again, mount to trigger log replay and validate file contents.
>    # During log replay, the btrfs delayed references implementation used to run the
>    # deletion of back references before the addition of new back references, which
>    # made the addition fail as it didn't find the key in the extent tree that it
>    # was looking for. The failure triggered by this test was related to the 40K
>    # extent, which got 1 reference dropped and 1 reference added during the fsync
>    # log replay - when running the delayed references at transaction commit time,
>    # btrfs was applying the deletion before the insertion, resulting in a failure
>    # of the insertion that ended up turning the fs into read-only mode.
>    _load_flakey_table $FLAKEY_ALLOW_WRITES
>    _mount_flakey
>
>    echo "File digest after log replay:"
>    md5sum $SCRATCH_MNT/foo | _filter_scratch
>
>    _unmount_flakey
>
>    status=0
>    exit
>
> This issue turned the filesystem into read-only mode (current transaction
> aborted) and produced the following traces:
>
>    [ 8247.578385] ------------[ cut here ]------------
>    [ 8247.579947] WARNING: CPU: 0 PID: 11341 at fs/btrfs/extent-tree.c:1547 lookup_inline_extent_backref+0x17d/0x45d [btrfs]()
>    (...)
>    [ 8247.601697] Call Trace:
>    [ 8247.602222]  [<ffffffff8145f077>] dump_stack+0x4f/0x7b
>    [ 8247.604320]  [<ffffffff8104b3b0>] warn_slowpath_common+0xa1/0xbb
>    [ 8247.605488]  [<ffffffffa0506c8d>] ? lookup_inline_extent_backref+0x17d/0x45d [btrfs]
>    [ 8247.608226]  [<ffffffffa0506c8d>] lookup_inline_extent_backref+0x17d/0x45d [btrfs]
>    [ 8247.617061]  [<ffffffffa0507957>] insert_inline_extent_backref+0x41/0xb2 [btrfs]
>    [ 8247.621856]  [<ffffffffa0507c4f>] __btrfs_inc_extent_ref+0x8c/0x20a [btrfs]
>    [ 8247.624366]  [<ffffffffa050ee60>] __btrfs_run_delayed_refs+0xb0c/0xd49 [btrfs]
>    [ 8247.626176]  [<ffffffffa0510dcd>] btrfs_run_delayed_refs+0x6d/0x1d4 [btrfs]
>    [ 8247.627435]  [<ffffffff81155c9b>] ? __cache_free+0x4a7/0x4b6
>    [ 8247.628531]  [<ffffffffa0520482>] btrfs_commit_transaction+0x4c/0xa20 [btrfs]
>    (...)
>    [ 8247.648430] ---[ end trace 2461e55f92c2ac2d ]---
>
>    [ 8247.727263] WARNING: CPU: 3 PID: 11341 at fs/btrfs/extent-tree.c:2771 btrfs_run_delayed_refs+0xa4/0x1d4 [btrfs]()
>    [ 8247.728954] BTRFS: Transaction aborted (error -5)
>    (...)
>    [ 8247.760866] Call Trace:
>    [ 8247.761534]  [<ffffffff8145f077>] dump_stack+0x4f/0x7b
>    [ 8247.764271]  [<ffffffff8104b3b0>] warn_slowpath_common+0xa1/0xbb
>    [ 8247.767582]  [<ffffffffa0510e04>] ? btrfs_run_delayed_refs+0xa4/0x1d4 [btrfs]
>    [ 8247.769373]  [<ffffffff8104b410>] warn_slowpath_fmt+0x46/0x48
>    [ 8247.770836]  [<ffffffffa0510e04>] btrfs_run_delayed_refs+0xa4/0x1d4 [btrfs]
>    [ 8247.772532]  [<ffffffff81155c9b>] ? __cache_free+0x4a7/0x4b6
>    [ 8247.773664]  [<ffffffffa0520482>] btrfs_commit_transaction+0x4c/0xa20 [btrfs]
>    [ 8247.775047]  [<ffffffff81087310>] ? trace_hardirqs_on+0xd/0xf
>    [ 8247.776176]  [<ffffffff81155dd5>] ? kmem_cache_free+0x12b/0x189
>    [ 8247.777427]  [<ffffffffa055a920>] btrfs_recover_log_trees+0x2da/0x33d [btrfs]
>    [ 8247.778575]  [<ffffffffa055898e>] ? replay_one_extent+0x4fc/0x4fc [btrfs]
>    [ 8247.779838]  [<ffffffffa051e265>] open_ctree+0x1cc0/0x201a [btrfs]
>    [ 8247.781020]  [<ffffffff81120f48>] ? register_shrinker+0x56/0x81
>    [ 8247.782285]  [<ffffffffa04fb12c>] btrfs_mount+0x5f0/0x734 [btrfs]
>    (...)
>    [ 8247.793394] ---[ end trace 2461e55f92c2ac2e ]---
>    [ 8247.794276] BTRFS: error (device dm-0) in btrfs_run_delayed_refs:2771: errno=-5 IO failure
>    [ 8247.797335] BTRFS: error (device dm-0) in btrfs_replay_log:2375: errno=-5 IO failure (Failed to recover log tree)
>
> Fixes: c6fc24549960 ("btrfs: delayed-ref: Use list to replace the ref_root in ref_head.")
> Signed-off-by: Filipe Manana <fdmanana@suse.com>
> ---
>   fs/btrfs/extent-tree.c | 13 +++++++++++++
>   1 file changed, 13 insertions(+)
>
> diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
> index 1c2bd17..bccceea5 100644
> --- a/fs/btrfs/extent-tree.c
> +++ b/fs/btrfs/extent-tree.c
> @@ -2296,9 +2296,22 @@ static int run_one_delayed_ref(struct btrfs_trans_handle *trans,
>   static inline struct btrfs_delayed_ref_node *
>   select_delayed_ref(struct btrfs_delayed_ref_head *head)
>   {
> +	struct btrfs_delayed_ref_node *ref;
> +
>   	if (list_empty(&head->ref_list))
>   		return NULL;
>
> +	/*
> +	 * Select a delayed ref of type BTRFS_ADD_DELAYED_REF first.
> +	 * This is to prevent a ref count from going down to zero, which deletes
> +	 * the extent item from the extent tree, when here still are references
> +	 * to add, which would fail because they would not find the extent item.
> +	 */
> +	list_for_each_entry(ref, &head->ref_list, list) {
> +		if (ref->action == BTRFS_ADD_DELAYED_REF)
> +			return ref;
> +	}
> +
>   	return list_entry(head->ref_list.next, struct btrfs_delayed_ref_node,
>   			  list);
>   }
>

  parent reply	other threads:[~2015-07-27  6:53 UTC|newest]

Thread overview: 6+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2015-07-09 14:50 [PATCH] Btrfs: fix order by which delayed references are run fdmanana
2015-07-10  0:28 ` Qu Wenruo
2015-07-27  6:53 ` Qu Wenruo [this message]
2015-07-27  8:26   ` Filipe Manana
2015-07-27  9:22     ` Qu Wenruo
2015-07-27 13:50       ` Chris Mason

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=55B5D55A.8060806@cn.fujitsu.com \
    --to=quwenruo@cn.fujitsu.com \
    --cc=fdmanana@kernel.org \
    --cc=fdmanana@suse.com \
    --cc=linux-btrfs@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).