From: Qu Wenruo <quwenruo@cn.fujitsu.com>
To: Filipe Manana <fdmanana@kernel.org>
Cc: "linux-btrfs@vger.kernel.org" <linux-btrfs@vger.kernel.org>,
Filipe Manana <fdmanana@suse.com>
Subject: Re: [PATCH] Btrfs: fix order by which delayed references are run
Date: Mon, 27 Jul 2015 17:22:28 +0800 [thread overview]
Message-ID: <55B5F854.10303@cn.fujitsu.com> (raw)
In-Reply-To: <CAL3q7H6Ki25cOyV1-Xj716mpiQD3qU5Eq3CLubh80P7wRou57A@mail.gmail.com>
Filipe Manana wrote on 2015/07/27 09:26 +0100:
> On Mon, Jul 27, 2015 at 7:53 AM, Qu Wenruo <quwenruo@cn.fujitsu.com> wrote:
>> Hi Filipe,
>
> Hi Qu,
>
>>
>> Sorry for the late reply after it is already merged,
>> but I'm a little concerned about the extra loop to find the first inc
>> delayed ref.
>>
>> It may take some extra time when there are a lot of delayed refs.
>>
>> What about allowing deleting the extent item at dec delayed ref time and
>> then add it back for later inc delayed refs?
>
> So, the reason I did it this way is simplicity - it's pretty much what
> the pre 4.2-rc1 [1] code did, but slightly more efficient because it
> iterates a linked list rather than a red black tree using rb_next().
> So it seems somewhat odd that you're worrying about this after a
> functional fix when we pretty much always had this behavior in
> place...
>
> Given the complexity of what you propose, I would prefer if we have a
> benchmark that indeed shows this is a performance critical area (I
> don't think this list can get that huge, but I might be wrong).
I agree with you that this may not be a performance hotspot.
And in fact, before the qgroup patchset, we tested the list implement of
delayed ref for performance regression, and the difference is so little
that can be easily ignored.
So I think even we skip the loop, the performance will only change little.
One of the main purpose to change the delayed ref to list, is to make it
a little more straightforward, without the hard to understand/review rb
tree compare.
So I hope there is another more simple and straightforward one.
But as you mentioned, your fix is already simple enough, and after a
short code search, my idea to fix may not be as simple as I thought.
So sorry for the disturb, and thanks for your fix again.
Thanks,
Qu
>
> I also like the current approach of triggering a warning/bug_on/assert
> if the extent item isn't found in the extent tree - it helps detecting
> more quickly if a logic bug elsewhere exists.
>
> But anyway, I'm ok if you are willing to implement such approach if
> there's indeed a test/benchmark to justify more complex/optimized
> code.
>
> thanks
>
> [1] - https://git.kernel.org/cgit/linux/kernel/git/stable/linux-stable.git/tree/fs/btrfs/extent-tree.c?id=refs/tags/v4.1.3#n2327
>
>>
>> Thanks,
>> Qu
>>
>> wrote on 2015/07/09 15:50 +0100:
>>>
>>> From: Filipe Manana <fdmanana@suse.com>
>>>
>>>
>>> When we have an extent that got N references removed and N new references
>>> added in the same transaction, we must run the insertion of the references
>>> first because otherwise the last removed reference will remove the extent
>>> item from the extent tree, resulting in a failure for the insertions.
>>>
>>> This is a regression introduced in the 4.2-rc1 release and this fix just
>>> brings back the behaviour of selecting reference additions before any
>>> reference removals.
>>>
>>> The following test case for fstests reproduces the issue:
>>>
>>> seq=`basename $0`
>>> seqres=$RESULT_DIR/$seq
>>> echo "QA output created by $seq"
>>> tmp=/tmp/$$
>>> status=1 # failure is the default!
>>> trap "_cleanup; exit \$status" 0 1 2 3 15
>>>
>>> _cleanup()
>>> {
>>> _cleanup_flakey
>>> rm -f $tmp.*
>>> }
>>>
>>> # get standard environment, filters and checks
>>> . ./common/rc
>>> . ./common/filter
>>> . ./common/dmflakey
>>>
>>> # real QA test starts here
>>> _need_to_be_root
>>> _supported_fs btrfs
>>> _supported_os Linux
>>> _require_scratch
>>> _require_dm_flakey
>>> _require_cloner
>>> _require_metadata_journaling $SCRATCH_DEV
>>>
>>> rm -f $seqres.full
>>>
>>> _scratch_mkfs >>$seqres.full 2>&1
>>> _init_flakey
>>> _mount_flakey
>>>
>>> # Create prealloc extent covering range [160K, 620K[
>>> $XFS_IO_PROG -f -c "falloc 160K 460K" $SCRATCH_MNT/foo
>>>
>>> # Now write to the last 80K of the prealloc extent plus 40K to the
>>> unallocated
>>> # space that immediately follows it. This creates a new extent of 40K
>>> that spans
>>> # the range [620K, 660K[.
>>> $XFS_IO_PROG -c "pwrite -S 0xaa 540K 120K" $SCRATCH_MNT/foo |
>>> _filter_xfs_io
>>>
>>> # At this point, there are now 2 back references to the prealloc extent
>>> in our
>>> # extent tree. Both are for our file offset 160K and one relates to a
>>> file
>>> # extent item with a data offset of 0 and a length of 380K, while the
>>> other
>>> # relates to a file extent item with a data offset of 380K and a length
>>> of 80K.
>>>
>>> # Make sure everything done so far is durably persisted (all back
>>> references are
>>> # in the extent tree, etc).
>>> sync
>>>
>>> # Now clone all extents of our file that cover the offset 160K up to
>>> its eof
>>> # (660K at this point) into itself at offset 2M. This leaves a hole in
>>> the file
>>> # covering the range [660K, 2M[. The prealloc extent will now be
>>> referenced by
>>> # the file twice, once for offset 160K and once for offset 2M. The 40K
>>> extent
>>> # that follows the prealloc extent will also be referenced twice by our
>>> file,
>>> # once for offset 620K and once for offset 2M + 460K.
>>> $CLONER_PROG -s $((160 * 1024)) -d $((2 * 1024 * 1024)) -l 0
>>> $SCRATCH_MNT/foo \
>>> $SCRATCH_MNT/foo
>>>
>>> # Now create one new extent in our file with a size of 100Kb. It will
>>> span the
>>> # range [3M, 3M + 100K[. It also will cause creation of a hole spanning
>>> the
>>> # range [2M + 460K, 3M[. Our new file size is 3M + 100K.
>>> $XFS_IO_PROG -c "pwrite -S 0xbb 3M 100K" $SCRATCH_MNT/foo |
>>> _filter_xfs_io
>>>
>>> # At this point, there are now (in memory) 4 back references to the
>>> prealloc
>>> # extent.
>>> #
>>> # Two of them are for file offset 160K, related to file extent items
>>> # matching the file offsets 160K and 540K respectively, with data
>>> offsets of
>>> # 0 and 380K respectively, and with lengths of 380K and 80K
>>> respectively.
>>> #
>>> # The other two references are for file offset 2M, related to file
>>> extent items
>>> # matching the file offsets 2M and 2M + 380K respectively, with data
>>> offsets of
>>> # 0 and 380K respectively, and with lengths of 389K and 80K
>>> respectively.
>>> #
>>> # The 40K extent has 2 back references, one for file offset 620K and
>>> the other
>>> # for file offset 2M + 460K.
>>> #
>>> # The 100K extent has a single back reference and it relates to file
>>> offset 3M.
>>>
>>> # Now clone our 100K extent into offset 600K. That offset covers the
>>> last 20K
>>> # of the prealloc extent, the whole 40K extent and 40K of the hole
>>> starting at
>>> # offset 660K.
>>> $CLONER_PROG -s $((3 * 1024 * 1024)) -d $((600 * 1024)) -l $((100 *
>>> 1024)) \
>>> $SCRATCH_MNT/foo $SCRATCH_MNT/foo
>>>
>>> # At this point there's only one reference to the 40K extent, at file
>>> offset
>>> # 2M + 460K, we have 4 references for the prealloc extent (2 for file
>>> offset
>>> # 160K and 2 for file offset 2M) and 2 references for the 100K extent
>>> (1 for
>>> # file offset 3M and a new one for file offset 600K).
>>>
>>> # Now fsync our file to make all its new data and metadata updates are
>>> durably
>>> # persisted and present if a power failure/crash happens after a
>>> successful
>>> # fsync and before the next transaction commit.
>>> $XFS_IO_PROG -c "fsync" $SCRATCH_MNT/foo
>>>
>>> echo "File digest before power failure:"
>>> md5sum $SCRATCH_MNT/foo | _filter_scratch
>>>
>>> # Silently drop all writes and ummount to simulate a crash/power
>>> failure.
>>> _load_flakey_table $FLAKEY_DROP_WRITES
>>> _unmount_flakey
>>>
>>> # Allow writes again, mount to trigger log replay and validate file
>>> contents.
>>> # During log replay, the btrfs delayed references implementation used
>>> to run the
>>> # deletion of back references before the addition of new back
>>> references, which
>>> # made the addition fail as it didn't find the key in the extent tree
>>> that it
>>> # was looking for. The failure triggered by this test was related to
>>> the 40K
>>> # extent, which got 1 reference dropped and 1 reference added during
>>> the fsync
>>> # log replay - when running the delayed references at transaction
>>> commit time,
>>> # btrfs was applying the deletion before the insertion, resulting in a
>>> failure
>>> # of the insertion that ended up turning the fs into read-only mode.
>>> _load_flakey_table $FLAKEY_ALLOW_WRITES
>>> _mount_flakey
>>>
>>> echo "File digest after log replay:"
>>> md5sum $SCRATCH_MNT/foo | _filter_scratch
>>>
>>> _unmount_flakey
>>>
>>> status=0
>>> exit
>>>
>>> This issue turned the filesystem into read-only mode (current transaction
>>> aborted) and produced the following traces:
>>>
>>> [ 8247.578385] ------------[ cut here ]------------
>>> [ 8247.579947] WARNING: CPU: 0 PID: 11341 at
>>> fs/btrfs/extent-tree.c:1547 lookup_inline_extent_backref+0x17d/0x45d
>>> [btrfs]()
>>> (...)
>>> [ 8247.601697] Call Trace:
>>> [ 8247.602222] [<ffffffff8145f077>] dump_stack+0x4f/0x7b
>>> [ 8247.604320] [<ffffffff8104b3b0>] warn_slowpath_common+0xa1/0xbb
>>> [ 8247.605488] [<ffffffffa0506c8d>] ?
>>> lookup_inline_extent_backref+0x17d/0x45d [btrfs]
>>> [ 8247.608226] [<ffffffffa0506c8d>]
>>> lookup_inline_extent_backref+0x17d/0x45d [btrfs]
>>> [ 8247.617061] [<ffffffffa0507957>]
>>> insert_inline_extent_backref+0x41/0xb2 [btrfs]
>>> [ 8247.621856] [<ffffffffa0507c4f>] __btrfs_inc_extent_ref+0x8c/0x20a
>>> [btrfs]
>>> [ 8247.624366] [<ffffffffa050ee60>]
>>> __btrfs_run_delayed_refs+0xb0c/0xd49 [btrfs]
>>> [ 8247.626176] [<ffffffffa0510dcd>] btrfs_run_delayed_refs+0x6d/0x1d4
>>> [btrfs]
>>> [ 8247.627435] [<ffffffff81155c9b>] ? __cache_free+0x4a7/0x4b6
>>> [ 8247.628531] [<ffffffffa0520482>]
>>> btrfs_commit_transaction+0x4c/0xa20 [btrfs]
>>> (...)
>>> [ 8247.648430] ---[ end trace 2461e55f92c2ac2d ]---
>>>
>>> [ 8247.727263] WARNING: CPU: 3 PID: 11341 at
>>> fs/btrfs/extent-tree.c:2771 btrfs_run_delayed_refs+0xa4/0x1d4 [btrfs]()
>>> [ 8247.728954] BTRFS: Transaction aborted (error -5)
>>> (...)
>>> [ 8247.760866] Call Trace:
>>> [ 8247.761534] [<ffffffff8145f077>] dump_stack+0x4f/0x7b
>>> [ 8247.764271] [<ffffffff8104b3b0>] warn_slowpath_common+0xa1/0xbb
>>> [ 8247.767582] [<ffffffffa0510e04>] ?
>>> btrfs_run_delayed_refs+0xa4/0x1d4 [btrfs]
>>> [ 8247.769373] [<ffffffff8104b410>] warn_slowpath_fmt+0x46/0x48
>>> [ 8247.770836] [<ffffffffa0510e04>] btrfs_run_delayed_refs+0xa4/0x1d4
>>> [btrfs]
>>> [ 8247.772532] [<ffffffff81155c9b>] ? __cache_free+0x4a7/0x4b6
>>> [ 8247.773664] [<ffffffffa0520482>]
>>> btrfs_commit_transaction+0x4c/0xa20 [btrfs]
>>> [ 8247.775047] [<ffffffff81087310>] ? trace_hardirqs_on+0xd/0xf
>>> [ 8247.776176] [<ffffffff81155dd5>] ? kmem_cache_free+0x12b/0x189
>>> [ 8247.777427] [<ffffffffa055a920>]
>>> btrfs_recover_log_trees+0x2da/0x33d [btrfs]
>>> [ 8247.778575] [<ffffffffa055898e>] ? replay_one_extent+0x4fc/0x4fc
>>> [btrfs]
>>> [ 8247.779838] [<ffffffffa051e265>] open_ctree+0x1cc0/0x201a [btrfs]
>>> [ 8247.781020] [<ffffffff81120f48>] ? register_shrinker+0x56/0x81
>>> [ 8247.782285] [<ffffffffa04fb12c>] btrfs_mount+0x5f0/0x734 [btrfs]
>>> (...)
>>> [ 8247.793394] ---[ end trace 2461e55f92c2ac2e ]---
>>> [ 8247.794276] BTRFS: error (device dm-0) in
>>> btrfs_run_delayed_refs:2771: errno=-5 IO failure
>>> [ 8247.797335] BTRFS: error (device dm-0) in btrfs_replay_log:2375:
>>> errno=-5 IO failure (Failed to recover log tree)
>>>
>>> Fixes: c6fc24549960 ("btrfs: delayed-ref: Use list to replace the ref_root
>>> in ref_head.")
>>> Signed-off-by: Filipe Manana <fdmanana@suse.com>
>>>
>>> ---
>>> fs/btrfs/extent-tree.c | 13 +++++++++++++
>>> 1 file changed, 13 insertions(+)
>>>
>>> diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
>>> index 1c2bd17..bccceea5 100644
>>> --- a/fs/btrfs/extent-tree.c
>>> +++ b/fs/btrfs/extent-tree.c
>>> @@ -2296,9 +2296,22 @@ static int run_one_delayed_ref(struct
>>> btrfs_trans_handle *trans,
>>> static inline struct btrfs_delayed_ref_node *
>>> select_delayed_ref(struct btrfs_delayed_ref_head *head)
>>> {
>>> + struct btrfs_delayed_ref_node *ref;
>>> +
>>> if (list_empty(&head->ref_list))
>>> return NULL;
>>>
>>> + /*
>>> + * Select a delayed ref of type BTRFS_ADD_DELAYED_REF first.
>>> + * This is to prevent a ref count from going down to zero, which
>>> deletes
>>> + * the extent item from the extent tree, when here still are
>>> references
>>> + * to add, which would fail because they would not find the extent
>>> item.
>>> + */
>>> + list_for_each_entry(ref, &head->ref_list, list) {
>>> + if (ref->action == BTRFS_ADD_DELAYED_REF)
>>> + return ref;
>>> + }
>>> +
>>> return list_entry(head->ref_list.next, struct
>>> btrfs_delayed_ref_node,
>>> list);
>>> }
>>>
>>
next prev parent reply other threads:[~2015-07-27 9:22 UTC|newest]
Thread overview: 6+ messages / expand[flat|nested] mbox.gz Atom feed top
2015-07-09 14:50 [PATCH] Btrfs: fix order by which delayed references are run fdmanana
2015-07-10 0:28 ` Qu Wenruo
2015-07-27 6:53 ` Qu Wenruo
2015-07-27 8:26 ` Filipe Manana
2015-07-27 9:22 ` Qu Wenruo [this message]
2015-07-27 13:50 ` Chris Mason
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=55B5F854.10303@cn.fujitsu.com \
--to=quwenruo@cn.fujitsu.com \
--cc=fdmanana@kernel.org \
--cc=fdmanana@suse.com \
--cc=linux-btrfs@vger.kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).