From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from cn.fujitsu.com ([59.151.112.132]:43148 "EHLO heian.cn.fujitsu.com" rhost-flags-OK-FAIL-OK-FAIL) by vger.kernel.org with ESMTP id S1751316AbbGJA3A (ORCPT ); Thu, 9 Jul 2015 20:29:00 -0400 Subject: Re: [PATCH] Btrfs: fix order by which delayed references are run To: , References: <1436453447-27466-1-git-send-email-fdmanana@kernel.org> CC: Filipe Manana From: Qu Wenruo Message-ID: <559F11C8.8090604@cn.fujitsu.com> Date: Fri, 10 Jul 2015 08:28:56 +0800 MIME-Version: 1.0 In-Reply-To: <1436453447-27466-1-git-send-email-fdmanana@kernel.org> Content-Type: text/plain; charset="utf-8"; format=flowed Sender: linux-btrfs-owner@vger.kernel.org List-ID: wrote on 2015/07/09 15:50 +0100: > From: Filipe Manana > > When we have an extent that got N references removed and N new references > added in the same transaction, we must run the insertion of the references > first because otherwise the last removed reference will remove the extent > item from the extent tree, resulting in a failure for the insertions. > > This is a regression introduced in the 4.2-rc1 release and this fix just > brings back the behaviour of selecting reference additions before any > reference removals. > Thanks, Filipe, that's right, it's my fault to forgot such case. Acked-by: Qu Wenruo Thanks, Qu > The following test case for fstests reproduces the issue: > > seq=`basename $0` > seqres=$RESULT_DIR/$seq > echo "QA output created by $seq" > tmp=/tmp/$$ > status=1 # failure is the default! > trap "_cleanup; exit \$status" 0 1 2 3 15 > > _cleanup() > { > _cleanup_flakey > rm -f $tmp.* > } > > # get standard environment, filters and checks > . ./common/rc > . ./common/filter > . ./common/dmflakey > > # real QA test starts here > _need_to_be_root > _supported_fs btrfs > _supported_os Linux > _require_scratch > _require_dm_flakey > _require_cloner > _require_metadata_journaling $SCRATCH_DEV > > rm -f $seqres.full > > _scratch_mkfs >>$seqres.full 2>&1 > _init_flakey > _mount_flakey > > # Create prealloc extent covering range [160K, 620K[ > $XFS_IO_PROG -f -c "falloc 160K 460K" $SCRATCH_MNT/foo > > # Now write to the last 80K of the prealloc extent plus 40K to the unallocated > # space that immediately follows it. This creates a new extent of 40K that spans > # the range [620K, 660K[. > $XFS_IO_PROG -c "pwrite -S 0xaa 540K 120K" $SCRATCH_MNT/foo | _filter_xfs_io > > # At this point, there are now 2 back references to the prealloc extent in our > # extent tree. Both are for our file offset 160K and one relates to a file > # extent item with a data offset of 0 and a length of 380K, while the other > # relates to a file extent item with a data offset of 380K and a length of 80K. > > # Make sure everything done so far is durably persisted (all back references are > # in the extent tree, etc). > sync > > # Now clone all extents of our file that cover the offset 160K up to its eof > # (660K at this point) into itself at offset 2M. This leaves a hole in the file > # covering the range [660K, 2M[. The prealloc extent will now be referenced by > # the file twice, once for offset 160K and once for offset 2M. The 40K extent > # that follows the prealloc extent will also be referenced twice by our file, > # once for offset 620K and once for offset 2M + 460K. > $CLONER_PROG -s $((160 * 1024)) -d $((2 * 1024 * 1024)) -l 0 $SCRATCH_MNT/foo \ > $SCRATCH_MNT/foo > > # Now create one new extent in our file with a size of 100Kb. It will span the > # range [3M, 3M + 100K[. It also will cause creation of a hole spanning the > # range [2M + 460K, 3M[. Our new file size is 3M + 100K. > $XFS_IO_PROG -c "pwrite -S 0xbb 3M 100K" $SCRATCH_MNT/foo | _filter_xfs_io > > # At this point, there are now (in memory) 4 back references to the prealloc > # extent. > # > # Two of them are for file offset 160K, related to file extent items > # matching the file offsets 160K and 540K respectively, with data offsets of > # 0 and 380K respectively, and with lengths of 380K and 80K respectively. > # > # The other two references are for file offset 2M, related to file extent items > # matching the file offsets 2M and 2M + 380K respectively, with data offsets of > # 0 and 380K respectively, and with lengths of 389K and 80K respectively. > # > # The 40K extent has 2 back references, one for file offset 620K and the other > # for file offset 2M + 460K. > # > # The 100K extent has a single back reference and it relates to file offset 3M. > > # Now clone our 100K extent into offset 600K. That offset covers the last 20K > # of the prealloc extent, the whole 40K extent and 40K of the hole starting at > # offset 660K. > $CLONER_PROG -s $((3 * 1024 * 1024)) -d $((600 * 1024)) -l $((100 * 1024)) \ > $SCRATCH_MNT/foo $SCRATCH_MNT/foo > > # At this point there's only one reference to the 40K extent, at file offset > # 2M + 460K, we have 4 references for the prealloc extent (2 for file offset > # 160K and 2 for file offset 2M) and 2 references for the 100K extent (1 for > # file offset 3M and a new one for file offset 600K). > > # Now fsync our file to make all its new data and metadata updates are durably > # persisted and present if a power failure/crash happens after a successful > # fsync and before the next transaction commit. > $XFS_IO_PROG -c "fsync" $SCRATCH_MNT/foo > > echo "File digest before power failure:" > md5sum $SCRATCH_MNT/foo | _filter_scratch > > # Silently drop all writes and ummount to simulate a crash/power failure. > _load_flakey_table $FLAKEY_DROP_WRITES > _unmount_flakey > > # Allow writes again, mount to trigger log replay and validate file contents. > # During log replay, the btrfs delayed references implementation used to run the > # deletion of back references before the addition of new back references, which > # made the addition fail as it didn't find the key in the extent tree that it > # was looking for. The failure triggered by this test was related to the 40K > # extent, which got 1 reference dropped and 1 reference added during the fsync > # log replay - when running the delayed references at transaction commit time, > # btrfs was applying the deletion before the insertion, resulting in a failure > # of the insertion that ended up turning the fs into read-only mode. > _load_flakey_table $FLAKEY_ALLOW_WRITES > _mount_flakey > > echo "File digest after log replay:" > md5sum $SCRATCH_MNT/foo | _filter_scratch > > _unmount_flakey > > status=0 > exit > > This issue turned the filesystem into read-only mode (current transaction > aborted) and produced the following traces: > > [ 8247.578385] ------------[ cut here ]------------ > [ 8247.579947] WARNING: CPU: 0 PID: 11341 at fs/btrfs/extent-tree.c:1547 lookup_inline_extent_backref+0x17d/0x45d [btrfs]() > (...) > [ 8247.601697] Call Trace: > [ 8247.602222] [] dump_stack+0x4f/0x7b > [ 8247.604320] [] warn_slowpath_common+0xa1/0xbb > [ 8247.605488] [] ? lookup_inline_extent_backref+0x17d/0x45d [btrfs] > [ 8247.608226] [] lookup_inline_extent_backref+0x17d/0x45d [btrfs] > [ 8247.617061] [] insert_inline_extent_backref+0x41/0xb2 [btrfs] > [ 8247.621856] [] __btrfs_inc_extent_ref+0x8c/0x20a [btrfs] > [ 8247.624366] [] __btrfs_run_delayed_refs+0xb0c/0xd49 [btrfs] > [ 8247.626176] [] btrfs_run_delayed_refs+0x6d/0x1d4 [btrfs] > [ 8247.627435] [] ? __cache_free+0x4a7/0x4b6 > [ 8247.628531] [] btrfs_commit_transaction+0x4c/0xa20 [btrfs] > (...) > [ 8247.648430] ---[ end trace 2461e55f92c2ac2d ]--- > > [ 8247.727263] WARNING: CPU: 3 PID: 11341 at fs/btrfs/extent-tree.c:2771 btrfs_run_delayed_refs+0xa4/0x1d4 [btrfs]() > [ 8247.728954] BTRFS: Transaction aborted (error -5) > (...) > [ 8247.760866] Call Trace: > [ 8247.761534] [] dump_stack+0x4f/0x7b > [ 8247.764271] [] warn_slowpath_common+0xa1/0xbb > [ 8247.767582] [] ? btrfs_run_delayed_refs+0xa4/0x1d4 [btrfs] > [ 8247.769373] [] warn_slowpath_fmt+0x46/0x48 > [ 8247.770836] [] btrfs_run_delayed_refs+0xa4/0x1d4 [btrfs] > [ 8247.772532] [] ? __cache_free+0x4a7/0x4b6 > [ 8247.773664] [] btrfs_commit_transaction+0x4c/0xa20 [btrfs] > [ 8247.775047] [] ? trace_hardirqs_on+0xd/0xf > [ 8247.776176] [] ? kmem_cache_free+0x12b/0x189 > [ 8247.777427] [] btrfs_recover_log_trees+0x2da/0x33d [btrfs] > [ 8247.778575] [] ? replay_one_extent+0x4fc/0x4fc [btrfs] > [ 8247.779838] [] open_ctree+0x1cc0/0x201a [btrfs] > [ 8247.781020] [] ? register_shrinker+0x56/0x81 > [ 8247.782285] [] btrfs_mount+0x5f0/0x734 [btrfs] > (...) > [ 8247.793394] ---[ end trace 2461e55f92c2ac2e ]--- > [ 8247.794276] BTRFS: error (device dm-0) in btrfs_run_delayed_refs:2771: errno=-5 IO failure > [ 8247.797335] BTRFS: error (device dm-0) in btrfs_replay_log:2375: errno=-5 IO failure (Failed to recover log tree) > > Fixes: c6fc24549960 ("btrfs: delayed-ref: Use list to replace the ref_root in ref_head.") > Signed-off-by: Filipe Manana > --- > fs/btrfs/extent-tree.c | 13 +++++++++++++ > 1 file changed, 13 insertions(+) > > diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c > index 1c2bd17..bccceea5 100644 > --- a/fs/btrfs/extent-tree.c > +++ b/fs/btrfs/extent-tree.c > @@ -2296,9 +2296,22 @@ static int run_one_delayed_ref(struct btrfs_trans_handle *trans, > static inline struct btrfs_delayed_ref_node * > select_delayed_ref(struct btrfs_delayed_ref_head *head) > { > + struct btrfs_delayed_ref_node *ref; > + > if (list_empty(&head->ref_list)) > return NULL; > > + /* > + * Select a delayed ref of type BTRFS_ADD_DELAYED_REF first. > + * This is to prevent a ref count from going down to zero, which deletes > + * the extent item from the extent tree, when here still are references > + * to add, which would fail because they would not find the extent item. > + */ > + list_for_each_entry(ref, &head->ref_list, list) { > + if (ref->action == BTRFS_ADD_DELAYED_REF) > + return ref; > + } > + > return list_entry(head->ref_list.next, struct btrfs_delayed_ref_node, > list); > } >